Krzysztof P. Jasiutowicz/Proposed keyword search mechanism

From Wikipedia

< Krzysztof P. Jasiutowicz

HomePage | Recent changes | View source | Discuss this page | Page history | Log in |

Printable version | Disclaimers | Privacy policy

Proposed keyword search mechanism
Metadata in Wikipedia

Search mechanism on Wikipedia is rather basic. As Wikipedia grows it will become more and more inefficient and inadeqate.
We are not aware of any plans of making Wikipedia software more robust to take into account the size of encyclopedic data being gathered ( at an astonishing speed) and hopefully growing number of Wikipedia end-users.
BTW we have no idea what are even approximate figures on different Wikipedia page hits. Sigh.
The another point is that Wikipedia is weak at categorizing its data.
Some efforts are made but with arrival of EB articles, they will be definitely lagging behind.
Why is Google successful and Altavista is not ?
Because Google has an ingenious system of categorizing and ranking Web pages.
I have only vague idea about the software of Wikipedia but perhaps my idea might be viable.

The mechanism :

  • Each page should have a subpage called "Keywords"
  • on the page the author/editor put keywords describing the page's data
  • on each line there's one or more keywords
  • keywords are delimited (white space or comma)
  • the keywords are ranked
  • ranking is in top down order
  • the higher a keyword on the page the higher its relevance
  • each line ends with
  • lines beginning with "-" are skipped
  • ranking is absolute - on all pages keywords in the same line have the same weight

Then suitable software can easily search those pages digest them and present a nice search results. Ha.

What do you think ?
Or maybe a much much better software solution is currently under way at Bomis ?
My 2 cents.

I like the idea of a "hidden" subpage. I'd make it "/Metadata", and put all kinds of metadata there, including keywords, evaluations, etc., in a simple text format that was easy to parse so that the software could make use of it. The server fetching keywords for headers is one such use. I can imagine many others. --LDC

Because this information would mostly, if not entirely, be maintained manually, I'd prefer seeing it as part of the main page; otherwise, I'd have to possibly edit two separate pages, with greater risk of them getting out of sync. I'd use a first-column keyword (sorry Clifford), such as "$METADATA", that would end the displayed page; anything after "$METADATA" would be, well, meta data, only visible in an editing session.

This would also provide a smoother transition, because worst-case we could begin using it prior to the feature actually being in the software, and searches, for instance would then already work for the keywords. First phase would be a SMOP to just stop display at the special keyword. Later phases could begin actually using the information. For example, this wouldn't be too obnoxious if visible at the bottom of a page:

+- - - - - -
| (body of article here)
|$METADATA (html generation/page display would cease here)
|Keyword: blah; blah; blah ...
|Title: This is a Pretty Title for This Page
|Abstract: A one-line abstract.  Well, maybe
|  two lines.

Following the conventions of many RFC-style headers -- "Tag: value" -- starting in column one, and leading whitespace represents continuation. I guess you could use multiple "Keyword:" tags if you wanted to rank them, as suggested above. Also, perhaps use semicolon separators for keywords to allow embedded whitespace and commas. Much more elaboration and you'd probably need database support, and likely have to change wiki software (php or Zope based). --loh (2001-06-20)

Perl wiki is quite effective what we have seen already.
But what in the future ?
I just thought that making a subpage would make less overhead for the wiki software than looking through all pages.
Thanks for your comments.