Malcolm Farmer/How many Wikipedia pages are there

From Wikipedia

< Malcolm Farmer

HomePage | Recent changes | View source | Discuss this page | Page history | Log in |

Printable version | Disclaimers | Privacy policy

Estimating article numbers raises some interesting questions.

Fiddling about with a Perl script to count the number of pages, I get the following counts, as of the morning of 12th December.

Selecting "comma pages" - pages where there's at least enough text to include a comma, which filters out redirects and a load of one-liner pages, we have

19931 comma pages found:
    4 were /Talk subpages
  472 were author pages (other than those already excluded above)
  261 were Wikipedia pages (other than those already excluded above)
----
19194 remaining


Subtract 27 for the Biographical Listing indexes, which are just lists of links even though they have commas, and 36 for the Complete list of encyclopedia topics pages and subpages, and that leaves 19131 articles. The current rate of addition of new articles is such that barring the wikipedia server going down in the next few hours, there will be more than 19000 by Thursday 13th December.

  • I'm now betting on 20,000 articles by Christmas.

Notes:


  1. I have a list of those signing their contributions, and any page with one of those names gets excluded from the count as a Wikipedian. The list now has 250 login names, but there are probably a handful of new wikipedians since I updated the list last week.
  2. Page titles containing the term "Wikipedia" are about the Wikipedia, so such self-referential articles will be mostly excluded. There will be a handful of pages, such as the one on how to edit a page that won't get counted as a wikipedia page, but they are so few that I can't be bothered to adjust the numbers for them. Dec 12th: The number of wikipedia pages has decreased since I last ran the script: this reflects the fact that many of them have been moved over to http://meta.wikipedia.com to give Magnus's new wikipedia code a workout.
  3. Some articles fall under two or more of these headings, but only get counted once above. Once a page title is found to match one of the above categories the title needs no further analysis, so the first matching category is the one in which a page will be counted.
  4. The old-style slow search is now getting so slow as Wikipedia grows that doing the comma search more often than not times out without getting a result. When we convert to Magnus's new code the problem will go away: his code can do the above counting automatically, at the server end.


Possible pages to exclude to refine the counting

  • Historical anniversary pages - Page titles beginning January, February etc., as they are mostly links?
  • other suggestions?

The historical anniversaries pages are mostly links, I think; most of them have just one or two events each. But some of them have been rather aggressively expanded, e.g. June 28. If it can be programmed--and IANAP--you could have your script search for the number of asterisks on each page to determine how many events are included on it. Pages with a result of "0" will be there either to encourage people to use a certain format (i.e. they'll include the statement at the top saying what day of the year it is in which calendar, etc. but no events yet) or will have used colons to indent the event summaries, rather than asterisks to indent and demarcate them. --KQ


I think the historical anniversary pages have useful encyclopedic-type info, myself, so I'd include them.

I'll say we have "over 6,000 articles" on the main page. Any objections? --LMS


A very interesting aquaintance of mine had a reputation for conservative estimates. If some people boasted, he "reverse-boasted". This was always to deprecate his personal ability, and least for a while, done with some semblance of humility (now its an in-joke, but thats a completely different story). For this, he is renowned and celebrated within his peer group.

I see no harm if wikipedia underestimates the amount of articles that have been written. Such a refreshing change from the usual advertising hyperbole bombarding us from every direction.

Even having 4,000-5,000 articles is really quite a feat!


Accuracy beats out conservatism every day, I say. --LMS


Hi Malcolm. A 2:06 PM search on Sept. 19th reveals 12,502 comma articles. Time to up the front page count to 11,000? Well, extrapolating from your earlier counts, it seems very reasonable to think we passed the 11,000 mark a while ago. So I'm going to change the HomePage. --Larry Sanger

You're right; I was going to up the count this morning. -- Malcolm Farmer

I tried to use the old search engine to find articles with 2,000 characters (=about three paragraphs), per the instructions on Wikipedia Announcements/March 2001, but wasn't successful somehow. I'm not sure what that number is, but given that there were about 500 articles with 2000 characters when there were about 2000 comma articles, and given that there are now something like 12,000 comma articles, it seems to follow that there are now something like 3000 articles with 2000 characters.

Why do I care? See Wikipedia commentary/Kill the Stub Pages. Recent criticisms of Wikipedia on K5 and Usenet make it clear that our PR might be improved by our counting up more substantial articles ("three paragraphs" is obviously arbitrary, but it's reasonably credible). So we could say "We have 11,000 articles, of which 3,000 have three or more paragraphs."

If we decide to do this, it will be psychologically important what number we choose to advertise on the front page, because that will set a length benchmark that will make an article seem officially "substantial." It might be better, instead, for somebody to (finally!) program a statistics page which gives various article number estimates. Then, on the front page, we could say just "11,000 articles" but link that to the statistics page, where the real deal would be stated.

Ideas??? --Larry Sanger

Perhaps the most informative solution would be a histogram of page sizes, with cumulative totals working backwards. IE., 10 pages of 10k or more, 300 pages of 3k or more, 1000 pages of 2k or more, etc. -J


Ye Olde "500 word essay" springs to mind... That would be a seriously address the "just hype" numbers. Also, maybe we should exclude CIA factbook text as well as pages from the 1911 encyclopedia, to be fairer? Regarding the conservatism argument, I agree; if our numbers are in error, it might be better to be in err on the small side.

-- BryceHarrington

Many CIA pages were edited a lot after import. --Taw