From WikiApiary, monitoring the MediaWiki universe

Revision as of 10:53, 15 April 2013 by Ete (talk | contribs) (Site Statistics are completely wrong: reply)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Site Statistics are completely wrong

To whom ever,

I love the concept of this site and in general think it's a great thing, but sadly your statistics about my site,, are completely wrong. The issue is the way the Mediawiki software does page counts. It's a fault of the design of the magic word {{NUMBEROFARTICLES}} , which only counts a page if it contains one wiki link or is categorized to at least one category. Normally for wikis that grow organically this is a perfectly justified method, but we at uploaded, via a special script, the entire contents of the United States Yellow Pages. This created over 10 million pages, not the 1400 our current statistics show us as having. Part of this problem is when we created all those pages we did not record them to the category tables in the MYSQL -- this happens when someone edits and saves a page for the first time. Not sure how this can be corrected on this site, or if could be, since this data is collected via automatic bots so we're leaving this note. Chris Tharp (talk) 03:18, 9 April 2013 (UTC)

Hmm, indeed Bumble Bee does collect articles, but it also collects pages. When I request statistics via the API I get <statistics pages="1464" articles="35" views="484733" edits="3635" images="742" users="48" activeusers="20" admins="3" jobs="0" />. WikiApiary collects and graphs both pages and articles. I am aware of the article count stuff, but I don't see how that would alter the page count that is returned via the API. This is actually why the page line is highlighted stronger than the articles line. You can see some discussion about this on my talk page. With all that said, I'm not sure at all why the page count from your statistics endpoint is returning a lower number. Perhaps others might know? 🐝 thingles (talk) 20:21, 9 April 2013 (UTC)
Thingles, I wish I had a clue why the page count is off completely in the statistics, but so far I've been unable to find an answer to the question. Since it's not my top concern I just decided in the end not to worry about it. I'm sure the problem lies with the fact that we "cheated" when we added all our data and didn't record everything to every table. These days if anyone questions my statements on the size of Yellpedia I just tell them to do a search by State -- entering State Ca, for example, returns 1,180,695 page results.(Not implying your questioning my statements, but some have). All the best with your project here. Chris Tharp (talk) 21:16, 9 April 2013 (UTC)
It would be interesting to see what Semantic usage would report (see the note below this). I'm curious, by "cheated" did you insert content directly into MediaWiki's database when you upload via your script? If so, I'm guessing it's exactly as you are thinking that there is some internal thing that is not getting updated to reflect the page count right. For what it's worth, I would suggest doing that via the MediaWiki edit API instead. That would insure you are protected form internal database changes and that all internal references are handled properly. 🐝 thingles (talk) 21:54, 9 April 2013 (UTC)
Just thinking about this further, that is likely also why your edit count is too low too. Each one of those new pages should be an edit. 🐝 thingles (talk) 21:56, 9 April 2013 (UTC)
Yes, we directly inserted content into the Mediawiki database, which may or may not have been the ideal way to do it, but it's done now. Most likely adding data via the method you suggested is a better way to go and in the future we most likely will be using it. The thing I'm curious is the speed of the API -- as I recall we were adding something like 120,000 Pages an hour when we uploaded our data. If all goes well we are going to face this problem in the future since we hope to expand beyond just the directory listings of the United States. Chris Tharp (talk) 00:07, 10 April 2013 (UTC)
You should be able to correct this by running updateArticleCount.php, after altering $wgArticleCountMethod to $wgArticleCountMethod = 'any';. Though for a site with as many pages as yours, it may take some time to run.--Ete (talk) 20:59, 14 April 2013 (UTC)
Ete, thanks for the suggestion, but unfortunately setting the page count method to 'any' via the $wgArticleCountMethod doesn't work. The problem lies with that fact we didn't write to every table in the database when we created the pages, which creates this error no matter what method is used. Actually I'm now thinking, when I get time, the answer lies with writing a bot that opens the page to edit and then saves it, which will then correct how every page is recorded. But since my primary focus is getting traffic and improving the user experience this correction is a low importance problem for me. Thanks, however, for reminding me why I love the wiki community -- everybody working& throwing stuff together. -- was going to stop there but as I was writing I realized I could throw this problem out to the community: Anyone know of, or have, the code for a bot like I described that I can steal, borrow, or copy? Chris Tharp (talk) 14:26, 15 April 2013 (UTC)
hm, have you tried just running updateArticleCount.php? Even if $wgArticleCountMethod is not relevant, that script may work by rebuilding the tables you did not write to. I'm not certain where that draws from and where you've written to, but it seems worth trying before looking into a bot to resave everything. And I don't know of any bot like that, but will let you know if I come across one.--Ete (talk) 14:53, 15 April 2013 (UTC)

Semantic usage?

I see that Yellpedia uses Semantic MediaWiki. You may find it useful to also enable collection of semantic usage information as well. I would also be curious to see if that returns radically different statistics than siteinfo, per the issue above. 🐝 thingles (talk) 20:22, 9 April 2013 (UTC)