I would like to store a list of all en.wikipedia articles in my database. For each article I want to store the pageid, title and the popularity. I thought about using the view count (over the last month) as a measurement for popularity but if that is not possible, I could imagine going for something else (maybe use the number of revisions). I'm aware of http://dumps.wikimedia.org/enwiki/latest/ and that I can get a full list of articles from there (current count 36508337). However, I can not find a clever way to get the view count for each article.
// Updates, Edits, ... The suggested duplicate does not help me because a) I was looking for a popularity measurement. The answer to the other questions just states that it is not possible to get the number of watchers for a page, which is fine with me. b) There is no answer there that gives me the page views (or any other metric) for every page.
Okay I'm finally done. Here is what I did:
I found http://dumps.wikimedia.org/other/pagecounts-ez/ which provides page views per month. This seems promising but they don't mention the pageid so what I'm doing is getting a list of all articles from http://dumps.wikimedia.org/enwiki/latest/, create a mapping name->pageid and then parse the pagecount dump. This takes about 30 minutes, here are some statistics:
68% of the articles in the page count file do not exist in the latest dump. This is probably due to some users linking, for example, Misfits_(TV_series) while other link to Misfits_(tv_series) and even stuff like Misfits_%28TV_series%29... I did not bother with those because my program already took long enough to run.
The top 3 pages are:
2.1. Front page with 639 million views (in the last month)
2.2. Malware with 8.5 million views
2.3. Falcon 9 v1.1 with 4.7 million views (cool!)
I made a histogram for the number of pages with a certain view count, here it is:
I also plotted the number of pages I would have to deal with when I disregard all articles below a certain view count. Here it is: