The other day I told Oliver, that I was tired of being on break and I wanted to map something. In response, he gave me a full extract of the geolocated Wikipedia articles.
The file is huge, but with some work, I managed to import it into ArcMap. From the general kernel density, it looked like the majority of articles in the United States are found near New York City. But, NYC is the largest city, by population, in the US. What happens when you normalize based off of population? (Full size Map)
The easiest way to do this would be to normalize by the population to find out the number of articles per person in a set of areas. Since my county data layer included 2010 population, it seemed like a decent fit for the job.
I did a spatial join from the county data to the article points and came up with the article count for each county. This broadly gave me the same view as the kernel density though it shows some quirks along the US-Canada border. (Full size Map)
I then did my normalization to figure out which counties have the highest article count per person. I expected DC, Boston, Chicago and Philadelphia to pop out due to their historical significance. Strangely, the major west coast population centers and New York City are notably absent while the Twin-Cities are included. What is the most interesting is the number of articles found along the US-Canada border. Why are there so many articles in central Montana? What happened in Northern Maine? I have no idea. (Full size Map)
The data looks prettier with 20 natural breaks.