As an avid reader of Runner’s World, I was excited when the current issue contained not only an article titled, “America’s 50 Best Running Cities,” but also the data and methodology for choosing the cities. If you read this blog, you know how I love me some reproducible research.
The methodology, according to the magazine:
We started with a list of 250 U.S. cities with populations of more than 160,000 that had the highest number of households per capita reporting participation in running within the last 12 months (according to the SimplyMap 2014 census study). Then we gathered data from myriad sources to create five indexes of special importance to runners, ranking the cities in each index from 1 to 150. We then weighted the indexes and tallied up the scores to create the final list.
The indexes and weights are described below (with more-detailed descriptions in the article):
- Run (40%) – Presence of RRCA- and USATF-sanctioned clubs, as well as races and running stores.
- Parks (20%) – The number of (and access to) trails, open spaces, running tracks, and other fitness facilities.
- Climate (20%) – An index of ideal running weather, including precipitation levels, air quality and daily average temperatures closest to 55 degrees Fahrenheit.
- Food (10%) – Analysis of residents’ access to healthful food and farmer’s markets.
- Safety (10%) – Measure of crime and traffic incidents involving pedestrians.
My original idea for this post was to explore the different variables on a map and maybe adjust the weights a little to see different outcomes. (Safety, in my opinion, was underweighted, especially since traffic deaths and injuries are a major problem in San Francisco.) But I ran into an issue.
Using the data (the full table was included in the magazine), I decided to recreate the weighted scores using the R script below:
Note: I also tried the script with R’s
weighted.mean function and received the same results. The math is the same, I was just double-checking my arithmetic.
The table below shows the outcome of the script, sorted by the weighted final score (ascending). You can see the issue right away by looking at the “Ranking” column. Portland, which came in 6th, is suddenly above Washington, D.C., which came in 5th. (And being a current resident of the D.C. area and having visited Portland, I agree with that new ranking because … well, weather, mostly).
|1||San Francisco, CA||1||5||6||19||146||19.10|
|4||San Diego, CA||10||34||9||30||97||25.30|
|8||New York, NY||4||20||42||39||108||28.70|
|13||Colorado Springs, CO||12||71||67||41||10||37.50|
|14||San Jose, CA||43||29||24||12||92||38.20|
|15||Los Angeles, CA||33||48||12||44||128||42.40|
|24||St. Louis, MO||17||8||127||92||101||53.10|
|26||Virginia Beach, VA||46||67||36||53||18||46.10|
|27||St. Paul, MN||53||21||94||14||50||50.60|
|29||Santa Rosa, CA||103||63||14||5||3||57.40|
|31||Las Vegas, NV||19||126||27||108||135||62.50|
|39||Des Moines, IA||69||62||110||7||145||77.20|
|41||Salt Lake City, UT||58||76||43||83||46||59.90|
|47||San Antonio, TX||16||94||111||122||130||72.60|
|49||Oklahoma City, OK||27||103||101||93||94||70.30|
I thought it might be a data entry error, so I took 10 random observations from the dataframe and double-checked the numbers. Everything is correct.
So what did I do wrong?