Create maps in R in 10 (fairly) easy steps

Sure you can use this technique for election maps — but it’s also handy for sales figures, mobile data coverage and many other types of data.

One problem with your typical election map is that it lets you see which candidate won where, but that's about it. In classic presidential election maps, that means it's easy to see if a state voted Democrat or Republican, but not necessarily which candidate actually won, thanks to large differences in U.S. population densities — giving us states like Montana, which is 4th-largest in area but tied for dead last in number of electoral votes.

That's why, if you're a data geek, it can be interesting to make your own election maps to visualize the questions that are important to you. For example: Where were a candidate's real areas of strength and weakness? Which areas contributed most to victory? This goal of better mapping trends can work with many other types of data, from sales figures to mobile data coverage.

There are many options for mapping. If you do this kind of thing often or want to create a map with lots of slick bells and whistles, it could make more sense to learn GIS software like Esri's ArcGIS or open-source QGIS. If you care only about well-used geographic areas such as cities, counties or zip codes, software like Tableau and Microsoft Power BI may have easier interfaces. If you don't mind drag-and-drop tools and having your data in the cloud, there are still more options such as Google Fusion Tables.

But there are also advantages to using R -- a language designed for data analysis and visualization. It's open source, which means you don't have to worry about ever losing access to (or paying for) your tools. All your data stays local if you want it to. It's fully command-line scripted end-to-end, making an easily repeatable process in a single platform from data input and re-formatting through final visualization. And, R's mapping options are surprisingly robust.

Ready to code your own election results maps -- or any other kind of color-coded choropleth map? Here’s how to handle a straightforward two-person race and a more complex race with three or more candidates in R.

We'll be using two mapping packages in this tutorial: tmap for quick static maps and leaflet for interactive maps. You can install and load them now with

install.packages("tmap")
install.packages("leaflet")
library("tmap")
library("leaflet")

(Skip the install.packages lines for any R packages that are already on your system.)

Step 1: Get election results data

I'll start with the New Hampshire Democratic primary results, which are available from the NH secretary of state's office as a downloadable Excel spreadsheet.

Getting election data into the proper format for mapping is one of this project's biggest challenges -- more so than actually creating the map. For simplicity, let's stick to results by county instead of drilling down to individual towns and precincts.

One common problem: Results data need to have one column with all election district names — whether counties, precincts or states — and candidate names as column headers. Many election returns, though, are reported with each election district in its own column and candidate results by row.

That's the case with the official NH results. I transposed the data to fix that and otherwise cleaned up the spreadsheet a bit before importing it into R (such as removing ", d" after each candidate's name). The first column now has county names, while every additional column is a candidate name; each row is a county result. I also got rid of the’ total’ row at the bottom, which can interfere with data sorting.

You can do the same -- or, if you'd like to download the data file and all the other files I'm using, including R code, head to the "Mapping with R" file download page below. (Free Insider registration needed. Bonus: You'll be helping me convince my boss that I ought to write more of these types of tutorials). If you download and unzip the mapping with R file, look for NHD2016.xlsx in the zip file.

To make your R mapping script as re-usable as possible, I suggest putting data file names at the top of the script -- that makes it easy to swap in different data files without having to hunt through code to find where a file name appears. You can put this toward the top of your R script:

datafile <- "data/NHD2016.xlsx"

Note: My data file isn't in the same working directory as my R script; I have it in a data subdirectory. Make sure to include the appropriate file path for your system, using forward slashes even on Windows.

There are several packages for importing Excel files into R; but for ease of use, you can't beat rio. Install it with:

install.packages("rio") if it's not already on your system, and then run:

nhdata <- rio::import(datafile)

to store data from the election results spreadsheet into a variable called nhdata.

There were actually 28 candidates in the results; but to focus on mapping instead of data wrangling, let's not worry about the many minor candidates and pretend there were just two: Hillary Clinton and Bernie Sanders. Select just the County, Clinton and Sanders columns with:

nhdata <- nhdata[,c("County", "Clinton", "Sanders")]

Step 2: Decide what data to map

Now we need to think about what exactly we'd like to color-code on the map. We need to pick one column of data for the map's county colors, but all we have so far is raw vote totals. We probably want to calculate either the winner's overall percent of the vote, the winner's percentage-point margin of victory or, less common, the winner's margin expressed by number of votes (after all, winning by 5 points in a heavily populated county might be more useful than winning by 10 points in a place with way fewer people if the goal is to win the entire state).

It turns out that Sanders won every county; but if he didn't, we could still map the Sanders "margin of victory" and use negative values for counties he lost.

Let's add columns for candidates' margins of victory (or loss) and percent of the vote, again for now pretending there were votes cast only for the two main candidates:

nhdata$SandersMarginVotes <- nhdata$Sanders - nhdata$Clinton
nhdata$SandersPct <- (nhdata$Sanders - nhdata$Clinton) / (nhdata$Sanders + nhdata$Clinton) # Will use formatting later to multiply by 100
  nhdata$ClintonPct <- (nhdata$Clinton - nhdata$Sanders) / (nhdata$Sanders + nhdata$Clinton)
nhdata$SandersMarginPctgPoints <- nhdata$SandersPct - nhdata$ClintonPct

Step 3: Get your geographic data

Whether you're mapping results for your city, your state or the nation, you need geographic data for the area you'll be mapping in addition to election results. There are several common formats for such geospatial data; but for this tutorial, we'll focus on just one: shapefiles, a widely used format developed by Esri.

If you want to map results down to your city or town's precinct level, you'll probably need to get files from a local or state GIS office. For mapping by larger areas like cities, counties or states, the Census Bureau is a good place to find shapefiles.

For this New Hampshire mapping project by county, I downloaded files from the Cartographic Boundary shapefiles page -- these are smaller, simplified files designed for mapping projects where extraordinarily precise boundaries aren't needed. (Files for engineering projects or redistricting tend to be considerably larger).

I chose the national county file at http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_5m.zip and unzipped it within my data subdirectory. With R, it's easy to create a subset for just one state, or more; and now I've got a file I can re-use for other state maps by county as well.

There are a lot of files in that newly unzipped subdirectory; the one you want has the .shp extension. I'll store the name of this file in a variable called usshapefile:

usshapefile <- "data/cb_2014_us_county_5m/cb_2014_us_county_5m.shp"

Several R packages have functions for importing shapefiles into R. I'll use tmap's read_shape(), which I find quite intuitive:

usgeo <- read_shape(file=usshapefile)

If you want to check to see if the usgeo object looks like geography of the U.S., run tmap's quick thematic map command: qtm(usgeo). This may take a while to load and appear small and rather boring, but if you've got a map of the U.S. with divisions, you're probably on the right track.

If you run str(usgeo) to see the data structure within usgeo, it will look pretty unusual if you haven't done GIS in R before. usgeo contains a LOT of data, including columns starting with @ as well as more familiar entries starting with $. If you're interested in the ins and outs of this type of geospatial object, known as a SpatialPolygonsDataFrame, see Robin Lovelace's excellent Creating maps in R tutorial, especially the section on "The structure of spatial data in R."

For this tutorial, we're interested in what's in usgeo@data -- the object's "data slot." (Mapping software will need spatial data in the @Polygons slot, but that's nothing we'll manipulate directly). Run str(usgeo@data) and that structure should look familiar - much more like a typical R data frame, including columns for STATEFP (state FIPS code), COUNTYFP (county FIPS codes), NAME (in this case county names, making it easy to match up with county names in election results).

Extracting geodata just for New Hampshire is similar to subsetting any other type of data in R, we just need the state FIPS code for New Hampshire, which turns out to be 33 -- or in this case "33," since the codes aren't stored as integers in usgeo.

Here's the command to extract New Hampshire data using FIPS code 33:

nhgeo <- usgeo[usgeo@data$STATEFP=="33",]

If you want to do a quick check to see if nhgeo looks correct, run the quick thematic map function again qtm(nhgeo) and you should see something like this:

nhbasiccounties Sharon Machlis

Still somewhat boring, but it looks like the Granite State with county-sized divisions, so it appears we've got the correct file subset.

Step 4: Merge spatial and results data

Like any database join or merge, this has two requirements: 1) a column shared by each data set, and 2) records stored exactly the same way in both data sets. (Having a county listed as "Hillsborough" in one file and FIPS code "011" in another wouldn't give R any idea how to match them up without some sort of translation table.)

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about ClickExcelGitHubGoogleInteractiveMicrosoft

Show Comments
[]