Python vs. R: The battle for data scientist mind share

Here’s how the general-purpose favorite of scientists stacks up against the stat head’s data-honed tool of choice

Comments

The boss’s boss looks out across the server farm and sees data—petabytes and petabytes of data. That leads to one conclusion: There must be a signal in that noise. There must be intelligent life in that numerical world—a strategy to monetize all those hard disks filling up with numbers.

That job falls on your desk, and you must now find a way to poke around the digital rat’s nest and find a gem to hand the boss.

How? If you’re a developer, there are two major contenders: R and Python. There are plenty of other solutions that help crunch data, and they live under rubrics like business intelligence or data visualization, but they are often full-service solutions. If they do what you want, you should choose them. But if you want something different, well, writing your own code is the only solution. Full-service tools do a good job when the data is cleaned, buffed, and ready, but they tend to hiccup and even throw up when everything is not quite perfect.

The difference between Python and R is largely philosophical. One is a full-service language developed by Unix scripters that happened to be adopted by stat heads, big data junkies, and social scientists. The other is a tool for data analysis designed and built by stat heads, big data junkies, and social scientists.

The crowds couldn’t be more similar, but the approach is very different. One is a general-purpose tool with many libraries that can help. The other is built specifically for big data analysis.

Which should you choose? Following is a head-to-head comparison to make your decision easier.

Python makes preprocessing easy

Some say 50 percent of data analysis is cleaning up the data beforehand; some say 99 percent. Whatever the exact metric, it’s nicer to clean data with a full-service language that will let you perform arbitrary tasks just in case you need to. Python is a full-service, imperative language, so you’re probably familiar with its structure and approach even if you’ve never used it. It’s easy to add new functions and new layers to take apart your data and clean it up. If these functions need local storage, access to a web service, or any arbitrary item that computer programs can normally do, you can include it without too much fuss. It’s simply another language.

R lets you preprocess with anything

Yes, Python makes preprocessing easy, but that doesn’t mean you can’t use R if you need to clean up your data. You can use any language. In fact, in many cases, it’s structurally unsound to mix up your data purification routines with your analysis routines. It’s better to separate them. And if you’re going to separate them, why not use any language you like? That may indeed be Python, but it could be Java, C, or even assembly code. Or maybe even you want to preprocess your data within the database or some other storage layer. R doesn’t care.

Python has a bazillion libraries

Python is popular, and the statistics for the common repositories make it clear. The Python Package Index (PyPi) has 102,199 packages on offer as I’m writing this, and the number will almost certainly be larger by the time you read it. That’s the tip of the iceberg because the code is everywhere, from GitHub to social science websites. There’s plenty of good Python code out of view of PyPi. Almost all of it is open source and available to make your life a bit easier.

R has a bazillion libraries for statistical analysis

R has packages, too. The Comprehensive R Archive Network (CRAN) offers 10,033 packages at this writing, and as with Python, the list is getting longer. These packages are built for one task only: statistical analysis of data. There aren’t packages for, say, running file system checks or server maintenance because R doesn’t do that. While there may be a few duds—as there are in all open source repositories—the code has been for the most part written and vetted by statisticians.

Python is evolving

If you say “le weekend” in France, everyone understands. That’s what it means to be a living language. Python is evolving and getting better, like French. The jump from version 2.3 to version 3.0 broke old code, yes, but many Python lovers say the change was worth it. A living pile of code gets better, even if it breaks old code. A living language means that people want to use and improve it. That translates into more open source code and more solutions. Just as political memes on Facebook are the price we pay for democracy, shifting standards and broken code are the cost of working with a popular, evolving, and living language.

R is staying pure

It’s not fair to say that R isn’t changing. Indeed, it is a variant of S with lexical scoping to make large code bases cleaner. Even then, many are able to run S in an R interpreter. There simply aren’t the paradigm shifts of the magnitude that leave Python programmers scratching their head about whether the code base is 2.3 or 3.0. It’s only going to be more familiar and less likely to break. There are no guarantees, because R is living too, but the steps are not as large or revolutionary.

Python does everything any language can do

Python is a general-purpose language designed by programmers to do everything a programmer wants to do. This isn’t the same as being Turing-complete. The Game of Life is Turing-complete, but you wouldn’t want to write a function to calculate a Fibonacci sequence with it. You can usually accomplish something with many of the options out there, but Python is designed to make it easy. Python is designed for real projects filled with lots of code. This may not seem useful at the start of a project when all you need to do is write a few lines to clean up some tiny detail. But it becomes important later when those few lines grow into thousands, and the mess turns into spaghetti code. Python was built for bigger projects, and you might need those features, eventually.

R does statistics well

R was built for statistical analysis. What is the job on your desk? Statistical analysis? Look no further. Choose the right tool for the job. A wrench can do the job of a hammer, but a hammer was built to do a hammer’s job.

Python has the command line

Kids raised on pointing and clicking usually can’t grok the command line at first, but eventually they understand the power and expressiveness of a command line coupled with a good keyboard. The combinatorics of language are so amazing. You have to click through dozens of menu pages to accomplish the same thing that a good string of characters can do with the command line. Python lives in this world. Python is built for the command line and flourishes there. The technology may look incredibly dated, but it is efficient and powerful.

R has that and RStudio

R is also built around a command line of sorts, although it builds up quite a bit of state inside of it. But many people work inside of environments like RStudio or R commander, two worlds that wrap up everything in a nice bow. There’s a command line but also a data editor, some debugging support, and a window to hold graphics. The Python world has been trying to catch up lately by working with existing IDEs like Eclipse or Visual Studio.

Python has the web

Perhaps it is only natural that there will be websites for developing Python, a scripting language that has co-evolved with Unix web servers. There’s Rodeo and Jupyter for starters, and more will probably come. It’s easy to link up port 80 with the interpreter, so Python works quite well with the web. Yes, you can use Jupyter for other languages like Scala, Julia, and R if you must, but remember how the name is spelled if you’re wondering who got there first.

R loves LaTeX

Many people who use R to crunch their data also use LaTeX to write the papers reporting the signals found in said data. It’s only natural that someone would create Sweave, a very clever system that mixes data analysis with paper layout. Your R commands for analyzing the data and creating graphs are mixed into your text reporting the results. It’s all in one place, minimizing the danger of corruption or sticky caches. You push one button, and the software re-analyzes your data and dumps the result in the final document.

Use both

Why not make the best of both worlds, as many data scientists already do? The first stage of data aggregation can be accomplished with Python. Then the data is fed into R, which applies the well-tested, optimized statistical analysis routines built into the language. It’s as if R is a library for Python. Or maybe Python is a preprocessing library for R. Choose the best language for the particular layer and build up a layer cake. Is Python the frosting and R the cake? Or is it the other way around? You decide.