Python vs. R — Choosing the Best Programming Language for Data Science
A wrestling match unlike any other.
A fter a few years of programming in both Python and R, I still struggle with this. Which language is the best language to use for Data Science? I like to think of myself as a technologist with a Statistics degree who is dabbling in Data Science. But, even I can’t shy away from the easy to use aspects of R. The allure of the R programming language stayed with me even as I ventured into Pandas, Numpy and Scipy. Python’s robust packages in machine learning frankly amaze me. It’s literally a one-stop shop. At the same time, it literally takes me less than 30 minutes to run simply statistical analysis in R to explore my dataset. Machine Learning and Deep Learning packages are now becoming the norm in R as well.
For a while, like any data enthusiast, I started to test out both programming languages side by side for everyday tasks of data science. Using them across datasets large and small has really made a difference in how I look at each programming language.
Before we delve into the programming languages, I want you to understand that a Data Scientist or Data Analyst or a Data Enthusiast does not actually develop software.
He or she might inform the process of software development and help software developers with data logic that’s needed inside the software, but he or she does not actually develop software.
There’s a real difference between programming for software development and programming for data analysis.
Software Developmentrequires extensive design of code for simplicity and efficiency . Object-Oriented languages tend to lend themselves to software development simply because the code is more scalable as the system grows.
Data Science Programmingrequires the ability to do everything it takes to analyze the data. By everything, I mean using multidisciplinary set of knowledge from almost every walk of life to figure out the true nature of this data. The code is used to solve an equation in the form a dataset.
Functional Programming vs. OOP
Functional Programming Language is a language that:
- treats computation as the evaluation of mathematical functions
- avoids changing-state and mutable data
- programming is done with expressions or declarations
- functional code is idempotent or function’s return value depends only on its result
Functional Programming is good when you have a fixed set of things. As your code evolves, you add new operations on existing things.
In contrast, in Object-Oriented Programming, data can have both mutable and immutable states. Programming is done with statements and expressions. Global programming states can affect the function’s resulting value.
OOP is good when you have a fixed set of operations on things. As your code evolves, you add new things.
Why does Data Science lend itself to Functional Programming?
Data Science’s objective is often to solve a problem. It is often functional in nature. Models themselves are essentially equations where the return values need to be the same. Even in deep learning, the data itself does not change. New values are added. But, the data stays the same. The immutable state is essential for the output to be consistent in the model. Functional Programming is all about chaining together functions to operate over a simple data structure. This design makes it easy to implement parallelism. In any machine learning or deep learning project, parallelism is essential when working with large sets of data.
The Nature of R and Python
Python is an interpreted, high level, general-purpose programming language. You can do some functional programming in Python . But, Python is not a functional programming language. It does not meet the technical specifications of “purity” in the context of a functional programming language. There are a lot more OO use cases that Python caters to. Python is actually a good language to use for Object-Oriented Programming. You will find that because Python is versatile, it will often be used in Software Development. Even though it’s not a strictly functional programming language, it has robust packages for Data Science.
R is primarily a functional programming language. It contains many tools for creations and manipulations of functions. You can do anything with functions as you can do with vectors. Anonymous functions give you the ability to use functions without giving them a name. This makes possible the chaining of functions that is useful in machine learning and deep learning. Almost all R objects are immutable. R environments, however, are mutable. R has robust visualization libraries such as ggplot2, plot, lattice, etc. Statisticians use R to visualize data. Often, quick visualization of data can provide insights into the data that leads to further statistical analysis.
Which One is Better: R vs. Python?
In the real world, it’s often difficult to choose R or Python for all of your Data Science efforts.
At the end of the day, the purpose of the programming language is to allow for the simplest and the most efficient code to be used for the job at hand.
Personally, for my Data Science projects, I have taken to use both R and Python in conjunction of other languages for the different steps of the Data Science process.
Exploration of Unstructured Data
80% of the world’s data is actually unstructured data. Data such as text, video, and images are all unstructured data. Python has a multitude of packages such as NLTK , scikit-image , pyPI for natural language processing, image processing, and voice analysis. Making sense of unstructured data often means that the data needs to be converted into structured data. Python is very useful for this conversion.
Data Cleaning: Structured Data or Semi-Structured Data
With large sets of data, Python is unbeatable in data cleaning. You can use packages such as Pandas, NumPy to easily clean up large sets of data . Frequently, I will also usePerl one-liners for specific data cleaning purposes.
The combination of the two often produces “clean data” in a short span of time. This way, most of my Data Science effort can be focused on Analysis.
Exploration and Modeling in R
Once you have structured data or semi-structured data, it’s much easier to do data exploration in R. I can write clean code for a multitude of statistical analysis to get to know my data. It’s also easy to use the visualization packages to visualize the data to help with my analysis: ANOVA, Multivariate correlations and Regressions, Factor Analysis, and Geostatistics. Logistic Regression and Time Series Analysis are both simple to implement in R with easy visualizations. Feature selection is easily done with R using the caret package and fastcaret package. Model Selection is easily implemented. Machine Learning models such as LDA, CART, kNN, SVM and RF all are easily implemented in R. Each algorithm has its own packages in R. Training the dataset and cross-validation takes just a few lines of code. Even Deep Learning, the Keras library in R with Tensor Flow now make this an easier endeavor in R.
Exploration and Modeling in Python
Data exploration and modeling is not limited to R. Python has packages such as NumPy , Matplotlib and Pandas that can help with the data exploration process. Seaborn is used for visualization much the same way as ggplot2 in R. Scipy provides all you need for traditional statistical analysis. SciKit-Learn provides for machine learning algorithm implementation, cross-validation and more. Using Keras, TensorFlow and PyTorch, deep learning in Python is not also a much easier process. Machine Learning and Deep Learning often means that you are working with large sets of data that sits in the Cloud. Most likely the infrastructure aspect of the Data Science project will drive any Data Scientist to AWS, Azure, and Google Cloud. This will mean that Python will be the default language to use in such large scale Data Science project.
In conclusion, working with real-world data presents complex problems. These problems can’t often be solved with one programming language or another. Understanding the nature of R and Python can help any programmer, data scientist or data analyst to choose the best programming language for the task at hand. The hybrid nature of tasks in Data Science means that there will always be a wrestling match between Python and R.
That is a good thing. The competing nature of the two languages might help us produce the simplest and the most efficient code for our purposes.