Data Science sounds like something cool and awesome. It’s pictured as something cool and awesome. It is a sexiest job of 21st century as we all know (I won’t even add the link to that article :D). All the cool terms are related to this field — Machine Learning, Deep Learning, AI, Neural Networks, algorithms, models…
But all this is just a top of an iceberg. 70–80% of our work is data preprocessing, data cleaning, data transformation, data reprocessing — all these boring steps to make our data suitable for the model that will make some modern magic.
And today I would like to list all the methods and functions that can help us to clean and prepare the data.
So what can be wrong with our data? A lot of things actually:
- Irrelevant column names
- Missing data
- Columns that have to be processed
- Unexpected data values
For the quick overview we can use following methods and attributes of a DataFrame:
df.head() # show first 5 rows df.tail() # last 5 rows df.columns # list all column names df.shape # get number of rows and columns df.info() # additional info about dataframe df.describe() # statistical description, only for numeric values df['col_name'].value_counts(dropna=False) # count unique values in a column
The output of at least one of these will give us first clues where we want to start our cleaning.
Another way to quickly check the data is by visualizing it. We use bar plots for discrete data counts and histogram for continuous.
df['col_name'].plot('hist') df.boxplot(column='col_name1', by='col_name2')
Histogram and box plot can help to spot visually the outliers. The scatter plot shows relationship between 2 numeric variables.
df.plot(kind='scatter', x='col1', y='col2')
Visualizing data can bring some unexpected results as it gives you a perspective of what’s going on in your dataset. Kind of view from the top.
Tidy data is the data obtained as a result of a process called data tidying. It is one of the important cleaning processes during big data processing and is a recognized step in the practice of data science. Tidy data sets have structure and working with them is easy; they’re easy to manipulate, model and visualize. Tidy data sets main concept is to arrange data in a way that each variable is a column and each observation (or case) is a row.
Tidy data provide standards and concepts for data cleaning, and with tidy data there’s no need to start from scratch and reinvent new methods for data cleaning.
- Each variable you measure should be in one column.
- Each different observation of that variable should be in a different row.
- There should be one table for each “kind” of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.
- (all these taken from Wikipedia)
To transform our data and make it tidy we can use melting.
pd.melt() # transform columns to rows
There are two parameters you should be aware of: id_vars
. The id_vars
represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars
represent the columns you do wish to melt into rows. By default, if no value_vars
are provided, all columns not set in the id_vars
will be melted.
new_df = pd.melt(df, id_vars = 'do not melt')
The opposite operation to melting is pivoting. Here we turn unique values into separate columns. We use it when we want to transform our data from analysis shape to reporting shape, from easy-machine-readable to easy-human-readable form.
df.pivot(index='date', columns='element', values='value')
Although, this method cannot handle duplicate values. When this is a case we should use .pivot_table()
that has an additional parameter — aggfunc
, which will handle those duplicates based on a function we provide (sum, count, mean, min, max or user defined function).
df.pivot_table(index='date', columns='element', values='value', aggfunc=np.mean)
Sometimes we might have data stored incorrectly. For example, values for the gender groups that stored as ‘m014’, ‘f014’, ‘m1528’, ‘f1528’. You cannot tell it’s completely wrong, but it would be better to split these values into ‘gender’ and ‘age’ columns. To do this we use Python slicing sintaxis by accessing .str attribute of column of object type.
df['gender'] = df.variable.str df['age_group'] = df.variable.str[1:]
Also data might not come in one huge file and be separated into few different chunks, so we have to be able to concatenate all that data and clean it or clean the first sample and then apply the same process on remaining parts. To do this we can use pandas .concat
method, which provided with the list of dataframes will concatenate them all. By default it will store the original indexes what will result in duplicate index values. To prevent this we have to reset index of a new dataframe by passing an additional parameter ignore_index=True.
concatenated = pd.concat([df1, df2], ignore_index=True)
But what if we have thousands of files? It’ll be dumb to import them one by one, clean them and repeat it again. And we are not stupid, we know loops in Python. The only missing part is to find all those files for import. We can do this with glob library. So the process will be the following: write a pattern, save all files into a list, iterate over csv files, import each file and concatenate the dataframes into one. Doesn’t seem that difficult, but with the code sample it’s much better:
# Import necessary modules import glob import pandas as pd
# Write the pattern: pattern pattern = '*.csv'
# Save all file matches: csv_files csv_files = glob.glob(pattern)
# Create an empty list: frames frames = 
# Iterate over csv_files for csv in csv_files:
# Read csv into a DataFrame: df df = pd.read_csv(csv)
# Append df to frames frames.append(df)
# Concatenate frames into a single DataFrame: final_df final_df = pd.concat(frames)
Merging is the same as SQL join operation. You combine two or more tables into one by key which is present in every table. There are three types of joins: one-to-one, one-many, many-to-many. In SQL this process is advanced, with a lot of options, modifications, where you have to explicitly specify what and how you want to join. Here, all is done for you by one function and the type of join will only depend on a data in a dataframe. If the column name to be used as a key is the same in both dataframes ‘on’
parameter is used.
merged = pd.merge(left=df1, right=df2, on=None, left_on='col1', right_on='col2')
Converting data types
It is very important to have data with correct data types as later it may play a bad joke with you and your analysis. Remember how once I didn’t convert one column to the correct data type and spent 1 hour trying to subtract float from string :D. So be careful :D. That’s a typical error in data, which can be fixed with pd.to_numeric()
and with errors=’coerce’
it will convert all the err values into NaNs.
To convert data we can use .astype()
method on a series. Also keep in mind the ‘category’ type — it reduces size of a dataframe and makes computations faster. We can convert to any value that can be used as category — days of the week, gender, continent abbreviations — depends on a context.
df['column1'] = df['column1'].astype(str) df['column1'] = df['column1'].astype('category') df['column1'] = pd.to_numeric(df['column1'], errors='coerce')
Duplicates and missing values
Our favourite part, isn’t it? To drop duplicates we can use druuuuuuuuuums drop_duplicates()
df = df.drop_duplicates()
With missing values it is little bit more complicated. In general there are three ways to deal with missing data:
- leave as-is
- drop them
- fill missing values
To drop missing values we can use .dropna()
, but careful with it — it may delete up to 50% of your data — which is not really good. But again, depends on a context.
To fill missing values we use .fillna()
, careful with it as well — if we fill missing values they have to be reasonable and make sense.
There is a wonderful article on handling missing data and I really have nothing to add on this part. Here is thelink and don’t forget to give an author a round of applause, because that work is just amazing.
We can also programmatically check our data using assert statements. It will return nothing if the result is True and error otherwise.
assert 1 == 1 # returns nothing assert 1 == 2 # returns error assert df.notnull().all().all() # returns error if at least one column has one missing value
I skipped regular expressions as it deserves a separate article, but in general summarized tools I use for data preprocessing. Please, let me know if something else can be added, have a great day and use data science for good