A framework for finding insight in the data from all the noise.
Often times as a data scientist or machine-learning engineer the key to gaining more insight into a problem is examining data. It’s important to understand the problem before a solution can be determined. To get the most insight it’s key to jump into the data, and analyze it from different angles. Typically this stage is called the exploratory data analysis (EDA).
An EDA consists of multiple steps to analyze the data. Depending on the type of data available (image, text, numerical, audio, etc.) determines how an EDA should be approached. For the purpose of this blog this EDA framework will be focused on tabular data with scalar and discrete features. The data that will be used throughout the post is available from Kaggle (great site) here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
The objective of this Kaggle competition is to use features to predict housing prices. The label that is to be predicted is the SalesPrice label in the dataset. To understand how to best forecast the sales price of a home, an EDA will help uncover the features and transformations that may need to be performed to help increase the accuracy of modeling the data.
Through this EDA we will take the following steps:
- Load the data (sounds silly but half the battle is getting data sometimes)
- Explore the features
- Analyze the missing features
- Look at the distributions of the data
- Find any correlations between variables
- Examine outliers
Load the Data
This step is quick and easy, by downloading the data and loading it, the EDA may begin.
import pandas as pd# This file was downloaded from the link at the top of the page
data = pd.read_csv('train.csv')
Explore the Features
Here the analysis is focused on all the different features that are contained in the dataset.
This command will list out all the different features which are:
Next taking a look at the data types of each feature
# Prints out data types for all the features
data.dtypes# Prints out how many of each data type are present in the data.
Running this command is important to understand which variables are scalar (numbers) and which variables are discrete (categorical).
This is a good point to look at summary statistics of all the feature using the command below:
describe() is a method built into pandas dataframes that provides summary statistics such as mean, median, percentiles etc.
Analyze the Missing Features
Now that there’s a better understanding of the individual features, the next step is analyzing the missing values. This step is important because there are multiple reasons why there are missing values: instrumentation failure, data collection error, sometimes it plain doesn’t exist. There are also multiple ways to try to replace the missing values, this depends on a few different factors including how many values are missing.
total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum() / data.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
This output gives a rundown of the top missing features:
Here we see that there are some features with quite a lot of missing data.
Look at the distributions of the data
To get insight into what the data looks like at a higher level, the data can be graphed on distribution plots. Visualization helps greatly at this step, it allows for the data to be seen all at once. Looking at the target variable is a pretty good place to start.
# import a great library for graphing called seaborn
import seaborn as sns# used in a jupyter notebook to display the graph in the notebook
This shows that the target deviates from the normal distribution, is positive skewness and has peakedness. The characteristics above show that we may want to perform some sort of transformation on the label to normalize it to improve the ability to predict it.
Find any correlations between variables
After looking at all the various features, their metrics, and which values are missing, it’s time to look at feature correlation. This is important because it will give you an indication of which features correlate with the label and which features can help give the most insight into future forecasting. It is also helpful to look at the relationship between the features them selves, since sometimes there are collinear features that may be giving insight into the same variable.
import matplotlib.pyplot as pltcorrelation_matrix = data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(correlation_matrix, vmax=.8, square=True)
The graph below shows is the correlation between all the features. The scale on the graph shows how correlated the features are by color.
In the graph above, features that are most correlated appear lighter in color, and vice versa. The plot above indicates there are some highly correlated features such as GarageYrBlt (garage year built) and YearBuilt (year the house was built). This makes sense because in most cases the garage was built at the same time as the house.
The most interesting features here are the ones that are related to SalePrice. OverallQual (overall quality of the house) is highly correlated to the sale price of the house. This feature would be a good candidate in a model that can predict the price of the house.
Understanding and dealing with outliers is another challenge in the process. Sometimes there are values in our data that are outside of the normal range, when compared to the rest of the data. An example of this is a house that’s sale price is way higher then the rest of the houses with the same features. There may be various reasons for outliers, such as a mis-recorded feature. An easy way to look at the outliers is to plot a bivariate graph. In this case there are a couple of outliers in GrLivingArea:
var = 'GrLivArea'
outlier = pd.concat([data['SalePrice'], data[var]], axis=1)
outlier.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))
The graph above there are a couple of outliers with a value of GrLivArea much higher than the rest of the rows and with a relatively low sale price. A possible explanation for these outliers is some large agricultural lands in a relatively cheap market. Whether you decide to keep outliers in your dataset is a matter of understanding your data and your domain. In some cases, it may make sense to keep them, and in others, it may make sense to remove them. This is where the EDA can be truly useful
By using this framework provided by this blog post, certain patterns and trends will emerge that can be further explored. After completing an EDA, you should have a better understanding of your data, which will make it easier for you to build a model that can solve your underlying problem.
Finally, if looking at data, analyzing it and modeling it sounds interesting to you we’re always looking for like minded data scientists at Apteo. Send us a message at firstname.lastname@example.org for more information and open roles