How to Approach Data: Tabular Data

explore the data

Intro

Often times as a data scientist or machine-learning engineer the key to gaining more insight into a problem is examining data. It’s important to understand the problem before a solution can be determined. To get the most insight it’s key to jump into the data, and analyze it from different angles. Typically this stage is called the exploratory data analysis (EDA).

Goal

The objective of this Kaggle competition is to use features to predict housing prices. The label that is to be predicted is the SalesPrice label in the dataset. To understand how to best forecast the sales price of a home, an EDA will help uncover the features and transformations that may need to be performed to help increase the accuracy of modeling the data.

EDA

Through this EDA we will take the following steps:

  1. Load the data (sounds silly but half the battle is getting data sometimes)
  2. Explore the features
  3. Analyze the missing features
  4. Look at the distributions of the data
  5. Find any correlations between variables
  6. Examine outliers

Load the Data

This step is quick and easy, by downloading the data and loading it, the EDA may begin.

import pandas as pd# This file was downloaded from the link at the top of the page
data = pd.read_csv('train.csv')

Explore the Features

Here the analysis is focused on all the different features that are contained in the dataset.

data.columns
# Prints out data types for all the features
data.dtypes
# Prints out how many of each data type are present in the data.
data.dtypes.value_counts()
A good mix of scalar and discrete features
data.describe()
describe() displays some summary statistics for the scalar features

Analyze the Missing Features

Now that there’s a better understanding of the individual features, the next step is analyzing the missing values. This step is important because there are multiple reasons why there are missing values: instrumentation failure, data collection error, sometimes it plain doesn’t exist. There are also multiple ways to try to replace the missing values, this depends on a few different factors including how many values are missing.

total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum() / data.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
missing_data.head(20)
PoolQC is the most missing feature

Look at the distributions of the data

To get insight into what the data looks like at a higher level, the data can be graphed on distribution plots. Visualization helps greatly at this step, it allows for the data to be seen all at once. Looking at the target variable is a pretty good place to start.

# import a great library for graphing called seaborn
import seaborn as sns
# used in a jupyter notebook to display the graph in the notebook
%matplotlib inline
sns.distplot(data['SalePrice'])
distribution of the target variable

Find any correlations between variables

After looking at all the various features, their metrics, and which values are missing, it’s time to look at feature correlation. This is important because it will give you an indication of which features correlate with the label and which features can help give the most insight into future forecasting. It is also helpful to look at the relationship between the features them selves, since sometimes there are collinear features that may be giving insight into the same variable.

import matplotlib.pyplot as pltcorrelation_matrix = data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(correlation_matrix, vmax=.8, square=True)
Correlation Matrix

Examine Outliers

Understanding and dealing with outliers is another challenge in the process. Sometimes there are values in our data that are outside of the normal range, when compared to the rest of the data. An example of this is a house that’s sale price is way higher then the rest of the houses with the same features. There may be various reasons for outliers, such as a mis-recorded feature. An easy way to look at the outliers is to plot a bivariate graph. In this case there are a couple of outliers in GrLivingArea:

var = 'GrLivArea'
outlier = pd.concat([data['SalePrice'], data[var]], axis=1)
outlier.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))
Example of outliers

Conclusion

By using this framework provided by this blog post, certain patterns and trends will emerge that can be further explored. After completing an EDA, you should have a better understanding of your data, which will make it easier for you to build a model that can solve your underlying problem.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store