1. Profiling the pandas dataframe
Profilingis a process that helps us in understanding our data andPandas Profiling is python package which does exactly that. It is a simple and fast way to perform exploratory data analysis of a Pandas Dataframe. The pandas
df.info()functions are normally used as a first step in the EDA process. However, it only gives a very basic overview of the data and doesn’t help much in the case of large data sets. The Pandas Profiling function, on the other hand, extends the pandas DataFrame with
df.profile_report() for quick data analysis. It displays a lot of information with a single line of code and that too in an interactive HTML report.
For a given dataset the pandas profiling package computes the following statistics:
pip install pandas-profiling
conda install -c anaconda pandas-profiling
Let’s use the age-old titanic dataset to demonstrate the capabilities of the versatile python profiler.
#importing the necessary packages
import pandas as pd
# Depreciated: pre 2.0.0 version
df = pd.read_csv('titanic/train.csv')
Edit: A week after this article was published, Pandas-Profiling came out with a major upgrade -version 2.0.0. The syntax has changed a bit, in fact, the functionality has been included in the pandas itself and the report has become more comprehensive. Below is the latest usage syntax:
To display the report in a Jupyter notebook, run:
This single line of code is all that you need to display the data profiling report in a Jupyter notebook. The report is pretty detailed including charts wherever necessary.
The report can also be exported into an interactive HTML file with the following code.
profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(outputfile="Titanic data profiling.html")
Refer the documentation for more details and examples.