top of page

[python]How to get an overview of the whole data in Pandas


Introduction


What happens to the entire data when doing data analysis? You may want to confirm that. So I'll write about how to get an overview of the whole data in pandas. First, I will summarize the existing methods, and then I will introduce my own method.

Environment is python 3.7.4, pandas 0.25.1



Existing method


The methods .info () and .describe () that summaryze data already exist in pandas.DataFrame. Since these are summarized in various places, only the results are displayed here easily. Titanic is used for the data.

Click here for details, for example


import pandas as pd
data = pd.read_csv("train.csv") #read data

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data.describe()


My own method


However, only with these, there is a slight itching. For example, describe () doesn't show the type and missing value information, but it is troublesome to do both info () and describe () twice.

So, I made a method that combines info () and describe ().


import numpy as np

def summarize_data(df):

    df_summary=pd.DataFrame({'nunique':np.zeros(df.shape[1])}, index=df.keys())

    df_summary['nunique']=df.nunique()
    df_summary['dtype']=df.dtypes
    df_summary['isnull']=df.isnull().sum()
    df_summary['first_val']=df.iloc[0]
    df_summary['max']=df.max(numeric_only=True)
    df_summary['min']=df.min(numeric_only=True)
    df_summary['mean']=df.mean(numeric_only=True)
    df_summary['std']=df.std(numeric_only=True)
    df_summary['mode']=df.mode().iloc[0]

    pd.set_option('display.max_rows', len(df.keys())) 
    #Do not omit the display

    return df_summary

summarize_data(data)

In addition, in the kaggle kernel, the display will be omitted if the number of data is large, so it is set so that it is not omitted in the last line of summarize_data ().



Summary


I introduced the existing method that summarizes the data summary of pandas.DataFrame and the self-made method that combines them. Not only can you get an overview at the beginning, but you can also use it to check whether scale conversion and missing value processing are done properly.


Methods in the article is uploaded on github.

Recent Posts

See All

[Python] Conditionally fitting

Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many...

Comments


Let's do our best with our partner:​ ChatReminder

iphone6.5p2.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Let's do our best with our partner:​ ChatReminder

納品:iPhone6.5①.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png
bottom of page