[python]How to get an overview of the whole data in Pandas
Introduction
What happens to the entire data when doing data analysis? You may want to confirm that. So I'll write about how to get an overview of the whole data in pandas. First, I will summarize the existing methods, and then I will introduce my own method.
Environment is python 3.7.4, pandas 0.25.1
Existing method
The methods .info () and .describe () that summaryze data already exist in pandas.DataFrame. Since these are summarized in various places, only the results are displayed here easily. Titanic is used for the data.
Click here for details, for example
import pandas as pd
data = pd.read_csv("train.csv") #read data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data.describe()
My own method
However, only with these, there is a slight itching. For example, describe () doesn't show the type and missing value information, but it is troublesome to do both info () and describe () twice.
So, I made a method that combines info () and describe ().
import numpy as np
def summarize_data(df):
df_summary=pd.DataFrame({'nunique':np.zeros(df.shape[1])}, index=df.keys())
df_summary['nunique']=df.nunique()
df_summary['dtype']=df.dtypes
df_summary['isnull']=df.isnull().sum()
df_summary['first_val']=df.iloc[0]
df_summary['max']=df.max(numeric_only=True)
df_summary['min']=df.min(numeric_only=True)
df_summary['mean']=df.mean(numeric_only=True)
df_summary['std']=df.std(numeric_only=True)
df_summary['mode']=df.mode().iloc[0]
pd.set_option('display.max_rows', len(df.keys()))
#Do not omit the display
return df_summary
summarize_data(data)
In addition, in the kaggle kernel, the display will be omitted if the number of data is large, so it is set so that it is not omitted in the last line of summarize_data ().
Summary
I introduced the existing method that summarizes the data summary of pandas.DataFrame and the self-made method that combines them. Not only can you get an overview at the beginning, but you can also use it to check whether scale conversion and missing value processing are done properly.
Methods in the article is uploaded on github.
Recent Posts
See AllSummary Data analysis is performed using python. The analysis itself is performed using pandas, and the final results are stored in...
Phenomenon I get a title error when trying to import firestore with raspberry pi. from from firebase_admin import firestore ImportError:...
Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many...
Comments