- Aug 21, 2021

[python] Visualize data and understand correlation at the same time

Introduction

When analyzing data, I think you will use graphs to visualize the data. At that time, it would be convenient if the statistics showing the correlation between the two variables could be displayed at the same time. Therefore, I have made it possible to display the appropriate statistic on the appropriate graph according to the content of the variable (category or numerical value).

Review so far

Here's a summary of the appropriate graphing methods for each variable content and the statistics that represent the correlations I've covered so far. Please see the link below for details.

説明変数：Explanatory variable 目的変数：Objective variable 離散: discrete　連続：continuous

[python]How to visualize data

[python]How to find the correlation for categorical variables

Put the right statistics on the right graph

For each variable combination, I created a method to select and display an appropriate graph and an appropriate statistic according to whether it is a discrete quantity or a continuous quantity.

import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

def visualize_data(data, target_col, categorical_keys=None):

    keys=data.keys()

    if categorical_keys is None:

        categorical_keys=keys[[is_categorical(data, key) for key in keys]]

    for key in keys:

        if key==target_col:
            continue

        length=10
        subplot_size=(length, length/2)

        if (key in categorical_keys) and (target_col in categorical_keys):

            r=cramerV(key, target_col, data)

            fig, axes=plt.subplots(1, 2, figsize=subplot_size)
            sns.countplot(x=key, data=data, ax=axes[0])
            sns.countplot(x=key, data=data, hue=target_col, ax=axes[1])
            plt.title(r)
            plt.tight_layout()
            plt.show()

        elif (key in categorical_keys) and not (target_col in categorical_keys):

            r=correlation_ratio(cat_key=key, num_key=target_col, data=data)

            fig, axes=plt.subplots(1, 2, figsize=subplot_size)
            sns.countplot(x=key, data=data, ax=axes[0])
            sns.violinplot(x=key, y=target_col, data=data, ax=axes[1])
            plt.title(r)
            plt.tight_layout()
            plt.show()

        elif not (key in categorical_keys) and (target_col in categorical_keys):

            r=correlation_ratio(cat_key=target_col, num_key=key, data=data)

            fig, axes=plt.subplots(1, 2, figsize=subplot_size)            
            sns.distplot(data[key], ax=axes[0], kde=False)
            g=sns.FacetGrid(data, hue=target_col)
            g.map(sns.distplot, key, ax=axes[1], kde=False)
            axes[1].set_title(r)
            axes[1].legend()            
            plt.tight_layout()
            plt.close()
            plt.show()

        else:

            r=data.corr().loc[key, target_col]

            sg=sns.jointplot(x=key, y=target_col, data=data, height=length*2/3)
            plt.title(r)
            plt.show()

The following method is used on the way.

def is_categorical(data, key):  #Determine if it is a categorical variable

    col_type=data[key].dtype

    if col_type=='int':

        nunique=data[key].nunique()
        return nunique<6

    elif col_type=="float":
        return False

    else:
        return True

def correlation_ratio(cat_key, num_key, data):  #Find the correlation 
                                                 ratio

    categorical=data[cat_key]
    numerical=data[num_key]

    mean=numerical.dropna().mean()
    all_var=((numerical-mean)**2).sum()

    unique_cat=pd.Series(categorical.unique())
    unique_cat=list(unique_cat.dropna())

    categorical_num=[numerical[categorical==cat] for cat in unique_cat]
    categorical_var=[len(x.dropna())*(x.dropna().mean()-mean)**2 for x in categorical_num]    

    r=sum(categorical_var)/all_var

    return r

def cramerV(x, y, data):  #coefficient of association

    table=pd.crosstab(data[x], data[y])
    x2, p, dof, e=st.chi2_contingency(table, False)

    n=table.sum().sum()
    r=np.sqrt(x2/(n*(np.min(table.shape)-1)))

    return r

Let's apply it to titanic data (only part of the result is shown).

train_data=pd.read_csv("train.csv")
train_data=train_data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)

categories=["Survived", "Pclass", "Sex", "Embarked"]
visualize_data(train_data, "Survived", categories)

Lastly

I tried to combine the methods so far into one. I try to do this at the beginning of the data analysis to get an overview of the data.

Methods used in the article is uploaded on my github.

[python] Visualize data and understand correlation at the same time

Introduction

Review so far

Put the right statistics on the right graph

Lastly

Recent Posts

category

article

Make a "don't forget to add to list" shopping list app with Flutter + Raspberry pi

I made a towel exchange monitoring app with Flutter and Raspberry Pi

[Flutter] Manage status by linking Firestore and Redux

[python] Visualize data and grasp correlation at the same time

Let's do our best with our partner: ChatReminder

It is an application that achieves goals in a chat format with partners.

Let's do our best with our partner: ChatReminder

It is an application that achieves goals in a chat format with partners.

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

[python] Visualize data and understand correlation at the same time

Introduction

Review so far

Put the right statistics on the right graph

Lastly

Recent Posts

category

article

Make a "don't forget to add to list" shopping list app with Flutter + Raspberry pi

I made a towel exchange monitoring app with Flutter and Raspberry Pi

[Flutter] Manage status by linking Firestore and Redux

[python] Visualize data and grasp correlation at the same time

Let's do our best with our partner:​ ChatReminder

It is an application that achieves goals in a chat format with partners.

Let's do our best with our partner:​ ChatReminder

It is an application that achieves goals in a chat format with partners.

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

Let's do our best with our partner: ChatReminder

Let's do our best with our partner: ChatReminder