- Aug 21, 2021

[python]How to find the correlation for categorical variables

Introduction

When analyzing data, you will look at the correlation between variables in the given data. For the correlation between numerical values, you can check the correlation coefficient, but what if one or both are categories? I looked it up, so I will summarize it.

Number vs Number

In this case, it is famous. You can check the correlation coefficient. The definition of the correlation coefficient is as follows.

To find the correlation coefficient in python, use the corr () method of pandas.DataFrame.

import numpy as np
import pandas as pd

x=np.random.randint(1, 10, 100)
y=np.random.randint(1, 10, 100)

data=pd.DataFrame({'x':x, 'y': y})

data.corr()

If the value is 0, there is no correlation, if it is close to 1, there is a positive correlation, and if it is close to -1, there is a negative correlation.

Category vs Number

It is expressed as a statistic called correlation ratio. The definition is as follows.

See here for a concrete example.

The numerator represents "how far each category is". The farther the categories are, the larger the numerator, and the stronger the correlation.

This correlation ratio also means no correlation when it is 0, and a strong positive correlation when it approaches 1.

In python, it calculates as follows (see here).

def correlation_ratio(cat_key, num_key, data):

    categorical=data[cat_key]
    numerical=data[num_key]

    mean=numerical.dropna().mean()
    all_var=((numerical-mean)**2).sum()  #Sum of squares of 
                                         #total deviation

    unique_cat=pd.Series(categorical.unique())
    unique_cat=list(unique_cat.dropna())

    categorical_num=[numerical[categorical==cat] for cat in unique_cat]
    categorical_var=[len(x.dropna())*(x.dropna().mean()-mean)**2 for x in categorical_num]  
    #Number of categories x (category average - overall average)^2

    r=sum(categorical_var)/all_var

    return r

Category vs Category

We will look at it using a statistic called Cramer's coefficient of association. Definition is

where χ2 is the chi-square distribution, n is the number of data items, and k is the one with the smaller number of categories. Please refer to here for the χ-square distribution.

Roughly speaking, it is a quantity that expresses how different the distribution of each category is from the overall distribution. Again, if it is close to 0, there is no correlation, and if it is close to 1, there is a positive correlation.

To calculate with python, do the following (see here).

import scipy.stats as st

def cramerV(x, y, data):

    table=pd.crosstab(data[x], data[y])
    x2, p, dof, e=st.chi2_contingency(table, False)

    n=table.sum().sum()
    r=np.sqrt(x2/(n*(np.min(table.shape)-1)))

    return r

Find each index together

Only this would be the second brew of the previous article, so I created a method to calculate each index collectively for DataFrame. You don't have to look it up one by one!

def is_categorical(data, key):

    col_type=data[key].dtype

    if col_type=='int':

        nunique=data[key].nunique()
        return nunique<6

    elif col_type=="float":
        return False

    else:
        return True

def get_corr(data, categorical_keys=None):

    keys=data.keys()

    if categorical_keys is None:

        categorical_keys=keys[[is_categorycal(data, key) for key in keys]]

    corr=pd.DataFrame({})
    corr_ratio=pd.DataFrame({})
    corr_cramer=pd.DataFrame({})

    for key1 in keys:
        for key2 in keys:

            if (key1 in categorical_keys) and (key2 in categorical_keys):

                r=cramerV(key1, key2, data)
                corr_cramer.loc[key1, key2]=r                

            elif (key1 in categorical_keys) and (key2 not in categorical_keys):

                r=correlation_ratio(cat_key=key1, num_key=key2, data=data)
                corr_ratio.loc[key1, key2]=r                

            elif (key1 not in categorical_keys) and (key2 in categorical_keys):

                r=correlation_ratio(cat_key=key2, num_key=key1, data=data)
                corr_ratio.loc[key1, key2]=r                

            else:

                r=data.corr().loc[key1, key2]
                corr.loc[key1, key2]=r                    

    return corr, corr_ratio, corr_cramer

Which key is a categorical variable is automatically determined from the variable type unless specified.

Let's apply it to titanic data.

data=pd.read_csv(r"train.csv")
data=data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
category=["Survived", "Pclass", "Sex", "Embarked"]

corr, corr_ratio, corr_cramer=get_corr(data, category)

corr

corr_ratio

corr_cramer

In addition, it can be visualized with the seaborn heatmap.

import seaborn as sns
sns.heatmap(corr_cramer, vmin=-1, vmax=1)

Lastly

The explanation of each statistic has become messy, so please see the page mentioned in the reference. Even if I put it together, I end up forgetting it and looking it up, so I try to create a method that automates as much as possible.

Methods used in the article is uploaded on github.

Reference

様々な尺度の変数同士の関係を算出する(Python)

相関分析

相関比

カイ２乗検定・クラメール連関係数

[python]How to find the correlation for categorical variables

Introduction

Number vs Number

Category vs Number

Category vs Category

Find each index together

Lastly

Reference

Recent Posts

category

article

Make a "don't forget to add to list" shopping list app with Flutter + Raspberry pi

I made a towel exchange monitoring app with Flutter and Raspberry Pi

[Flutter] Manage status by linking Firestore and Redux

[python] Visualize data and grasp correlation at the same time

Let's do our best with our partner: ChatReminder

It is an application that achieves goals in a chat format with partners.

Let's do our best with our partner: ChatReminder

It is an application that achieves goals in a chat format with partners.

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

[python]How to find the correlation for categorical variables

Introduction

Number vs Number

Category vs Number

Category vs Category

Find each index together

Lastly

Reference

Recent Posts

category

article

Make a "don't forget to add to list" shopping list app with Flutter + Raspberry pi

I made a towel exchange monitoring app with Flutter and Raspberry Pi

[Flutter] Manage status by linking Firestore and Redux

[python] Visualize data and grasp correlation at the same time

Let's do our best with our partner:​ ChatReminder

It is an application that achieves goals in a chat format with partners.

Let's do our best with our partner:​ ChatReminder

It is an application that achieves goals in a chat format with partners.

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

Let's do our best with our partner: ChatReminder

Let's do our best with our partner: ChatReminder