top of page

[python]Visualize the distribution of data according to the values of two items


Overview


You can use seaborn's count plot or displot to visualize the distribution of data. Also, if you want to visualize the distribution of data according to the value of a certain item, you can use jointplot or the argument hue of countplot (example: Fig. 1).

Previously, I introduced a method to visualize the distribution according to the value of a certain item according to the attribute (discrete quantity or continuous quantity) of each variable.

This time, I will extend it and introduce a method to visualize the distribution of data according to the values of two items.


Distribution of "Survived" according to the value of the variable "Embarked"


Method


The appropriate methods for the attributes of the explanatory variable (the variable whose distribution you want to check) and the objective variable (you want to check the distribution according to its value) are as follows.

目的変数:objective variable 説明変数:explanatory variable 離散:descrete 連続:continuous

Add a new argument hue to the argument of the previously created visualize_data method. Pass the key name of the second classification data to this.

def visualize_data(data, target_col, categorical_keys=None, hue=None):

Case of jointplot、violineplot


I do not yet use the argument hue in these methods, so I can use the argument hue.

sg=sns.jointplot(x=key, y=target_col, data=data, height=length*2/3, 
                hue=hue)

Case of countplot


In the case of countplot, the argument hue has already been used. Therefore, I will separate the plot for each value of the second data.

Change the method to use to catplot and set kind to 'count'. If you pass a variable to the argument row of catplot, the plot will be divided for each variable. (If you pass it to row, it will line up horizontally, and if you pass it to col, it will line up vertically.)


sns.catplot(x=key, data=data, hue=target_col, row=hue, kind='count')

Case of distplot


Similarly, since the argument hue has been used, I also divide the plot for each value of the second data.

g=sns.FacetGrid(data, hue=target_col, row=hue)
g.map(sns.distplot, key, ax=axes[1], kde=False)

The whole code is as follows. See the previous page for the original methods used internally.

def visualize_data(data, target_col, categorical_keys=None, hue=None):

    keys=data.keys()
    
    if categorical_keys is None:
        categorical_keys=keys[[is_categorical(data, key) for key in 
        keys]]
        
    for key in keys:
        if key==target_col or key==hue:
            continue
            
    length=10
    subplot_size=(length, length/2)
    
    if (key in categorical_keys) and (target_col in categorical_keys):
        
        r=cramerV(key, target_col, data)
        
        fig, axes=plt.subplots(1, 2, figsize=subplot_size)
        sns.countplot(x=key, data=data, ax=axes[0])
        sns.catplot(x=key, data=data, hue=target_col, row=hue, 
        kind='count')
        plt.title(r)
        plt.tight_layout()
        plt.show()
        
    elif (key in categorical_keys) and not (target_col in 
          categorical_keys):
          
          r=correlation_ratio(cat_key=key, num_key=target_col, 
          data=data)
          
          fig, axes=plt.subplots(1, 2, figsize=subplot_size)
          sns.countplot(x=key, data=data, ax=axes[0])
          sns.violinplot(x=key, y=target_col, data=data, ax=axes[1], 
          hue=hue)
          plt.title(r)
          plt.tight_layout()
          plt.show()
          
    elif not (key in categorical_keys) and (target_col in 
         categorical_keys):
         
         r=correlation_ratio(cat_key=target_col, num_key=key, 
         data=data)
         
         fig, axes=plt.subplots(1, 2, figsize=subplot_size)
         sns.distplot(data[key], ax=axes[0], kde=False)
         g=sns.FacetGrid(data, hue=target_col, col=hue)
         g.map(sns.distplot, key, ax=axes[1], kde=False)
         axes[1].set_title(r)
         axes[1].legend()       
         plt.title(r)
         plt.tight_layout()
         plt.show()
         
    else:
    
        r=data.corr().loc[key, target_col]

        sg=sns.jointplot(x=key, y=target_col, data=data,     
                         height=length*2/3, hue=hue)
         plt.title(r)
         plt.show()  

The result is as follows (only a part is displayed). Plot the distribution of the data according to the values of the two items 'Cover_Type' and 'result'.

visualize_data(data=data_res, target_col='Cover_Type', hue='result')




Lastly


The most common use is to add the prediction result from the model to a column of data and see the distribution of the data according to the two values, this and the target value.

Recent Posts

See All

[Python] Output pandas.DataFrame as json

Summary Data analysis is performed using python. The analysis itself is performed using pandas, and the final results are stored in pandas.DataFrame format. I want to output this result to a file in j

[Python] Conditionally fitting

Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many cases where you want to condition the fitting parameters. For

Comentarios


Let's do our best with our partner:​ ChatReminder

iphone6.5p2.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Let's do our best with our partner:​ ChatReminder

納品:iPhone6.5①.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png
bottom of page