- Aug 18, 2021

[python] When performing k-means clustering, normalize and standardize the data.

Overview

While working on a machine learning problem, I tried k-means clustering of the data. However, it seemed that it depended too much on the value of one column. To prevent this from happening, the size of the data must be the same in advance.

What is k-means clustering

k-means clustering is an algorithm that divides data into multiple clusters. Create a number of cluster centroids specified in advance and classify each data into the nearest cluster.

Why normalization is necessary

The distance between data points in the coordinate space is used in cluster classification in k-means. Therefore, if the absolute value of only one piece of data is large, the distance between points will be large even with the same variance, which will affect the calculation of cluster classification.

Even with the same 10% difference, the distance between points is significantly different.

The data I was actually using was such that columns A, C, and D ranged from 0 to 2, and column B ranged from 1000 to 3000. When I forgot to standardize and tried clustering, the clustering seemed to be affected only by column B. I'm glad I noticed something was wrong this time.

Colclusion

Be sure to normalize and standardize before performing kmeans clustering. And be sure to check the result of clustering.

[python] When performing k-means clustering, normalize and standardize the data.

Overview

What is k-means clustering

Why normalization is necessary

Colclusion

Recent Posts

コメント

category

article

Make a "don't forget to add to list" shopping list app with Flutter + Raspberry pi

I made a towel exchange monitoring app with Flutter and Raspberry Pi

[Flutter] Manage status by linking Firestore and Redux

[python] Visualize data and grasp correlation at the same time

Let's do our best with our partner: ChatReminder

It is an application that achieves goals in a chat format with partners.

Let's do our best with our partner: ChatReminder

It is an application that achieves goals in a chat format with partners.

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

[python] When performing k-means clustering, normalize and standardize the data.

Overview

What is k-means clustering

Why normalization is necessary

Colclusion

Recent Posts

コメント

category

article

Make a "don't forget to add to list" shopping list app with Flutter + Raspberry pi

I made a towel exchange monitoring app with Flutter and Raspberry Pi

[Flutter] Manage status by linking Firestore and Redux

[python] Visualize data and grasp correlation at the same time

Let's do our best with our partner:​ ChatReminder

It is an application that achieves goals in a chat format with partners.

Let's do our best with our partner:​ ChatReminder

It is an application that achieves goals in a chat format with partners.

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

Let's do our best with our partner: ChatReminder

Let's do our best with our partner: ChatReminder