top of page

[python] When performing k-means clustering, normalize and standardize the data.


Overview


While working on a machine learning problem, I tried k-means clustering of the data. However, it seemed that it depended too much on the value of one column. To prevent this from happening, the size of the data must be the same in advance.



What is k-means clustering


k-means clustering is an algorithm that divides data into multiple clusters. Create a number of cluster centroids specified in advance and classify each data into the nearest cluster.


Why normalization is necessary


The distance between data points in the coordinate space is used in cluster classification in k-means. Therefore, if the absolute value of only one piece of data is large, the distance between points will be large even with the same variance, which will affect the calculation of cluster classification.

Even with the same 10% difference, the distance between points is significantly different.

The data I was actually using was such that columns A, C, and D ranged from 0 to 2, and column B ranged from 1000 to 3000. When I forgot to standardize and tried clustering, the clustering seemed to be affected only by column B. I'm glad I noticed something was wrong this time.



Colclusion


Be sure to normalize and standardize before performing kmeans clustering. And be sure to check the result of clustering.

Recent Posts

See All

[Python] Conditionally fitting

Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many...

コメント


Let's do our best with our partner:​ ChatReminder

iphone6.5p2.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Let's do our best with our partner:​ ChatReminder

納品:iPhone6.5①.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png
bottom of page