[python] When performing k-means clustering, normalize and standardize the data.
Overview
While working on a machine learning problem, I tried k-means clustering of the data. However, it seemed that it depended too much on the value of one column. To prevent this from happening, the size of the data must be the same in advance.
What is k-means clustering
k-means clustering is an algorithm that divides data into multiple clusters. Create a number of cluster centroids specified in advance and classify each data into the nearest cluster.
Why normalization is necessary
The distance between data points in the coordinate space is used in cluster classification in k-means. Therefore, if the absolute value of only one piece of data is large, the distance between points will be large even with the same variance, which will affect the calculation of cluster classification.
The data I was actually using was such that columns A, C, and D ranged from 0 to 2, and column B ranged from 1000 to 3000. When I forgot to standardize and tried clustering, the clustering seemed to be affected only by column B. I'm glad I noticed something was wrong this time.
Colclusion
Be sure to normalize and standardize before performing kmeans clustering. And be sure to check the result of clustering.
Recent Posts
See AllSummary Data analysis is performed using python. The analysis itself is performed using pandas, and the final results are stored in...
Phenomenon I get a title error when trying to import firestore with raspberry pi. from from firebase_admin import firestore ImportError:...
Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many...
Comments