python - 计算k均值的方差百分比？ - 码农俱乐部 - Golang中国

在Wikipedia page上，描述了确定k-means中簇数的肘形方法。The built-in method of scipy提供了一个实现，但我不确定我是否了解如何计算它们所称的失真。
更准确地说，如果你用图表来表示
根据集群的数量，第一个集群将
添加很多信息（解释很多差异），但在某一点上
边际增益将下降，在图中给出一个角度。
假设我有以下点及其相关的质心，那么计算这个度量的一个好方法是什么？

points = numpy.array([[ 0,  0],
       [ 0,  1],
       [ 0, -1],
       [ 1,  0],
       [-1,  0],
       [ 9,  9],
       [ 9, 10],
       [ 9,  8],
       [10,  9],
       [10,  8]])

kmeans(pp,2)
(array([[9, 8],
   [0, 0]]), 0.9414213562373096)

我特别关注的是0.94的计算。只测量点和质心。我不确定是否可以使用scipy的任何内置方法，或者我必须自己写。对于如何有效地为大量的点做这件事有什么建议吗？
简而言之，我的问题（所有相关）如下：
给定一个距离矩阵和一个点所属的映射
集群，计算一个可以使用的度量的好方法是什么？
绘制肘形图？
如果使用不同的距离函数，如余弦相似性，该方法将如何改变？
编辑2：失真

from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))

第一组点的输出是精确的。但是，当我尝试不同的设置时：

>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
       [1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797

我猜最后一个值不匹配，因为kmeans似乎在按数据集中的点数潜水。
编辑1：百分比差异
到目前为止，我的代码（应该添加到Denis的k-means实现中）：

centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
        delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )

print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
    print "Cluster: ", cluster

    truthcondition = ([x == cluster for x in xtoc])
    distances_inside_cluster = (truthcondition * dist)

    indices = [i for i,x in enumerate(truthcondition) if x == True]
    final_distances = [distances_inside_cluster[k] for k in indices]

    print final_distances
    print np.array(final_distances).var()
    cluster_vars.append(np.array(final_distances).var())
    print ""

print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())

下面是k=2的输出：

Unique clusters:  set([0, 1])

Cluster:  0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041

Cluster:  1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16

Sum of variances:  0.587451660041
Total Variance:  21.1475
Percent:  2.77787757437

在我的真实数据集上（对我来说不合适！）：

Sum of variances:  0.0188124746402
Total Variance:  0.00313754329764
Percent:  599.592510943
Unique clusters:  set([0, 1, 2, 3])

Sum of variances:  0.0255808508714
Total Variance:  0.00313754329764
Percent:  815.314672809
Unique clusters:  set([0, 1, 2, 3, 4])

Sum of variances:  0.0588210052519
Total Variance:  0.00313754329764
Percent:  1874.74720416
Unique clusters:  set([0, 1, 2, 3, 4, 5])

Sum of variances:  0.0672406353655
Total Variance:  0.00313754329764
Percent:  2143.09824556
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6])

Sum of variances:  0.0646291452839
Total Variance:  0.00313754329764
Percent:  2059.86465055
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7])

Sum of variances:  0.0817517362176
Total Variance:  0.00313754329764
Percent:  2605.5970695
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8])

Sum of variances:  0.0912820650486
Total Variance:  0.00313754329764
Percent:  2909.34837831
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Sum of variances:  0.102119601368
Total Variance:  0.00313754329764
Percent:  3254.76309585
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Sum of variances:  0.125549475536
Total Variance:  0.00313754329764
Percent:  4001.52168834
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Sum of variances:  0.138469402779
Total Variance:  0.00313754329764
Percent:  4413.30651542
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

最佳答案：

对于Kmeans而言，畸变被用作停止标准（如果两次迭代之间的变化小于某个阈值，我们假设收敛）。
如果要从一组点和质心计算它，可以执行以下操作（代码在matlab中使用pdist2函数，但在python/numpy/scipy中重写应该很简单）：

% data
X = [0 1 ; 0 -1 ; 1 0 ; -1 0 ; 9 9 ; 9 10 ; 9 8 ; 10 9 ; 10 8];

% centroids
C = [9 8 ; 0 0];

% euclidean distance from each point to each cluster centroid
D = pdist2(X, C, 'euclidean');

% find closest centroid to each point, and the corresponding distance
[distortions,idx] = min(D,[],2);

结果是：

% total distortion
>> sum(distortions)
ans =
           9.4142135623731

编辑1：
我有时间玩这个……下面是应用于'Fisher Iris Dataset'的kmeans集群示例（4个功能，150个实例）。我们迭代k=1..10，绘制肘形曲线，选取K=3作为簇数，并显示结果的散点图。
注意，在给定点和质心的情况下，我包含了许多计算簇内方差（畸变）的方法。scipy.cluster.vq.kmeans函数默认返回此度量值（使用欧几里德作为距离度量值计算）。您也可以使用scipy.spatial.distance.cdist函数来计算距离，使用您选择的函数（前提是您使用相同的距离测量获得了簇形心：@Denis有解决方案），然后从中计算畸变。

import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

# load the iris dataset
fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt'
fp = open(fName)
X = np.loadtxt(fp)
fp.close()

##### cluster data into K=1..10 clusters #####
K = range(1,10)

# scipy.cluster.vq.kmeans
KM = [kmeans(X,k) for k in K]
centroids = [cent for (cent,var) in KM]   # cluster centroids
#avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares

# alternative: scipy.cluster.vq.vq
#Z = [vq(X,cent) for cent in centroids]
#avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z]

# alternative: scipy.spatial.distance.cdist
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

##### plot ###
kIdx = 2

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')

# scatter plot
fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k])
clr = ['b','g','r','c','m','y','k']
for i in range(K[kIdx]):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i)
plt.xlabel('Petal Length')
plt.ylabel('Sepal Width')
plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx])
plt.legend()

plt.show()

编辑2：
作为对这些评论的回应，我在下面给出了另一个完整的示例，使用“NIST hand-written digits dataset”：它有1797个数字图像，从0到9，每个数字大小为8×8像素。我重复上面的实验，稍加修改：Principal Components Analysis用于将维数从64减少到2：

import numpy as np
from scipy.cluster.vq import kmeans
from scipy.spatial.distance import cdist,pdist
from sklearn import datasets
from sklearn.decomposition import RandomizedPCA
from matplotlib import pyplot as plt
from matplotlib import cm

##### data #####
# load digits dataset
data = datasets.load_digits()
t = data['target']

# perform PCA dimensionality reduction
pca = RandomizedPCA(n_components=2).fit(data['data'])
X = pca.transform(data['data'])

##### cluster data into K=1..20 clusters #####
K_MAX = 20
KK = range(1,K_MAX+1)

KM = [kmeans(X,k) for k in KK]
centroids = [cent for (cent,var) in KM]
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]

tot_withinss = [sum(d**2) for d in dist]  # Total within-cluster sum of squares
totss = sum(pdist(X)**2)/X.shape[0]       # The total sum of squares
betweenss = totss - tot_withinss          # The between-cluster sum of squares

##### plots #####
kIdx = 9        # K=10
clr = cm.spectral( np.linspace(0,1,10) ).tolist()
mrk = 'os^p<dvh8>+x.'

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(KK, betweenss/totss*100, 'b*-')
ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained (%)')
plt.title('Elbow for KMeans clustering')

# show centroids for K=10 clusters
plt.figure()
for i in range(kIdx+1):
    img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8)
    ax = plt.subplot(3,4,i+1)
    ax.set_xticks([])
    ax.set_yticks([])
    plt.imshow(img, cmap=cm.gray)
    plt.title( 'Cluster %d' % i )

# compare K=10 clustering vs. actual digits (PCA projections)
fig = plt.figure()
ax = fig.add_subplot(121)
for i in range(10):
    ind = (t==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i)
plt.legend()
plt.title('Actual Digits')
ax = fig.add_subplot(122)
for i in range(kIdx+1):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i)
plt.legend()
plt.title('K=%d clusters'%KK[kIdx])

plt.show()

您可以看到一些集群实际上是如何对应于可分辨数字的，而其他集群则不匹配单个数字。
注：K-means的实现包括在scikit-learn中（以及许多其他群集算法和各种clustering metrics中）。Here是另一个类似的例子。