如何使SGDClassifier反映不确定性

提问

您如何获得sklearn’s SGDClassifier来显示其预测的不确定性?

我试图确认SGDClassifier将报告严格不与任何标签对应的输入数据50%的概率.但是,我发现分类器一直都是100%确定的.

我正在使用以下脚本对此进行测试:

from sklearn.linear_model import SGDClassifier

c = SGDClassifier(loss="log")
#c = SGDClassifier(loss="modified_huber")

X = [
    # always -1
    [1,0,0],
    [1,0,0],
    [1,0,0],
    [1,0,0],

    # always +1
    [0,0,1],
    [0,0,1],
    [0,0,1],
    [0,0,1],

    # uncertain
    [0,1,0],
    [0,1,0],
    [0,1,0],
    [0,1,0],
    [0,1,0],
    [0,1,0],
    [0,1,0],
    [0,1,0],
]
y = [
    -1,
    -1,
    -1,
    -1,
    +1,
    +1,
    +1,
    +1,

    -1,
    +1,
    -1,
    +1,
    -1,
    +1,
    -1,
    +1,
]

def lookup_prob_class(c, dist):
    a = sorted(zip(dist, c.classes_))
    best_prob, best_class = a[-1]
    return best_prob, best_class

c.fit(X, y)

probs = c.predict_proba(X)
print 'probs:'
for dist, true_value in zip(probs, y):
    prob, value = lookup_prob_class(c, dist)
    print '%.02f'%prob, value, true_value

如您所见,我的训练数据始终将-1与输入数据[1,0,0]相关联,将1与[0,0,1]相关联,而将[0,1,0]与50/50相关联.

因此,我希望对于输入[0,1,0],predict_proba()的结果返回0.5.但是,它报告的可能性为100%.为什么会这样,我该如何解决?

有趣的是,将SGDClassifier换为DecisionTreeClassifierRandomForestClassifier确实会产生我期望的输出.

最佳答案

它确实显示出一些不确定性:

>>> c.predict_proba(X)
array([[  9.97254333e-01,   2.74566740e-03],
       [  9.97254333e-01,   2.74566740e-03],
       [  9.97254333e-01,   2.74566740e-03],
       [  9.97254333e-01,   2.74566740e-03],
       [  1.61231111e-06,   9.99998388e-01],
       [  1.61231111e-06,   9.99998388e-01],
       [  1.61231111e-06,   9.99998388e-01],
       [  1.61231111e-06,   9.99998388e-01],
       [  1.24171982e-04,   9.99875828e-01],
       [  1.24171982e-04,   9.99875828e-01],
       [  1.24171982e-04,   9.99875828e-01],
       [  1.24171982e-04,   9.99875828e-01],
       [  1.24171982e-04,   9.99875828e-01],
       [  1.24171982e-04,   9.99875828e-01],
       [  1.24171982e-04,   9.99875828e-01],
       [  1.24171982e-04,   9.99875828e-01]])

如果您希望模型更加不确定,则必须更强地对其进行正则化.这是通过调整alpha参数来完成的:

>>> c = SGDClassifier(loss="log", alpha=1)
>>> c.fit(X, y)
SGDClassifier(alpha=1, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
       loss='log', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5,
       random_state=None, shuffle=False, verbose=0, warm_start=False)
>>> c.predict_proba(X)
array([[ 0.58782817,  0.41217183],
       [ 0.58782817,  0.41217183],
       [ 0.58782817,  0.41217183],
       [ 0.58782817,  0.41217183],
       [ 0.53000442,  0.46999558],
       [ 0.53000442,  0.46999558],
       [ 0.53000442,  0.46999558],
       [ 0.53000442,  0.46999558],
       [ 0.55579239,  0.44420761],
       [ 0.55579239,  0.44420761],
       [ 0.55579239,  0.44420761],
       [ 0.55579239,  0.44420761],
       [ 0.55579239,  0.44420761],
       [ 0.55579239,  0.44420761],
       [ 0.55579239,  0.44420761],
       [ 0.55579239,  0.44420761]])

alpha是对高特征权重的惩罚,因此alpha越高,权重增长越少,线性模型值变得越不极端,逻辑概率估计值越接近½.通常,使用交叉验证来调整此参数.