为什么数据框和它的列之间存在这种不一致?

当在我的代码中调试一个令人讨厌的错误时,我遇到了这个错误,看起来数据框的工作方式不一致(使用pandas = 1.0.3):

import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]

Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):

y['d'] = df['d']

I'm now aware that this adds a weird row to the Series; y is now:

0                                                   11
1                                                   11
2                                                   11
3                                                   11
d    0     0
1    10
2    20
3    30
Name: d, dtype...
Name: k, dtype: object

但是奇怪的是,现在:

>>> df.shape, df['k'].shape
((4, 4), (5,))

And df and df['k'] look like:

 d   k  c1  c2
0   0  11  22  33
1  10  11  22  33
2  20  11  22  33
3  30  11  22  33

0                                                   11
1                                                   11
2                                                   11
3                                                   11
d    0     0
1    10
2    20
3    30
Name: d, dtype...
Name: k, dtype: object
评论
  • et_et
    et_et 回复

    这里有一些事情在起作用:

    • A pandas series can store objects of arbitrary types.

    • y['d'] = _ add a new object to the series y with name 'd'.

    • Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].

    So you have added a series as the last entry of the series y. You can verify that

    • (y['d'] == y.iloc[-1]).all() == True and

    • (y.iloc[-1] == df['d']).all() == True.