最快，最有效的方式在python中聚合大型数据集 - 码农俱乐部 - Golang中国

假设我正在测量汽车在单轴上前进的速度，每10分钟更新一次。

I have a column in my DataFrame called delta_x, which contains how much the car moved on my axis in the last 10 minutes, values are integers only.

现在，让我们说我想聚合我的数据，并且每个小时仅移动一次，但是我想尽可能地优化代码，因为我的数据集非常大，实现这一目标的最有效方法是什么？

df.head(9)

    date        time    delta_x
0   01/01/2018  00:00   9
1   01/01/2018  00:10   9
2   01/01/2018  00:20   9
3   01/01/2018  00:30   9
4   01/01/2018  00:40   11
5   01/01/2018  00:50   12
6   01/01/2018  01:00   10
7   01/01/2018  01:10   10
8   01/01/2018  01:20   10

目前，我的解决方案是执行以下操作

for file in os.listdir('temp'):
    if(file.endswith('.txt'):
        df = pd.read_csv(''.join(["./temp/",file]), header=None, delim_whitespace=True)
        df.columns = ['date', 'time', 'delta_x']
        df['hour'] = [(datetime.strptime(x, "%H:%M")).hour for x in df['time'].values]
        df = df.groupby(['date','hour']).agg({'delta_x': 'sum'})

哪个输出正确：


date        hour   delta_x
01/01/2018  0      59

但是我想知道，是否有更好，更快和更有效的方法，也许使用NumPy？