使用多处理并行对Pandas DataFrame进行排序

I have a large pandas DataFrame that I’d like to divide into smaller DFs and then sort each of them in parallel using multiprocessing.

我的代码的简化版本如下:

from multiprocessing import Process

import numpy as np
import pandas as pd


def sort_df(df):
    df.sort_values(by=["b"], inplace=True)
    print(df)


if __name__ == "__main__":
    df = pd.DataFrame(np.random.rand(20, 3), columns=['a', 'b', 'c'])
    gb = df.groupby(pd.cut(df["b"], 4))
    # copy() is to suppress SettingWithCopyWarning
    partitioned_dfs = [gb.get_group(g).copy() for g in gb.groups]

    procs = []
    for df in partitioned_dfs:
        proc = Process(target=sort_df, args=(df,))
        procs.append(proc)
        proc.start()

    for proc in procs:
        proc.join()

当我使用大型DF作为输入时,似乎我的代码使用了太多的内存。

  1. The partitioned_dfs takes up an equally large amount of memory as the input DF;
  2. When spawning a new process, since a child process is created as a copy of the parent process, it also has a copy of the input DF and partitioned_dfs in its address space, although it is responsible for only a subset of the entire DF.

我想知道如何减少代码的内存使用量。多处理在这里合适吗?

评论