提高运行大文件的性能-Python

I know there are a few questions on this topic but I can't seem to get it going efficiently. I have large input datasets (2-3 GB) running on my machine, which contains 8GB of memory. I'm using a version of spyder with pandas 0.24.0 installed. I initially ran a script of the input file and it took around an hour to generate an output file of around 10MB.

I then attempted to optimise the process by chunking the input file using the code below. Essentially, I chunk the input file into smaller segments, run it through some code and export the smaller output. I then delete the chunked info to release memory. But the memory still builds throughout the operation and ends up taking similar time. I'm not sure what I'm doing wrong:

# Because the column headers change from file to file I use location indexing to read the col headers I need
df_cols = pd.read_csv('file.csv')
# Read cols to be used
df_cols = df_cols.iloc[:,np.r_[1,3,8,12,23]]
# Export col headers
cols_to_keep = df_cols.columns


PATH = '/Volume/Folder/Event/file.csv'
chunksize = 10000

df_list = [] # list to hold the batch dataframe

for df_chunk in pd.read_csv(PATH, chunksize = chunksize, usecols = cols_to_keep):
    # Measure time taken to execute each batch
    print("summation download chunk time: " , time.clock()-t)

    # Execute func1
    df1 = func1(df_chunk)

    # Execute func2
    df2 = func1(df1)

    # Append the chunk to list and merge all
    df_list.append(df2) 

# Merge all dataframes into one dataframe
df = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk
评论