I know there are a few questions on this topic but I can't seem to get it going efficiently. I have large input datasets (2-3 GB)
running on my machine, which contains 8GB
of memory. I'm using a version of spyder
with pandas 0.24.0
installed. I initially ran a script of the input file and it took around an hour to generate an output file of around 10MB
.
I then attempted to optimise the process by chunking the input file using the code below. Essentially, I chunk
the input file into smaller segments, run it through some code and export the smaller output. I then delete the chunked info to release memory. But the memory still builds throughout the operation and ends up taking similar time. I'm not sure what I'm doing wrong:
# Because the column headers change from file to file I use location indexing to read the col headers I need
df_cols = pd.read_csv('file.csv')
# Read cols to be used
df_cols = df_cols.iloc[:,np.r_[1,3,8,12,23]]
# Export col headers
cols_to_keep = df_cols.columns
PATH = '/Volume/Folder/Event/file.csv'
chunksize = 10000
df_list = [] # list to hold the batch dataframe
for df_chunk in pd.read_csv(PATH, chunksize = chunksize, usecols = cols_to_keep):
# Measure time taken to execute each batch
print("summation download chunk time: " , time.clock()-t)
# Execute func1
df1 = func1(df_chunk)
# Execute func2
df2 = func1(df1)
# Append the chunk to list and merge all
df_list.append(df2)
# Merge all dataframes into one dataframe
df = pd.concat(df_list)
# Delete the dataframe list to release memory
del df_list
del df_chunk