Python writelines()和write()巨大的时差

我正在编写一个脚本,该脚本读取文件的文件夹(每个文件的大小从20 MB到100 MB不等),修改每行中的某些数据,然后写回文件的副本。

with open(inputPath, 'r+') as myRead:
     my_list = myRead.readlines()
     new_my_list = clean_data(my_list)
with open(outPath, 'w+') as myWrite:
     tempT = time.time()
     myWrite.writelines('\n'.join(new_my_list) + '\n')
     print(time.time() - tempT)
print(inputPath, 'Cleaning Complete.')

On running this code with a 90 MB file (~900,000 lines), it printed 140 seconds as the time taken to write to the file. Here I used writelines(). So I searched for different ways to improve file writing speed, and in most of the articles that I read, it said write() and writelines() should not show any difference since I am writing a single concatenated string. I also checked the time taken for only the following statement:

new_string = '\n'.join(new_my_list) + '\n'

And it took only 0.4 seconds, so the large time taken was not because of creating the list. Just to try out write() I tried this code:

with open(inputPath, 'r+') as myRead:
     my_list = myRead.readlines()
     new_my_list = clean_data(my_list)
with open(outPath, 'w+') as myWrite:
     tempT = time.time()
     myWrite.write('\n'.join(new_my_list) + '\n')
     print(time.time() - tempT)
print(inputPath, 'Cleaning Complete.')

And it printed 2.5 seconds. Why is there such a large difference in the file writing time for write() and writelines() even though it is the same data? Is this normal behaviour or is there something wrong in my code? The output file seems to be the same for both cases, so I know that there is no loss in data.

最佳答案

file.writelines() expects an iterable of strings. It then proceeds to loop and call file.write() for each string in the iterable. In Python, the method does this:

def writelines(self, lines)
    for line in lines:
        self.write(line)

You are passing in a single large string, and a string is an iterable of strings too. When iterating you get individual characters, strings of length 1. So in effect you are making len(data) separate calls to file.write(). And that is slow, because you are building up a write buffer a single character at a time.

Don't pass in a single string to file.writelines(). Pass in a list or tuple or other iterable instead.

您可以发送单独的行,并在生成器表达式中添加换行符,例如:

 myWrite.writelines(line + '\n' for line in new_my_list)

Now, if you could make clean_data() a generator, yielding cleaned lines, you could stream data from the input file, through your data cleaning generator, and out to the output file without using any more memory than is required for the read and write buffers and however much state is needed to clean your lines:

with open(inputPath, 'r+') as myRead, open(outPath, 'w+') as myWrite:
    myWrite.writelines(line + '\n' for line in clean_data(myRead))

In addition, I'd consider updating clean_data() to emit lines with newlines included.