将自定义函数应用于大列表需要很长时间

问题:

I have a list of words of length 48,000 and I am trying to group possible 4 words (if present else lesser) that are closest to each other. I am taking help from the difflib module for this.

I had 2 ways to do this in my mind. Get 4 closest matches using difflib.get_close_matches() or make a Cartesian product of the words list and get the scores from each tuple from the product list.

我有一个适用于较小列表的代码,但是当列表的长度增加时(以我的情况为48k),将花费大量时间。我正在寻找此问题的可扩展解决方案。

复制此列表的代码:

import random , string , itertools , difflib
from functools import partial
N = 10
random.seed(123)
words = [''.join(random.choice(string.ascii_lowercase) for i in range(5)) for j in range(10)]

我的尝试:

1:创建了一个函数,该函数在创建笛卡尔乘积后将返回分数。发布后,我可以将第一个元素分组,并根据需要选择顶部n。

def fun(x) : return difflib.SequenceMatcher(None,*x).ratio()
products = list(itertools.product(words,words))
scores = list(map(fun,products))

2:直接给出最佳n(4)个匹配项的函数

f = partial(difflib.get_close_matches , possibilities = words , n=4 , cutoff = 0.4)
matches = list(map(f,words)) #this gives 4 possible matches if present

两者都适用于较小的列表,但是随着列表大小的增加,它会花费很长时间。因此,我尝试诉诸于多处理:

多处理尝试1:

Save the first function (fun) in attempt 1 in a py file and then import it

import multiprocessing
pool = multiprocessing.Pool(8)
import fun
if__name__ == '__main__':
    score_mlt_pr = pool.map(fun.fun, products ) #products is the cartesian product same as attempt 1
scores_mlt = list(score_mlt_pr)

多处理尝试2:

Using the same f as attempt 2 earlier but with pool:

close_matches = list(pool.map(f,words))

使用Multiprocessing,可以减少花费的时间,但是对于1000 * 48000个单词的组合,仍然需要大约1小时。

我希望我能为我的问题提供一个明确的例子。请告知我如何加快我的代码。

评论