有没有更快的方法来遍历集合并替换句子中的MWE? -Python

提问

任务是对由多个单词组成的表达式(也称为多单词表达式)进行分组.

给定MWE字典,我需要在检测到MWE的输入句子中添加破折号,例如

**Input:** i have got an ace of diamonds in my wet suit .
**Output:** i have got an ace-of-diamonds in my wet-suit .

目前,我遍历排序的字典,查看MWE是否出现在句子中,并在出现时替换它们.但是有很多浪费的迭代.

有没有更好的方法呢?一种解决方案是产生所有可能的n-gram 1st,即chunker2()

import re, time
mwe_list =set([i.strip() for i in codecs.open( \
            "wn-mwe-en.dic","r","utf8").readlines()])

def chunker(sentence):
  for item in mwe_list:
    if item or item.replace("-", " ") in sentence:
      #print item
      mwe_item =  '-'.join(item.split(" "))
      r=re.compile(re.escape(mwe_item).replace('\\-','[- ]'))
      sentence=re.sub(r,mwe_item,sentence)    
  return sentence

def chunker2(sentence):
    nodes = []
    tokens = sentence.split(" ")
    for i in range(0,len(tokens)):
        for j in range(i,len(tokens)):
            nodes.append(" ".join(tokens[i:j]))
    n = sorted(set([i for i in nodes if not "" and len(i.split(" ")) > 1]))

    intersect = mwe_list.intersection(n)

    for i in intersect:
        print i
        sentence = sentence.replace(i, i.replace(" ", "-"))

    return sentence

s = "i have got an ace of diamonds in my wet suit ."

time.clock()
print chunker(s)
print time.clock()

time.clock()
print chunker2(s)
print time.clock()

最佳答案

我会尝试这样做:

>对于每个句子,构造一组长度为给定长度(列表中最长的MWE)的n-gram.
>现在,只需执行mwe_nmgrams.intersection(sentence_ngrams)并搜索/替换它们.

您不必浪费时间来遍历原始集合中的所有项目.

这是chunker2的更快版本:

def chunker3(sentence):
    tokens = sentence.split(' ')
    len_tokens = len(tokens)
    nodes = set()

    for i in xrange(0, len_tokens):
        for j in xrange(i, len_tokens):
            chunks = tokens[i:j]

            if len(chunks) > 1:
                nodes.add(' '.join(chunks))

    intersect = mwe_list.intersection(n)

    for i in intersect:
        print i
        sentence = sentence.replace(i, i.replace(' ', '-'))

    return sentence