我有以下数据集:
ID Text
12 Coolest fan we’ve ever seen.
12 SHARE this with anyone you know who can use this tip!
31 Time for a Royal Celebration! Save the date.
54 The way to a sports fan’s heart? Behind-the-scenes content from their favourite teams.
419 Start asking your questions now for tomorrow’s LIVE Q&A on careers you can do without going to university.
451 Save the date, we’re hosting a fabulous & fun meetup at Coffee Bar Bryant on 9/20. Stay tuned
我已经使用ngram来分析文本和单词/句子的频率。
from nltk import ngrams
text=df.Text.tolist()
list_n=[]
for i in text:
n_grams = ngrams(i.split(), 3)
for grams in n_grams:
list_n.append(grams)
list_n
Since I am interested in finding in which text a particular word/words sequence was used, I would need to create an association between text (i.e. ID
) and text with particular ngrams.
For example: I am interested in finding texts which contains "Save the date"
, i.e. ID=31
and ID=451
.
To find the n-grams for one single word, I have been using this:
def ngram_filter(col, word, n):
tokens = col.split()
all_ngrams = ngrams(tokens, n)
filtered_ngrams = [x for x in all_ngrams if word in x]
return filtered_ngrams
However, I do not know how to find the ID
associated to the text and how to select more words in the function above.
我该怎么办?任何的想法?
如果需要,请随时更改标签。谢谢
I don't have much experience with
ngrams
, but you could get what you want withstr.contains
like: