I'm cleaning up a long list of noun-phrases for further text mining. They're supposed to be 1- or 2-word phrases, but some have /
in a conjunction. Here's what I've got:
library(tidyverse)
conjuncts <- tibble(usecase = 1:3,
classes = c("Insulators/Insulation",
"Optic/light fiber",
"Magnets"))
而且我要:
wanted <- tibble(usecase = c(1,1,2,2,3),
classes = c("Insulators/Insulation",
"Insulators/Insulation",
"Optic/light fiber",
"Optic/light fiber",
"Magnets"),
bigrams = c("Insulators", "Insulation",
"Optic fiber", "Light fiber", NA))
我有一些有效的方法,但是它既可怕又不可扩展。
patternSplit <- function(class){
regexs <- c("(?x) ^ (\\w+) / (\\w+) $",
"(?x) ^ (\\w+) / (\\w+) \\s+ (\\w+) $")
if(str_detect(class, regexs[1])){
extr <- str_match(class, regexs[1])
list(extr[1,2],
extr[1,3])
} else if(str_detect(class, regexs[2])){
extr <- str_match(class, regexs[2])
list(paste(extr[1,2], extr[1,4]),
paste(extr[1,3], extr[1,4]))
} else {
list(NA_character_)
}
}
anx <- conjuncts %>%
mutate(bigrams = map(classes, patternSplit)) %>%
unnest(cols = "bigrams") %>%
unnest(cols = "bigrams")
这给了我我想要的东西,但是blecchh!
# A tibble: 5 x 3
usecase classes bigrams
<int> <chr> <chr>
1 1 Insulators/Insulation Insulators
2 1 Insulators/Insulation Insulation
3 2 Optic/light fiber Optic fiber
4 2 Optic/light fiber light fiber
5 3 Magnets NA
The top two problems (1) I have to run the rexex twice - once with str_detect
to get the logical for the if / else
and again with str_match
to pull out the tokens. (2) I have do the double unnest
to unwind the list structure. And smaller problem (3) Can I get out of if / else
, into case_when
or switch
?
我最终会将其扩展到大约十二种模式和用例。