在带或不带空格和标点符号的字符串之间提取字符串

我正在尝试使用str_extract从给定模式的空格和标点符号的可变格式中提取文本字符串中的单词“ Present”或“ Absent”。我在逻辑上哪里去了?

test<-c("as follows: ABC Staining Absent in Tissue","as follows: ABC:   StainingPresent in Tissue","as follows: ABC:   Staining Present in Tissue","as follows ABC Staining Present in Tissue extra words here in Present")

pattern<-"(?<=ABC[:]|[\\s]* Staining ).*(?=in)"
unique(str_extract(string = test, (pattern)))
评论
cvelit
cvelit

You may use stringr::str_match:

test<-c("as follows: ABC Staining Absent in Tissue","as follows: ABC:   StainingPresent in Tissue","as follows: ABC:   Staining Present in Tissue","as follows ABC Staining Present in Tissue extra words here in Present")

library(stringr)
pattern<-"ABC[:\\s]*Staining[:\\s]*(.*?)(?=\\s*\\bin\\b)"
unique(str_match(test, pattern)[,2])
## => [1] "Absent"  "Present"

See the R demo online and the regex demo.

细节

  • ABC - ABC string
  • [:\s]* - 0 or more colons or whitespaces
  • Staining - a Staining string
  • [:\s]* - 0 or more colons or whitespaces
  • (.*?) -Group 1: any zero or more chars other than line break chars, as few as possible
  • (?=\s*\bin\b) - a positive lookahead that requires 0+ whitespaces and then a whole word in immediately to the right of the current location.
点赞
评论