过滤文本并将过滤后的句子/段落存储到新列中

I am trying to extract some sentences from text data. I want to extract the sentences which correspond to medical device company released. I can run the following code:

df_text <- unlist(strsplit(df$TD, "\\."))
df_text

df_text <- df_text[grep(pattern = "medical device company released", df_text, ignore.case = TRUE)]
df_text

这给了我:

[1] "\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday"

So I extracted the sentences which contain the sentence medical device company released. However, I want to do this but store the results in a new column from which grp the sentence came from.

预期产量:

grp     TD     newCol
3613    text   NA                                # does not contain the sentence
4973    text   medical device company released
5570    text   NA                                # does not contain the sentence

数据:

df <- structure(list(grp = c("3613", "4973", "5570"), TD = c(" Wal-Mart plans to add an undisclosed number of positions in areas including its store-planning operation and New York apparel office.\n\nThe moves, which began Tuesday, are meant to \"increase operational efficiencies, support our strategic growth plans and reduce overall costs,\" Wal-Mart spokesman David Tovar said.\n\nWal-Mart still expects net growth of tens of thousands of jobs at the store level this year, Tovar said.\n\nThe reduction in staff is hardly a new development for retailers, which have been cutting jobs at their corporate offices as they contend with the down economy. Target Corp. (TGT), Saks Inc. (SKS) and Best Buy Co. (BBY) are among retailers that have said in recent weeks they plan to pare their ranks.\n\nTovar declined to say whether the poor economy was a factor in Wal-Mart's decision.\n\nWal-Mart is operating from a position of comparative strength as one of the few retailers to consistently show positive growth in same-store sales over the past year as the recession dug in.\n\nWal-Mart is \"a fiscally responsible company that will manage its capital structure appropriately,\" said Todd Slater, retail analyst at Lazard Capital Markets.\n\nEven though Wal-Mart is outperforming its peers, the company \"is not performing anywhere near peak or optimum levels,\" Slater said. \"The consumer has cut back significantly.\"\n\nWal-Mart indicated it had regained some footing in January, when comparable-store sales rose 2.1%, after a lower-than-expected 1.7% rise in December.\n\nWal-Mart shares are off 3.2% to $47.68.\n\n-By Karen Talley, Dow Jones Newswires; 201-938-5106; karen.talley@dowjones.com [ 02-10-09 1437ET ]\n    ", 
" --To present new valve platforms Friday\n\n(Updates with additional comment from company, beginning in the seventh paragraph.)\n\n\n \n   By Anjali Athavaley \n   Of DOW JONES NEWSWIRES \n \n\nNEW YORK (Dow Jones)--Edwards Lifesciences Corp. (EW) said Friday that it expects earnings to grow 35% to 40%, excluding special items, in 2012 on expected sales of its catheter-delivered heart valves that were approved in the U.S. earlier this year.\n\nThe medical device company released its financial outlook in a press release before an investor conference Friday. The catheter-delivered heart valve market is considered to have a multibillion-dollar market potential, but questions have persisted on how quickly the Edwards device, called Sapien, will be rolled out and who will be able to receive it.\n\nEdwards said it expects transcatheter valve sales between $560 million and $630 million in 2012, with $200 million to $260 million coming from the U.S.\n\nOverall, for 2012, Edwards sees total sales between $1.95 billion and $2.05 billion, above the $1.68 billion to $1.72 billion expected this year and bracketing the $2.01 billion expected on average by analysts surveyed by Thomson Reuters.\n\nThe company projects 2012 per-share earnings between $2.70 and $2.80, the midpoint of which is below the average analyst estimate of $2.78 on Thomson Reuters. Edwards estimates a gross profit margin of 73% to 75%.\n\nEdwards also reaffirmed its 2011 guidance, which includes earnings per share of $1.97 to $2.02, excluding special items.\n\nThe company said it continues to expect U.S. approval of its Sapien device for high-risk patients in mid-2012. Currently, the device is only approved in the U.S. for patients too sick for surgery.\n\nThe company added that a separate trial studying its newer-generation valve in a larger population is under way in the U.S. It expects U.S. approval of that device in 2014.\n\nEdwards also plans to present at its investor conference two new catheter-delivered valve platforms designed for different implantation methods. European trials for these devices are expected to begin in 2012.\n\nShares of Edwards, down 9% over the past 12 months, were inactive premarket. The stock closed at $63.82 on Thursday.\n\n-By Anjali Athavaley, Dow Jones Newswires; 212-416-4912; anjali.athavaley@dowjones.com [ 12-09-11 0924ET ]\n    ", 
" In September, the company issued a guidance range of 43 cents to 44 cents a share.  \n\nFor the year, GE now sees earnings no lower than $1.81 a share to $1.83 a share. The previous forecast called for income of $1.80 to $1.83 a share. The new range brackets analyst projections of $1.82 a share.  \n\nThe new targets represent double-digit growth from the respective year-earlier periods. Last year's third-quarter earnings were $3.87 billion, or 36 cents a share, excluding items; earnings for the year ended Dec. 31 came in at $16.59 billion, or $1.59 a share. [ 10-06-05 0858ET ]  \n\nGeneral Electric also announced Thursday that it expects 2005 cash flow from operating activities to exceed $19 billion.  \n\nBecause of the expected cash influx, the company increased its authorization for share repurchases by $1 billion to more than $4 billion.  \n\nGE announced the updated guidance at an analysts' meeting Thursday in New York. A Web cast of the meeting is available at  .  \n\nThe company plans to report third-quarter earnings Oct. 14.  \n\nShares of the Dow Jones Industrial Average component recently listed at $33.20 in pre-market trading, according to Inet, up 1.6%, or 52 cents, from Wednesday's close of $32.68.  \n\nCompany Web site:  \n\n-Jeremy Herron; Dow Jones Newswires; 201-938-5400; Ask Newswires@DowJones.com  \n\nOrder free Annual Report for General Electric Co.  \n\nVisit   or call 1-888-301-0513 [ 10-06-05 0904ET ]  \n    "
)), class = "data.frame", row.names = c(NA, -3L))
评论
  • Basil
    Basil 回复

    We can get data in separate rows keeping the grp intact and keep only sentence that has "medical device company released" in it.

    library(dplyr)
    
    df %>%
      tidyr::separate_rows(TD, sep = "\\.") %>%
      group_by(grp) %>% 
      summarise(newCol = toString(grep(pattern = "medical device company released", 
                                  TD, ignore.case = TRUE, value = TRUE)))
    
    
    #  grp   newCol                                                
    #  <chr> <chr>                                                 
    #1 3613  ""                                                    
    #2 4973  "\n\nThe medical device company released its financia…
    #3 5570  ""