R：解析报价文本文件/将其拆分为段落

由冬莲发布于 2020-03-27 22:27:58

I'm looking for an R solution to the problem of parsing a text file of quotations (as below) giving a data.frame with one observation per quote, and variables text and source as described below.

DIAGRAMS are of great utility for illustrating certain questions of vital statistics by
conveying ideas on the subject through the eye, which cannot be so readily grasped when
contained in figures.
--- Florence Nightingale, Mortality of the British Army, 1857

To give insight to statistical information it occurred to me, that making an
appeal to the eye when proportion and magnitude are concerned, is the best and
readiest method of conveying a distinct idea. 
--- William Playfair, The Statistical Breviary (1801), p. 2


Regarding numbers and proportions, the best way to catch the imagination is to speak to the eyes.
--- William Playfair, Elemens de statistique, Paris, 1802, p. XX.

The aim of my carte figurative is to convey promptly to the eye the relation not given quickly by numbers requiring mental calculation.
--- Charles Joseph Minard

Here, each quotation is a paragraph, separated from the next by "\n\n". Within the paragraph, all lines up to the one beginning --- comprise the text and what follows --- is the source.

I imagine I could solve this if I could first first split the text lines into paragraphs (separated by '\\n\\n+' (2 or more blank lines), but I'm having trouble doing that.

snon

2020-03-27 22:27:58

Assuming your text file is quote.txt in working directory.

R base solution: split it 2 times: (1) by \n\n and (2) by ---, then combine into data frame.

quote <- readLines("quote.txt")
quote <- paste(quote, collapse = "\n")

DF <- strsplit(unlist(strsplit(quote, "\n\n")), "---")
DF <- data.frame(text= trimws(sapply(DF, "[[", 1)), 
           source = trimws(sapply(DF, "[[", 2)))

输出量

DF
                                                                                                                                                                                                                                                                                 # text
# 1     DIAGRAMS are of great utility for illustrating certain questions of vital statistics by\nconveying ideas on the subject through the eye, which cannot be so readily grasped when\ncontained in figures.
# 2 To give insight to statistical information it occurred to me, that making an\nappeal to the eye when proportion and magnitude are concerned, is the best and\nreadiest method of conveying a distinct idea.
# 3                                                                                                           Regarding numbers and proportions, the best way to catch the imagination is to speak to the eyes.
# 4                                                                     The aim of my carte figurative is to convey promptly to the eye the relation not given quickly by numbers requiring mental calculation.
#                                                          source
# 1     Florence Nightingale, Mortality of the British Army, 1857
# 2       William Playfair, The Statistical Breviary (1801), p. 2
# 3 William Playfair, Elemens de statistique, Paris, 1802, p. XX.
# 4                                         Charles Joseph Minard

fharum

2020-03-27 22:27:58

This should do the bulk of what you need to achieve. I assume you already have the file in a length-1 character vector called txt:

library(tidyverse)

txt                                             %>% 
strsplit("\n{2,5}")                             %>% 
unlist()                                        %>% 
lapply(function(x) unlist(strsplit(x, "--- "))) %>%
{do.call("rbind", .)}                           %>%
as.data.frame(stringsAsFactors = FALSE)         %>%
setNames(c("Text", "Source"))                    ->
df

如果然后通过用空格替换换行符来整理文本，则会得到以下信息：

df$Text <- gsub("\n", " ", df$Text)
as_tibble(df)
#> # A tibble: 4 x 2
#>   Text                                              Source                             
#>   <chr>                                             <chr>                              
#> 1 "DIAGRAMS are of great utility for illustrating ~ Florence Nightingale, Mortality of~
#> 2 "To give insight to statistical information it o~ William Playfair, The Statistical ~
#> 3 "Regarding numbers and proportions, the best way~ William Playfair, Elemens de stati~
#> 4 "The aim of my carte figurative is to convey pro~ Charles Joseph Minard

fdolor

2020-03-27 22:27:58

Assuming you have the initial text loaded in rawText variable

library(stringr)

strsplit(rawText, "\n\n")[[1]] %>% 
  str_split_fixed("\n--- ", 2) %>% 
  as.data.frame() %>% 
  setNames(c("text", "source"))

冬莲

这家伙很懒，什么都没留下

积分
0
话题
0
评论
3244
注册排名
1999