我正在尝试我的第一个NLP项目,但对R还是有点不了解。我一直在遵循DataScienceDojo的过程,并且已经提取了三个文件,进行了组织并进行了标记化。
另外,我有一个亵渎的数据框。
我想删除亵渎列表中显示的令牌数据中的任何列。我想出了一些主意,但是几乎没有什么可以做的。
我的直觉告诉我,在整理数据集或标记化时我犯了一个错误,但是我仍然很生疏,无法确定。
码:
# This will be the initial exploratory analysis of our project.
# Install all required packages.
install.packages(c("ggplot2", "e1071", "caret", "quanteda",
"irlba", "randomForest","tidyr","dplyr"))
library("ggplot2")
library("e1071")
library("caret")
library("quanteda")
library("irlba")
library("randomForest")
library("tidyr")
library("dplyr")
library("data.table")
# Load in the data files
blogs.raw <- read.csv("en_US.blogs.txt", header = FALSE, stringsAsFactors = FALSE, skipNul = TRUE, fileEncoding = "UTF-8")
news.raw <- read.csv("en_US.news.txt", header = FALSE, stringsAsFactors = FALSE, skipNul = TRUE, fileEncoding = "UTF-8")
twitter.raw <- read.csv("en_US.twitter.txt", header = FALSE, stringsAsFactors = FALSE, skipNul = TRUE, fileEncoding = "UTF-8")
# Pre-processing cleanup
# Rename columns for ease of uniting later
names(blogs.raw) <- c("a","b","c","d")
names(news.raw) <- c("a","b","c")
names(twitter.raw) <- c("a","b")
# Unite columns so that each dataset is just a single column, titled "Text"
blogs.raw <- blogs.raw %>% unite("Text", a:d, sep = " ", remove = TRUE, na.rm = TRUE)
news.raw <- news.raw %>% unite("Text", a:c, sep = " ", remove = TRUE, na.rm = TRUE)
twitter.raw <- twitter.raw %>% unite("Text", a:b, sep = " ", remove = TRUE, na.rm = TRUE)
# Combine all three into a single dataset
merged_dataset <- rbind(blogs.raw,news.raw)
merged_dataset <- rbind(merged_dataset,twitter.raw)
# Tokenize the dataset
# Because we are using the quanteda package, we can use the arguments to further
# clean up the dataset
train.tokens <- tokens(merged_dataset$Text, what = "word",
remove_numbers = TRUE, remove_punct= TRUE,
remove_symbols = TRUE, split_hyphens = TRUE)
# Make all the tokens lower-case
train.tokens <- tokens_tolower(train.tokens)
# Remove stopwords - these are words like "the", "a", "an" etc., which don't add much syntactical information
train.tokens <- tokens_select(train.tokens,stopwords(),selection="remove")
# Now stemming the words -- so that "run", "ran", "running" etc become one token
train.tokens <- tokens_wordstem(train.tokens, language="english")
train.tokens.dfm <- dfm(train.tokens,tolower=FALSE)
train.tokens.matrix <- as.matrix(train.tokens.dfm)
# Now let's do some initial exploratory data summaries. Some of this will be on the raw data, some on the cleaned data.
number_tokens <- ncol(train.tokens.matrix)
#Remove profanity as per Google's list https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
profanity <- read.csv("base-list-of-bad-words_CSV-file_2018_07_30.csv")
# This is where I get stuck and nothing I try works.