在R中,从矩阵中消除与不相关数据框中的一系列单词匹配的列

我正在尝试我的第一个NLP项目,但对R还是有点不了解。我一直在遵循DataScienceDojo的过程,并且已经提取了三个文件,进行了组织并进行了标记化。

另外,我有一个亵渎的数据框。

我想删除亵渎列表中显示的令牌数据中的任何列。我想出了一些主意,但是几乎没有什么可以做的。

我的直觉告诉我,在整理数据集或标记化时我犯了一个错误,但是我仍然很生疏,无法确定。

码:

# This will be the initial exploratory analysis of our project.


# Install all required packages.
install.packages(c("ggplot2", "e1071", "caret", "quanteda", 
                   "irlba", "randomForest","tidyr","dplyr"))

library("ggplot2")
library("e1071")
library("caret")
library("quanteda")
library("irlba")
library("randomForest")
library("tidyr")
library("dplyr")
library("data.table")


# Load in the data files

blogs.raw <- read.csv("en_US.blogs.txt", header = FALSE, stringsAsFactors = FALSE, skipNul = TRUE, fileEncoding = "UTF-8")
news.raw <- read.csv("en_US.news.txt", header = FALSE, stringsAsFactors = FALSE, skipNul = TRUE, fileEncoding = "UTF-8")
twitter.raw <- read.csv("en_US.twitter.txt", header = FALSE, stringsAsFactors = FALSE, skipNul = TRUE, fileEncoding = "UTF-8")


# Pre-processing cleanup


# Rename columns for ease of uniting later
names(blogs.raw) <- c("a","b","c","d")
names(news.raw) <- c("a","b","c")
names(twitter.raw) <- c("a","b")


# Unite columns so that each dataset is just a single column, titled "Text"
blogs.raw <- blogs.raw %>% unite("Text", a:d, sep = " ", remove = TRUE, na.rm = TRUE)
news.raw <- news.raw %>% unite("Text", a:c, sep = " ", remove = TRUE, na.rm = TRUE)
twitter.raw <- twitter.raw %>% unite("Text", a:b, sep = " ", remove = TRUE, na.rm = TRUE)




# Combine all three into a single dataset
merged_dataset <- rbind(blogs.raw,news.raw)
merged_dataset <- rbind(merged_dataset,twitter.raw)


# Tokenize the dataset
# Because we are using the quanteda package, we can use the arguments to further 
# clean up the dataset
train.tokens <- tokens(merged_dataset$Text, what = "word",
                        remove_numbers = TRUE, remove_punct= TRUE,
                        remove_symbols = TRUE, split_hyphens = TRUE)


# Make all the tokens lower-case
train.tokens <- tokens_tolower(train.tokens)


# Remove stopwords - these are words like "the", "a", "an" etc., which don't add much syntactical information
train.tokens <- tokens_select(train.tokens,stopwords(),selection="remove")


# Now stemming the words -- so that "run", "ran", "running" etc become one token
train.tokens <- tokens_wordstem(train.tokens, language="english")

train.tokens.dfm <- dfm(train.tokens,tolower=FALSE)
train.tokens.matrix <- as.matrix(train.tokens.dfm)



# Now let's do some initial exploratory data summaries. Some of this will be on the raw data, some on the cleaned data.

number_tokens <- ncol(train.tokens.matrix)


#Remove profanity as per Google's list https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
profanity <- read.csv("base-list-of-bad-words_CSV-file_2018_07_30.csv")


# This is where I get stuck and nothing I try works.