由于字符变量,在R中运行T-Test时出现错误消息

I have been trying to run a two side t-test in R but keep running into error. Below is my process flow, dataset details and script from R-studio. I used a dataset called LungCapacity that I downloaded from this website: https://www.statslectures.com/r-scripts-datasets.

#Imported data set into RStudio.

# Ran a summary report to see the data and class.
summary(LungCapData)

# Here I could see that the smoke column is a character, so I converted it to a factor
LungCapacityData$Smoke <- factor(LungCapacityData$Smoke)

# On checking the summary. I see its converted to a factor with a yes and no.

# I want to run a t-test between lung capacity and smoking. 
t.test(LungCapData$LungCap, LungCapData$Smoke, alternative = c("two.sided"), mu=0, var.equal = FALSE, conf.level = 0.95, paired = FALSE)

现在,在运行它时,我得到以下错误。

Error in var(y) : Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA

我试图将Smoke变量从Yes和No转换为1和0。数据正在运行,但不正确。 我究竟做错了什么?

评论
  • Doyle
    Doyle 回复

    You're very close, you just need to call t.test with a formula:

    t.test(LungCap ~ Smoke, data = LungCapacityData,
           alternative = c("two.sided"), mu=0, var.equal = FALSE,
           conf.level = 0.95, paired = FALSE)
    
    #   Welch Two Sample t-test
    #
    #data:  LungCap by Smoke
    #t = -3.6498, df = 117.72, p-value = 0.0003927
    #alternative hypothesis: true difference in means is not equal to 0
    #95 percent confidence interval:
    # -1.3501778 -0.4003548
    #sample estimates:
    # mean in group no mean in group yes 
    #         7.770188          8.645455 
    

    With your current approach, you're trying to compare LungCapData$LungCap which is a numeric vector:

    LungCapData$LungCap[1:10]
    # [1]  6.475 10.125  9.550 11.125  4.800  6.225  4.950  7.325  8.875  6.800
    

    With LungCapData$Smoke, which is a vector of factors:

    LungCapData$Smoke[1:10]
    # [1] no  yes no  no  no  no  no  no  no  no 
    

    Instead, you want to instruct t.test to compare LungCapData$LungCap when grouping by LungCapData$Smoke. That is achieved with a formula.

    The formula LungCap ~ Smoke says that LungCap should depend on Smoke. When you use a formula, you also need to supply data =.

    When you try to convert LungCapData$Smoke to numeric, you get the wrong result because you're just getting the factor level indices which have no biological significance.

    as.numeric(LungCapData$Smoke)[1:10]
    # [1] 1 2 1 1 1 1 1 1 1 1
    

    您基本上是在问我们分配的因子水平的平均值是否与肺活量的平均值不同。

    The other way is to subset LungCapData$LungCap yourself, but that's a lot more typing:

    t.test(LungCapacityData$LungCap[LungCapacityData$Smoke == "yes"],
           LungCapacityData$LungCap[LungCapacityData$Smoke == "no"],
           alternative = c("two.sided"), mu=0, var.equal = FALSE,
           conf.level = 0.95, paired = FALSE)
    

  • zamet
    zamet 回复

    As specified in the OP, t.test() attempts to compare two vectors, expecting them to be numeric.

    Instead, use the formula version of t.test().

    data <- read.table(file = "./data/LungCapData.txt",header = TRUE)
    t.test(LungCap ~ Smoke,data = data)
    

    ...以及输出:

    > t.test(LungCap ~ Smoke,data = data)
    
        Welch Two Sample t-test
    
    data:  LungCap by Smoke
    t = -3.6498, df = 117.72, p-value = 0.0003927
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -1.3501778 -0.4003548
    sample estimates:
     mean in group no mean in group yes 
             7.770188          8.645455 
    
    >