迭代计算./dplyr函数中大型数据集平均值的变异

假设这是我的数据集

dataset

我想考虑始终相同的模式来计算一个新变量(突变)。因此,新变量必须是一列的平均值,且前两列为空格,依此类推。 [在实际数据集中,我要取平均值的两列之间有30列]。

在excel中,选择两个变量后,将在右侧进行“拖放”。因此,当结果变为“丢失”时,算法应停止。

new dataset

我想留在整洁的环境中。 有什么建议么?

码:

ds <-structure(list(identificacao = c("3004U", "77584X", "25917G", 
                                      "39895C", "20597Y", "64085M", "51573F", "42221E", "58658E", "8983C", 
                                      "18516K", "27050E"), lh_aparc_volume = c(2112, 2081, 2050, 2350, 
                                                                               2250, 1730, 1874, 1821, 2004, 1928, 1844, 2900), lh_bankssts_volume = c(1750, 
                                                                                                                                                       1654, 1344, 1876, 1366, 1424, 1416, 1521, 1231, 2415, 938, 1356
                                                                               ), rh_aparc_volume = c(1797, 1895, 1386, 1875, 2123, 1457, 1754, 
                                                                                                      2478, 1670, 1613, 1702, 1873), rh_bankssts_volume = c(1951, 1991, 
                                                                                                                                                            1774, 2539, 1830, 2564, 2433, 1092, 1803, 2009, 1609, 1787)), row.names = c(NA, 
                                                                                                                                                                                                                                        -12L), class = c("tbl_df", "tbl", "data.frame"))
ds
评论
  • gcum
    gcum 回复

    Here's an approach with bind_cols and map2:

    library(dplyr)
    library(purrr)
    ds %>%
    bind_cols(., map2(seq(2,ncol(.),by = 2),seq(3,ncol(.),by = 2),
                      ~ setNames((ds[,.x]+ds[,.y])/2,
                                 paste0(gsub("(\\w+?_).+","\\1",names(ds)[.x]),"mean"))))
    # A tibble: 12 x 7
       identificacao lh_aparc_volume lh_bankssts_volume rh_aparc_volume rh_bankssts_volume lh_mean rh_mean
       <chr>                   <dbl>              <dbl>           <dbl>              <dbl>   <dbl>   <dbl>
     1 3004U                    2112               1750            1797               1951   1931    1874 
     2 77584X                   2081               1654            1895               1991   1868.   1943 
     3 25917G                   2050               1344            1386               1774   1697    1580 
     4 39895C                   2350               1876            1875               2539   2113    2207 
     5 20597Y                   2250               1366            2123               1830   1808    1976.
     6 64085M                   1730               1424            1457               2564   1577    2010.
     7 51573F                   1874               1416            1754               2433   1645    2094.
     8 42221E                   1821               1521            2478               1092   1671    1785 
     9 58658E                   2004               1231            1670               1803   1618.   1736.
    10 8983C                    1928               2415            1613               2009   2172.   1811 
    11 18516K                   1844                938            1702               1609   1391    1656.
    12 27050E                   2900               1356            1873               1787   2128    1830 
    

    Another "tidyverse" approach would be tidyr:pivot_longer:

    library(dplyr)
    library(tidyr)
    ds %>%
      pivot_longer(-identificacao, names_to = "variable", values_to = "values") %>%
      separate(variable, into = c("group","variable"),
               sep = "_", extra = "merge") %>%
      pivot_wider(id_cols = c("identificacao","group"),
                  names_from = "variable", values_from = "values") %>%
      mutate(mean = (aparc_volume + bankssts_volume)/2) %>%
      pivot_wider(id_cols = "identificacao",
                  names_from = "group",
                  values_from = c("aparc_volume","bankssts_volume","mean"))
    # A tibble: 12 x 7
       identificacao aparc_volume_lh aparc_volume_rh bankssts_volume_lh bankssts_volume_rh mean_lh mean_rh
       <chr>                   <dbl>           <dbl>              <dbl>              <dbl>   <dbl>   <dbl>
     1 3004U                    2112            1797               1750               1951   1931    1874 
     2 77584X                   2081            1895               1654               1991   1868.   1943 
     3 25917G                   2050            1386               1344               1774   1697    1580 
     4 39895C                   2350            1875               1876               2539   2113    2207 
     5 20597Y                   2250            2123               1366               1830   1808    1976.
     6 64085M                   1730            1457               1424               2564   1577    2010.
     7 51573F                   1874            1754               1416               2433   1645    2094.
     8 42221E                   1821            2478               1521               1092   1671    1785 
     9 58658E                   2004            1670               1231               1803   1618.   1736.
    10 8983C                    1928            1613               2415               2009   2172.   1811 
    11 18516K                   1844            1702                938               1609   1391    1656.
    12 27050E                   2900            1873               1356               1787   2128    1830 
    

    Obviously this moves whatever lh and rh are to the end of the column name. If this is a dealbreaker, you could use rename_at.

  • kqui
    kqui 回复

    简单变异有什么问题?

    View(ds %>% mutate(col1 = (lh_aparc_volume + rh_aparc_volume) /2 ,col2 = (lh_bankssts_volume + rh_bankssts_volume)/2))