r - 为什么同一个查询使用 dplyr 在不同的 R session 上返回不同的结果?

当我和我的同事一起做一个项目时,涉及使用 tidyverse 的包 dplyr 来操作数据框,我注意到即使我们使用相同的代码和相同的数据。

来自两个 R session 的 session 信息:

桌面:

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [4] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [7] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
[10] sp_1.3-2      

RStudio 云

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] randomNames_1.4-0.0  plotly_4.9.2.1       lubridate_1.7.9     
 [4] openintro_2.0.0      usdata_0.1.0         cherryblossom_0.1.0 
 [7] airports_0.1.0       leaflet_2.0.3        forcats_0.5.0       
[10] stringr_1.4.0        dplyr_1.0.0          purrr_0.3.4         
[13] readr_1.3.1          tidyr_1.1.0          tibble_3.0.2        
[16] ggplot2_3.3.2        tidyverse_1.3.0      shinydashboard_0.7.1
[19] shiny_1.5.0         

使用 Iris 的可重现示例:


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
  

#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

我想知道每个物种有多少独特的花,所以我写了以下查询:

 
iris_big %>% 
  group_by(name,Species) %>% 
  count() %>% 
  ungroup() %>% 
  count(Species)

问题是,它返回两个不同的结果,一个在我的桌面上,另一个在我 friend 的桌面上(他使用的是 Rstudio Cloud)。

我的桌面:

# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

Rstudio 云:


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        83
2 versicolor    80
3 virginica     87

我最终通过使用以下查询解决了这个问题:

iris_big %>% 
  group_by(name,Species) %>% 
  count() %>% 
  ungroup() %>%
  select(Species) %>% 
  group_by(Species) %>% 
  count()

# A tibble: 3 x 2
# Groups:   Species [3]
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

但我想知道为什么会这样。

最佳答案

(首先,我将此作为备选答案提交,因为我的 first answer(关于 sample.int 在 R-3.5 和 R-3.6 之间的变化)似乎仍然与“为什么相同的查询在不同的 R session 中返回不同的结果” 的问题;这不是导致症状的原因,但从第一个开始就很容易出现您问题的版本使用了 sample。相反,这里真正的罪魁祸首是由于 dplyr 中同样“主要”的版本更改。)

dplyr::count 的行为发生了重大变化。

在 dplyr-0.8.3 中,?count 说:

      wt: (Optional) If omitted (and no variable named 'n' exists in
          the data), will count the number of rows. If specified, will
          perform a "weighted" tally by summing the (non-missing)
          values of variable 'wt'. A column named 'n' (but not 'nn' or
          'nnn') will be used as weighting variable by default in
          'tally()', but not in 'count()'. This argument is
          automatically quoted and later evaluated in the context of
          the data frame. It supports unquoting. See
          'vignette("programming")' for an introduction to these
          concepts.

在 dplyr-1.0.0 中:

      wt: <'data-masking'> Frequency weights. Can be a variable (or
          combination of variables) or 'NULL'. 'wt' is computed once
          for each unique combination of the counted variables.

            • If a variable, 'count()' will compute 'sum(wt)' for each
              unique combination.

            • If 'NULL', the default, the computation depends on
              whether a column of frequency counts 'n' exists in the
              data frame. If it exists, the counts are computed with
              'sum(n)' for each unique combination. Otherwise, 'n()' is
              used to compute the counts. Supply 'wt = n()' to force
              this behaviour even if you have an 'n' column in the data
              frame.

要看的重要部分是在 0.8.3 中,它说 “名为 'n' 的列 ... 将用于 ... 在 'tally()' 中而不是在 'count() '”。但是,在 1.0.0 中,它不包含该措辞。我使用 R-3.5.3/dplyr-0.8.3 和 R-4.0.2/dplyr-1.0.0 重现了您的结果。

绕过它的方法是以下两种方法之一:

  1. 使用count(..., wt=n()):

    R.version$version.string
    # [1] "R version 3.5.3 (2019-03-11)"
    iris_big %>%
      group_by(name,Species) %>%
      count() %>%
      ungroup() %>%
      count(Species, wt = n())
    # # A tibble: 3 x 2
    #   Species        n
    #   <fct>      <int>
    # 1 setosa        50
    # 2 versicolor    50
    # 3 virginica     50
    
    R.version$version.string
    # [1] "R version 4.0.2 (2020-06-22)"
    iris_big %>%
      group_by(name,Species) %>%
      count() %>%
      ungroup() %>%
      count(Species, wt = n())
    # # A tibble: 3 x 2
    #   Species        n
    #   <fct>      <int>
    # 1 setosa        50
    # 2 versicolor    50
    # 3 virginica     50
    
  2. 转向在分组中使用tally,如

    iris_big %>%
      group_by(name,Species) %>%
      count() %>%
      group_by(Species) %>%
      tally()
    

或者你可以选择另一个选项:

  1. 意识到这是问题 dplyr#5298 ,已在尚未发布的 dplyr-1.0.1 中修复(我不知道时间表)。这样,RStudio Cloud 用户可以选择 dplyr 的 github 版本以从 dplyr#5349 中受益。 ,一个已经被合并的 PR。这应该将 count 的行为恢复到 1.0.0 之前的行为(尽管 Hadley's opinion 在这件事上)。

关于r - 为什么同一个查询使用 dplyr 在不同的 R session 上返回不同的结果?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62941250/

相关文章:

python-3.x - 可以将我的临时文件放入我的应用程序中的 tmp/目录吗?

html - 在 flex 中将 div 的内容右对齐

python - 为什么我会收到 KeyError : 0

python - Jupyter notebook 感叹号参数

python - 如何在 Python 中读取 SPSS aka (.sav)

vue.js - Nuxt : Open component/page as modal with

asp.net-mvc - 如何将 Azure Active Directory 身份验证添加到 R

r - 错误 : Must subset columns with a valid subscrip

c - 浅/深复制术语是否适用于没有引用的对象?

r - 对于循环存储问题