我有这个数据集
df <- tibble(id, event, duration)
我需要使用后续的“表面”计算每个“潜水”行的表面持续时间比例,并将结果插入新列。所有这些都由“id”分隔。
比例=水面/潜水+水面
#Output dataframe
# A tibble: 8 x 4
id event duration proportion
1 A surface 56 x
2 A surface 96 x
3 A surface 14 x
4 A surface 77 x
5 B surface 28 x
6 B surface 63 x
7 B surface 47 x
8 B surface 90 x
############################################################
编辑:
在我的原始数据中,我有一些没有“表面”的“潜水”,并且创建的代码有错误。
Error in `dplyr::mutate()`:
! Problem while computing `proportion = DurationMin[What ==
"Surface"]/sum(DurationMin)`.
✖ `proportion` must be size 2 or 1, not 0.
ℹ The error occurred in group 2803: ptt = "2017111870", grp = 1015.
在“id”中会有奇数行,其中“潜水”事件在其序列中不会有“表面”。所以我需要每次遇到未配对的事件时,要么忽略它,要么插入一个 NA。可能吗?
按照这个数据框示例:
id <- c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
event <- c("dive", "surface", "dive", "surface", "dive", "surface", "dive", "surface", "dive", "surface", "dive", "surface", "dive", "surface", "dive")
duration <- c(55, 56, 40, 96, 58, 14, 43, 77, 19, 28, 34, 63, 29, 47, 61)
df <- tibble(id, event, duration)
> df
id event duration
1 A dive 55
2 A surface 56
3 A dive 40
4 A surface 96
5 A dive 58
6 A surface 14
7 A dive 43
8 A surface 77
9 B dive 19
10 B surface 28
11 B dive 34
12 B surface 63
13 B dive 29
14 B surface 47
15 B dive 61
16 B dive 45
17 B surface 30
>
最佳答案
我们可以使用 gl
每 2 行创建一次分组索引,然后通过划分 'duration' 来创建列 'proportion',其中事件值为 'surface' (event = = 'surface'
) 与 'duration' 的 sum
library(dplyr)
df %>%
group_by(id) %>%
group_by(grp = as.integer(gl(n(), 2, n())), .add = TRUE) %>%
mutate(proportion = duration[event == 'surface'][1]/sum(duration)) %>%
ungroup %>%
select(-grp)
-输出
# A tibble: 16 × 4
id event duration proportion
<chr> <chr> <dbl> <dbl>
1 A dive 55 0.505
2 A surface 56 0.505
3 A dive 40 0.706
4 A surface 96 0.706
5 A dive 58 0.194
6 A surface 14 0.194
7 A dive 43 0.642
8 A surface 77 0.642
9 B dive 19 0.596
10 B surface 28 0.596
11 B dive 34 0.649
12 B surface 63 0.649
13 B dive 29 0.618
14 B surface 47 0.618
15 B dive 61 0.596
16 B surface 90 0.596
对于新的数据集,我们可以使用
df %>%
group_by(id) %>%
group_by(grp = cumsum(event == 'dive'), .add = TRUE) %>%
mutate(proportion = duration[event == 'surface'][1]/sum(duration)) %>%
ungroup %>%
select(-grp)
-输出
# A tibble: 17 × 4
id event duration proportion
<chr> <chr> <int> <dbl>
1 A dive 55 0.505
2 A surface 56 0.505
3 A dive 40 0.706
4 A surface 96 0.706
5 A dive 58 0.194
6 A surface 14 0.194
7 A dive 43 0.642
8 A surface 77 0.642
9 B dive 19 0.596
10 B surface 28 0.596
11 B dive 34 0.649
12 B surface 63 0.649
13 B dive 29 0.618
14 B surface 47 0.618
15 B dive 61 NA
16 B dive 45 0.4
17 B surface 30 0.4
df <- structure(list(id = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B"), event = c("dive",
"surface", "dive", "surface", "dive", "surface", "dive", "surface",
"dive", "surface", "dive", "surface", "dive", "surface", "dive",
"dive", "surface"), duration = c(55L, 56L, 40L, 96L, 58L, 14L,
43L, 77L, 19L, 28L, 34L, 63L, 29L, 47L, 61L, 45L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17"))
https://stackoverflow.com/questions/72860923/