假设我有以下数据。
dt = data.table(
date = c("2020-10-01", "2020-10-02", "2020-10-03", "2020-10-01", "2020-10-01",
"2020-10-03", "2020-10-04", "2020-10-04", "2020-10-05", "2020-10-05"),
client = sample(LETTERS[1:3], 10, replace = TRUE),
vals = rnorm(10))
dt[order(date)]
dt2 = dt[order(date), .(sum_vals = sum(vals)), by = .(date, client)]
date client sum_vals
1: 2020-10-01 B 2.53737527
2: 2020-10-01 C 0.64366866
3: 2020-10-02 A 1.01776243
4: 2020-10-03 C -0.06303562
5: 2020-10-03 A 0.63702089
6: 2020-10-04 B 0.12681052
7: 2020-10-04 A 0.82889616
8: 2020-10-05 B -1.45734539
9: 2020-10-05 C 0.02594185
我想做的是按日期计算累计客户数。
所以在这种情况下,它看起来像这样。
date. acts
1: 2020-10-01 2 # b and c we active on 10/01 or before
2: 2020-10-02 3 # a, b and c we active on 10/02 or before
3: 2020-10-03 3 # a, b and c we active on 10/03 or before
4: 2020-10-04 3 # a, b and c we active on 10/04 or before
5: 2020-10-05 3 # a, b and c we active on 10/05 or before
关于如何使用 data.table 或 dplyr 实现这一点有什么想法吗?
最佳答案
我们可以做
dt2[, .(date = unique(date), acts = unlist(lapply(unique(date),
function(x) uniqueN(client[date <= x]))))]
-输出
date acts
1: 2020-10-01 2
2: 2020-10-02 2
3: 2020-10-03 3
4: 2020-10-04 3
5: 2020-10-05 3
https://stackoverflow.com/questions/69170009/