r - 从 R 中的区间 [start, stop] 数据估计密度

描述

这个问题的动机来自临床/流行病学研究，其中研究通常招募患者，然后对他们进行不同时间的跟踪。

研究开始时的年龄分布通常很有趣并且很容易评估，但偶尔也会对研究期间任何时间的年龄分布感兴趣。

我的问题是，是否有一种方法可以根据区间数据(例如 [age_start, age_stop])估计这种密度，而无需如下扩展数据？长格式的方法看起来不够优雅，更不用说它的内存占用了!

使用生存包中的数据的可重现示例

#### Prep Data ###
library(survival)
library(ggplot2)
library(dplyr)

data(colon, package = 'survival')
# example using the colon dataset from the survival package
ccdeath <- colon %>%
  # use data on time to death (not recurrence)
  filter(etype == 2) %>%
  # age at end of follow-up (death or censoring)
  mutate(age_last = age + (time / 365.25))

#### Distribution Using Single Value ####
# age at study entry
ggplot(ccdeath, aes(x = age)) +
  geom_density() +
  labs(title = "Fig 1.",
       x = "Age at Entry (years)",
       y = "Density")

#### Using Person-Month Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/12))

nrow(ccdeath_cp) # over 50,000 rows

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 2: Density based on approximate person-months",
       x = "Age (years)",
       y = "Density")

#### Using Person-Day Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/365.25))

nrow(ccdeath_cp) # over 1.5 million rows!

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 3: Density based on person-days",
       x = "Age (years)",
       y = "Density")

注意:虽然我将这个问题标记为“生存”是因为我认为它会吸引熟悉该领域的人，但我对这里的事件发生时间不感兴趣，我只对所有研究时间的总体年龄分布感兴趣。

最佳答案

与其计算越来越精细的时间间隔，您可以只对特定年龄的患者数量进行累积计数

setDT(ccdeath)
x <- rbind(
  ccdeath[, .(age = age, num_patients = 1)],
  ccdeath[, .(age = age_last, num_patients = -1)]
)[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = unique(age))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

锯齿模式是因为假定每个患者的起始年龄都是整数。对如何平滑它有一些想法并提出了这个想法 - 将相等的概率分配给给定 age 和 age+1 之间的一组均匀间隔的年龄。你得到这样的东西，

smooth_param <- 12
x <- rbindlist(lapply(
  (1:smooth_param-0.5)/smooth_param,
  function(s) {
    rbind(
      ccdeath[, .(age = age+s, num_patients = 1/smooth_param)],
      ccdeath[, .(age = age_last+s, num_patients = -1/smooth_param)]
    )
  }
))[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = sort(unique(age)))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

https://stackoverflow.com/questions/63746044/

相关文章：

javascript - 是否可以使用 javascript 将整个网页静音？

sql-server - T-SQL Json_modify 将属性附加到每个对象

python - 如何找到 DataFrame 行的所有组合？

apache-spark - 有什么方法可以使用 spark 从 s3 并行读取多个 Parquet

java - 如何在 Java 中测试不同的 "tables input"？

post - 错误启动内核 : '_xsrf' argument missing from POST

python - 将 ASN.1 字符串与 python 正则表达式匹配

c - 尝试将 'insert' 或 'add' 写入文本文件 - 一个小问题

amazon-web-services - 具有自定义授权方的无服务器 lambda 单元测试处理程

reactjs - Jest 测试因 gatsby webpack 配置而抛出错误