Simulator usage • gamlss.longitudinal

This article covers two simulation use cases:

Simulation of a new dataset with gamlss margins and copula dependence
Simulation from a fitted gamlss.longitudinal model for the fitted dataset
Simulation from a fitted gamlss.longitudinal model for new data (e.g. new timepoints, subjects, or different covariates)

The following packages are required for the code in this vignette:

library(gamlss.longitudinal)
library(gamlss.dist)

Simulating a new dataset

simulate_longitudinal_dataset() is a flexible framework for generating new datasets with gamlss marginal distribution and first-order copula dependence. It includes functionality for reasonasbly simply specifying formulas for every available parameter of the margin and copula, and includes helper functions for simply inserting smooth terms which can be used in testing.

Creating a minimal longitudinal dataset

simulate_longitudinal_dataset() returns one row per subject and time combination. We specify the number of subjects with n=100 and the number of observations per timepoint with an ordered time variable, here times = 0:4 which will result in a dataset with 500 observations total.

Any non-mixed marginal distribution from gamlss.dist is supported (e.g. NO(), GA(), GG()), and copulas supported are the same as for the overall package:

"N" Normal
"C" Clayton
"F" Frank
"G" Gumbel
"J" Joe
"t" Student t

For a simple model we can just specify intercepts for mu and sigma for a standard normal margin with an intercept for the theta parameter of a normal copula as below.

dataset <- simulate_longitudinal_dataset(
  n = 100,
  times = 0:4,
  margin_dist = NO(),
  margin_params = list(mu = 0, sigma = 1),
  copula_dist = "N",
  copula_params = list(theta = 0.35),
  seed = 1
)

head(dataset)
table(dataset$time)

The output includes the subject, time and response variable alongside a column for the transformed copula input u based on the simulated margin. The simulated dataset by default also includes the true values for each parameter for every row for ease of comparison. Note that true_theta will have an NA for the first timepoint as for a given dataset with T timepoints, there will only be T-1 copulas fit and we assign the copula for time pair 0,1 to time 1.

Time-varying parameters

Formulas for parameters based only on time have a simpler interpretation than other covariates since it’s such a common feature, and can be incorporated as:

an intercept for all time, margin_params = list(mu = 0, sigma = 1),
a time-based vector with length for margin equal to T or for copula T-1, margin_params = list(mu = 0, sigma = c(0.35, 0.45, 0.55, 0.65)), or
a function of the design matrix time variable d$.sim_time_index, mu = function(d) exp(1 + 0.15 * d$.sim_time_index)
a matrix with the same time/subject index as the dataset specifying the value for the parameter

dataset <- simulate_longitudinal_dataset(
  n = 80,
  times = 1:4,
  margin_dist = GA(mu.link = "log", sigma.link = "log"),
  margin_params = list(
    mu = function(d) exp(1 + 0.15 * d$.sim_time_index),
    sigma = c(0.35, 0.45, 0.55, 0.65)
  ),
  copula_dist = "C",
  copula_params = list(tau = c(0.15, 0.25, 0.35)),
  seed = 2
)

head(dataset)
aggregate(true_sigma ~ time, dataset, unique)
aggregate(true_theta ~ time, dataset, unique, na.rm = TRUE)

Covariates

Covariates can either be subject-level (so don’t change over time) or observation-level which can change over time.

We pass in covariates to the simulation via covariates = and pass the relationship between parameters and the covariates with margin_params = and copula_params =. Inside the helpers, d refers to the subject/time matrix generated for the margins and e is the relevant edge subject/time for the copula. We also provide internal helper in he edge matrix for time_left and time_right when these are needed since the copula can refer to either timepoints variables if required.

We can either specify our own covariates or use the function simulate_longitudinal_covariates() which requires the subject/time matrix to create lists of subject level covariates, here a treatment group and a normally distributed age, and observation level covariates, here just a scaled version of the time covariate.

treatment_effect <- c(control = 0, active = 0.4)

dataset <- simulate_longitudinal_dataset(
  n = 120,
  times = seq(0, 1, length.out = 5),
  margin_dist = GA(mu.link = "log", sigma.link = "log"),
  covariates = function(base) {
    simulate_longitudinal_covariates(
      base,
      subject = list(
        treatment = function(s) {
          factor(
            sample(c("control", "active"), nrow(s), replace = TRUE),
            levels = c("control", "active")
          )
        },
        age = function(s) rnorm(nrow(s), mean = 55, sd = 10)
      ),
      observation = list(
        time_scaled = function(d) sim_rescale01(d$time)
      )
    )
  },
  margin_params = list(
    mu = function(d) {
      age_scaled <- sim_rescale01(d$age)
      eta <- 1.2 +
        sim_factor_effect(d$treatment, treatment_effect) +
        0.35 * d$time_scaled +
        sim_smooth_bump(age_scaled, amplitude = 0.25)
      exp(eta)
    },
    sigma = function(d) {
      exp(-0.8 + 0.25 * d$time_scaled)
    }
  ),
  copula_dist = "N",
  copula_params = list(
    theta = function(e) {
      0.2 + 0.25 * sim_rescale01(e$time_left)
    }
  ),
  seed = 3
)

head(dataset)

Simulation from a fitted model

The alternative simulation use case is for simulating from a fitted gamlss.longitudinal model which is supported through simulate(). This is often for inference or prediction/forecasting rather than generating new data for testing purposes.

Simulation from a fitted model (on the fitted dataset)

The simplest simulation option is to directly simulate new replicates of a fitted model, essentially a parametric bootstrap simulation against the fitted data covariates.

sim_fit <- simulate(fit, nsim = 5, seed = 10)

head(sim_fit)
dim(sim_fit)

The ouptut is a dataframe with one column for each simulation replicate with rows which match the original dataset supplied for the model fit.

Simulation from a fitted model (for new subjects)

We can pass simulate() a new dataset with the same timepoints but different subjects through newdata =. This will generate simulations which respect the dependence structure in the fitted model and apply it to the new simulated cohort.

new_cohort <- dat[dat$subject %in% unique(dat$subject)[1:20], ]
new_cohort$subject <- paste0("new_", new_cohort$subject)
new_cohort$response <- NA_real_

sim_new <- simulate(
  fit,
  nsim = 10,
  seed = 11,
  newdata = new_cohort
)

head(cbind(new_cohort[c("subject", "time")], sim_new))

Simulation from a fitted model (for new timepoints)

Alternatively, we can pass a new dataset with the same set of subjects but new timepoints (as long as it makes sense to do this, e.g. for a factor time this won’t make much sense, but for a linear fit time it does). To take advantage of this type of fit, time will need to be fit as a structure that extrapolates to the new timepoints cleanly.

subject_covariates <- dat[!duplicated(dat$subject), c("subject", "treatment", "age")]
future_grid <- expand.grid(
  subject = subject_covariates$subject[1:8],
  time = c(sort(unique(dat$time)), 1.25),
  KEEP.OUT.ATTRS = FALSE,
  stringsAsFactors = FALSE
)

future_panel <- merge(future_grid, subject_covariates, by = "subject", sort = FALSE)
future_panel <- future_panel[order(future_panel$subject, future_panel$time), ]
future_panel$time_scaled <- future_panel$time
future_panel$response <- NA_real_

sim_future <- simulate(
  fit,
  nsim = 20,
  seed = 12,
  newdata = future_panel
)

head(cbind(future_panel[c("subject", "time")], sim_future))

Missing data / observations handling

The gamlss.longitudinal model relies on first order copula dependence to generate adjacent time points. If there are missing adjacent observations then the link between observations for an individual before and after that missing timpoint will be broken (i.e. independent) as we do not model dependence beyond the first order, but remaining observations that are adjacent will respect the dependence structure.