Minimal workflow • gamlss.longitudinal

This vignette gives the minimal workflow for applied longitudinal analyses with gamlss.longitudinal.

We include a suggested set of steps and functions including short descriptions of each, while excluding outputs inline to minimise the complexity of reviewing the file. The idea is to allow the user to quickly pickup the relevant code and apply to either their own dataset or the included simulated dataset.

For a worked through example with more detailed review of data, outputs and diagnostics, including their interpretation, please review the detailed worked example: Detailed worked example.

The key parts are:

Select marginal distribution and copula function
Fit model, review outputs and compare fits
Review model diagnostics
(Optional) Predict and simulate
(Optional) Compare to common models: gee, glmm, gam

1-3 are the core model building workflow, while 4-5 provide additional tools for analysis depending on user needs.

Installation

For now installation is available directly from the project’s github repository. You may need to install the package ‘remotes’ through install.packages("remotes").

remotes::install_github("ahibbert/gamlss.longitudinal")
library(gamlss.longitudinal)

Data Requirements

The dataset should be structured as one row per subject and timepoint. The dataset should include:

the response variable;
a subject identifier;
a visit, time, or occasion identifier;
any covariates for analysis.

This workflow is designed to be able to be used on a dataset of your choice (with the suggested structure). Alternatively, the small simulated dataset below has the required structure and can be used to run the workflow end to end.

# Simulate example dataset
dat <- simulate_longitudinal_dataset(
  n = 100,
  times = 0:3,
  margin_dist = GG(),
  copula_dist = "N",
  margin_params = list(
    mu = function(d) exp(1.6 + 0.35 * (d$treatment == "active") + 0.12 * d$time + d$subject_shift),
    sigma = function(d) exp(-0.5 + 0.08 * d$time)
  ),
  copula_params = list(theta = 0.4),
  covariates = function(d) {
    simulate_longitudinal_covariates(d, subject = list(
      treatment = function(s) factor(rep(c("control", "active"), length.out = nrow(s))),
      age_scaled = function(s) as.numeric(scale(rnorm(nrow(s), mean = 55, sd = 10))),
      subject_shift = function(s) rnorm(nrow(s), sd = 0.35)
    ))
  },
  seed = 1
)

# Review data completeness / counts by time, subject
str(dat)
table(dat$subject)
table(dat$time)

Missing responses are allowed but the missingness mechanism still needs to be reviewed as part of model assessment. Missingness at random is generally well managed by the model when the relevant observed predictors are included, but missingness not at random may affect results. Use check_missingness() as a first screen for observed-feature associations with response missingness.

check_missingness(
  dat,
  response_var = "response",
  time_var = "time",
  subject_var = "subject"
)

Step 1: Select margin distribution and copula function

First, we review the shape of the response distribution (margin) and choose the best fitting distribution. gamlss_longitudinal supports all non-mixed response types from the gamlss.dist library of distributions.

For simplicity, we provide the function select_margin() which fits the available gamlss distributions and provides a sorted list of fits by AIC, and confirms if they are supported by gamlss_longitudinal. This can be limited to certain fits such as specific distributions, e.g. families = c("PO", "GA") or response types, e.g. type=realplus.

In general, we would suggest using the top fitting distributions as a shortlist that can later be revisited once covariates are incorporated into the fit.

# Select best-fitting margin distribution
margin_selection <- select_margin(
  dat,
  response_var = "response"
)

print(margin_selection)

margin_dist <- best_fit_family(margin_selection)

We then select our starting copula based on the chosen marginal distribution. The function select_copula() is used to compare dependence copula functions. When response_var and margin_dist are supplied, pseudo-observations are created automatically from a temporary marginal GAMLSS fit.

# Select best-fitting copula distribution
copula_selection <- select_copula(
  data = dat,
  response_var = "response",
  margin_dist = margin_dist,
  subject_var = "subject",
  time_var = "time"
)

print(copula_selection)

copula_dist <- best_fit_family(copula_selection)

The selected family is a starting point. Dependence diagnostics after the full fit matter more than the screen as dependence shape can change a lot after covariates are accounted for in the margins.

Step 2: Fit the longitudinal gamlss model and review key outputs

Depending on the margin distribution selected and the copula function selected, it’s possible to depend variables on any parameter for the margin: mu.formula which is generally the mean, sigma.formula which is generally the variance parameter, nu.formula, tau.formula, and for the copula: theta.formula and zeta.formula.

Other than the covariate formulas, the model requires the user specify the subject and time variables in the provided dataset through parameters time_var and subject_var. The user should also specify the chosen margin distribution in functional form (i.e. GA()) as well as the selected distributions for the margin and copula (initially based on screening in the previous step).

For the included simulated example, we know there is a binary treatment effect and a time effect on both the mean, mu.formula, and the variance, sigma.formula. In addition, there is a smooth age-effect on the mean. There is also increasing correlation over time which we capture with a covariate in theta.formula.

# Fit longitudinal gamlss model
fit <- gamlss_longitudinal(
  dataset = dat,
  margin_dist = margin_dist,
  copula_dist = copula_dist,
  time_var = "time",
  subject_var = "subject",
  mu.formula = response ~ treatment + time + s(age_scaled, bs = "ps"),
  sigma.formula = ~ treatment + time,
  theta.formula = ~ time
)

You can then review a summary of the fit including covariate estimates, standard errors and significance alongside model selection criteria: AIC, BIC and log likelihood.

# Model summary
summary(fit)

Other standard functions for outputs of regression models are also available, such as confidence intervals, wald tests, likelihood ratios and bootstrap inference.

# Confidence intervals
confint(fit)

#Wald test
wald_test(fit)

# Likelihood ratio test
fit_reduced <- gamlss_longitudinal(
  dataset = dat,
  margin_dist = margin_dist,
  copula_dist = copula_dist,
  time_var = "time",
  subject_var = "subject",
  mu.formula = response ~ treatment + time,
  sigma.formula = ~ treatment + time,
  theta.formula = ~ time
)
likelihood_compare(fit_reduced, fit)

# Bootstrapped confidence intervals (WARNING: Can take a long time to run since it's refitting for each R)
boot <- bootstrap_inference(fit, R = 20)
print(boot)

See vignette("inference-uncertainty", package = "gamlss.longitudinal") for the reporting language around Hessian, numerical-Hessian, delta-method, and simulation-based uncertainty.

See vignette("diagnostics-decisions", package = "gamlss.longitudinal") for the decision language around marginal, tail, dependence, and inference warnings.

Step 3: Diagnostics

We provide a built-in check_model() function which highlights basic automated check statuses and failed-check warnings in plain language from the fit diagnostics including convergence errors or unusual looking fit statistics.

# Model checks
check <- check_model(fit, include_vcov = TRUE, dependence_cor_cutoff = 0.25)
check
check$basic_checks
check$basic_checks_result
check$warnings

To visually review the fits for both margin and the dependence copula we provide plotting functions directly on the fit object: plot(fit) and plot_copula_diagnostics(fit). For the margin plots, we include standard margin diagnostics:

PIT histogram,
QQ plot,
Worm plot, and
Rootogram.

For the copula we include empirical copula v fitted density and correlation by quantile, Rosenblatt residuals and lag plot, ordered fitted versus observed density, tail co-occurrence and exceedance, and residual dependence by lag.

# View margin diagnostics
plot(fit)

# View dependence diagnostics
plot_copula_diagnostics(fit)

Treat the output as a decision aid: - if the model is not converging, for a highly complex model more iterations may be needed which can be specified by max_outer_iter=N. Usually, however, iteration warnings are either due to a model which is not ideally specified (e.g. incorrect copula or margin shape) or covariates which are adding complexity to the model without adding fit resulting in optimizer issues - lack of fit in the marginal diagnostics, plot(fit) or marginal warnings, generally indicate a poor selection of marginal distribution or marginal covariates - lack of fit in the copula diagnostics, plot_copula_diagnostics(fit) or residual dependence warnings, could either be choice of copula or copula covariates, or marginal distributions which are not fully accounting for pre-correlation margin effects resulting in a misshapen copula. The residual-dependence warning uses a practical correlation cutoff, not a formal test, so small remaining correlations should be reviewed visually - tail fit warnings mean quantile or probability claims may not be reliable

Ideally all warnings should be rectified and diagnostics accepted as satisfactory prior to moving to reporting.

Optional: Simulation and prediction

Multiple options are provided for prediction including on response and parameter scale, by quantiles, cdf, density or probability, and reporting by marginal effect.

# Predict the fitted response mean with confidence interval
predict(
  fit,
  newdata = dat,
  type = "mean",
  se.fit = TRUE,
  interval = "confidence"
)

# Predict the estimated GAMLSS parameter values for mu, sigma etc.
predict(
  fit,
  newdata = dat,
  type = "parameters"
)

# Predict the fitted GAMLSS mu parameter explicitly
predict(
  fit,
  newdata = dat,
  type = "mu"
)

# Predict specified quantiles
predict(
  fit,
  newdata = dat,
  type = "quantile",
  probs = c(0.1, 0.5, 0.9)
)

# Predict cdf
predict(
  fit,
  newdata = dat,
  type = "cdf",
  q = 10
)

# Predict probability density
predict(
  fit,
  newdata = dat,
  type = "density",
  y = 10
)

# Predict observation probability (either above or below response value)
predict(
  fit,
  newdata = dat,
  type = "probability",
  q = 10,
  direction = "above"
)

# Calculate average response across a covariate for the margin
marginal_effects(
  fit,
  newdata = dat,
  variable = "treatment",
  parameter = "mu",
  se.fit = TRUE
)

Simulate from the fitted model as desired, for a total of nsim replications. Note that simulation uses both copula and margin structure so dependence is maintained in simulations.

sim_model <- simulate(fit, nsim = 100, seed = 1)

Optional: Compare/benchmark to familiar models: gee (geepack), glmm (lme4), gam (mgcv)

Since we know most users would be considering a longitudinal gamlss model as an alternative to existing solutions using gees, gam or glmms we provide easy to use comparison tools to benchmark longitudinal gamlss models to existing methods.

# Fit comparison models and summarise fits for a model with linear age term
bench_nosmooth <- benchmark_standard_models(
  data = dat,
  formula = response ~ treatment + time + age_scaled,
  subject_var = "subject",
  family = "gamma",
  fit = fit_reduced
)

print(bench_nosmooth)

# Fit comparison models and summarise fits for a model with smooth age term
bench_smooth <- benchmark_standard_models(
  data = dat,
  formula = response ~ treatment + time + s(age_scaled),
  subject_var = "subject",
  family = "gamma",
  fit = fit
)

print(bench_smooth)

In general, note that the purpose of the gamlss.longitudinal model is to accurately capture the full joint distribution shape. If the item of interest is just the mean, then gamlss.longitudinal is unlikely to provide large gains except for where distributions are clearly non-standard. By modelling the whole joint outcome distribution we can more accurately answer questions such as whether quantiles, tail risks, or uncertainty (SEs, confidence intervals, prediction intervals) change in a meaningful way. In addition, note that the consumption of degrees of freedom for calculation of accurate conditional AICs for random effect models can be substantial compared to the small number of parameters needed to fit a copula-based dependence structure.

We provide additional information on comparisons in other vignettes: vignette("benchmarking-adoption", package = "gamlss.longitudinal") for scenario-level benchmark reporting.

For a more detailed walk-through example, including interpretation of model fit and diagnostics, please review the Detailed worked example.