This vignette gives the minimal workflow for applied longitudinal
analyses with gamlss.longitudinal.
We include a suggested set of steps and functions including short descriptions of each, while excluding outputs inline to minimise the complexity of reviewing the file. The idea is to allow the user to quickly pickup the relevant code and apply to either their own dataset or the included simulated dataset.
For a worked through example with more detailed review of data, outputs and diagnostics, including their interpretation, please review the detailed worked example: Detailed worked example.
The key parts are:
- Select marginal distribution and copula function
- Fit model, review outputs and compare fits
- Review model diagnostics
- (Optional) Predict and simulate
- (Optional) Compare to common models: gee, glmm, gam
1-3 are the core model building workflow, while 4-5 provide additional tools for analysis depending on user needs.
Installation
For now installation is available directly from the project’s github
repository. You may need to install the package ‘remotes’ through
install.packages("remotes").
remotes::install_github("ahibbert/gamlss.longitudinal")
library(gamlss.longitudinal)Data Requirements
The dataset should be structured as one row per subject and timepoint. The dataset should include:
- the response variable;
- a subject identifier;
- a visit, time, or occasion identifier;
- any covariates for analysis.
This workflow is designed to be able to be used on a dataset of your choice (with the suggested structure). Alternatively, the small simulated dataset below has the required structure and can be used to run the workflow end to end.
# Simulate example dataset
dat <- simulate_longitudinal_dataset(
n = 100,
times = 0:3,
margin_dist = GG(),
copula_dist = "N",
margin_params = list(
mu = function(d) exp(1.6 + 0.35 * (d$treatment == "active") + 0.12 * d$time + d$subject_shift),
sigma = function(d) exp(-0.5 + 0.08 * d$time)
),
copula_params = list(theta = 0.4),
covariates = function(d) {
simulate_longitudinal_covariates(d, subject = list(
treatment = function(s) factor(rep(c("control", "active"), length.out = nrow(s))),
age_scaled = function(s) as.numeric(scale(rnorm(nrow(s), mean = 55, sd = 10))),
subject_shift = function(s) rnorm(nrow(s), sd = 0.35)
))
},
seed = 1
)Missing responses are allowed but the missingness mechanism still
needs to be reviewed as part of model assessment. Missingness at random
is generally well managed by the model when the relevant observed
predictors are included, but missingness not at random may affect
results. Use check_missingness() as a first screen for
observed-feature associations with response missingness.
check_missingness(
dat,
response_var = "response",
time_var = "time",
subject_var = "subject"
)Step 1: Select margin distribution and copula function
First, we review the shape of the response distribution (margin) and choose the best fitting distribution. gamlss_longitudinal supports all non-mixed response types from the gamlss.dist library of distributions.
For simplicity, we provide the function select_margin()
which fits the available gamlss distributions and provides a sorted list
of fits by AIC, and confirms if they are supported by
gamlss_longitudinal. This can be limited to certain fits such as
specific distributions, e.g. families = c("PO", "GA") or
response types, e.g. type=realplus.
In general, we would suggest using the top fitting distributions as a shortlist that can later be revisited once covariates are incorporated into the fit.
# Select best-fitting margin distribution
margin_selection <- select_margin(
dat,
response_var = "response"
)
print(margin_selection)
margin_dist <- best_fit_family(margin_selection)We then select our starting copula based on the chosen marginal
distribution. The function select_copula() is used to
compare dependence copula functions. When response_var and
margin_dist are supplied, pseudo-observations are created
automatically from a temporary marginal GAMLSS fit.
# Select best-fitting copula distribution
copula_selection <- select_copula(
data = dat,
response_var = "response",
margin_dist = margin_dist,
subject_var = "subject",
time_var = "time"
)
print(copula_selection)
copula_dist <- best_fit_family(copula_selection)The selected family is a starting point. Dependence diagnostics after the full fit matter more than the screen as dependence shape can change a lot after covariates are accounted for in the margins.
Step 2: Fit the longitudinal gamlss model and review key outputs
Depending on the margin distribution selected and the copula function
selected, it’s possible to depend variables on any parameter for the
margin: mu.formula which is generally the mean,
sigma.formula which is generally the variance parameter,
nu.formula, tau.formula, and for the copula:
theta.formula and zeta.formula.
Other than the covariate formulas, the model requires the user
specify the subject and time variables in the provided dataset through
parameters time_var and subject_var. The user
should also specify the chosen margin distribution in functional form
(i.e. GA()) as well as the selected distributions for the
margin and copula (initially based on screening in the previous
step).
For the included simulated example, we know there is a binary
treatment effect and a time effect on both the mean,
mu.formula, and the variance, sigma.formula.
In addition, there is a smooth age-effect on the mean. There is also
increasing correlation over time which we capture with a covariate in
theta.formula.
# Fit longitudinal gamlss model
fit <- gamlss_longitudinal(
dataset = dat,
margin_dist = margin_dist,
copula_dist = copula_dist,
time_var = "time",
subject_var = "subject",
mu.formula = response ~ treatment + time + s(age_scaled, bs = "ps"),
sigma.formula = ~ treatment + time,
theta.formula = ~ time
)You can then review a summary of the fit including covariate estimates, standard errors and significance alongside model selection criteria: AIC, BIC and log likelihood.
# Model summary
summary(fit)Other standard functions for outputs of regression models are also available, such as confidence intervals, wald tests, likelihood ratios and bootstrap inference.
# Confidence intervals
confint(fit)
#Wald test
wald_test(fit)
# Likelihood ratio test
fit_reduced <- gamlss_longitudinal(
dataset = dat,
margin_dist = margin_dist,
copula_dist = copula_dist,
time_var = "time",
subject_var = "subject",
mu.formula = response ~ treatment + time,
sigma.formula = ~ treatment + time,
theta.formula = ~ time
)
likelihood_compare(fit_reduced, fit)
# Bootstrapped confidence intervals (WARNING: Can take a long time to run since it's refitting for each R)
boot <- bootstrap_inference(fit, R = 20)
print(boot)See
vignette("inference-uncertainty", package = "gamlss.longitudinal")
for the reporting language around Hessian, numerical-Hessian,
delta-method, and simulation-based uncertainty.
See
vignette("diagnostics-decisions", package = "gamlss.longitudinal")
for the decision language around marginal, tail, dependence, and
inference warnings.
Step 3: Diagnostics
We provide a built-in check_model() function which highlights basic automated check statuses and failed-check warnings in plain language from the fit diagnostics including convergence errors or unusual looking fit statistics.
# Model checks
check <- check_model(fit, include_vcov = TRUE, dependence_cor_cutoff = 0.25)
check
check$basic_checks
check$basic_checks_result
check$warningsTo visually review the fits for both margin and the dependence copula
we provide plotting functions directly on the fit object:
plot(fit) and plot_copula_diagnostics(fit).
For the margin plots, we include standard margin diagnostics:
- PIT histogram,
- QQ plot,
- Worm plot, and
- Rootogram.
For the copula we include empirical copula v fitted density and correlation by quantile, Rosenblatt residuals and lag plot, ordered fitted versus observed density, tail co-occurrence and exceedance, and residual dependence by lag.
# View margin diagnostics
plot(fit)
# View dependence diagnostics
plot_copula_diagnostics(fit)Treat the output as a decision aid: - if the model is not converging,
for a highly complex model more iterations may be needed which can be
specified by max_outer_iter=N. Usually, however, iteration
warnings are either due to a model which is not ideally specified
(e.g. incorrect copula or margin shape) or covariates which are adding
complexity to the model without adding fit resulting in optimizer issues
- lack of fit in the marginal diagnostics, plot(fit) or
marginal warnings, generally indicate a poor selection of marginal
distribution or marginal covariates - lack of fit in the copula
diagnostics, plot_copula_diagnostics(fit) or residual
dependence warnings, could either be choice of copula or copula
covariates, or marginal distributions which are not fully accounting for
pre-correlation margin effects resulting in a misshapen copula. The
residual-dependence warning uses a practical correlation cutoff, not a
formal test, so small remaining correlations should be reviewed visually
- tail fit warnings mean quantile or probability claims may not be
reliable
Ideally all warnings should be rectified and diagnostics accepted as satisfactory prior to moving to reporting.
Optional: Simulation and prediction
Multiple options are provided for prediction including on response and parameter scale, by quantiles, cdf, density or probability, and reporting by marginal effect.
# Predict the fitted response mean with confidence interval
predict(
fit,
newdata = dat,
type = "mean",
se.fit = TRUE,
interval = "confidence"
)
# Predict the estimated GAMLSS parameter values for mu, sigma etc.
predict(
fit,
newdata = dat,
type = "parameters"
)
# Predict the fitted GAMLSS mu parameter explicitly
predict(
fit,
newdata = dat,
type = "mu"
)
# Predict specified quantiles
predict(
fit,
newdata = dat,
type = "quantile",
probs = c(0.1, 0.5, 0.9)
)
# Predict cdf
predict(
fit,
newdata = dat,
type = "cdf",
q = 10
)
# Predict probability density
predict(
fit,
newdata = dat,
type = "density",
y = 10
)
# Predict observation probability (either above or below response value)
predict(
fit,
newdata = dat,
type = "probability",
q = 10,
direction = "above"
)
# Calculate average response across a covariate for the margin
marginal_effects(
fit,
newdata = dat,
variable = "treatment",
parameter = "mu",
se.fit = TRUE
)Simulate from the fitted model as desired, for a total of
nsim replications. Note that simulation uses both copula
and margin structure so dependence is maintained in simulations.
sim_model <- simulate(fit, nsim = 100, seed = 1)Optional: Compare/benchmark to familiar models: gee (geepack), glmm (lme4), gam (mgcv)
Since we know most users would be considering a longitudinal gamlss model as an alternative to existing solutions using gees, gam or glmms we provide easy to use comparison tools to benchmark longitudinal gamlss models to existing methods.
# Fit comparison models and summarise fits for a model with linear age term
bench_nosmooth <- benchmark_standard_models(
data = dat,
formula = response ~ treatment + time + age_scaled,
subject_var = "subject",
family = "gamma",
fit = fit_reduced
)
print(bench_nosmooth)
# Fit comparison models and summarise fits for a model with smooth age term
bench_smooth <- benchmark_standard_models(
data = dat,
formula = response ~ treatment + time + s(age_scaled),
subject_var = "subject",
family = "gamma",
fit = fit
)
print(bench_smooth)In general, note that the purpose of the gamlss.longitudinal model is to accurately capture the full joint distribution shape. If the item of interest is just the mean, then gamlss.longitudinal is unlikely to provide large gains except for where distributions are clearly non-standard. By modelling the whole joint outcome distribution we can more accurately answer questions such as whether quantiles, tail risks, or uncertainty (SEs, confidence intervals, prediction intervals) change in a meaningful way. In addition, note that the consumption of degrees of freedom for calculation of accurate conditional AICs for random effect models can be substantial compared to the small number of parameters needed to fit a copula-based dependence structure.
We provide additional information on comparisons in other vignettes:
vignette("benchmarking-adoption", package = "gamlss.longitudinal")
for scenario-level benchmark reporting.
For a more detailed walk-through example, including interpretation of model fit and diagnostics, please review the Detailed worked example.