Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 9)

Blog

Introduction

5.6 Tackling Multicollinearity

5.7 Testing the Proportional Hazards Assumption

5.8 Checking Model Fit with Residual Diagnostics

5.9 Interpreting Martingale Residuals Plots

Conclusion

March 3, 2025

Research

Tutorials

Introduction

Survival analysis is a powerful tool for understanding time-to-event data, and the Cox proportional hazards (PH) model is a cornerstone of this field. However, like any statistical model, the Cox model relies on certain assumptions. If these assumptions are violated, our results could be misleading or even incorrect. In this post, we'll dive into a practical guide for model diagnostics using R, focusing on a real-world example related to moderate-to-severe traumatic brain injury (msTBI). We'll cover how to assess multicollinearity, verify the proportional hazards assumption, and examine residuals to ensure our model fits the data well. By rigorously evaluating these assumptions, we pave the way for robust and interpretable results.

Why Diagnostics Matter: Building Trust in Our Results

Before we jump into the how-to, let's talk about the why. Model diagnostics are not just a statistical formality; they are essential for ensuring the trustworthiness of our analysis. Here's what we need to confirm before we start drawing conclusions:

Low Multicollinearity: If our predictor variables are too closely related (high multicollinearity), it becomes difficult to disentangle their individual effects on survival. This can also inflate the variance of our estimates, making them less precise.
Proportional Hazards: The Cox model assumes that the hazard ratio comparing two groups remains constant over time. If this assumption is violated, our estimated hazard ratios may not accurately reflect the true relationship between our predictors and survival.
Good Model Fit: We need to make sure our model adequately captures the relationshpi between our predictors and survival time. Residual diagnostics help us identify potential problems like non-linearity or influential observations that could skew our results.

5.6 Tackling Multicollinearity

This section evaluates the Cox regression model for multicollinearity—a critical step in survival analysis. It employs Variance Inflation Factors (VIF) and a correlation matrix to detect problematic correlations among predictors. Results reveal low VIF values and weak correlations, indicating no significant multicollinearity. This ensures the independence and reliability of estimated effects, particularly for key variables like depression and age. These findings bolster confidence in the model's accuracy in assessing factors influencing post-TBI mortality, providing a strong foundation for subsequent analysis and interpretation. The absence of multicollinearity validates the integrity of the model, paving the way for meaningful insights into survival outcomes.

5.7 Testing the Proportional Hazards Assumption

This section examines the proportional hazards assumption, which is fundamental to Cox regression. We utilize Schoenfeld residuals, alongside statistical tests and visualizations, to assess whether the effects of predictors on the hazard ratio are constant over time. This step ensures the reliability of our hazard ratio estimates and confirms the validity of applying the Cox model to our data.

5.8 Checking Model Fit with Residual Diagnostics

In this section, we delve into using Martingale residuals as a diagnostic tool to evaluate the overall fit of our Cox models, particularly with respect to the functional form of our continuous predictors. By plotting these residuals against our continuous covariates and adding a smooth loess curve, we can visually inspect whether a linear relationship is appropriate or if a transformation is needed to improve model fit. This helps us ensure that our model accurately represents the underlying data structure.

5.9 Interpreting Martingale Residual Plots

Building on the previous section, we provide a detailed guide on how to interpret Martingale residual plots. We'll look for patterns, trends, and outliers that could indicate non-linearity, model misspecification, or influential observations. This section will equip you with the skills to critically assess these plots and make informed decisions about potential model adjustments.

The Payoff

By performing these diagnostic checks, we're ensuring that our Cox regression models are not just statistically sound but also capable of yielding meaningful and trustworthy insights. We're building a bridge between complex statistical methods and real-world implications, allowing us to confidently answer our research question about the impact of depression on post-TBI survival.

As always, we'll guide you through each step with clear R code examples and intuitive explanations. By the end of this post, you'll have the skills and knowledge to perform essential diagnostics on your own survival models, ensuring that your findings are both accurate and impactful.

5.6 Tackling Multicollinearity

Introduction

In our ongoing investigation into the impact of depression one year after a traumatic brain injury (TBI) on mortality, we've explored many facets of survival analysis. Now, as we approach the culmination of our series, we need to address a crucial assumption underlying multivariable Cox regression: the absence of significant multicollinearity among our predictor variables. Multicollinearity—a situation where predictors are highly correlated—can muddy the waters, making it difficult to understand the true impact of each variable on survival. In this section, we'll use the Variance Inflation Factor (VIF) and correlation matrices to detect and address this potential issue, ensuring the robustness of our findings.

Why Should We Care About Multicollinearity?

In the realm of survival analysis—particularly when dealing with clinical and sociodemographic data—it's common for predictors to be correlated to some extent. While minor correlations are usually harmless, severe multicollinearity can wreak havoc on our analysis by:

Obscuring True Effects: It becomes difficult to determine which variable is truly driving the observed effect on survival. Imagine trying to figure out which ingredient is making a cake taste great when two ingredients are almost identical!
Inflating Uncertainty: Multicollinearity widens our confidence intervals, making our results less precise and reliable.
Distorting Hazard Ratios: Highly correlated predictors can lead to inaccurate estimates of hazard ratios, potentially leading to incorrect conclusions.

Our Arsenal for Detecting Multicollinearity

We'll employ two powerful tools to uncover multicollinearity:

Variance Inflation Factor (VIF): This metric quantifies how much the variance of a predictor's estimated coefficient is inflated due to its correlation with other predictors.
Correlation Matrix: This visual tool displays the pairwise correlations between all of our numeric predictors, helping us quickly spot strong relationships.

Step 1: Unveiling Multicollinearity with the Variance Inflation Factor (VIF)

The VIF is our first line of defense. It tells us how much the variance of a regression coefficient is increased because of multicollinearity. Generally:

VIF > 5: This is a warning sign, suggesting that multicollinearity might be influencing the results.
VIF > 10: This is a serious red flag, indicating severe multicollinearity that needs to be addressed.

Calculating VIF in R

# Define the VIF model using all model covariates
model_4_vif <- lm(as.formula(paste("event_status ~", paste(all_select_covariates, collapse = " + "))),
                   data = complete_cases_select)

# Calculate VIF values
vif_values <- car::vif(model_4_vif)
print(vif_values)

Key Insights in Interpreting the VIF

Purpose: VIF values reveal the extent to which the variance of each predictor is inflated due to its linear relationships with other predictors.
Thresholds:
- 1-5: Generally considered acceptable.
- >5: Suggests multicollinearity that warrants attention.
- >10: Indicates severe multicollinearity requiring corrective action (e.g., removing or transforming variables).

Presenting VIF Results Clearly

To make our VIF results easily digestible, we'll create a formatted table using the gtsummary package:

# Create a gtsummary table for the VIF results
vif_table <- tbl_regression(model_4_vif, label = var_name_mapping, pvalue_fun = purrr::partial(style_pvalue, digits = 3)) |>
  modify_header(label ~ "**VIF Summary Table for Model 4**", p.value ~ "**_p_-value**") |>
  bold_labels() |>
  bold_p() |>
  add_vif() |>
  as_gt()

# Save the table as an image
gtsave(vif_table, here(tables_dir, "table_3_vif_summary_table_model_4_final.png"))

Why This Table Matters: This table provides a clear and concise summary of our VIF results, making it easy to identify any covariates with potentially problematic VIF values. This guides our decisions on how to refine our model.

Interpreting the VIF Summary Table for Model 4

Let's examine the VIF summary table for Model 4 to understand the extent of multicollinearity in our current model:

Depression Level at Year 1: The Adjusted GVIF values for both "Minor Depression" and "Major Depression" (compared to "No Depression") are close to 1. This indicates no significant multicollinearity related to depression level.
Age at Injury: The Adjusted GVIF is 1.1, well within the acceptable range. Age does not appear to be involved in problematic multicollinearity.
Sex: Similar to age, the Adjusted GVIF for sex is 1.0, indicating no issues.
Educational Attainment at Injury: The Adjusted GVIF is 1.1, suggesting no multicollinearity concerns.
Medicaid Status: The Adjusted GVIF is 1.0, indicating no issues.
History of Mental Health Treatment: The Adjusted GVIF is 1.0, suggesting no multicollinearity.
History of Suicide Attempt: The Adjusted GVIF is 1.0, indicating no issues.
Problematic Substance Use at Injury: The Adjusted GVIF is 1.1, suggesting no multicollinearity.
Function Factor Score at Year 1 Quintiles: The Adjusted GVIF is 1.1, also within the acceptable range.

Overall Assessment of Multicollinearity Based on VIF

Based on the VIF values, our model does not show evidence of significant multicollinearity. All GVIF and Adjusted GVIF values are well below the commonly used threshold of 5. This suggests that the variances of our regression coefficients are not substantially inflated due to correlations among the predictors. We can be more confident that the estimated effects of each predictor are relatively independent and reliable.

Step 2: Visualizing Relationships with the Correlation Matrix

The correlation matrix is our visual detective—helping us identify pairs of numeric predictors that are strongly related.

Calculating the Correlation Matrix

# Select numeric predictors
numeric_predictors <- complete_cases_select |>
  select(-id, -event_status, -time_to_event_in_years, -func_score_at_year_1) |>
  select(where(is.numeric))

# Calculate the correlation matrix
correlation_matrix <- cor(numeric_predictors, use = "complete.obs")
print(correlation_matrix)

Creating a Heatmap for Easy Interpretation

We'll use the corrplot package to generate a visually appealing heatmap of our correlation matrix:

# Define a color palette
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))

# Save the correlation matrix plot
png(filename = file.path(multicollinearity_plots_dir, "correlation_matrix_final.png"),
    width = 4000, height = 3000, res = 300)
par(family = "Proxima Nova")
# Visualize the correlation matrix
corrplot(correlation_matrix, method = "shade", shade.col = NA, tl.col = "black", tl.srt = 45,
         col = col(200), addCoef.col = "black", cl.pos = "n", order = "AOE")
dev.off()

Interpreting the Correlation Matrix

Purpose: The correlation matrix and its heatmap provide a quick overview of the pairwise linear relationships between our numeric predictors.
Thresholds:
- |r| > 0.8: This is a common threshold for concern. Correlations this strong (either positive or negative) suggest that two variables might be measuring very similar things, potentially leading to multicollinearity issues.

Interpreting the Correlation Matrix for Model 4

Let's analyze the correlation matrix image that we generated, which shows the correlations between the numeric variables "Function Factor Score at Year 1 Quintiles," "Educational Attainment at Injury," and "Age at Injury."

Function Factor Score and Educational Attainment: The correlation is 0.22. This indicates a weak positive correlation, suggesting that individuals with higher functional scores tend to have slightly higher educational attainment. This correlation is not strong enough to raise concerns about multicollinearity.
Function Factor Score and Age at Injury: The correlation is -0.10. This indicates a weak negative correlation, suggesting that older individuals at the time of injury tend to have slightly lower functional scores. Again, this correlation is not strong enough to be problematic.
Educational Attainment and Age at Injury: The correlation is 0.14. This indicates a weak positive correlation, suggesting a slight tendency for older individuals to have higher educational attainment. This correlation is also not a cause for concern regarding multicollinearity.

Overall Assessment Based on the Correlation Matrix

The correlation matrix reveals only weak correlations between the three variables examined. None of the correlation coefficients are close to the |r| > 0.8 threshold that would suggest significant multicollinearity. This aligns with the findings from the VIF analysis, further supporting the conclusion that multicollinearity is not a major issue in our model.

Strategies for Addressing Multicollinearity

If we detected significant multicollinearity, we have several options:

Variable Selection: Carefully consider which variables are most clinically relevant or theoretically important and remove those that are highly correlated with these key predictors.
Combine Predictors: If several variables are measuring similar underlying constructs, consider creating a composite score or index that captures the shared information.
Regularization Techniques: Methods like Lasso regression can automatically shrink the coefficients of highly correlated predictors, reducing their impact and mitigating multicollinearity.

Outputs and Next Steps

After thoroughly evaluating multicollinearity:

Document: Keep a record of any variables with high VIF values or strong correlations.
Refine: Adjust your model by removing or combining variables as needed to address multicollinearity, ensuring that you don't lose essential information.
Reassess: After making changes, rerun your diagnostics (VIF, correlation matrix) to confirm that the updated model meets the assumptions of Cox regression.

Conclusion

By assessing and addressing multicollinearity, we've taken a crucial step to ensure the integrity and reliability of our multivariable Cox regression model. This enhances the trustworthiness of our findings regarding the relationship between depression and mortality after TBI. Since our VIF table and correlation matrix showed no significant issues, we can proceed with greater confidence in our model.

5.7 Testing the Proportional Hazards Assumption

Introduction

As we near the end of our exploration of survival analysis, we turn our attention to a critical assumption underlying the Cox proportional hazards model: the proportional hazards assumption. This assumption dictates that the hazard ratio for any given predictor variable remains constant over the study's duration. In simpler terms, the effect of a predictor on the risk of an event should not change with time. Failing to meet this assumption can compromise the model's validity, highlighting the need for rigorous testing before drawing conclusions.

Why Should We Care About the Proportional Hazards Assumption?

The Cox model relies on the concept of proportional hazards to produce meaningful and interpretable hazard ratios. If this assumption is violated:

Hazard ratios become unreliable: They may fluctuate over time, leading to misleading interpretations about the impact of predictor variables.
Model accuracy suffers: The model's fit and predictive power can diminish, making it less useful for forecasting outcomes.

To assess this critical assumption, we employ Schoenfeld residuals. These residuals help us determine if the impact of each predictor variable on the hazard function changes over time.

Putting the Assumption to the Test: Using Schoenfeld Residuals

Step 1: Conducting the Schoenfeld Test

The Schoenfeld test is our primary tool for investigating the time-dependency of predictors in the Cox model. It assesses whether the effect of each variable on the hazard changes significantly over time. A significant result (typically a p-value below 0.05) suggests a violation of the proportional hazards assumption for that specific variable.

In our R code, we use the cox.zph function to perform the Schoenfeld test for each of our models:

# Perform the Schoenfeld test on Models 1-4
schoenfeld_test_results_model_1 <- cox.zph(model_1, transform = "identity")
schoenfeld_test_results_model_2 <- cox.zph(model_2, transform = "identity")
schoenfeld_test_results_model_3 <- cox.zph(model_3, transform = "identity")
schoenfeld_test_results_model_4 <- cox.zph(model_4, transform = "identity")

The transform = "identity" argument ensures that the test is performed on the original time scale.

Step 2: Extracting and Organizing the Test Results

To effectively interpret the results, we need to:

Isolate Key Metrics:
- Chi-Square (chisq): This statistic tests the proportional hazards assumption for each variable.
- p-value: This value indicates whether the assumption is violated (a p-value less than 0.05 is generally considered significant).
Present Results Clearly:
- We utilize the gt package to create a well-structured table that clearly presents the results and highlights any variables with significant violations.

The following helper functions streamline this process:

extract_schoenfeld_results(): This function neatly extracts the Schoenfeld test results into a tidy data frame.

# Define a function to extract results into a data frame
extract_schoenfeld_results <- function(test_result) {
  data.frame(
    Variable = rownames(test_result$table),
    chisq = test_result$table[, "chisq"],
    df = test_result$table[, "df"],
    p = test_result$table[, "p"]
  )
}

create_gt_table(): This function transforms the results into a polished, publication-ready table.

# Define a function to create a gt table with better formatting and styling
create_gt_table <- function(results_df, model_name, var_name_mapping) {
  results_df <- results_df |>
    mutate(Variable = var_name_mapping[Variable])

  results_df |>
    gt() |>
    tab_header(
      title = paste("Schoenfeld Residuals Test Results for", model_name)
    ) |>
    fmt_number(
      columns = c(chisq, df),        # Format only chisq and df
      decimals = 3
    ) |>
    fmt(                           # New formatting for p-values
      columns = "p",
      fns = function(x) ifelse(x < 0.001, "<0.001", sprintf("%.3f", x))  # Conditional formatting
    ) |>
    cols_label(
      Variable = "Characteristic",
      chisq = "Chi-Square",
      df = "Degrees of Freedom",
      p = "p-value"
    ) |>
    tab_style(
      style = cell_text(weight = "bold"),
      locations = cells_column_labels(everything())
    ) |>
    tab_style(
      style = cell_text(weight = "bold"),
      locations = cells_body(
        columns = "p",
        rows = p < 0.05
      )
    ) |>
    tab_footnote(
      footnote = "p-values less than 0.05 indicate violation of the proportional hazards assumption",
      locations = cells_column_labels("p")
    )
}

Step 3: Generating User-Friendly Results Tables

The formatted tables effectively pinpoint variables that might have time-dependent effects, making it straightforward to identify potential violations. These tables are saved as high-resolution .png images, facilitating their inclusion in reports or presentations.

# Save gt tables to HTML files (intermediate step)
html_files <- list(
  file.path(schoenfeld_dir, "schoenfeld_test_model_1_final.html"),
  file.path(schoenfeld_dir, "schoenfeld_test_model_2_final.html"),
  file.path(schoenfeld_dir, "schoenfeld_test_model_3_final.html"),
  file.path(schoenfeld_dir, "schoenfeld_test_model_4_final.html")
)

gtsave(gt_table_model_1, filename = html_files[[1]])
gtsave(gt_table_model_2, filename = html_files[[2]])
gtsave(gt_table_model_3, filename = html_files[[3]])
gtsave(gt_table_model_4, filename = html_files[[4]])

# Convert HTML files to high-resolution PNG images
png_files <- sub(".html", ".png", html_files)

for (i in seq_along(html_files)) {
  webshot2::webshot(
    html_files[[i]], 
    file = png_files[[i]], 
    vwidth = 1200, 
    vheight = 800,
    zoom = 3  # Higher zoom for better resolution (effectively increases DPI)
  )
}

# Remove HTML files after PNG files are created
for (html_file in html_files) {
  if (file.exists(html_file)) {
    file.remove(html_file)
  }
}

Interpreting the Results: Focus on Model 4

Let's take a closer look at the Schoenfeld residuals test results for Model 4, as an example, to understand how to interpret the output. The table below summarizes the findings:

Overall Model Assessment (GLOBAL Test)

The "GLOBAL" test at the bottom of the table evaluates the proportional hazards assumption for the model as a whole.
It has a Chi-square statistic of 8.545 with 12 degrees of freedom, resulting in a p-value of 0.741.
Interpretation: This non-significant p-value (p > 0.05) suggests that there is no evidence to reject the proportional hazards assumption for the overall model. In other words, we do not have enough evidence to claim that the effect of the predictors changes significantly over time when considering the model as a whole.

Individual Predictor Variables:

We also examine the p-values for each individual predictor variable to check for potential violations of the proportional hazards assumption:

Depression Level at Year 1: Chi-Square = 1.095, df = 2, p = 0.578
Age at Injury: Chi-Square = 0.011, df = 1, p = 0.916
Sex: Chi-Square = 0.037, df = 1, p = 0.847
Educational Attainment at Injury: Chi-Square = 2.662, df = 1, p = 0.103
Medicaid Status: Chi-Square = 0.054, df = 1, p = 0.815
History of Mental Health Treatment: Chi-Square = 1.455, df = 2, p = 0.483
History of Suicide Attempt: Chi-Square = 0.185, df = 2, p = 0.912
Problematic Substance Use at Injury: Chi-Square = 2.004, df = 1, p = 0.157
Function Factor Score at Year 1 Quintiles: Chi-Square = 0.800, df = 1, p = 0.371

For all predictor variables, the p-values are greater than 0.05. This indicates that there is no statistically significant evidence to suggest that the effect of any of these individual predictors on the hazard (risk of the event) changes over time. The proportional hazards assumption appears to be met for each predictor.

Conclusion of Model 4 Analysis

Based on the Schoenfeld residuals test results, both the global test and the individual variable tests suggest that the proportional hazards assumption is not violated in Model 4. This means we can be more confident in the interpretability and validity of the hazard ratios estimated by this Cox regression model. The effects of the predictors appear to be relatively constant over the study period.

Important Note: While the statistical tests provide evidence supporting the proportional hazards assumption, it's always a good practice to visually inspect the Schoenfeld residual plots (as described in the next section) to further confirm these findings.

Seeing is Believing: Visualizing Schoenfeld Residuals

While the statistical tests provide crucial evidence, visualizing the Schoenfeld residuals can offer a more intuitive grasp of any time-dependent effects.

Step 4: Creating Informative Residual Plots

These plots visually depict the time-varying behavior of each predictor:

Y-axis (Beta(t)): Represents the scaled Schoenfeld residuals, illustrating how the effect of a variable changes over time.
Horizontal Line (Beta(t) = 0): This line signifies no time-dependency. Deviations from this line suggest a violation of the proportionality assumption.

Here's how we generate these plots in R:

# Loop through each variable and create residual plots
for (i in seq_len(num_elements)) {
  create_plot(i, var_labels[i])
}

  # Save the plot as a PNG file
  ggsave(
    filename = file.path(schoenfeld_dir, paste0("schoenfeld_plot_", gsub(" ", "_", var_name), ".png")),
    plot = plot,
    width = 8,
    height = 6,
    dpi = 300
  )
}

Interpreting the Plots

Look for Flat Lines: Lines that remain relatively flat and close to zero suggest that the proportionality assumption holds.
Identify Deviations: Systematic deviations from the zero line indicate that a variable's effect is changing over time, necessitating corrective measures (such as stratification or including time-dependent covariates in the model).

A Closer Look: Schoenfeld Residual Plot for "Depression Level at Year 1"

Let's examine the Schoenfeld residual plot for the variable "Depression Level at Year 1" from Model 4:

Interpretation of the "Depression Level at Year 1" Plot

Overall Trend: The smoothed line (dashed black line) in the plot is relatively flat and close to the horizontal line at zero. This suggests that the effect of depression level on the hazard does not change drastically over time.
Confidence Bands: The dashed lines above and below the smoothed line represent the confidence interval. The fact that the horizontal line at zero is generally within the confidence bands further supports the assumption of proportionality.
Individual Residuals: While there is some scattering of the individual residual points (open circles), they are fairly evenly distributed around the zero line and don't show any systematic upward or downward trends.

Conclusion from the Plot: The plot for "Depression Level at Year 1" visually supports the findings of the Schoenfeld test, which indicated no significant violation of the proportional hazards assumption for this variable in Model 4 (p = 0.578). The effect of depression level appears to be relatively constant over time.

Conclusion

Our thorough examination of the proportional hazards assumption using Schoenfeld residuals, coupled with our assessment of model fit using Martingale residuals, has provided strong evidence that our Cox model (Model 4) is well-specified and that its assumptions are met. We also previously determined that there is no significant multicollinearity among the predictors.

Therefore, we can now proceed with confidence to the next crucial stage: interpreting the results of our Cox regression model. This involves examining the estimated hazard ratios, their confidence intervals, and the associated p-values to understand the relationships between our predictor variables and the outcome of interest. We will carefully analyze the magnitude and direction of the effects, paying close attention to the clinical and practical significance of the findings. This interpretation will allow us to draw meaningful conclusions from our analysis and answer our research questions.

5.8 Checking Model Fit with Residual Diagnostics

Introduction

Before we can confidently interpret the results of our Cox regression models, we need to make sure the models are a good fit for the data. This is where model diagnostics come in, and one particularly useful tool is the Martingale residual.

In this section, we'll focus on using Martingale residuals to assess the relationship between our continuous covariates and the survival outcome. By examining these residuals, we can identify potential issues with our model and ensure that our findings are reliable. Specifically, we will use the Martingale residuals to assess the functional form of continuous variables in our model.

Why Do Martingale Residuals Matter?

Martingale residuals are a type of residual specifically designed for Cox regression models. They essentially represent the difference between the observed number of events for a participant and the expected number of events based on the model.

Here's why they're valuable for diagnostics:

Detecting Non-Linearity: Martingale residuals can help us detect whether the relationship between a continuous predictor and the hazard of the event is truly linear (as assumed by the standard Cox model) or if a transformation (e.g., logarithmic, quadratic) might be needed to better capture the relationship.
Identifying Outliers: Extreme values of Martingale residuals can point to influential observations that might be unduly affecting our model.
Assessing Overall Model Fit: By examining the patterns (or lack thereof) in Martingale residual plots, we can get a sense of how well our model fits the data overall.

Calculating Martingale Residuals

Here's how we calculate Martingale residuals in R, using our final Cox model (model_4) as an example:

martingale_residuals <- residuals(model_4, type = "martingale")

What's happening here?

residuals(model_4, type = "martingale"): This function extracts the Martingale residuals from our fitted Cox model (model_4). The type = "martingale" argument specifies that we want Martingale residuals.

Conceptual Understanding

Martingale residuals can range from negative infinity to positive infinity. Values close to zero suggest a good model fit for that particular observation.

Preparing Data for Plotting

To visualize the relationship between our continuous covariates and the Martingale residuals, we need to combine them into a single data frame:

martingale_plot_data <- data.frame(
  Age_at_Injury = complete_cases_select$age_at_injury,
  Education_Level = complete_cases_select$education_level_at_injury,
  Function_Factor_Score_Q5 = complete_cases_select$func_score_at_year_1_q5,
  Martingale_Residuals = martingale_residuals
)

What's happening here?

We create a new data frame called martingale_plot_data that includes our continuous covariates (age_at_injury, education_level_at_injury, and the quintiles of the func_score_at_year_1) and the calculated Martingale_Residuals.

Plotting Martingale Residuals: Visualizing the Relationship

Now, we'll create scatter plots of Martingale residuals against each of our continuous covariates, adding a smooth loess curve to highlight any potential non-linear patterns.

Example: Age at Injury

# Filter and plot residuals against age at injury
martingale_plot_data <- martingale_plot_data |>
  filter(is.finite(Age_at_Injury) & is.finite(Martingale_Residuals)) |>
  filter(!is.na(Age_at_Injury) & !is.na(Martingale_Residuals))

gg.age_at_injury_vs_martingale <- ggplot(martingale_plot_data, aes(x = Age_at_Injury, y = Martingale_Residuals)) +
  geom_point() +
  geom_smooth(method = "loess", color = "#EF5350", linewidth = 1.25) + # Add a Loess curve
  labs(x = "Age at Injury", y = "Martingale Residuals", title = "Martingale Residuals vs. Age at Injury") +
  scale_x_continuous(breaks = seq(16, 96, by = 10), limits = c(15, 100)) + # Adjust x-axis tick values and limits
  scale_y_continuous(breaks = seq(-3.5, 2.0, by = 1.0), limits = c(-3.5, 2.0)) + # Adjust y-axis tick values
  theme_minimal(base_family = "Proxima Nova") +
  theme(plot.title = element_text(size = 18)) +
  customization_martingale

ggsave(here(martingale_residuals_plots_dir, "figure_19-1_age_at_injury_vs_martingale_residuals.png"),
       plot = gg.age_at_injury_vs_martingale,
       width = 10, # Specify the width in inches
       height = 6, # Specify the height in inches
       bg = "#FFFFFF",
       dpi = 300)

Conceptual Breakdown

Filtering Data: We remove any rows with missing or non-finite values for Age_at_Injury and Martingale_Residuals to ensure a clean plot.
ggplot(): We initiate a ggplot object, mapping Age_at_Injury to the x-axis and Martingale_Residuals to the y-axis.
geom_point(): We create a scatter plot of the data points.
geom_smooth(method = "loess", …): We add a loess smoother, which is a non-parametric curve that helps us visualize potential non-linear trends in the relationship between the variables. The argument color = "#EF5350" specifies the color of the smoother, and linewidth = 1.25 makes it a bit thicker than the default.
labs(): We add informative labels for the x and y axes and a title for the plot.
scale_x_continuous(…) and scale_y_continuous(…): We adjust the x- and y-axis scales to ensure that the plot displays a relevant range of values. The breaks argument determines where the tick marks are placed on the axes.
theme_minimal() and customization_martingale: We apply a minimal theme and our custom theme modifications for a clean and consistent look. We also adjust the size of the plot title.
ggsave(): We save the plot as a high-resolution .png file.

Example: Education Level at Injury

We repeat a similar process to create a plot for Education_Level_at_Injury:

# Filter and plot residuals against education level at injury
martingale_plot_data <- martingale_plot_data |>
  filter(is.finite(Education_Level) & is.finite(Martingale_Residuals)) |>
  filter(!is.na(Education_Level) & !is.na(Martingale_Residuals))

gg.education_level_at_injury_vs_martingale <- ggplot(martingale_plot_data, aes(x = Education_Level, y = Martingale_Residuals)) +
  geom_point() +
  geom_smooth(method = "loess", color = "#EF5350", linewidth = 1.25) + # Add a Loess curve
  labs(x = "Education Level at Injury", y = "Martingale Residuals", title = "Martingale Residuals vs. Education Level at Injury") +
  scale_x_continuous(breaks = seq(0, 20, by = 5), limits = c(0, 20)) + # Adjust x-axis tick values and limits
  scale_y_continuous(breaks = seq(-3.5, 2.0, by = 1.0), limits = c(-3.5, 2.0)) + # Adjust y-axis tick values
  theme_minimal(base_family = "Proxima Nova") +
  theme(plot.title = element_text(size = 18)) +
  customization_martingale

ggsave(here(martingale_residuals_plots_dir, "figure_19-2_education_level_at_injury_vs_martingale_residuals.png"),
       plot = gg.education_level_at_injury_vs_martingale,
       width = 10, # Specify the width in inches
       height = 6, # Specify the height in inches
       bg = "#FFFFFF",
       dpi = 300)

Interpretation

The code and the process are very similar to the previous example, but now we're examining the relationship between Education_Level_at_Injury and the Martingale residuals.

Example: Function Factor Score Quintiles

Finally, we create a plot for the quintiles of the Function_Factor_Score_at_Year_1:

# Filter and plot residuals against function factor score at year 1
martingale_plot_data <- martingale_plot_data |>
  filter(is.finite(Function_Factor_Score_Q5) & is.finite(Martingale_Residuals)) |>
  filter(!is.na(Function_Factor_Score_Q5) & !is.na(Martingale_Residuals))

gg.function_factor_score_q5_at_year_1_vs_martingale <- ggplot(martingale_plot_data, aes(x = Function_Factor_Score_Q5, y = Martingale_Residuals)) +
  geom_point() +
  geom_smooth(method = "loess", color = "#EF5350", linewidth = 1.25) + # Add a Loess curve
  labs(x = "Function Factor Score at Year 1 Quintiles", y = "Martingale Residuals", title = "Martingale Residuals vs. Function Factor Score at Year 1") +
  scale_y_continuous(breaks = seq(-3.5, 2.0, by = 1.0), limits = c(-3.5, 2.0)) + # Adjust y-axis tick values
  theme_minimal(base_family = "Proxima Nova") +
  theme(plot.title = element_text(size = 18)) +
  customization_martingale

ggsave(here(martingale_residuals_plots_dir, "figure_19-3_function_factor_score_at_year_1_q5_vs_martingale_residuals.png"),
       plot = gg.function_factor_score_q5_at_year_1_vs_martingale,
       width = 10, # Specify the width in inches
       height = 6, # Specify the height in inches
       bg = "#FFFFFF",
       dpi = 300)

Interpretation

Again, the code and process are similar, but this time we're looking at the relationship between the quintiles of Function_Factor_Score_at_Year_1 and the Martingale residuals.

Interpreting Martingale Residual Plots: What to Look For

These plots help us assess whether our continuous covariates are appropriately incorporated into the model. Here's what to look for:

Non-Linearity:
- If the relationship between a covariate and the hazard is linear, the Martingale residuals should be randomly scattered around zero.
- A curved or non-random pattern in the loess smoother suggests that a transformation of the variable (e.g., taking the logarithm, adding a quadratic term) might improve the model fit.
Outliers:
- Extreme values of Martingale residuals (far above or below zero) can indicate outliers or influential observations. These points might warrant further investigation to ensure that they are not unduly influencing the model.
Overall Model Fit:
- While individual plots are helpful, it's also important to consider the overall pattern of residuals across all covariates. If multiple plots show non-random patterns, it might suggest a more fundamental issue with the model specification.

Moving Forward: From Diagnostics to Visualization

By examining Martingale residual plots, we gain valuable insights into the fit of our Cox regression models. These diagnostics provide a critical checkpoint, ensuring that our models are well-suited for the data and that our findings are reliable.

In the next section, we'll transition from model diagnostics to visualizing the results of our Cox models using Kaplan-Meier curves and other plots. This will allow us to communicate our findings effectively and explore the impact of depression on survival after TBI in greater detail.

5.9 Interpreting Martingale Residuals Plots

We've created Martingale residual plots for our three continuous covariates: age_at_injury (Figure 19-1), education_level_at_injury (Figure 19-2), and the quintiles of the func_score_at_year_1 (Figure 19-3). These plots are essential diagnostic tools for assessing whether our Cox regression model adequately captures the relationship between these predictors and the hazard of mortality. Specifically, they help us evaluate the assumption of linearity—whether a linear term for each variable is sufficient or if a transformation might be needed.

Recall: When examining Martingale residual plots, we're primarily looking for non-linear patterns (curves, trends) in the smoothed loess line. Random scatter around the zero line is expected and suggests a good fit.

Figure 19-1: Martingale Residuals vs. Age at Injury

Key Observations

Overall Trend: The loess smoother is relatively flat and close to zero across the entire age range, suggesting that a linear term for age_at_injury is likely appropriate in our model.
Slight Curvature: There's a hint of a downward curve at the extremes of the age range, but the deviation from linearity is minimal.
Scatter: The residuals show a fair amount of scatter, which is expected in survival data, but the scatter appears relatively random.

Implications

The plot provides evidence supporting the use of a linear term for age_at_injury in our Cox model.
The slight curvature could be investigated further, but it doesn't strongly suggest the need for a transformation.

Figure 19-2: Martingale Residuals vs. Education Level at Injury

Key Observations

Overall Trend: Similar to the age_at_injury plot, the loess smoother for education_level_at_injury is relatively flat and close to zero across most of the education range. This suggests that a linear term for education level might be sufficient in our model.
Possible Deviation at Lower End: There's a slight upward trend in the smoother at the lower end of the education range (around 0-5 years). However, there is also a scarcity of data points in this range, making it difficult to determine if this is a real trend.
Scatter: The scatter of residuals around the smoother appears to be relatively random, with no major deviations from the zero line.

Implications

The plot generally supports using a linear term for education_level_at_injury in our model.
The slight upward trend at the lower end could be explored further, although the sparsity of data in that range might make any transformation based on this trend unreliable.

Figure 19-3: Martingale Residuals vs. Function Factor Score at Year 1 Quintiles

Key Observations

Overall Trend: The loess smoother is almost perfectly flat and very close to zero across all quintiles of the func_score_at_year_1 variable. This strongly suggests that modeling this variable as a continuous variable using the quintiles is appropriate and that there are no issues with the functional form.
Scatter: There's some scatter in the residuals, as expected, but it appears relatively random and evenly distributed around the zero line.

Implications

Linearity in Categories: The plot suggests that the relationship between the hazard of mortality and functional status at Year 1, as measured by the function factor score quintiles, is adequately captured by our model.
Model Adequacy: The random scatter around zero provides further evidence that our model is fitting well with respect to this variable.

Overall Assessment for Age, Education, and Functional Status

The Martingale residual plots for age_at_injury, education_level_at_injury, and the quintiles of func_score_at_year_1 provide encouraging evidence that our Cox model is appropriately handling these important predictors. The smoothers are relatively flat for all three variables, and the residuals are scattered around zero without any strong non-linear patterns. This increases our confidence in the model's fit with respect to these key covariates.

Caveats

Visual Inspection: While these plots are informative, they are based on visual inspection. Formal statistical tests, such as those based on Schoenfeld residuals (which we covered previously), are also necessary to rigorously assess the proportional hazards assumption.
Other Covariates: These plots only examine three covariates in isolation. The relationships between these variables and the hazard might be influenced by other covariates in the model.

Moving Forward: Continuing Model Diagnostics

These Martingale residual plots provide valuable insights into the functional form of our continuous predictors. While we are making good progress in building our Cox regression models, it's essential to remember that model building is an iterative process. We might revisit these plots and consider transformations if other diagnostics suggest a need for model refinement.

By carefully examining these diagnostic plots, we're ensuring that our survival models are well-specified and that our findings are as reliable as possible. In the next blog post, we'll interpret the nested Cox regression models.

Conclusion

In this installment of our survival analysis series, we scrutinized the crucial assumptions underpinning our multivariable Cox regression model. This examination ensures that our analysis rests upon a solid foundation, paving the way for valid interpretation and trustworthy conclusions. Our comprehensive evaluation encompassed multicollinearity diagnostics, the proportional hazards assumption, and an assessment of model fit through residual analysis. Let's recap the key findings:

Absence of Multicollinearity: Through Variance Inflation Factors (VIF) and correlation matrices, we confirmed the absence of significant multicollinearity among our predictor variables. This assures us that the estimated effects of each predictor are independent and reliable.
Upholding the Proportional Hazards Assumption: Both formal statistical tests and a visual inspection of Schoenfeld residual plots lent strong support to the proportional hazards assumption. We can be confident that the effects of our predictors on the hazard of the outcome remain constant over the study's duration.
Confirmation of Model Fit: Analysis of Martingale residuals revealed no evidence of misspecified relationships between our continuous predictors and the survival outcome. This affirms the adequacy of our model's functional form.

These diagnostic steps have fortified the validity and interpretability of our findings, providing a robust framework for drawing meaningful insights.

Looking Ahead

With the foundational checks complete, we are poised to enter the final and most illuminating stage of our analysis: a deep dive into the results of our multivariable Cox regression model. We will interpret the hazard ratios, confidence intervals, and p-values for each predictor. This will allow us to quantify the impact of our variables of interest, with a particular focus on their clinical and practical implications. Ultimately, this detailed examination will enable us to address our central research question: How does depression at one year post-TBI influence the risk of all-cause mortality? The insights gleaned will contribute significantly to our understanding of the long-term consequences of TBI and the crucial role of mental health in patient outcomes.

Comments

Newer

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 10)

Older

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 8)