Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 8)
February 24, 2025
Research
Tutorials
Introduction
Welcome back to our exploration of survival analysis! We've journeyed through the intricacies of data preparation, exploration, and visualization. Now, in this eighth installment, we're shifting our focus to the crucial steps that ensure a smooth, efficient, and reproducible workflow as we move into building and refining our Cox regression models.
While these steps might seem like behind-the-scenes work, they are essential for setting the stage for impactful analysis. We'll be focusing on the foundational elements that underpin robust survival modeling, all while continuing to investigate our core research question: How do depression levels one year after a traumatic brain injury (TBI) influence long-term mortality?
This post will guide you through a series of essential tasks, providing you with the tools and techniques to create a well-oiled analytical machine:
5.1 Initial Setup and Library Loading
This section configures the R environment for this project. It installs and loads necessary libraries using the pacman
package, establishes a structured directory system for data organization, loads preprocessed data for analysis, and customizes font plots. These steps lay a solid foundation for data analysis.
5.2 Customizing Plot Aesthetics
A custom theme is introduced to enhance plot aesthetics. This ensures consistency in titles, axes, legends, and text styling across all visualizations, improving interpretability.
5.3 Defining and Preparing Covariates
This section involves defining and organizing covariates for nested models in survival analysis. Three groups of covariates are specified: sociodemographic variables (e.g., age, sex, education, Medicaid status), mental health variables (e.g., mental health treatment history, suicide attempts, substance use), and functional variables (e.g., function factor score at year 1). These groups are combined into a comprehensive list of all propose covariates for model building. Additionally, user-friendly labels are created for covariates and their factor levels to improve the clarity and interpretability of outputs such as tables and plots.
5.4 Univariate Cox Proportional Hazards Models
This section introduces univariate Cox proportional hazards models as an essential first step in our survival analysis. These models examine the individual effect of each covariate on the risk of the event of interest (i.e., death), providing initial insights into which factors—such as age at injury or depression level—significantly influence survival probability after moderate-to-severe traumatic brain injury (msTBI). Emphasis is placed on rounding adn formatting results for clear interpretation. By isolating the impact of each variable, we identify promising predictors for subsequent multivariable modeling, laying the groundwork for a deeper understanding of the drivers of survival outcomes.
5.5 Multivariable Cox Proportional Hazards Models
To understand how multiple factors collectively impact survival following moderate-to-severe TBI, this section employs multivariable Cox proportional hazards models. Using a nested approach, we progressively incorporate sociodemographic, mental health, and functional variables, estimating their independent effects while controlling for potential confounders. Results are rounded, clearly formatted, and visualized with comparative tables. Hazard ratios (HRs) and their corresponding confidence intervals (CIs) are interpreted to highlight both statistical and clinical significance. This step-by-step process illuminates the sophisticated network of factors affecting survival and lays the groundwork for addressing multicollinearity in the upcoming blog post section.
5.1 Initial Setup and Library Loading
Initialization and Environment Setup
This script establishes the foundational environment for data analysis by loading essential R libraries, setting up a structured directory system for data management, loading preprocessed data, and configuring plot aesthetics. These steps ensure a reproducible, organized, and visually consistent workflow.
Library Management with pacman
The first step involves ensuring all necessary libraries are installed and loaded. The pacman
package simplifies this process, providing a unified interface for library management. If pacman
is not already installed, it can be installed using:
Subsequently, pacman::p_load
is used to load the following libraries:
autoReg
: Simplifies regression modeling workflows by automating the generation of comprehensive and customizable summary tables.broom
: Bridges the gap between statistical modeling and tidy data principles, converting model outputs into tidy data frames for seamless analysis and visualization.corrplot
: Enhances data exploration with elegant visualization tools for correlation matrices, enabling intuitive analysis of variable relationships.extrafont
: Enables the import and management of custom fonts, enhancing the visual appeal of plots.forestplot
: Creates publication-ready forest plots, enabling clear visualization of estimates and confidence intervals for regression and meta-analysis results.ggsurvfit
: Extendsggplot2
for survival analysis, providing tools to craft customizable Kaplan-Meier plots and survival curves with ease.grid
: Empowers advanced graphics manipulation by enabling precise layout design and control of graphical elements in R plots.gridExtra
: Facilitates the arrangement of multiple grid-based plots, streamlining the creation of composite visualizations and complex layouts.gt
: Creates highly customizable and presentation-ready tables.gtsummary
: Simplifies the creation of publication-ready tables for summarizing data and regression model results.here
: Enables robust file path management, ensuring project portability across different computing environments.patchwork
: Simplifies the creation of multi-panel plots by seamlessly combiningggplot2
visualizations into cohesive layouts.scales
: Provides functions for customizing plot scales, allowing for fine-grained control over plot aesthetics.showtext
: Enhances text rendering in R graphics by enabling the use of custom fonts, improving aesthetic appeal and readability.survival
: Provides comprehensive tools for survival analysis, supporting Kaplan-Meier curves, Cox proportional hazards models, and more.survminer
: Streamlines the visualization of survival analysis results, offering intuitive tools for creating and customizing survival curves and diagnostic plots.tidyverse
: A collection of R packages designed for data manipulation, exploratino, and visualization, promoting a streamlined and consistent workflow.
Using pacman
guarantees that all library dependencies are correctly managed, minimizing issues arising from missing or incompatible packages.
Directory Structure for Organized Data Management
To maintain a well-organized project, the following directory structure is established:
Data/Processed
: This directory stores the cleaned, transformed, and analysis-ready datasets.Output/Tables
: This directory stores the generated summary tables and regression results, providing a centralized location for easily accessible tabular outputs.Output/Plots/Missingness
: This directory stores visualizations of missing data patterns, aiding in the assessment and handling of missingness in datasets.Output/Plots/Schoenfeld_Residuals
: This directory stores Schoenfeld residual plots, enabling diagnostics for the proportional hazards assumption in Cox regression models.Output/Plots/Martingale_Residuals
: This directory stores Martingale residual plots, facilitating the evaluation of model fit and identification of outliers or nonlinearity.Output/Plots/Multicollinearity
: This directory stores visualizations of multicollinearity diagnostics, including correlation matrices and VIF results, to assess predictor relationships.Output/Plots/Survival_Curves
: This directory stores survival curve plots, illustrating survival probabilities over time for different covariates or groups.Output/Plots/Forest_Plots
: This directory stores forest plots, providing a clear summary of hazard ratios and confidence intervals for regression model covariates.
The here
package is employed to dynamically generate these directories. The dir.create
function, with recursive = TRUE
, ensures that all necessary subdirectories are created. This approach guarantees consistent file path construction across different operating systems, enhancing project reproducibility.
Loading Preprocessed Data
The core dataset, which has undergone prior cleaning and transformation and includes only the covariates selected for the Cox regression models, is loaded using the readRDS
function:
The .rds
format is efficient for storing and retrieving R objects, preserving data structure and metadata. This step ensures that the analysis-ready data are available for subsequent steps. It's recommended to verify that the data frame has loaded correctly using functions like str()
or head()
.
Customizing Plot Aesthetics with Imported Fonts
To enhance the visual appeal and consistency of plots, system fonts are imported using:
This makes a wider range of fonts available for use in R visualizations, enabling greater control over plot aesthetics. This will improve the quality of visualizations.
Conclusion
By executing these initial steps, we establish a reproducible, organized, and visually consistent foundation for the subsequent data analysis, ensuring the integrity and efficiency of the workflow. This step lays the groundwork for a robust and insightful analysis.
5.2 Customizing Plot Aesthetics
Establishing a Unified Visual Style with Custom Plotting Preferences
This section focuses on establishing a consistent and visually appealing aesthetic for all plots generated throughout the analysis. By defining a custom theme, we aim to enhance the readability, professionalism, and overall impact of our visualizations. Predefining these plotting preferences streamlines the process of creating multiple plots and ensures a cohesive visual identity across the entire project.
Crafting a Custom Theme
The foundation of our visual style is a custom theme, designed to optimize the presentation of our data. This theme, named customization
, dictates the appearance of various plot elements, from titles and legends to axis labels and text. The customization
theme is defined as follows:
Dissecting the Theme's Components
Each element within the customization
theme plays a specific role in shaping the overall look and feel of the plots. The title, for instance, is styled using element_text
to be rendered in a bold, 20-point "Proxima Nova" font. This choice of font, known for its clean and modern aesthetic, ensures that the title is both prominent and visually appealing. Similarly, the legend title is set in bold "Proxima Nova" with a font size of 10, while the legend text itself uses a slightly smaller font size of 9.5. This subtle difference in size creates a visual hierarchy within the legend, making it easier to distinguish between the title and the individual items.
Axis titles, both x and y, are also rendered in bold "Proxima Nova," but with a font size of 12. A margin is added around the axis titles—10 points of spacing above the x-axis title and 10 points to the right of the y-axis title—to enhance readability and prevent them from overlapping with other plot elements. The axis text, which includes the tick labels, is styled using a 10-point "Proxima Nova" font, ensuring consistency while maintaining a clear visual distinction from the axis titles.
The text
element within the theme sets the default font for all other text elements in the plot to "Proxima Nova," guaranteeing a uniform appearance across the entire visualization. Finally, the legend.position
is set to "top," positioning the legend above the plot. This placement helps to minimize visual clutter within the plot area, especially when dealing with complex visualizations or plots with a wide aspect ratio.
Advantages of a Unified Visual Style
Employing a custom theme offers several key advantages. First, it ensures consistency across all visualizations, creating a cohesive and professional look that is particularly important for reports, presentations, and publications. Second, the carefully chosen font sizes, spacing, and positioning enhance the readability of the plots, making them easier to interpret for a wider audience. Finally, defining the theme once streamlines the plotting process, eliminating the need to repeatedly specify styling options for each individual plot.
Applying the Custom Theme
Integrating this custom theme into your plots is straightforward. Simply add + customization
to the end of your ggplot
code:
This simple addition ensures that the plot adheres to the predefined styling preferences, guaranteeing a consistent and visually appealing presentation.
Practical Considerations
When working with custom fonts like "Proxima Nova," ensure that the font is installed on your system and properly registered in R using extrafont::loadfonts()
. Additionally, consider your target audience when selecting font sizes; for instance, slightly larger font sizes might be necessary for presentations to ensure visibility from a distance. If the legend interferes with the plot data in certain instances, consider adjusting its position dynamically using options like legend.position = "bottom"
or legend.position = "right"
as needed.
Conclusion
By establishing a custom theme, we've laid the groundwork for creating professional, visually consistent, and highly readable plots throughout the analysis. This proactive approach not only elevates the aesthetic quality of our visualizations but also streamlines the plotting process, making it a valuable investment in the overall workflow. This attention to visual detail ensures that the data are communicated effectively and engagingly.
5.3 Defining and Preparing Covariates
Introduction
This section outlines the selection, grouping, and labeling of key measures (covariates) used in our statistical models to analyze survival following traumatic brain injury (TBI). Properly defining covariates is crucial for a structured, interpretable analysis aligned with our research objectives.
Rationale for Covariate Definition
Covariates are variables that are hypothesized to influence the outcome of interest: survival time after TBI. Careful covariate selection and definition ensures:
Clarity and Consistency: Standardized definitions are used throughout the analysis.
Simplified Model Building: Related variables are grouped for efficient testing.
Enhanced Interpretability: Results are presented using clear, descriptive labels.
Strategic Covariate Grouping
To facilitate a stepwise approach to model building, covariates are organized into three logical groups based on their conceptual relationship to survival:
Sociodemographic Variables: Characteristics reflecting social or demographic factors, including
age_at_injury
,sex
,education_level_at_injury
, andrehab_payor_primary_type
(as a proxy for Medicaid status).Mental Health Variables: Indicators of psychological well-being, such as
mental_health_tx_hx
,suicide_attempt_hx
, andproblematic_substance_use_at_injury
.Functional Variables: Measures of functional status post-injury, represented by
func_score_at_year_1_q5
(functional ability factor score one year post-injury, divided into quintiles).
The following R code defines these groups:
Enhancing Interpretability with Descriptive Labels
To improve the readability of statistical outputs, we replace technical variable names with descriptive labels. For example, func_score_at_year_1_q5
becomes "Function Factor Score at Year 1 Quintiles."
These mappings are defined in R as follows:
These labels will be consistently applied across all tables, figures, and statistical summaries.
Best Practices for Covariate Management
Maintain Consistency: Employ the same labels throughout the analysis for clarity.
Align with Research Goals: Organize covariates to reflect your hypotheses and model building strategy.
Iterate and Refine: Review initial summaries and visualizations to ensure data quality and completeness before proceeding.
Next Steps
With covariates clearly defined and labeled, we can now explore their individual and combined influence on survival outcomes.
5.4 Univariate Cox Proportional Hazards Models
Introduction
This section details the construction of univariate Cox proportional hazards models—a crucial initial step in our survival analysis. By analyzing each covariate individually, we assess its independent association with survival time before proceeding to more complex multivariable models.
Rationale for a Univariate Approach
Isolate Potential Predictors: Identify individual covariates that may influence survival.
Prioritize Significant Variables: Focus on covariates with statistically significant associations for further investigation.
Validate Data Formatting: Ensure that each variable meets model assumptions and is appropriately formatted before inclusion in multivariable models.
Understanding the Cox Proportional Hazards Model:
The Cox model estimates the hazard ratio (HR), which quantifies the effect of a covariate on the risk of the event of interest (e.g., mortality). An HR greater than 1 indicates an increased risk, while an HR less than 1 indicates a decreased risk associated with a one-unit increase in the covariate.
In this analysis, time_to_event_in_years
represents survival time, and event_status
indicates whether the event occurred (1 = event, 0 = censored).
Implementing Univariate Models in R
The following R code rounds and formats the numeric components of a Cox model summary to two decimal places for improved readability:
The following R code performs univariate Cox regression for a list of covariates, validates input data, handles errors, and saves the rounded summary of each model to the global environment:
The following R code executes univariate Cox models for groups of covariates related to sociodemographics, mental health, and functional variables using the complete-case dataset:
Code Functionality Breakdown
round_cox_summary()
: This helper function formats the Cox model output by rounding numeric values to two decimal places for improved readability.univariate_cox_models()
: This function iterates through a list of covariates, fitting a separate Cox model for each using the formulaSurv(time_to_event_in_years, event_status) ~ covariate
.Error Handling: A
tryCatch
block gracefully handles potential errors during model fitting, preventing the entire loop from halting.
Interpreting the Output
The script outputs the following for each covariate:
Hazard Ratio (HR): The estimated effect size of the covariate on survival.
Confidence Interval (CI): The likely range of the true HR.
p-value: Indicates the statistical significance of the association (e.g., p < 0.05 suggests a significant association).
Key Considerations
Data Type Verification: Ensure covariates are appropriately formatted (e.g., categorical, continuous) before model fitting.
Beyond Statistical Significance: A significant p-value does not automatically imply clinical or practical importance; consider the magnitude of the HR and its real-world implications.
Result Archiving: The
assign()
function saves each model's output for future reference or further analysis.
Moving Forward
The results of these univariate models help to inform the selection of promising covariates for inclusion in multivariable models. In the next section, we will construct nested multivariable models, strategically adding groups of covariates to refine our analysis and gain a more comprehensive understanding of survival.
By beginning with univariate models, we establish a crucial foundation for understanding the individual relationships between covariates and survival outcomes, setting the stage for more sophisticated analyses.
5.5 Multivariable Cox Proportional Hazards Models
Introduction
This section advances our analysis from individual covariate effects to a more comprehensive multivariable approach using Cox regression. By incorporating multiple covariates into a single model, we can estimate their independent contributions to survival while controlling for the influence of other factors. We will employ a nested modeling strategy to systematically assess the impact of adding successive groups of variables.
Advantages of Multivariable Modeling
Confounding Control: Adjusts for potential confounding factors, isolating the unique effect of each covariate.
Realistic Representation: Creates models that better reflect the complex interplay of factors influencing survival outcomes.
Interaction Exploration: Allows for the investigation of how groups of covariates jointly influence survival.
Our nested approach progressively adds covariate groups, enabling us to evaluate their incremental contribution to the model's explanatory power.
Constructing the Nested Model
The following R code builds our nested multivariable Cox regression models. Each model adds a new group of covariates to the preceding one:
Summarizing and Formatting Model Results
We use a custom function, round_cox_summary()
, to neatly format the model summaries for improved readability:
Visualizing and Comparing Models with gtsummary
To facilitate model comparison, we present a combined table of all models using the gtsummary
package:
Interpreting the Multivariable Model Results
Hazard Ratios (HRs): Each covariate's HR now represents its independent effect on survival after adjusting for all other variables in the model.
Confidence Intervals (CIs): Narrow CIs indicate more precise estimates. CIs that include 1 suggest that the covariate is not statistically significant in the presence of other variables.
Assessing Incremental Contributions: Comparing nested models reveals how much each group of covariates improves the model's ability to explain survival variation.
Key Considerations
Model Building Strategy: Starting with a key variable and incrementally adding complexity allows for a more nuanced understanding of covariate relationships.
Clinical and Statistical Significance: Statistical significance should be interpreted alongside practical implications and clinical relevance.
Model Diagnostics: Always verify that the proportional hazards assumption is met for each model (addressed in a later section).
Next Steps
Having constructed and interpreted our nested models, we now have a more comprehensive understanding of how different factors collectively influence survival after TBI. The next blog post will address multicollinearity, ensuring that our predictors are not excessively correlated, which can compromise model stability and interpretation.
By systematically building and carefully interpreting multivariable Cox models, we gain valuable and actionable insights into the web of factors driving survival outcomes in TBI research.
Conclusion
This post has provided a solid foundation for a comprehensive survival analysis within the TBIMS project. We began by establishing a reproducible R environment and carefully defining our covariates. We then systematically analyzed factors influencing survival after moderate-to-severe TBI, progressing from univariate to multivariable Cox proportional hazards models. This process allowed us to isolate the effects of individual predictors and understand their complex interplay, revealing insights with both statistical and clinical significance.
Reflecting on Our Progress: Key Accomplishments
Let's take a moment to reflect on the key steps we've conquered in this installment:
Streamlining Our R Environment: We mastered the initial setup, efficiently loading essential libraries and establishing a structured project directory. This might seem like a small detail, but it's the bedrock of a smooth and reproducible workflow, saving us time and headaches down the road.
Elevating Visual Consistency: We crafted a custom visual theme, ensuring that all of our plots will have a unified, professional look. This enhances the clarity and impact of our visualizations, making them more effective tools for communication.
Defining and Refining Our Covariates: We thoughtfully defined our covariates of interest, creating both comprehensive and focused lists for different stages of analysis. We also assigned clear, descriptive labels, ensuring that our variables are easily understood and consistently represented.
Building Univariate and Multivariable Models: We took our first steps into modeling, constructing both univariate and multivariable Cox regression models. This allowed us to explore the individual and combined effects of our predictors on survival, setting the stage for identifying key risk factors. We used a nested modeling approach to strategically evaluate the specific impact of groups of variables, such as our mental health-related variables.
Why This Matters: Building a Bridge from Data to Knowledge
These steps are far more than just technical procedures. They are about building a trustworthy bridge between our raw data and the insights we seek to uncover. We've ensured that:
Our Results Will Be Reliable: Thoughtful preparation, from organizing our environment to carefully defining our variables, minimizes errors and enhances the clarity and accuracy of our findings.
Our Workflow is Transparent and Reproducible: Clear documentation, a structured project directory, and well-commented code make it possible for others (or our future selves) to easily replicate and validate our analysis.
Our Findings Will Be Actionable: By building a solid foundation, we're setting the stage for uncovering patterns and relationships that can ultimately translate into real-world applications, potentially improving the lives of individuals with TBI.
Looking Ahead: Fine-Tuning Our Models and Unveiling the Story
Moving forward, our next steps will involve rigorous model validation. We will examine crucial assumptions underlying Cox regression, including the presence of multicollinearity and the validity of the proportional hazards assumption. Martingale residual diagnostics will be employed to assess model fit, and key findings will be illustrated through the visualization of survival curves. This thorough validation process is essential to ensuring the robustness and interpretability of our results as we continue to refine our understanding of post-TBI survival.
Comments
Newer