Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 6)
February 10, 2025
Research
Tutorials
Introduction
Welcome back to our journey into the world of survival analysis! We're moving beyond the initial stages of data cleaning and transformation, and into the crucial step of exploring our data through descriptive statistics. This phase is all about building a solid foundation for reproducible analysis, ensuring that our later survival models are built upon a bedrock of high-quality, well-understood data.
In this installment, we continue our exploration of the critical question: How do depression levels one year after a traumatic brain injury (TBI) influence all-cause mortality within the subsequent five years? The insights we uncover have the potential to inform interventions and improve patient care, making this not just an academic exercise but a journey with real-world implications.
This post will guide you through the essential steps required to create a robust and transparent analytical workflow. We'll cover:
3.1 Initial Setup and Library Loading
We'll show you how to create a clean and efficient R environment. This includes loading essential libraries like tidyverse
, naniar
, and gtsummary
, setting up a structured directory system for managing data and outputs, and configuring plot aesthetics. These steps ensure a smooth and reproducible workflow.
3.2 Defining Covariates and Assigning Clear Labels
We'll walk through the process of defining our key variables of interest and assigning them clear, reader-friendly labels. This enhances the interpretability of our data and sets the stage for effective communication of our findings.
3.3 Creating a Complete-Case Sample
We'll create subset of our data containing only participants with complete data on our key variables. While we acknowledge the limitations of listwise deletion, this complete-case sample provides a transparent and straightforward baseline for our initial analyses and descriptive statistics, allowing us to directly assess the impact of missing data.
3.4 Generating Descriptive Statistics Tables
We'll dive into the art of summarizing our data using the powerful gtsummary
package. You'll learn how to create publication-ready tables that showcase the key characteristics of our study population, stratified by depression level at Year 1. These tables will provide crucial insights into the relationships between depression, demographics, injury characteristics, and other clinical factors.
3.5 Interpreting Descriptive Statistics Tables
We'll go beyond the numbers and explore what our descriptive statistics reveal about our data. We'll compare findings between the full analytic sample and the complete-case sample, highlighting potential biases and informing our modeling choices.
Why This Matters: More Than Just Housekeeping
These steps might appear to be mere "data housekeeping," but they are, in fact, the cornerstone of a successful survival analysis. A well-documented workflow offers several crucial benefits:
Enhanced Insights: Properly prepared data allows us to uncover patterns and relationships that might otherwise be obscured by missing values, inconsistencies, or poorly defined variables. These descriptive analyses provide a richer context for understanding our data before we even begin to model it.
Improved Reproducibility: A structured workflow, with clear documentation and well-organized code, makes it easy to replicate and validate our findings. This is essential for scientific rigor and ensures that others can build upon our work.
Streamlined Workflow: By organizing our workspace, defining variables clearly, and automating tasks, we save valuable time and reduce frustration during the later, more complex stages of analysis.
Actionable Results: Clean, well-understood data leads to models and visualizations that are more likely to yield actionable insights, ultimately guiding better decisions in patient care and policy.
Throughout this post, we'll provide detailed R code examples, accompanied by clear explanations of the "why" and "how" behind each step. Whether you're new to survival analysis or a seasoned data analyst, you'll find practical tips, tools, and strategies that you can apply to your own projects.
Let's dive in and build the foundation for an impactful survival analysis that can contribute to improving the lives of individuals with TBI!
3.1 Initial Setup and Library Loading
Introduction
This script establishes the foundational environment for data analysis by loading essential R libraries, setting up a structured directory system for data management, loading preprocessed data, and configuring the study timeline. These steps ensure a reproducible, organized, and visually consistent workflow.
Step 1: Equipping Ourselves - Loading Essential Libraries
Before we can start exploring our data, we need to ensure that we have the right tools at our disposal. We'll load a curated set of R libraries, each chosen for its specific role in data analysis, visualization, or reporting.
Let's break down what's happening:
pacman
: Our Package Manager:The
pacman
package simplifies the process of managing R packages. The code first checks ifpacman
is installed, and if not, it installs it.Why It Matters:
pacman
streamlines our workflow by allowing us to install and load multiple packages with a single command (p_load
). It also handles situations where a package is already installed, preventing unnecessary re-installations.
Our Arsenal of Libraries:
extrafont
: This package allows us to customize our plots with specific fonts, giving our visualizations a polished and professional look.gt
andgtsummary
: These packages are our tools for creating beautiful, publication-ready tables. They offer extensive customization options, making it easy to present our descriptive statistics in a clear and informative way.here
: This package is essential for creating reproducible file paths. It automatically detects the project's root directory, making our code portable across different computer environments.naniar
: This package specializes in working with missing data. We'll use it to analyze and visualize missingness patterns in our dataset.scales
: This package provides tools for customizing plot scales and labels, enhancing the clarity and readability of our visualizations.tidyverse
: This is a collection of essential R packages for data science, includingdplyr
(for data manipulation),ggplot2
(for data visualization), and many others. Thetidyverse
provides a cohesive and powerful framework for working with data in R.
Pro Tip: Using pacman::p_load
is a a best practice for managing package dependencies. It ensures that all necessary libraries are installed and loaded efficiently, saving you time and preventing potential errors.
Step 2: Building Our Home Base - Creating a Project Directory
A well-organized project directory is essential for keeping our files in order, ensuring reproducibility, and making collaboration easier. Let's create a clear structure for our project:
What's happening here?
Defining Directories:
Data/Processed
: This directory will house our preprocessed datasets, keeping them separate from the raw data.Output/Tables
: This directory will store our descriptive statistics tables.Output/Plots/Missingness
: This directory will store visualizations related to missing data patterns.
Automating Directory Creation:
here()
: This function from thehere
package dynamically defines file paths relative to the project's root directory, ensuring portability.dir.create()
: This function creates the specified directories. Therecursive = TRUE
argument ensures that any necessary parent directories are also created. Theif (!dir.exists(…))
checks prevent these directories from being recreated if they already exist.
Why It Matters
This structured approach eliminates confusion about file locations, ensures that outputs and intermediate datasets are systematically organized, and promotes reproducibility.
Step 3: Loading Our Preprocessed Data
Now that our environment is set up, let's load the preprocessed dataset that we've prepared in previous steps:
What's happening here?
readRDS()
: This function reads R objects that was previously saved as.rds
files. We're loading:analytic_data_final
: Our main dataset, which has undergone cleaning, transformation, and eligibility criteria application.na_counts_for_all_proposed_covariates
: A data frame containing missing value counts for all potential covariates.na_counts_for_select_covariates
: A data frame containing missing value counts for our selected set of covariates.
Why It Matters
These datasets are the result of our careful preprocessing efforts. They are now ready for exploration, visualization, and ultimately, survival modeling.
Using
.rds
files allows for efficient storage and retrieval of R objects, preserving all data structures, including factor levels, labels, and metadata.
Step 4: Polishing Our Tables and Plots - Configuring Aesthetics
To ensure that our tables and visualizations effectively communicate our findings, let's import some custom fonts:
What's happening here?
extrafont
: This package allows us to use fonts beyond the standard R defaults.loadfonts()
: This function imports fonts installed on your system, making them available for use in R.
Why It Matters
Consistent aesthetics and enhanced readability make our tables and visualizations more professional and impactful.
Pro Tip: If you are sharing code with others, it is best to specify a font that is commonly available across systems.
The Big Picture: A Foundation for Discovery
These initial setup steps might seem like small details, but they are the cornerstone of a successful and reproducible analysis pipeline. By investing in this foundation, we ensure that:
Our workflow is efficient and organized.
Our project is reproducible.
Our data are readily accessible.
Our visualizations are polished and impactful.
Looking Ahead: Exploring and Visualizing Our Data
With our R environment configured and our data loaded, we're now ready to continue the exciting phase of exploratory data analysis. In the next sections, we will:
Prepare covariate sets for generating descriptive statistics tables.
Define preferred variable labels for clarity and consistency.
Generate comprehensive tables summarizing the key characteristics of our study population.
Each step builds upon this foundation, paving the way for survival models that will address our central research question.
3.2 Defining Covariates and Assigning Clear Labels
Introduction
We're now ready to continue our focus on exploratory data analysis, where we'll use descriptive statistics and visualization to delve into the characteristics of our study population and begin to uncover patterns in the data. But before we can start generating insightful tables and plots, we need to make sure our dataset is properly organized and that our variables are clearly defined.
In this section, we'll focus on two essential preparatory tasks:
Defining Our Covariates of Interest: We'll create specific lists of variables that will guide our exploratory analyses and inform our subsequent modeling choices.
Assigning Descriptive Variable Labels: We'll replace cryptic variable names with clear, reader-friendly labels that enhance the interpretability of our results.
Let's dive into how we accomplish these tasks.
Step 1: Defining Our Covariates of Interest
First, we need to explicitly define the variables that we'll be working with. We'll create two lists:
all_proposed_covariates
: This is an exhaustive list of all potential predictor variables in our dataset that might be relevant to our research question. It includes a wide range of variables capturing demographic information, injury characteristics, functional status, and mental health history. Think of this as our initial long list of potential players for our analysis.select_covariates
: This is a more curated list, containing a subset of variables that we've deemed particularly important for our core research question or that are most suitable for initial exploration based on careful consideration of previous research and clinical knowledge. This is our starting lineup—the key players that we'll first focus on. It's important to note that this selection isn't set in stone; we refined it after our initial Cox regression analyses, as detailed below.
Here's how we define these lists in our R code:
What's happening here?
We're creating two character vectors,
all_proposed_covariates
andselect_covariates
, that list the names of the variables that we'll be using.select_covariates
is a subset ofall_proposed_covariates
.
Addressing Potential Overfitting
It's important to note that the select_covariates
list was refined based on initial model diagnostics and concerns about potential overfitting. Overfitting occurs when a model is too complex relative to the amount of data, leading to poor generalization on new data.
One rule of thumb to mitigate overfitting is to have roughly 10-15 events (in our case, deaths) per predictor variable (or degree of freedom) in the model. Our initial 5-year dataset had approximately 4 events per degree of freedom (113 events and 26 df), falling short of this guideline.
To address this, we carefully considered the variables in our initial model and removed those that were deemed less critical or potentially redundant. This included:
calendar_year_of_injury
: This variable might capture time trends that could be confounded with other factors.psych_hosp_hx
: This variable could be correlated with other mental health variables, leading to redundancy.employment_at_injury
cause_of_injury
: This variable, while potentially relevant, had many categories, increasing the degrees of freedom in our model and thus the risk of overfitting for this particular analysis.
By creating a more parsimonious model, we aim to improve its generalizability and robustness.
Why It Matters
Flexibility and Focus: Having both comprehensive and focused lists gives us flexibility. We can use
all_proposed_covariates
for broad exploratory analyses, generating hypotheses and examining a wide range of potential predictors. We can then useselect_covariates
for more targeted investigations related to our primary research question and for building our final survival models.Organization and Clarity: Explicitly defining these lists makes our code more organized and easier to understand. It clearly signals which variables we're considering at each stage of the analysis.
Model Stability: The refined
select_covariates
list helps us build more stable and reliable survival models by reducing the risk of overfitting.
Step 2: Defining Preferred Variable Labels - Speaking a Common Language
Raw variable names are often cryptic and inconsistent. To make our data more accessible and interpretable, we'll assign clear, descriptive labels to our variables.
What's happening here?
var_name_mapping
: We create a named list where the names are the original variable names in our dataset, and the values are the new, descriptive labels we want to assign. For example, we're mapping the variabledepression_level_at_year_1
to the label "Depression Level at Year 1."
Why It Matters
Readability: Descriptive labels will make our tables, plots, and model outputs much easier to understand, especially for those who are not intimately familiar with the raw dataset.
Consistency: Using these labels ensures that our variables are consistently named throughout our analysis, reducing the risk of confusion.
Pro Tip: When creating labels, aim for a balance between brevity and informativeness. Choose labels that are both concise and easily understandable by a broad audience.
Step 3: Creating Data Frames for Analysis and Visualization
Before we can create our plots and tables, we will create two data frames tailored for these specific tasks:
analytic_data_for_tables_all
: This data frame will include all variables in ourall_proposed_covariates
list, providing a comprehensive dataset for broad exploration.analytic_data_for_tables_select
: This data frame will include only the variables in ourselect_covariates
list, offering a more focused dataset for targeted analyses related to our primary research question.
Here's how we create these data frames using R:
What's happening here?
analytic_data_for_tables_all
:We start with our
analytic_data_final
dataset (the result of all our previous preprocessing).mutate(depression_level_at_year_1 = fct_na_value_to_level(depression_level_at_year_1, "Missing"))
: We take a variable calleddepression_level_at_year_1
and convert any missing values (NA
) within it to a new level specifically labeled "Missing." This is performed using thefct_na_value_to_level
function.select("id", all_of(all_proposed_covariates))
: We select only theid
column and the columns listed in ourall_proposed_covariates
variable.arrange(id)
: Finally, we sort the data by participant ID.
analytic_data_for_tables_select
:We create a similar data frame, but this time we select only the
id
column and the columns listed in ourselect_covariates
variable, usingselect("id", all_of(select_covariates))
.We also use the
fct_na_value_to_level
function to convert missing values indepression_level_at_year_1
to a "Missing" level.The data are also sorted by participant ID.
Why It Matters
Targeted Data Frames: We now have two data frames specifically designed for generating descriptive statistics tables.
analytic_data_for_tables_all
allows for a broad overview of all potential covariates, whileanalytic_data_for_tables_select
focuses on the variables most relevant to our primary research question.Handling Missing Data in Categorical Variables: By converting missing values in
depression_level_at_year_1
to a distinct "Missing" level, we are preparing this variable for inclusion in our descriptive tables. This allows us to represent and analyze missingness within this key variable, rather than simply ignoring it.Foundation for Exploration: These data frames will be the foundation for creating informative tables that summarize the characteristics of our study population, overall and stratified by depression level.
Conceptual Takeaways: Preparing for Insightful Exploration
These steps—defining our covariates, assigning clear labels, and creating tailored data frames—are essential for setting the stage for a robust and insightful exploratory data analysis.
Here's why this preparation is so critical:
Balancing Breadth and Focus: We've created both comprehensive and focused variable lists, allowing us to explore our data broadly while also maintaining a clear focus on our primary research question.
Model Stability: The refined
select_covariates
list helps us build more stable and reliable survival models by reducing the risk of overfitting.Enhanced Communication: Clear and descriptive variable labels ensure that our findings will be accessible and interpretable by a wide audience.
Looking Ahead: Visualizing Missingness and Generating Descriptive Statistics
With our data frames prepared, we're now ready to continue our journey in exploratory data analysis. In the next sections, we'll:
Creating Descriptive Statistics Tables: We'll create comprehensive tables that summarize the characteristics of our study population, overall and stratified by depression level at Year 1.
By combining careful data preparation with insightful visualizations and descriptive summaries, we're setting the stage for building robust survival models and uncovering meaningful insights into the relationship between depression and long-term survival after TBI.
3.3 Creating a Complete-Case Sample
Introduction
Before we dive into generating descriptive statistics and building our survival models, we need to address the issue of missing data. While more sophisticated methods like multiple imputation exist, for this analysis, we will focus on a simpler and more transparent approach: complete-case analysis, also known as listwise deletion.
This means that we'll be creating a subset of our data—a complete-case sample—that includes only those participants who have complete data on our key variables of interest. This sample will be used for generating our descriptive statistics tables and for our Cox regression models, providing a clear picture of the characteristics of participants with complete information.
Why Complete-Case Analysis (Listwise Deletion) in This Context?
While complete-case analysis has limitations (mainly a potential reduction in sample size and potential for bias if data are not missing completely at random), we are choosing this method here for several reasons:
Simplicity and Transparency: Complete-case analysis is straightforward to implement and understand. It involves simply removing any participant with missing values on the variables of interest. This transparency makes our analysis easier to interpret and reproduce. It also allows us to clearly see the impact of missing data on our sample size.
Consistency Across Analyses: By using the same complete-case sample for our descriptive tables and Cox regression models, we ensure that all analyses are based on the same group of participants. This makes our results directly comparable.
Foundation for Comparison: The complete-case analysis serves as an important baseline. While we will not perform multiple imputation in this blog series, we acknowledge that doing so is generally preferred. However, for illustrative purposes, presenting the complete-case analysis first provides a clear and simple starting point for understanding our data.
Important Note: We understand that listwise deletion can potentially reduce statistical power and may introduce bias if the data are not missing completely at random (MCAR). However, for the purposes of this blog series, we are prioritizing simplicity to illustrate the core concepts of survival analysis. A detailed exploration of multiple imputation will be the focus of a later blog series.
Step 1: Excluding Non-Essential Variables - Focusing on Key Predictors
Not all variables are equally important when defining our complete-case sample. For our descriptive statistics and our Cox regression models, we want to focus on the core set of predictor variables. Therefore, we'll exclude variables that are:
Not directly used in our descriptive tables or models: This helps to streamline the process and focus on the variables that matter most for these specific analyses.
Innately Incomplete Due to Study Design: Importantly, we will exclude
time_to_censorship_in_years
andtime_to_expiration_in_years
. These variables are, by definition, incomplete. Not all participants will have experienced the event (death), so some will have atime_to_censorship
but notime_to_expiration
, and vice versa. Instead, we will focus on the completeness of the combined variable,time_to_event_in_years
, which captures each participant's time to either censorship or expiration, ensuring greater completeness. We will also excludeage_at_censorship
andage_at_expiration
for similar reasons.
Here's how we define the variables to exclude:
We then use the setdiff
function to create two new lists of variables that will be used when creating our complete-case datasets:
What's happening here?
We create a character vector called
variables_to_exclude
containing the names of variables we want to omit when checking for complete cases.setdiff
is used to create lists of variables for our two complete-case data frames,variables_for_cc_all
andvariables_for_cc_select
, by subtracting thevariables_to_exclude
fromall_proposed_covariates
andselect_covariates
, respectively.
Why It Matters
Focus: By excluding these variables, we're focusing on the completeness of our core predictor variables and our key outcome variable,
time_to_event_in_years
.Efficiency: This simplifies the process of identifying complete cases without affecting the integrity of our analysis for these specific tasks.
Step 2: Preparing Complete-Case Data Frames - Creating Subsets for Analysis
Now, we'll create two complete-case data frames based on our previously defined variable lists:
analytic_data_for_cc_all
: Contains all variables inall_proposed_covariates
(minus the excluded variables).analytic_data_for_cc_select
: Contains only the variables inselect_covariates
(minus the excluded variables).
What's happening here?
We're using
select()
to create subsets of ouranalytic_data_final
data frame, including only the relevant variables for each complete-cases analysis.arrange(id)
ensures that the data are sorted by participant ID, maintaining consistency across datasets.
Why It Matters
Tailored Datasets: We're creating data frames specifically designed for complete-case analysis, making our workflow more organized and efficient.
Foundation for Comparison: These data frames will serve as the basis for identifying complete cases in the next step.
Step 3: Isolating Complete Cases - The complete.cases
Function
Now, we'll use the powerful complete.cases()
function to identify and isolate the rows (participants) in our data frames that have no missing values across the selected variables.
What's happening here?
complete.cases()
: This function checks each row of a data frame and returnsTRUE
if all columns in that row have non-missing values, andFALSE
otherwise.We apply
complete.cases()
to bothanalytic_data_for_cc_all
andanalytic_data_for_cc_select
, creating two new data frames,complete_cases_all
andcomplete_cases_select
, that contain only the complete cases.
Why It Matters
Identifying Complete Cases: This step efficiently identifies the subset of participants for whom we have complete data on the variables of interest.
Creating Analysis-Ready Datasets: The resulting
complete_cases_all
andcomplete_cases_select
data frames are now ready for generating descriptive statistics and for use in our Cox regression models.
Step 4: Enhancing the Complete-Case Data for Tables
For our descriptive statistics tables, we want to include some additional variables (like our time-to-event variables) that weren't used in defining complete cases. We'll add these variables back into our complete_cases_all
and complete_cases_select
datasets.
What's happening here?
We are going to merge our complete-case datasets with a subset of our main dataset (analytic_data_final
) that only contains the id
column and the variables that we want to add back in. We will then reorder the columns to ensure that the newly added variables are placed in a logical position within the data frame.
Why It Matters
Comprehensive Tables: This step ensures that our descriptive statistics tables include all relevant variables, even those not used in defining complete cases.
Contextual Information: Including variables like
time_to_event_in_years
in our tables provides important context when describing the characteristics of our complete-case sample.
Here's how we enhance both complete-case data frames:
What's happening here?
Joining Additional Variables:
left_join(…)
: We useleft_join
to merge ourcomplete_cases_all
(andcomplete_cases_select
) data frame with a subset ofanalytic_data_final
that contains theid
column and the variables that we want to add back in (time_to_censorship_in_years
,time_to_expiration_in_years
,age_at_censorship
,age_at_expiration
). The merge is performed based on the commonid
column.
Reordering Columns:
select(…)
: We carefully reorder the columns to ensure that the newly added variables are placed in a logical position within the data frame. We place them after the first three columns (which areid
,event_status
, andtime_to_event_in_years
) and before the rest of the original columns.
Now, both complete_cases_for_tables_all
and complete_cases_for_tables_select
are ready for generating comprehensive descriptive statistics tables that include all relevant variables for describing our complete-case sample.
Step 5: Saving Complete-Case Data Frames for Future Use
Finally, we'll save our complete-case data frames to both .rds
(for use in R) and .csv
(for broader accessibility) files.
Why It Matters
Reproducibility: Saving these data frames ensures that we can easily recreate our analyses and share our work with others.
Efficiency: We can load these data frames directly in future sessions, avoiding the need to repeat the complete-case selection process.
Key Takeaways: Building a Transparent Analysis
By creating these complete-case samples, we've taken a crucial step toward ensuring the robustness and transparency of our analysis. We've:
Defined Clear Criteria: We've established clear criteria for identifying participants with complete data on our key variables.
Created Tailored Datasets: We've generated data frames specifically designed for complete-case analysis, providing a solid foundation for our descriptive statistics and Cox regression models.
Prioritized Simplicity and Transparency: We've opted for a straightforward approach (listwise deletion) to enhance the interpretability and reproducibility of our findings.
Looking Ahead: Exploring Our Data Through Descriptive Statistics
With our complete-case datasets in hand, we're now ready to generate descriptive statistics tables. In the next section, we'll summarize the key characteristics of our study population, comparing the complete-case sample to the full analytic sample and exploring potential differences between participants with different levels of depression at Year 1. This crucial exploratory step will pave the way for a deeper understanding of our data and inform our subsequent survival modeling.
3.4 Generating Descriptive Statistics Tables
Introduction
We're prepared our data, and now comes the exciting part: exploring its characteristics through descriptive statistics tables! These tables will provide a comprehensive overview of our study population, summarizing key variables and revealing important patterns that will inform our survival models. We'll generate these descriptive statistics tables for both the full analytic sample and the complete-case sample (i.e., participants with no missing data on key variables) to assess the impact of missing data on the characteristics of the sample.
Think of this stage as getting to know the participants in our study. Who are they? What are their demographics, injury characteristics, and mental health histories? How do these characteristics differ across depression levels at Year 1? Descriptive statistics will help us answer these questions.
Why Descriptive Statistics Matter
Descriptive statistics tables are more than just lists of numbers. They provide a crucial foundation for understanding our data by:
Summarizing Key Variables: They provide a snapshot of the distribution of important variables like age, sex, functional independence scores, and mental health history.
Revealing Patterns and Trends: They help us identify potential relationships between variables and highlight differences between groups (e.g., participants with and without depression at Year 1).
Assessing Data Quality: They can reveal potential issues with our data, such as unexpected distributions or high levels of missingness.
Guiding Model Building: The insights gained from descriptive statistics inform the development of our survival models, helping us choose appropriate covariates and model specifications.
Creating Our Descriptive Tables: A Step-by-Step Guide
We'll be using the powerful gtsummary
package in R to create our descriptive tables. gtsummary
simplifies the process of generating beautiful, publication-ready tables with minimal code.
Step 1: Handling Missing Data in depression_level_at_year_1
Before we generate our tables, we need to make a decision about how to handle missing values in our key stratifying variable, depression_level_at_year_1
. For the descriptive tables, we'll treat missing values as a separate category, allowing us to see the characteristics of participants for whom we don't have depression data.
Here is the code we will use to do this:
What It Does
The fct_na_value_to_level()
function from the forcats
package takes any NA
values in the depression_level_at_year_1
variable and converts them to a new factor level labeled "Missing."
Why It Matters
This ensures that participants with missing depression data are not excluded from our descriptive tables. Instead, they are included as a distinct category, allowing us to examine their characteristics and compare them to other groups.
Step 2: Generating Descriptive Tables for the Full Analytic Sample
Now, let's create our first descriptive table, summarizing the characteristics of our full analytic sample:
What's happening here?
select(-"id")
: We remove theid
column, as it's not needed for our descriptive table.tbl_summary(…)
: This is the core function fromgtsummary
that generates the descriptive table.by = depression_level_at_year_1
: We stratify our table by depression level at Year 1, allowing us to compare characteristics across these groups.type = …
: We specify the data type for certain variables. Here, we indicate thatcalendar_year_of_injury
andgose_total_at_year_1
should be treated as continuous variables.statistic = …
: We define the summary statistics to display. For continuous variables, we use the median and interquartile range (IQR), which are appropriate for non-normally distributed data. For categorical variables, we show frequencies and percentages.digits = …
: We specify the number of decimal places to display for different variable types.label = var_name_mapping
: We use ourvar_name_mapping
list to apply descriptive labels to the variables in the table.add_overall()
: This adds a column summarizing the entire sample, without stratification.bold_labels()
: This bolds the variable labels in the table for better readability.add_p(…)
: This adds p-values for comparisons between depression levels, allowing us to assess the statistical significance of any observed differences.as_gt()
: This converts thegtsummary
table object to agt
table object, which offers more advanced formatting options.tab_header(…)
: This adds a title to our table.gtsave(…)
: This saves the table as a.png
file.
Why It Matters
Comprehensive Overview: This table provides a detailed overview of our full analytic sample, stratified by depression level.
Informs Hypothesis Generation: By examining differences between depression groups, we can start to generate hypotheses about the factors that might influence survival.
Step 3: Generating Descriptive Tables for the Complete-Case Sample
We repeat the process, this time using our complete_cases_for_tables_all
dataset to create a table summarizing the characteristics of our complete-case sample:
Why It Matters
Assessing the Impact of Missingness: Comparing this table to the one generated from the full sample allows us to assess whether participants with complete data differ systematically from those with missing data. This helps us understand the potential impact of missingness on our findings.
Transparency: Presenting both tables provides a transparent view of our data and the potential limitations of using a complete-case approach.
Comparing Full and Complete Case Samples
By examining both the full analytic sample and the complete-case sample, we can:
Assess Potential Bias: We can observe whether the characteristics of participants with complete data differ systematically from the full dataset. This helps us understand potential biases introduced by listwise deletion.
Ensure Robustness: We can validate that key findings are consistent across both datasets, increasing our confidence in the results.
Pro Tips for Creating Effective Tables
Label Early and Often: Define your variable labels early in the preprocessing process (as we did with
var_name_mapping
) and apply them consistently throughout your analysis.Automate Repetitive Tasks: Use functions or loops to streamline table generation, especially when creating multiple tables with similar structures.
Validate Your Tables: Always carefully check the output of your tables to ensure that all variables and categories are displayed as expected and that the numbers make sense.
Looking Ahead: Visualizing Our Data and Building Survival Models
With our descriptive statistics tables in hand, we're well-equipped to interpret the characteristics of our study population and begin formulating hypotheses about the relationship between depression and survival.
In the next blog section, we'll focus on interpreting the descriptive statistics tables that we just created. We'll carefully examine the characteristics of our study population, both overall and stratified by depression level at Year 1. This crucial step will involve:
Comparing Groups: We'll analyze differences in sociodemographics, mental health histories, and functional status between participants with different levels of depression at Year 1.
Assessing the Impact of Missingness: We'll compare the characteristics of the full analytic sample to the complete-case sample, helping us understand the potential impact of missing data on our findings.
Generating Hypotheses: The insights gained from these tables will inform our hypotheses about the relationship between depression and survival, guiding the development of our Cox regression models.
This in-depth exploration of our descriptive statistics will provide a crucial foundation for understanding our data and building robust survival models. We'll be transforming numbers into narratives, setting the stage for uncovering meaningful insights about the factors that influence long-term outcomes after TBI.
3.5 Interpreting Descriptive Statistics Tables
Introduction
We've reached a critical point in our survival analysis journey—interpreting our descriptive statistics tables. These tables, generated from our carefully prepared complete-case sample, provide a wealth of information about the characteristics of our study population. They offer a vital foundation for understanding the relationships between our key variables and for informing the development of our survival models.
In this section, we'll focus on interpreting Table 2-2, which summarizes the characteristics of our complete-case sample, stratified by depression level at Year 1. We'll also compare these findings to Table 2-1, which describes the full analytic sample, to assess the potential impact of missing data.
Table 2-2: A Snapshot of Our Complete-Case Sample

Table 2-2 provides a detailed overview of our complete-case sample (N = 1,549), broken down by depression level at Year 1: "No Depression," "Minor Depression," and "Major Depression." Let's examine some of the key findings:
Sociodemographic Characteristics:
Sex: The majority of participants in the complete-case sample are male (71%), which is consistent with the full analytic sample. However, the proportion of males is slightly lower in the major depression group (67%) compared to the other groups.
Age at Injury: The median age at injury is 36 years in the complete-case sample. The distribution of age at injury appears to vary across depression levels. Participants with minor depression tend to be slightly younger (median 32 years) compared to those with no depression (median 38) or major depression (median 37).
Educational Attainment: The median educational attainment is 12 years (equivalent to a high school diploma) across all depression levels. However, the interquartile range (IQR) suggests slightly less variability in educational attainment among those with major depression.
Medicaid Status: A statistically significant difference exists in Medicaid status across depression levels. Participants with major depression have a higher proportion enrolled in Medicaid (26%) compared to those with no depression (19%) or minor depression (23%).
Clinical and Functional Characteristics:
Mortality: Overall, 9.5% of participants in the complete-case sample died during the study period. There are no statistically significant differences in mortality rates across depression levels, though this could be related to the reduction in sample size with the complete-case sample.
Time to Event/Censorship/Expiration: The median time to event, time to censorship, and time to expiration are similar across all groups in the complete-case sample, as well as to the full analytic sample. However, the median age at expiration is notably lower in the major depression group (58 years) compared to the no depression (74 years) and minor depression (69 years) groups.
Function Factor Score at Year 1: This variable, reflecting functional independence, shows significant differences across depression levels. As expected, participants with major depression have a lower median score (indicating greater functional impairment) compared to those with minor or no depression. This pattern is consistent with what was observed in the full analytic sample.
Mental Health and Substance Use History:
History of Mental Health Treatment: Participants with major depression were more likely to report receiving mental health treatment within the year preceding their injury (18%) compared to those with no depression (7.2%) or minor depression (12%). A similar, though less pronounced, pattern is observed for mental health treatment received prior to the year preceding the injury.
History of Suicide Attempt: Participants with major or minor depression reported significantly higher rates of suicide attempts both prior to their injury and within the first year post-injury compared to those with no depression.
Problematic Substance Use at Injury: Individuals with major or minor depression had higher rates of problematic substance use at injury (61% and 60%, respectively) compared to those with no depression (52%).
Impact of Missingness: Comparing to the Full Analytic Sample (Table 2-1):
Reduced Sample Size: The complete-case sample (N = 1,549) is considerably smaller than the full analytic sample (N = 4,283), primarily due to missing data on mental health-related variables, including depression level at Year 1.
Potential for Bias: While the overall patterns are generally similar between the two samples, there is no longer a statistically significant difference in mortality rates across depression levels in the complete-case sample. Other differences in the magnitude of effects and p-values are also noted. This suggests that the missing data may not be completely random and that listwise deletion could have introduced some bias.
Lower Representation of Certain Groups: For instance, the proportion of the sample with problematic substance use at injury is slightly higher in the complete-case sample. This could indicate underlying differences between those with and without missing data that are important to keep in mind when interpreting the results of models fit to this sample.
Key Insights and Implications
Depression and Functioning: The descriptive statistics confirm the expected association between depression and functional status. Participants with greater depression severity tend to have lower functional independence scores.
Mental Health History: The data highlights the complex interplay between depression and other mental health factors. Individuals with major depression are more likely to have a history of mental health treatment and substance use issues. The association between depression and suicide attempts is also particularly strong.
Potential Confounders: The observed differences in demographic and clinical characteristics across depression levels suggest that these variables might be confounders in the relationship between depression and mortality. We'll need to account for these potential confounders in our survival models.
Limitations of Complete-Case Analysis: The comparison with the full analytic sample underscores the potential limitations of listwise deletion, particularly the reduction in sample size and the possibility of bias.
Moving Forward: Building Our Survival Models
These descriptive statistics tables have provided us with a foundation for understanding our study population and the potential relationships between depression, covariates, and survival. But numbers alone can only tell us so much.
In the next blog post, we'll bring our data to life through univariate and bivariate visualizations. We'll create plots that complement our descriptive tables, allowing us to:
Visualize distributions: We'll use histograms and boxplots to examine the distribution of key variables like age, functional scores, and time-to-event, both overall and stratified by depression level.
Explore relationships between variables: We'll create scatter plots and other visualizations to explore how different variables relate to each other, helping us identify potential confounders and interaction effects. For instance, we might plot the relationship between depression severity and functional independence scores, or between age and time-to-event.
Visualize survival patterns: We will use Kaplan-Meier curves to visualize the survival probabilities over time for different depression groups, providing an intuitive graphical representation of survival differences.
These visualizations will not only enhance our understanding of the data but also help us communicate our findings more effectively. They will also inform the development of our survival models in subsequent posts, providing a bridge between descriptive exploration and formal statistical modeling.
By combining the power of descriptive statistics with insightful visualizations, we're setting the stage for a more nuanced and impactful analysis of the relationship between depression and survival after TBI.
Conclusion
We've reached a pivotal point in our survival analysis journey! We've transitioned from the meticulous work of data cleaning and preparation to the exciting realm of data exploration. By generating and interpreting detailed descriptive statistics tables, we've gained a much richer understanding of our study population and the intricate relationships within our data. This isn't just about summarizing numbers; it's about unveiling the story hidden within our data, a story that will ultimately help us understand the impact of depression on long-term survival after TBI.
Reflecting on Our Progress: Building a Foundation for Discovery
Let's recap the crucial steps that have brought us to this point:
Establishing a Reproducible Workflow: We began by setting up a well-organized R environment, loading essential libraries, and creating a structured directory system. This ensures that our analysis is transparent, efficient, and easily replicable.
Defining and Refining Our Variables: We carefully curated our list of covariates, creating both comprehensive and focused sets for different stages of analysis. We also assigned clear, descriptive labels, ensuring that our data is readily interpretable.
Addressing Missing Data: We created complete-case samples, providing a transparent and straightforward way to explore our data without the complexities of imputation. This allowed us to directly assess the impact of missing data on our sample characteristics.
Generating Insightful Descriptive Tables: Using the
gtsummary
package, we created detailed tables summarizing the key demographic, injury-related, functional, and mental health characteristics of our study population, stratified by depression level at Year 1. These tables revealed important differences between groups and highlighted potential confounders to consider in our models.Interpreting the Story in the Numbers: We went beyond simply reporting statistics; we analyzed the patterns in our tables, comparing the full and complete-case samples, and drawing initial insights into the potential relationship between depression and other key variables.
Why This Matters: Transforming Numbers into Narratives
These steps are far more than just technical exercises. They're about transforming raw data into a coherent narrative that we can understand and learn from. We've ensured that:
Our Data is High-Quality and Reliable: By addressing missingness and carefully defining our variables, we've built a foundation of trustworthy data.
We Have a Deeper Understanding of Our Population: Our descriptive tables have given us a nuanced view of the individuals in our study, their characteristics, and their experiences.
We're Ready to Ask More Complex Questions: The insights gained from our descriptive exploration have informed our hypotheses and prepared us for building sophisticated survival models.
Looking Ahead: Visualizing Patterns and Building Models
Our exploratory journey is far from over! In the next blog posts, we'll:
Bring Our Data to Life with Visualizations: We'll create a variety of plots, including histograms, scatter plots, and Kaplan-Meier curves, to visually explore distributions, relationships between variables, and survival patterns. These visualizations will complement our tables and provide a more intuitive understanding of our data.
Construct and Interpret Survival Models: Armed with the insights gained from our descriptive and visual exploration, we'll build Cox proportional hazards models to quantify the impact of depression on long-term survival, while controlling for other important factors.
We're now on the cusp of transforming our carefully prepared data into actionable insights that have the potential to improve clinical practice and enhance the lives of individuals recovering from TBI. The journey continues, and we're excited to share the next chapter with you.
Comments
Newer