Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 5)
February 3, 2025
Research
Tutorials
Introduction
Welcome to the fifth installment in our comprehensive series on survival analysis! In this series, we're exploring the practical steps involved in analyzing real-world data to uncover meaningful insights about how depression levels one year after a traumatic brain injury (TBI) influence all-cause mortality within the subsequent five years. By leveraging the power of R, we're delving into the nuances of survival analysis while maintaining a focus on reproducibility and clarity.
This post shifts our focus toward the critical task of exploratory data analysis, with an emphasis on understanding and addressing missing data. Missingness is a common yet challenging issue in longitudinal studies, and the decisions we make in handling it can have a profound impact on the reliability of our findings. Through visualization and descriptive statistics, we aim to uncover patterns in missingness and lay a solid foundation for imputing values or justifying exclusions.
Here's what we'll cover in this post:
2.1 Initial Setup and Library Loading
We'll set the stage by loading essential R libraries, organizing our project directory, and ensuring our environment is equipped for reproducible and efficient analysis.
2.2 Defining Covariates and Assigning Clear Labels
Next, we'll prepare the data by defining comprehensive and focused lists of covariates. This step involves assigning meaningful labels to variables, balancing breadth and focus to enable both exploratory and targeted analyses.
2.3 Crafting a Custom Theme
As we transition to visualization, we'll define a custom theme to ensure all plots have a consistent and professional appearance. Thoughtful aesthetics enhance the interpretability and impact of our findings.
2.4 Defining Helper Functions for Plotting Missingness
We'll introduce helper functions for handling missing data visualizations, streamlining the process and ensuring clarity in how we explore patterns of missingness.
2.5 Visualizing Missingness Counts
Using bar plots, we'll visualize the extent of missingness across variables, identifying key areas that require attention and providing an at-a-glance summary of data quality.
2.6 Visualizing Missingness Patterns
Finally, we'll dive into UpSet plots—a powerful tool for visualizing intersections of missing data. By examining these patterns, we'll uncover relationships between variables and guide imputation strategies.
Why This Matters: The Role of Exploratory Missing Data Analysis
Handling missing data is more than a technical hurdle—it's a critical step in ensuring the validity of our analysis. Through a combination of clear visualizations and targeted exploration, we can:
Identify Biases: Understand how and why data might be missing, and consider its implications on the representativeness of our findings.
Guide Imputation Strategies: Determine whether missingness is random or systematic, informing the choice of imputation methods.
Ensure Robust Models: Lay the groundwork for reliable survival models by proactively addressing gaps in our data.
Throughout this post, you'll find step-by-step R code, intuitive explanations, and practical insights to enhance your own survival analysis projects. Whether you're grappling with a similar dataset or exploring this field for the first time, these techniques will help you navigate the complexities of missing data with confidence.
2.1 Initial Setup and Library Loading
Introduction
This script establishes the foundational environment for data analysis by loading essential R libraries, setting up a structured directory system for data management, loading preprocessed data, and configuring the study timeline. These steps ensure a reproducible, organized, and visually consistent workflow.
Step 1: Equipping Ourselves - Loading Essential Libraries
Before we can start analyzing data, we need to ensure that we have the right tools at our disposal. We'll load a curated set of R libraries, each chosen for its specific role in data preprocessing, analysis, or visualization.
Let's break down what's happening:
pacman
: Our Package Manager:The
pacman
package simplifies the process of managing R packages. The first two lines of code check ifpacman
is installed, and if it is not, it installs it.Why It Matters:
pacman
streamlines our workflow by allowing us to install and load multiple packages with a single command (p_load
). It also handles situations where a package is already installed, preventing unnecessary re-installations.
Our Arsenal of Libraries:
ComplexUpset
: This package will help us create advanced UpSet plots, a powerful visualization technique for exploring complex patterns of missing data in our dataset.extrafont
: This package allows us to customize our plots with specific fonts, giving our visualizations a polished and professional look.here
: This package is essential for creating reproducible file paths. It automatically detects the project's root directory, making our code portable across different computer environments.naniar
: This package is specifically designed for working with missing data. We'll use it to analyze and visualize missingness patterns in our dataset.scales
: This package provides tools for customizing plot scales and labels, enhancing the clarity and readability of our visualizations.tidyverse
: This is a collection of essential R packages for data science, includingdplyr
(for data manipulation),ggplot2
(for data visualization), and many others. Thetidyverse
provides a cohesive and powerful framework for working with data in R.
Pro Tip: Using pacman::p_load
is a highly recommended for managing package dependencies. It ensures that all required libraries are installed and loaded efficiently, saving you time and preventing potential errors.
Step 2: Building Our Home Base - Creating a Project Directory
A well-organized project directory is essential for managing our files, ensuring reproducibility, and collaborating effectively. Let's create a clear structure for our project:
What's happening here?
Defining Directories:
Data/Processed
: This directory will house our preprocessed datasets, keeping them separate from the raw data.Output/Plots/Missingness
: This directory will store visualizations specifically related to missing data patterns.
Automating Directory Creation:
here()
: This function from thehere
package dynamically defines file paths relative to the project's root directory. This makes our code more portable, as it will work correctly even if the project is moved to a different location.dir.create()
: This function creates the specified directories. Therecursive = TRUE
argument ensures that any necessary parent directories are also created. Theif (!dir.exists(…))
checks ensure that we don't accidentally recreate existing directories.
Why It Matters
This structured approach eliminates confusion about file locations, ensures that outputs and intermediate datasets are systematically organized, and promotes reproducibility.
Step 3: Loading Our Preprocessed Data
Now that our environment is set up, let's load the preprocessed dataset that we've been working with throughout this series:
What's happening here?
readRDS()
: This function loads an R object that was previously saved as an.rds
file. We're loading ouranalytic_data_final
dataset, which contains the results of all of our previous preprocessing steps.
Why It Matters
This dataset is now ready for the next stages of our analysis: descriptive exploration, visualization, and eventually, survival modeling.
Using
.rds
files is an efficient way to store and retrieve R objects, preserving all data structures, including factor levels, labels, and metadata.
Step 4: Polishing Our Visualizations - Configuring Plot Aesthetics
To ensure that our visualizations effectively communicate our findings, let's import some custom fonts to give them a polished look:
What's happening here?
Consistent branding and enhanced readability can make your visualizations more impactful and professional.
Pro Tip: Test font availability on different systems to avoid discrepancies when collaborating or sharing code. If you are sharing code with others, it is best to specify a font that is commonly available across systems.
The Big Picture: A Solid Foundation for Success
These initial setup steps are more than just technicalities; they establish a reproducible workflow, ensure that our project is well-organized, and equip us with the tools needed to handle complex datasets and analyses.
Looking Ahead: Exploring and Visualizing Our Data
With our R environment configured and our dataset loaded, we're now ready to delve into the exciting world of descriptive statistics and data visualization! In the next sections, we'll create comprehensive tables and insightful plots, uncovering key trends and relationships within our data. This exploratory phase will set the stage for building our robust survival models.
This foundational setup might seem like a small step, but it's the linchpin of a successful analysis pipeline. A solid foundation ensures that each subsequent step builds seamlessly on the last, culminating in meaningful insights and actionable results.
2.2 Defining Covariates and Assigning Clear Labels
Introduction
We're now ready to prepare our data for exploratory analysis and missing data visualization. This crucial step lays the groundwork for understanding the characteristics of our study population and identifying potential patterns in our data, ultimately informing our survival models. To do this effectively, we need to:
Define our covariates of interest.
Assign clear and descriptive labels to our variables.
Let's dive into how we accomplish these tasks.
Step 1: Defining Our Covariates of Interest
First, we need to explicitly define the variables that we'll be focusing on in our analyses. We'll create two lists:
all_proposed_covariates
: This is an exhaustive list of all potential predictor variables in our dataset that might be relevant to our research question. It includes a wide range of variables capturing demographic information, injury characteristics, functional status, and mental health history. Think of this as our initial long list of potential players for our analysis.select_covariates
: This is a more curated list, containing a subset of variables that we've deemed particularly important for our core research question or that are most suitable for initial exploration based on careful consideration of previous research and clinical knowledge. This is our starting lineup—the key players that we'll first focus on. It's important to note that this selection isn't set in stone; we refined it after our initial Cox regression analyses, as detailed below.
Here's the code:
What's happening here?
We're creating two character vectors,
all_proposed_covariates
andselect_covariates
, that list the names of the variables that we'll be using.select_covariates
is a subset ofall_proposed_covariates
.
Why It Matters
Flexibility and Focus: Having both comprehensive and focused lists gives us flexibility. We can use
all_proposed_covariates
for broad exploratory analyses, generating hypotheses and examining a wide range of potential predictors. We can then useselect_covariates
for more targeted investigations related to our primary research question.Organization and Clarity: Explicitly defining these lists makes our code more organized and easier to understand. It clearly signals which variables we're considering at each stage of the analysis.
Addressing Potential Overfitting
It's important to note that the select_covariates
list was refined after our initial Cox regression analyses. We were mindful of the potential for overfitting, which can occur when a model is too complex relative to the amount of data available. Overfit models tend to perform well on the training data but poorly on new, unseen data.
One rule of thumb to mitigate overfitting is to have roughly 10-15 events (in our case, deaths) per predictor variable (or degree of freedom) in the model. Our initial 5-year dataset had approximately 4 events per degree of freedom (113 events and 26 df), falling short of this guideline.
To address this, we carefully considered the variables in our initial model and removed those that were deemed less critical or potentially redundant. This included:
calendar_year_of_injury
: While potentially relevant, this variable might capture secular trends that could be confounded with other factors.psych_hosp_hx
: This variable, while important, might be correlated with other mental health variables, leading to redundancy.employment_at_injury
cause_of_injury
: This variable, while potentially relevant, might introduce too many categories (and thus degrees of freedom) relative to the number of events, increasing the risk of overfitting for this particular analysis. Although, we did retain it in theall_proposed_covariates
list for use in descriptive statistics tables and data visualizations.
By trimming down our variable list, we aimed to create a more parsimonious and robust model, improving its generalizability and reducing the risk of overfitting.
Pro Tip: Model diagnostics and careful consideration of the balance between model complexity and the available data are crucial for avoiding overfitting. It's often an iterative process, requiring adjustments and refinements as you explore your data and build your models.
Step 2: Defining Preferred Variable Labels - Speaking a Common Language
Raw variable names can often by cryptic or inconsistent. To make our data more user-friendly and our results more interpretable, we'll assign clear, descriptive labels to our variables.
What's happening here?
var_name_mapping
: We create a named list where the names are the original variable names in our dataset, and the values are the new, descriptive labels we want to assign. For example, we're mapping the variabledepression_level_at_year_1
to the label "Depression Level at Year 1."
Why It Matters
Clarity and Interpretability: Descriptive labels make our results much easier to understand, especially for those who are not familiar with the technical details of the dataset.
Consistency: Using these labels ensures that our variables are consistently named across all tables, plots, and reports, making our work more professional and easier to follow.
Pro Tip: When creating labels, strive for a balance between brevity and informativeness. Choose labels that are both concise and easily understandable by a broad audience.
Step 3: Creating Data Frames for Analysis and Visualization
Now, we'll create specific data frames tailored for different aspects of our exploratory analysis:
What's happening here?
Creating Data Frames for Analysis:
analytic_all_proposed_covariates
: This data frame will contain all of the variables listed inall_proposed_covariates
, providing a dataset for broad exploration.analytic_select_covariates
: This data frame will contain only the variables inselect_covariates
, providing a more focused dataset for targeted analyses and our primary survival models.
Creating Data Frames for Visualization:
variables_to_exclude_from_plots
: We define a list of variables that we generally don't want to include in our missing data visualizations (e.g., ID variables, event status, time-to-event).all_proposed_covariates_for_plots
andselect_covariates_for_plots
: We create two additional data frames, based onall_proposed_covariates
andselect_covariates
, but with thevariables_to_exclude_from_plots
removed. These will be used specifically for our missing data visualizations.
Why It Matters
Tailored Datasets: We're creating data frames that are specifically designed for different analytical tasks. This keeps our workflow organized and efficient.
Optimized Visualizations: By excluding variables that are not informative for missing data visualizations (like ID numbers or time variables), we ensure that our plots are clear, focused, and easy to interpret.
Conceptual Takeaways: Preparing for Insightful Exploration
These steps—defining our covariates, assigning clear labels, and creating tailored data frames—are essential for setting the stage for a robust and insightful exploratory data analysis.
Here's why this preparation is so critical:
Balancing Breadth and Focus: We've created both comprehensive and focused variable lists, allowing us to explore our data broadly while also maintaining a clear focus on our primary research question.
Addressing Overfitting: We've taken proactive steps to mitigate the risk of overfitting in our survival models by carefully selecting the variables in
select_covariates
.Enhancing Communication: Clear and descriptive variable labels ensure that our findings will be accessible and interpretable by a wide audience.
Looking Ahead: Visualizing Missingness and Summarizing Our Data
With our data frames prepared, we're now ready to embark on the exciting phase of exploratory data analysis! In the next sections, we'll:
Visualize Missing Data Patterns: We'll use specialized tools to examine the patterns of missingness in our dataset, helping us understand the potential impact of missing data on our analysis and choose appropriate imputation strategies.
Generate Descriptive Statistics: We'll create comprehensive tables that summarize the key characteristics of our study population, providing a detailed overview of our data.
By combining careful data preparation with insightful visualizations, we're setting the stage for building robust survival models and uncovering meaningful insights into the factors influencing long-term outcomes after TBI.
2.3 Crafting a Custom Theme
Introduction
As we prepare to delve into data visualization—particularly for exploring missing data patterns—it's essential to think about the aesthetics of our plots. A consistent and well-designed visual style not only makes our results more appealing but also enhances their interpretability and impact. In this section, we'll define a custom theme that will ensure our plots are both informative and visually engaging.
Think of this as choosing the right font, colors, and layout for a presentation. Just as a well-designed presentation can captivate an audience, a well-crafted visualization can make complex data more accessible and understandable.
Step 1: Defining a Custom Theme - The customization
Object
Let's start by creating a custom theme object called customization
. This object will store all of our aesthetic preferences, which we can then apply to our plots.
What's happening here?
We're using the theme()
function from the ggplot2
package to define various aspects of our plot's appearance:
Font Choice:
We've selected "Proxima Nova" as our primary font. It's a modern, clean, and highly readable font, making our plots visually appealing and easy to understand. (If this font is not available on your system, you can replace it with a similar sans-serif font like "Arial" or "Helvetica.")
Title Styling:
title = element_text(…)
: We're making our plot titles bold and setting their font size to 20, ensuring they stand out.
Axis Labels:
axis.title.x = element_text(…)
andaxis.title.y = element_text(…)
: We're making our x- and y-axis labels bold with a font size of 12. We've also added margins (margin(t = 10)
andmargin(r = 10)
) to create some space between the labels and the axis lines, improving readability.
Axis Text:
axis_text = element_text(…)
: We're setting the font size of the tick labels on our axes to 10.
Legend Formatting:
legend.title = element_text(…)
: We're making the legend title bold with a font size of 10.legend.text = element_text(…)
: We're setting the font size of the legend text to 9.5 for better readability.legend.position = "top"
: We're placing the legend at the top of the plot. This is often a good choice when dealing with plots that have many elements, as it helps to avoid visual clutter.
Why It Matters
Consistency: Applying a custom theme ensures that all our plots have a consistent look and feel, making our work more professional and easier to follow.
Enhanced Interpretability: Clear, readable fonts, well-placed legends, and appropriate spacing make it easier for our audience to grasp the key insights from our visualizations.
Accessibility: A clean and well-designed visual style makes our plots more accessible to a wider audience, including those who may not be familiar with the technical details of our analysis.
Integrating the Theme into Our Workflow
The customization
theme will be applied to all the plots we create during our missing data analysis. This includes:
Counts of Missing Values: Bar plots or other visualizations summarizing the amount of missing data for each variable.
Missingness Patterns: More complex visualizations, like UpSet plots, that reveal how missing values are distributed across different combinations of variables.
By applying this theme consistently, we ensure that all our visualizations are not only informative but also visually appealing and easy to understand.
Pro Tip: Saving and Reusing Your Theme
To make your own custom theme even more useful, you can save it as an R object and reuse it in future projects. Here's how:
This allows you to maintain a consistent visual style across all your analyses without having to redefine the theme each time.
Looking Ahead: Bringing Our Data to Life with Visualizations
With our custom theme defined, we're now fully equipped to create impactful visualizations that will help us understand the patterns of missing data in our dataset. In the next sections, we'll define helper functions to prepare our data for plotting, and then we'll generate insightful visualizations, including UpSet plots, to explore the intricacies of missingness.
By combining careful data preparation with a polished visual style, we're setting the stage for a deeper understanding of our data and, ultimately, more reliable survival models.
2.4 Defining Helper Functions for Plotting Missingness
Introduction
Understanding the patterns of missing data in our dataset is a crucial step in preparing for survival analysis. Visualizing these patterns helps us identify potential biases, choose appropriate imputation strategies, and ultimately build more reliable models. In this section, we'll focus on creating helper functions that streamline the process of generating insightful visualizations of missing data, particularly using UpSet plots.
Step 1: Ensuring Valid Inputs
Before we start creating visualizations, we need to make sure that our functions are robust. We'll define two simple helper functions to validate our inputs and prevent errors down the line:
test_if_null
: Checking for NULL inputs
What It Does
This function checks if the input
x
isNULL
.If it is
NULL
, it throws a clear error message usingcli::cli_abort
, indicating that the input must not beNULL
and reporting the class of the provided input.
Why It's Important
NULL
values can cause unexpected behavior in many R functions. By explicitly checking for them, we prevent our code from crashing and make debugging easier.
test_if_dataframe
: Ensuring Data Frame Inputs
What It Does
This function checks if the input
x
is a data frame usinginherits(x, "data.frame")
.If it's not a data frame, it throws a clear error message, indicating the expected class ("data.frame") and the actual class of the input.
Why It's Important
Many data manipulation and visualization functions in R expect data frames as input. This check ensures that our functions are used correctly.
Step 2: Preparing Data for UpSet Plots - The as_shadow_upset_custom
Function
UpSet plots are a powerful tool for visualizing the intersections of missing data across multiple variables. To create these plots, we need to transform our data into a specific format known as a "shadow matrix."
The as_shadow_upset_custom
function handles this transformation:
What It Does
Input Validation:
It first uses our helper functions,
test_if_null
andtest_if_dataframe
, to ensure that the inputdata
is notNULL
and is a data frame.It then checks if the number of variables with missing data is less than 2 using
n_var_miss(data) <= 1
from thenaniar
package. If so, it throws an error because UpSet plots are most informative when visualizing the intersections of missingness between at least two variables.
Shadow Matrix Creation:
data_shadow <- is.na(data) * 1
: This is the core of the transformation. It creates a new data frame calleddata_shadow
where each cell indicates whether the corresponding cell in the originaldata
is missing (NA
). Theis.na(data)
part generates a logical matrix (TRUE
for missing,FALSE
for not missing), and multiplying by 1 convertsTRUE
to 1 andFALSE
to 0.
Column Renaming:
colnames(data_shadow) <- sapply(colnames(data), function(x) preferred_labels[x])
: This line renames the columns of thedata_shadow
data frame using thepreferred_labels
we defined earlier. This makes the resulting UpSet plot more interpretable.
Data Frame Conversion and Integer Type:
data_shadow <- as.data.frame(data_shadow)
: The shadow matrix is converted to a data frame.data_shadow <- data_shadow |> mutate(across(where(is.numeric), as.integer))
: Then, any numeric columns in the data frame are converted to integers. This step ensures that the data frame is in the correct format for theupset
function we will use later.
Why It's Important
Prepares Data for UpSet Plots: The shadow matrix format is required by the
UpSetR
package, which we'll use to create UpSet plots.Enhances Interpretability: Using our
preferred_labels
to rename columns makes the UpSet plots easier to understand.
Step 3: Creating Customized UpSet Plots - The gg_miss_upset_custom
Function
Finally, we define a function called gg_miss_upset_custom
to generate our customized UpSet plots:
What It Does
Data Transformation: It calls our
as_shadow_upset_custom
function to transform the inputdata
into the required shadow matrix format.UpSet Plot Generation: It uses the
upset()
function from theUpSetR
package to create the UpSet plot.order.by = "freq"
: This argument sorts the intersections in the plot by frequency (how common they are).set_size.show = TRUE
: This argument ensures that the plot displays the size of each set (variable).set_size.numbers_size = 4.5
: This argument controls the font size of the set sizes.…
: This allows us to pass additional arguments to theupset
function for further customization.
Why It's Important
Visualizes Missing Data Intersections: UpSet plots are excellent for visualizing how missing values are distributed across different combinations of variables. They reveal patterns of missingness that might not be apparent from simple summaries.
Customization: The function allows us to customize the plot's appearance and sorting, making it easier to highlight the most relevant patterns.
Conceptual Reasons: Why These Functions Matter
These helper functions embody important principles for good data analysis:
Reusability: We can easily reuse these functions with different datasets or different sets of variables, saving us time and effort in future analyses.
Error Handling: The input validation checks help prevent errors and make our code more robust.
Interpretability: By using
preferred_labels
, we ensure that our visualizations are clear and understandable to a wider audience.
Looking Ahead: Visualizing Missingness and Summarizing Our Data
With these helper functions in place, we're ready to generate insightful visualizations of missing data patterns and create comprehensive descriptive statistics tables. In the next section, we'll use these tools to explore the extent and nature of missingness in our TBIMS dataset, examining the intersections of missing data and summarizing the characteristics of our study population. This exploration will pave the way for building reliable survival models.
2.5 Visualizing Missingness Counts
Introduction
Before we can make informed decisions about how to handle missing data, we need to understand its extent and nature. Are values missing completely at random, or are there underlying patterns that could bias our results? This is where visualizing missingness comes in.
In this section, we'll focus on creating clear and informative plots that reveal the patterns of missing data in our dataset, specifically using bar plots to identify variables with substantial missingness. These visualizations will be crucial for guiding our choices regarding imputation or, in some cases, the exclusion of certain variables or participants.
Step 1: Defining a Helper Function to Count Missing Values - prepare_na_counts_df
First, let's create a helper function called prepare_na_counts_df
that will streamline the process of calculating and storing the number of missing values for each variable.
What's happening here?
Purpose: This function takes a data frame (
df
) as input and calculates the number of missing values (NA
s) in each column (variable).sapply()
for Efficiency: Thesapply()
function efficiently applies thesum(is.na(x))
function to each column of the data frame. This function counts the number ofNA
values in each column.Storing Results: The function then neatly packages the variable names and their corresponding missing value counts into a data frame called
labels_df
.
Why It's Important
Centralized Information: This function consolidates all of the missingness information into a single, easy-to-use data frame.
Foundation for Visualization: The
labels_df
data frame will be used to create our visualizations.Reusability: We can reuse this function with different data frames, making it a valuable tool for our workflow.
Step 2: Preparing Missing Value Counts for Different Covariate Sets
Now, we'll use our prepare_na_counts_df
function to calculate missing value counts for both all_proposed_covariates
and select_covariates
, the two sets of variables that we defined earlier. We will also save these results to .rds
files for future reference.
What's happening here?
Calculating Missing Value Counts: We call
prepare_na_counts_df
twice—once for each of our covariate sets—creating two data frames:na_counts_for_all_proposed_covariates
andna_counts_for_select_covariates
.Saving Results: We use
saveRDS()
to save these data frames as.rds
files. This allows us to easily load them later without having to recalculate the missing value counts.
Why It Matters
Targeted Exploration: This allows us to examine missingness patterns specifically within our two sets of covariates, informing our decisions about which variables to prioritize in our analysis.
Reproducibility: Saving these intermediate results ensures that our analysis is reproducible and that we can easily revisit these missing value counts later if needed.
Step 3: Visualizing Missing Value Counts with Bar Plots
Now comes the exciting part: visualizing the missing data! We'll create bar plots that show the number of missing values for each variable, making it easy to identify variables that might require special attention.
Figure 1-1: Missing Value Counts for All Proposed Covariates
Figure 1-2: Missing Value Counts for Select Covariates
What's happening here?
gg_miss_var()
: This function from thenaniar
package creates a bar plot showing the number of missing values for each variable in the input data frame.Customization:
labs()
: We add clear labels for the x and y axes.scale_x_discrete(labels = var_name_mapping)
: We use ourvar_name_mapping
to replace the original variable names with our descriptive labels on the x-axis.scale_y_continuous(labels = label_comma())
: We format the y-axis labels with commas for better readability of large numbers.theme_classic()
: We apply a clean, classic theme to the plot.customization
: We apply our custom theme, defined earlier, for consistent plot aesthetics.geom_text()
: We add text labels above each bar, displaying the exact number of missing values. This provides additional clarity and precision.
Why It Matters
At-a-Glance Summary: These plots provide a quick and intuitive overview of the extent of missingness for each variable.
Informs Imputation/Exclusion Decisions: Variables with high levels of missingness might require imputation or, in some cases, might need to be excluded from certain analyses. These plots help us make informed decisions about how to handle missing data.
Publication-Ready Visuals: The customizations ensure that our plots are clear, informative, and visually appealing, suitable for inclusion in reports or publications.
Interpreting the Bar Plot of Missing Values for Select Covariates
Before we move on to more complex visualizations of missing data, let's take a moment to interpret the bar plot we created for our select_covariates
(Figure 1-2). This plot provides a clear, visual summary of the number of missing values for each variable in our focused set of covariates.

What the Plot Reveals
The bar plot instantly reveals several key insights into the missingness patterns within our select_covariates
:
High Levels of Missingness in Mental Health and Function Factor Score Variables: The most striking observation is the substantial number of missing values in several key variables, particularly those related to mental health and the Function Factor Score:
Mental Health Variables:
Depression Level at Year 1
has 1,530 missing values.History of Suicide Attempt
has 1,522 missing values.History of Mental Health Treatment
has 1,447 missing values.Problematic Substance Use at Injury
has 1,131 missing values.
Functional Independence:
Function Factor Score at Year 1
(and its derived quintiles) has 845 missing values.
Relatively Low Levels of Missingness in Baseline Variables: Variables measured at baseline or during the initial rehabilitation period generally have much lower levels of missingness:
Educational Attainment at Injury
has 27 missing values.Medicaid Status
has 17 missing values.Sex
has only 1 missing value.Age at Injury
has no missing values.
Why These Patterns Matter:
Imputation Needs: The high levels of missingness in these crucial variables, especially those related to mental health and functional status, underscore the need for careful imputation. Simply discarding participants with missing data on these variables would drastically reduce our sample size and may also introduce bias.
Potential Biases: The fact that missingness is concentrated in mental health-related variables raises concerns about potential biases. Stigma surrounding mental health issues might make participants less likely to disclose this information, leading to higher rates of missingness. Additionally, participants with more severe mental health challenges might be more difficult to reach for follow-up.
Informing Imputation Strategies: Understanding which variables have the most missing data, and the potential reasons for this, is crucial for selecting appropriate imputation methods. The high correlation between missingness in depression level, suicide attempt history, and mental health treatment history suggests that these variables might be missing for similar reasons. Furthermore, the correlation between missingness in the mental health variables and
Function Factor Score at Year 1
suggests that these variables may be missing due to similar underlying factors. A multivariate imputation approach, which takes into account the relationships between variables, might be particularly appropriate here.Prioritizing Variables: The plot helps us prioritize our efforts in addressing missing data. Clearly, the mental health and functional status variables require the most attention.
Figure 1-2: A Visual Guide
The bar plot provides a clear visual representation of these patterns. Each bar represents a variable, and the length of the bar corresponds to the number of missing values. The exact counts are also displayed above each bar for precision.
By examining this plot, we can quickly grasp the extent of missingness in our key variables and start planning our strategy for addressing it.
Looking Ahead: Unveiling Deeper Patterns with UpSet Plots
While this bar plot provides a valuable overview of missingness per variable, it doesn't reveal how missing values are related across variables. For instance, are the participants with missing Depression Level at Year 1
also missing Function Factor Score at Year 1
? Or are these distinct groups?
To answer these questions, we'll turn to UpSet plots in the next section. These powerful visualizations will allow us to explore the intersection of missing data, revealing complex patterns that can inform our imputation strategies and help us build more robust survival models. We will also use the gstummary
package to create descriptive statistics tables that will help us further explore our data and prepare for survival modeling.
2.6 Visualizing Missingness Patterns
Introduction
We've calculated the amount of missing data for each variable, but to truly understand the nature of missingness in our dataset, we need to go a step further. We need to explore how missing values are related across variables. Are certain variables frequently missing together? Are there distinct patterns of missingness that could inform our imputation strategies?
This is where UpSet plots come in. These powerful visualizations are specifically designed to reveal the intersections of missing data, showing us which combinations of variables tend to be missing simultaneously. By visualizing these patterns, we can gain valuable insights that will guide our decisions about how to handle missing data in our survival analysis.
In this section, we'll generate UpSet plots for two key sets of covariates:
All Proposed Covariates: This gives us a broad overview of missingness across all of the variables that we initially considered for our analysis.
Select Covariates: This focuses on the final set of variables chosen for our Cox regression models after addressing potential overfitting.
Why UpSet Plots? A Powerful Tool for Exploring Missing Data
UpSet plots are particularly well-suited for visualizing missing data patterns because they:
Reveal Intersections: They show us which combinations of variables are frequently missing together, helping us understand the relationships between missing values. For example, we might discover that participants who are missing
Depression Level at Year 1
are often also missingFunction Factor Score at Year 1
.Guide Imputation Strategies: The patterns revealed by UpSet plots can inform our choice of imputation methods. For instance, if we see that several variables are often missing together, a multivariate imputation approach might be more appropriate than imputing each variable independently.
Highlight Potential Biases: UpSet plots can help us identify potential biases related to missing data. If certain groups of participants are more likely to have missing values on specific combinations of variables, this could affect the generalizability of our findings.
Step 1: Creating an UpSet Plot for All Proposed Covariates
Let's start by generating an UpSet plot for all of our proposed covariates. This will give us a broad overview of missingness patterns across the entire dataset.
What's happening here?
file_path
: We define the file path where the plot will be saved, within ourmissingness_plots_dir
directory.png(…)
: This function opens a PNG graphics device, which means that the plot we create will be saved as a.png
file. We specify thewidth
,height
, and resolution (res
) of the image.gg_miss_upset_custom(…)
: This is our custom function (defined in "Section 2.4 Defining Helper Functions for Plotting Missingness") that generates the UpSet plot.all_proposed_covariates_for_plots
: This is the data frame that contains all proposed covariates, excluding those we deemed irrelevant for plotting.var_name_mapping
: This provides user-friendly labels for the variables in the plot.nsets = 5
: This argument tells the function to display the top 5 most frequent missing data intersections.
dev.off()
: This closes the PNG graphics device, saving the plot to the specified file.
Conceptual Breakdown
High-Resolution Plots: We save the plot as a high-resolution PNG file to ensure clarity and readability, especially since UpSet plots can become quite complex.
Customization: Our
gg_miss_upset_custom
function handles the data transformation needed for UpSet plots and applies our preferred variable labels for better interpretability. By limiting the plot to the top 5 intersections (nsets = 5
), we focus on the most prevalent missing data patterns.
Step 2: Creating an UpSet Plot for Select Covariates
Next, we'll generate an UpSet plot specifically for our select_covariates
—the variables included in our final Cox regression models.
What's happening here?
This code is very similar to the previous example, but it uses
select_covariates_for_plots
as the input data, focusing on our core set of predictor variables.
Why It Matters
Focus on Key Variables: By examining missingness patterns in our
select_covariates
, we can directly address the missing data issues that are most relevant to our final models.Model-Specific Insights: This plot helps us understand how missing data might impact the specific variables included in our Cox regression analysis.
Interpreting UpSet Plots: Decoding the Patterns
UpSet plots might look a bit complex at first, but they are incredibly informative once you understand how to read them. Here's a quick guide:
Rows in the Intersection Matrix (Bottom Part):
Each row represents a variable in our dataset.
Filled dots in a row indicate that the variable is part of a specific intersection (combination) of missing data).
Bars Above the Intersection Matrix (Top Part):
Each bar represents a unique combination of variables with missing data (an intersection).
The height of the bar indicates the number of observations that have that specific pattern of missingness.
Example:
Imagine a bar that corresponds to filled dots for Depression Level at Year 1
, History of Suicide Attempt
, and History of Mental Health Treatment
. The height of that bar would tell us how many participants are missing data on all three of those variables simultaneously.
Insights We Can Gain
Common Missingness Patterns: We can identify which combinations of variables are frequently missing together. This can reveal underlying reasons for missing data (e.g., variables coming from the same questionnaire or requiring similar data collection procedures).
Isolated vs. Overlapping Missingness: We can see whether variables are often missing independently or if they tend to be missing in conjunction with other specific variables.
Guiding Imputation: These insights are invaluable for choosing appropriate imputation strategies. For example, if variables are frequently missing together, a multivariate imputation method might be necessary.
Practical Takeaways: Turning Visualizations into Action
The UpSet plots that we generate provide actionable insights that will directly inform our data preprocessing decisions:
Joint Imputation: If we observe that certain variables are frequently missing together, we might consider using imputation methods that can handle missing variables simultaneously (e.g., multivariate imputation by chained equations [MICE]).
Sensitivity Analyses: If critical variables have substantial missingness, we might need to perform sensitivity analyses to assess the potential impact of excluding these variables or using different imputation methods.
Efficiency and Transparency: Saving our plots as
.png
files ensures that we can easily share them with collaborators, include them in reports, and maintain a clear record of our data exploration process.
Interpreting the UpSet Plot of Missingness for Select Covariates
Now that we've generated our UpSet plot for the select_covariates
(Figure 2-2), let's dive into its interpretation. This plot provides a powerful visual representation of how missing values overlap across our key variables. By understanding these patterns, we can make more informed decisions about how to handle missing data in our survival analysis.

Understanding the Structure of the Plot
Recall that the UpSet plot displays:
Rows (Left Side): Each row represents one of our
select_covariates
. The horizontal bars represent the number of missing observations for that variable (the "set size").Intersection Matrix (Bottom Right): The matrix of dots shows the different combinations of variables where missingness overlaps. Each column represents a unique intersection.
Vertical Bars (Top Right): The vertical bars above the matrix represent the size (number of observations) of each intersection—in other words, the number of participants who have that specific pattern of missing data.
Key Observations from Figure 2-2
Let's analyze the key patterns revealed by our UpSet plot:
Dominant Missingness Patterns: The tallest bars on the plot highlight the most common missing data patterns. We can see that the most frequent pattern is having missing values in
Problematic Substance Use at Injury
(552 observations). The second most frequent pattern involves missingness acrossFunction Factor Score at Year 1 Quintiles
,History of Mental Health Treatment
,History of Suicide Attempt
, andDepression Level at Year 1
all at the same time (544 observations). The third and fourth most frequent patterns involve missingness only inDepression Level at Year 1
and inHistory of Mental Health Treatment
andHistory of Suicide Attempt
, respectively (417 and 401 observations, respectively). These four patterns account for the majority of observations with missing data.Clustering of Missingness in Mental Health Variables: The connected dots in the intersection matrix reveal a strong tendency for missing values to cluster within our mental health variables (
History of Mental Health Treatment
,History of Suicide Attempt
, andDepression Level at Year 1
) andFunction Factor Score at Year 1 Quintiles
. This suggests that participants who are missing data on one of these variables are also likely to be missing data on others.
Implications for Our Analysis
Multivariate Imputation: The strong clustering of missingness among our mental health variables and functional status variable suggests that a multivariate imputation approach might be the most appropriate. This type of imputation takes into account the relationship between variables when filling in missing values, potentially leading to more accurate and less biased results.
Potential Biases: The observed patterns raise concerns about potential biases. For instance, participants who are unwilling to disclose information about their mental health might also be less likely to participate in follow-up assessments, leading to missing data on other variables. We'll need to carefully consider these potential biases when interpreting our findings.
Focus on Key Variables: The UpSet plot confirms that missingness is concentrated in our
select_covariates
, particularly the mental health variables and the function factor score variable. This justifies our focus on these variables during the imputation process.
Figure 2-2: A Visual Guide to Missingness
By carefully examining the UpSet plot, we can quickly identify the dominant patterns of missing data and begin to formulate hypotheses about the underlying reasons for this missingness. This information is crucial for making informed decisions about how to proceed with imputation and ultimately build reliable survival models. The plot also allows us to quickly communicate these issues to others, including collaborators or individuals reviewing our work.
What's Next: Summarizing Our Data with Descriptive Statistics
Having visualized the patterns of missingness in our data, we're now ready to move on to the next crucial step of exploratory data analysis: generating descriptive statistics tables. These tables will provide a comprehensive summary of our study population's characteristics, stratified by depression levels at Year 1. This will allow us to further explore our data, identify potential confounders, and refine our hypotheses before proceeding to survival modeling.
Conclusion
We've reached a critical juncture in our survival analysis journey. We've taken raw, complex data and transformed it into a meticulously prepared, analysis-ready dataset. This hasn't just been about cleaning and organizing; it's been about crafting a powerful resource that will enable us to unlock meaningful insights into the relationship between depression one year post-TBI and long-term survival.
This installment focused on understanding and addressing the critical issue of missing data. We've explored its patterns, visualized its complexities using bar plots and UpSet plots, and made a key decision about how to handle it for this stage of our analysis. While we acknowledge that sophisticated methods like multiple imputation offer powerful ways to handle missing data—and we will explore them in detail in a forthcoming blog series—we opted for listwise deletion (complete-case analysis) in this specific study of depression's impact on all-cause mortality.
Why Listwise Deletion Here?
Our choice was driven by a need for straightforward interpretation and a streamlined analytic process. Listwise deletion, while potentially introducing bias if the data are not Missing Completely at Random (MCAR), allows us to work with a readily defined subset of our data where all variables of interest have complete information. This approach offers several advantages in the context of this introductory blog series:
Simplicity and Clarity: It provides a clear and easy-to-understand starting point for our analysis, making it easier to explain the core concepts of survival analysis without the added complexity of imputation.
Transparency: The impact of listwise deletion on our sample size is readily apparent, allowing readers to clearly see the population on which our findings are based.
Sufficient Power: Despite the reduction in sample size that comes with listwise deletion, our initial assessments indicated that we still retain sufficient statistical power for meaningful analysis. We will demonstrate this in the next blog post when we present our descriptive statistics tables.
We recognize the trade-off between the simplicity of listwise deletion and the potential for bias. However, for this specific analysis, we prioritized presenting a clear and direct examination of the relationship between depression and mortality in a readily interpretable manner.
A Dedicated Series on Multiple Imputation
It's important to emphasize that we do not dismiss the value of multiple imputation. In fact, we are planning a dedicated blog series that will delve into the intricacies of this powerful technique. That series will provide a comprehensive resource on multiple imputation, covering its theoretical underpinnings, practical implementation in R, and its application in the context of survival analysis. We will revisit the TBIMS data in that series to demonstrate the application of multiple imputation to this dataset.
For now, our complete-case sample provides a solid foundation for our initial exploration and modeling.
Reflecting on Our Progress: A Recap of Key Accomplishments
Let's take a moment to appreciate the significant strides we've made:
Setting Up a Robust R Environment: We began by establishing a reproducible R environment, organizing our project directory, and loading essential libraries. This seemingly simple step is the bedrock of a streamlined and efficient workflow.
Defining and Refining Our Variables: We carefully defined our covariates of interest, creating both comprehensive and focused lists to support different stages of analysis. We also assigned clear, descriptive labels to make our data more accessible and interpretable. We made key decisions to refine our variable list to avoid overfitting in our final models, and we made these decisions transparent and reproducible by documenting them in our code.
Creating a Polished Visual Style: We crafted a custom theme for our plots, ensuring that our visualizations will be both informative and visually engaging, effectively communicating our findings to a wide audience.
Illuminating Missing Data Patterns: We used bar plots and UpSet plots to visualize the extent and nature of missingness in our data. These visualizations provided crucial insights into which variables are most affected by missing data and how missing values cluster together, informing our imputation strategies.
Extracting, Imputing, and Transforming Key Variables: We extracted and imputed Year 1 variables, created comprehensive mental health history variables, and transformed a skewed continuous variable into quintiles.
Applying Eligibility Criteria and Handling Special Cases: We carefully applied our study's eligibility criteria, defining our analysis sample and thoughtfully handling cases with incomplete follow-up data.
Why This Matters
Addressing missing data is about ensuring that our analysis is scientifically sound and ethically responsible. By understanding and visualizing missingness patterns, we've taken crucial steps to:
Mitigate Potential Biases: We've gained a deeper underestanding of why data might be missing, allowing us to choose appropriate imputation methods and minimize the risk of biased results.
Maximize the Value of Our Data: Thoughtful handling of missing data allows us to retain as much information as possible, enhancing the statistical power of our study and the generalizability of our findings.
Ensure Transparency and Reproducibility: Our detailed logging and documentation ensures that our entire process is transparent and reproducible, building trust in our findings and allowing others to build upon our work.
Looking Ahead: From Exploration to Modeling - Transforming Data into Insights
With our missing data patterns illuminated and our dataset prepared, we're now poised to enter the next phase of our analysis.
In the upcoming posts, we will:
Explore our data through descriptive statistics: We will generate comprehensive summaries of our study population, examining the distribution of key variables and identifying potential relationships. We will stratify these summaries by depression level at year 1 to gain a better understanding of how these groups may differ.
Visualize survival patterns: We'll use Kaplan-Meier curves and other powerful visualizations to bring our data to life, revealing survival trends and patterns that will inform our modeling choices.
Build and interpret our survival models: We will use Cox regression to examine the relationship between depression and mortality, ultimately answering our core research question with precision and clarity.
We're not just analyzing data; we're uncovering a story about recovery, resilience, and the factors that shape long-term outcomes for individuals with TBI. This carefully prepared dataset is the foundation of that story, and the forthcoming blog series on multiple imputation will add another layer of depth to this narrative.
Comments
Newer