Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 5)

February 3, 2025

Research

Tutorials

Introduction

Welcome to the fifth installment in our comprehensive series on survival analysis! In this series, we're exploring the practical steps involved in analyzing real-world data to uncover meaningful insights about how depression levels one year after a traumatic brain injury (TBI) influence all-cause mortality within the subsequent five years. By leveraging the power of R, we're delving into the nuances of survival analysis while maintaining a focus on reproducibility and clarity.

This post shifts our focus toward the critical task of exploratory data analysis, with an emphasis on understanding and addressing missing data. Missingness is a common yet challenging issue in longitudinal studies, and the decisions we make in handling it can have a profound impact on the reliability of our findings. Through visualization and descriptive statistics, we aim to uncover patterns in missingness and lay a solid foundation for imputing values or justifying exclusions.

Here's what we'll cover in this post:

2.1 Initial Setup and Library Loading

We'll set the stage by loading essential R libraries, organizing our project directory, and ensuring our environment is equipped for reproducible and efficient analysis.

2.2 Defining Covariates and Assigning Clear Labels

Next, we'll prepare the data by defining comprehensive and focused lists of covariates. This step involves assigning meaningful labels to variables, balancing breadth and focus to enable both exploratory and targeted analyses.

2.3 Crafting a Custom Theme

As we transition to visualization, we'll define a custom theme to ensure all plots have a consistent and professional appearance. Thoughtful aesthetics enhance the interpretability and impact of our findings.

2.4 Defining Helper Functions for Plotting Missingness

We'll introduce helper functions for handling missing data visualizations, streamlining the process and ensuring clarity in how we explore patterns of missingness.

2.5 Visualizing Missingness Counts

Using bar plots, we'll visualize the extent of missingness across variables, identifying key areas that require attention and providing an at-a-glance summary of data quality.

2.6 Visualizing Missingness Patterns

Finally, we'll dive into UpSet plots—a powerful tool for visualizing intersections of missing data. By examining these patterns, we'll uncover relationships between variables and guide imputation strategies.

Why This Matters: The Role of Exploratory Missing Data Analysis

Handling missing data is more than a technical hurdle—it's a critical step in ensuring the validity of our analysis. Through a combination of clear visualizations and targeted exploration, we can:

  • Identify Biases: Understand how and why data might be missing, and consider its implications on the representativeness of our findings.

  • Guide Imputation Strategies: Determine whether missingness is random or systematic, informing the choice of imputation methods.

  • Ensure Robust Models: Lay the groundwork for reliable survival models by proactively addressing gaps in our data.

Throughout this post, you'll find step-by-step R code, intuitive explanations, and practical insights to enhance your own survival analysis projects. Whether you're grappling with a similar dataset or exploring this field for the first time, these techniques will help you navigate the complexities of missing data with confidence.

2.1 Initial Setup and Library Loading

Introduction

This script establishes the foundational environment for data analysis by loading essential R libraries, setting up a structured directory system for data management, loading preprocessed data, and configuring the study timeline. These steps ensure a reproducible, organized, and visually consistent workflow.

Step 1: Equipping Ourselves - Loading Essential Libraries

Before we can start analyzing data, we need to ensure that we have the right tools at our disposal. We'll load a curated set of R libraries, each chosen for its specific role in data preprocessing, analysis, or visualization.

# Load the pacman package (install if necessary)
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Install and load prerequisite libraries
pacman::p_load(ComplexUpset, extrafont, here, naniar, scales, tidyverse)
Let's break down what's happening:
  1. pacman: Our Package Manager:

    • The pacman package simplifies the process of managing R packages. The first two lines of code check if pacman is installed, and if it is not, it installs it.

    • Why It Matters: pacman streamlines our workflow by allowing us to install and load multiple packages with a single command (p_load). It also handles situations where a package is already installed, preventing unnecessary re-installations.

  2. Our Arsenal of Libraries:

  • ComplexUpset: This package will help us create advanced UpSet plots, a powerful visualization technique for exploring complex patterns of missing data in our dataset.

  • extrafont: This package allows us to customize our plots with specific fonts, giving our visualizations a polished and professional look.

  • here: This package is essential for creating reproducible file paths. It automatically detects the project's root directory, making our code portable across different computer environments.

  • naniar: This package is specifically designed for working with missing data. We'll use it to analyze and visualize missingness patterns in our dataset.

  • scales: This package provides tools for customizing plot scales and labels, enhancing the clarity and readability of our visualizations.

  • tidyverse: This is a collection of essential R packages for data science, including dplyr (for data manipulation), ggplot2 (for data visualization), and many others. The tidyverse provides a cohesive and powerful framework for working with data in R.

Pro Tip: Using pacman::p_load is a highly recommended for managing package dependencies. It ensures that all required libraries are installed and loaded efficiently, saving you time and preventing potential errors.

Step 2: Building Our Home Base - Creating a Project Directory

A well-organized project directory is essential for managing our files, ensuring reproducibility, and collaborating effectively. Let's create a clear structure for our project:

# Create the 'Data/Processed' subdirectory if not already accessible
data_processed_dir <- here("Data", "Processed")
if (!dir.exists(data_processed_dir)) {
  dir.create(data_processed_dir, recursive = TRUE)
}

# Create the 'Output/Plots/Missingness' subdirectory if not already available
missingness_plots_dir <- here("Output", "Plots", "Missingness")
if (!dir.exists(missingness_plots_dir)) {
  dir.create(missingness_plots_dir, recursive = TRUE)
}

What's happening here?
  1. Defining Directories:

    • Data/Processed: This directory will house our preprocessed datasets, keeping them separate from the raw data.

    • Output/Plots/Missingness: This directory will store visualizations specifically related to missing data patterns.

  2. Automating Directory Creation:

    • here(): This function from the here package dynamically defines file paths relative to the project's root directory. This makes our code more portable, as it will work correctly even if the project is moved to a different location.

    • dir.create(): This function creates the specified directories. The recursive = TRUE argument ensures that any necessary parent directories are also created. The if (!dir.exists(…)) checks ensure that we don't accidentally recreate existing directories.

Why It Matters

This structured approach eliminates confusion about file locations, ensures that outputs and intermediate datasets are systematically organized, and promotes reproducibility.

Step 3: Loading Our Preprocessed Data

Now that our environment is set up, let's load the preprocessed dataset that we've been working with throughout this series:

# Load the .rds file
analytic_data_final <- readRDS(file.path(data_processed_dir, "analytic_data_final.rds"))
What's happening here?
  • readRDS(): This function loads an R object that was previously saved as an .rds file. We're loading our analytic_data_final dataset, which contains the results of all of our previous preprocessing steps.

Why It Matters
  • This dataset is now ready for the next stages of our analysis: descriptive exploration, visualization, and eventually, survival modeling.

  • Using .rds files is an efficient way to store and retrieve R objects, preserving all data structures, including factor levels, labels, and metadata.

Step 4: Polishing Our Visualizations - Configuring Plot Aesthetics

To ensure that our visualizations effectively communicate our findings, let's import some custom fonts to give them a polished look:

# Import fonts from Font Book
loadfonts(device = "all", quiet = TRUE)
What's happening here?
  • Consistent branding and enhanced readability can make your visualizations more impactful and professional.

Pro Tip: Test font availability on different systems to avoid discrepancies when collaborating or sharing code. If you are sharing code with others, it is best to specify a font that is commonly available across systems.

The Big Picture: A Solid Foundation for Success

These initial setup steps are more than just technicalities; they establish a reproducible workflow, ensure that our project is well-organized, and equip us with the tools needed to handle complex datasets and analyses.

Looking Ahead: Exploring and Visualizing Our Data

With our R environment configured and our dataset loaded, we're now ready to delve into the exciting world of descriptive statistics and data visualization! In the next sections, we'll create comprehensive tables and insightful plots, uncovering key trends and relationships within our data. This exploratory phase will set the stage for building our robust survival models.

This foundational setup might seem like a small step, but it's the linchpin of a successful analysis pipeline. A solid foundation ensures that each subsequent step builds seamlessly on the last, culminating in meaningful insights and actionable results.

2.2 Defining Covariates and Assigning Clear Labels

Introduction

We're now ready to prepare our data for exploratory analysis and missing data visualization. This crucial step lays the groundwork for understanding the characteristics of our study population and identifying potential patterns in our data, ultimately informing our survival models. To do this effectively, we need to:

  1. Define our covariates of interest.

  2. Assign clear and descriptive labels to our variables.

Let's dive into how we accomplish these tasks.

Step 1: Defining Our Covariates of Interest

First, we need to explicitly define the variables that we'll be focusing on in our analyses. We'll create two lists:

  • all_proposed_covariates: This is an exhaustive list of all potential predictor variables in our dataset that might be relevant to our research question. It includes a wide range of variables capturing demographic information, injury characteristics, functional status, and mental health history. Think of this as our initial long list of potential players for our analysis.

  • select_covariates: This is a more curated list, containing a subset of variables that we've deemed particularly important for our core research question or that are most suitable for initial exploration based on careful consideration of previous research and clinical knowledge. This is our starting lineup—the key players that we'll first focus on. It's important to note that this selection isn't set in stone; we refined it after our initial Cox regression analyses, as detailed below.

Here's the code:

# Define all covariates of interest
all_proposed_covariates <- c("id",
                             "event_status",
                             "time_to_event_in_years",
                             "time_to_censorship_in_years",
                             "time_to_expiration_in_years",
                             "age_at_censorship",
                             "age_at_expiration",
                             "calendar_year_of_injury",
                             "sex",
                             "age_at_injury",
                             "education_level_at_injury",
                             "employment_at_injury",
                             "marital_status_at_injury",
                             "rehab_payor_primary_type",
                             "cause_of_injury",
                             "drs_total_at_year_1",
                             "fim_total_at_year_1",
                             "gose_total_at_year_1",
                             "func_score_at_year_1",
                             "func_score_at_year_1_q5",
                             "mental_health_tx_lifetime_at_injury",
                             "mental_health_tx_past_year_at_injury",
                             "mental_health_tx_hx",
                             "psych_hosp_hx_lifetime_at_injury",
                             "psych_hosp_hx_past_year_at_injury",
                             "psych_hosp_hx",
                             "problematic_substance_use_at_injury",
                             "problematic_substance_use_at_year_1",
                             "suicide_attempt_hx_lifetime_at_injury",
                             "suicide_attempt_hx_past_year_at_injury",
                             "suicide_attempt_hx_past_year_at_year_1",
                             "suicide_attempt_hx",
                             "depression_level_at_year_1")

# Define select covariates of interest (remove variables excluded for overfitting concerns)
select_covariates <- c("id",
                        "event_status",
                        "time_to_event_in_years",
                        "time_to_censorship_in_years",
                        "time_to_expiration_in_years",
                        "age_at_censorship",
                        "age_at_expiration",
                        "sex",
                        "age_at_injury",
                        "education_level_at_injury",
                        "rehab_payor_primary_type",
                        "func_score_at_year_1",
                        "func_score_at_year_1_q5",
                        "mental_health_tx_hx",
                        "problematic_substance_use_at_injury",
                        "suicide_attempt_hx",
                        "depression_level_at_year_1")
What's happening here?
  • We're creating two character vectors, all_proposed_covariates and select_covariates, that list the names of the variables that we'll be using.

  • select_covariates is a subset of all_proposed_covariates.

Why It Matters
  • Flexibility and Focus: Having both comprehensive and focused lists gives us flexibility. We can use all_proposed_covariates for broad exploratory analyses, generating hypotheses and examining a wide range of potential predictors. We can then use select_covariates for more targeted investigations related to our primary research question.

  • Organization and Clarity: Explicitly defining these lists makes our code more organized and easier to understand. It clearly signals which variables we're considering at each stage of the analysis.

Addressing Potential Overfitting

It's important to note that the select_covariates list was refined after our initial Cox regression analyses. We were mindful of the potential for overfitting, which can occur when a model is too complex relative to the amount of data available. Overfit models tend to perform well on the training data but poorly on new, unseen data.

One rule of thumb to mitigate overfitting is to have roughly 10-15 events (in our case, deaths) per predictor variable (or degree of freedom) in the model. Our initial 5-year dataset had approximately 4 events per degree of freedom (113 events and 26 df), falling short of this guideline.

To address this, we carefully considered the variables in our initial model and removed those that were deemed less critical or potentially redundant. This included:

  • calendar_year_of_injury: While potentially relevant, this variable might capture secular trends that could be confounded with other factors.

  • psych_hosp_hx: This variable, while important, might be correlated with other mental health variables, leading to redundancy.

  • employment_at_injury

  • cause_of_injury: This variable, while potentially relevant, might introduce too many categories (and thus degrees of freedom) relative to the number of events, increasing the risk of overfitting for this particular analysis. Although, we did retain it in the all_proposed_covariates list for use in descriptive statistics tables and data visualizations.

By trimming down our variable list, we aimed to create a more parsimonious and robust model, improving its generalizability and reducing the risk of overfitting.

Pro Tip: Model diagnostics and careful consideration of the balance between model complexity and the available data are crucial for avoiding overfitting. It's often an iterative process, requiring adjustments and refinements as you explore your data and build your models.

Step 2: Defining Preferred Variable Labels - Speaking a Common Language

Raw variable names can often by cryptic or inconsistent. To make our data more user-friendly and our results more interpretable, we'll assign clear, descriptive labels to our variables.

# Define the preferred variable labels for all covariates
var_name_mapping <- c(
  depression_level_at_year_1 = "Depression Level at Year 1",
  calendar_year_of_injury = "Calendar Year of Injury",
  sex = "Sex",
  age_at_injury = "Age at Injury",
  education_level_at_injury = "Educational Attainment at Injury",
  employment_at_injury = "Employment Status at Injury",
  marital_status_at_injury = "Marital Status at Injury",
  rehab_payor_primary_type = "Medicaid Status",
  cause_of_injury = "Mechanism of Injury",
  drs_total_at_year_1 = "DRS Score at Year 1",
  fim_total_at_year_1 = "FIM Score at Year 1",
  gose_total_at_year_1 = "GOS-E Score at Year 1",
  func_score_at_year_1 = "Function Factor Score at Year 1",
  func_score_at_year_1_q5 = "Function Factor Score at Year 1 Quintiles",
  mental_health_tx_lifetime_at_injury = "Lifetime History of Mental Health Treatment at Injury",
  mental_health_tx_past_year_at_injury = "Past-Year History of Mental Health Treatment at Injury",
  mental_health_tx_hx = "History of Mental Health Treatment",
  psych_hosp_hx_lifetime_at_injury = "Lifetime History of Psychiatric Hospitalization at Injury",
  psych_hosp_hx_past_year_at_injury = "Past-Year History of Psychiatric Hospitalization at Injury",
  psych_hosp_hx = "History of Psychiatric Hospitalization",
  problematic_substance_use_at_injury = "Problematic Substance Use at Injury",
  problematic_substance_use_at_year_1 = "Problematic Substance Use at Year 1",
  suicide_attempt_hx_lifetime_at_injury = "Lifetime History of Suicide Attempt at Injury",
  suicide_attempt_hx_past_year_at_injury = "Past-Year History of Suicide Attempt at Injury",
  suicide_attempt_hx_past_year_at_year_1 = "Past-Year History of Suicide Attempt at Year 1",
  suicide_attempt_hx = "History of Suicide Attempt"
)

What's happening here?
  • var_name_mapping: We create a named list where the names are the original variable names in our dataset, and the values are the new, descriptive labels we want to assign. For example, we're mapping the variable depression_level_at_year_1 to the label "Depression Level at Year 1."

Why It Matters
  • Clarity and Interpretability: Descriptive labels make our results much easier to understand, especially for those who are not familiar with the technical details of the dataset.

  • Consistency: Using these labels ensures that our variables are consistently named across all tables, plots, and reports, making our work more professional and easier to follow.

Pro Tip: When creating labels, strive for a balance between brevity and informativeness. Choose labels that are both concise and easily understandable by a broad audience.

Step 3: Creating Data Frames for Analysis and Visualization

Now, we'll create specific data frames tailored for different aspects of our exploratory analysis:

# Prepare the data frames with the covariates of interest
analytic_all_proposed_covariates <- analytic_data_final |>
  select(all_of(all_proposed_covariates))

analytic_select_covariates <- analytic_data_final |>
  select(all_of(select_covariates))

# Define the undesired variables from the data frames to prepare for visualizations
variables_to_exclude_from_plots <- c("id",
                                     "event_status",
                                     "time_to_event_in_years",
                                     "time_to_censorship_in_years",
                                     "time_to_expiration_in_years",
                                     "age_at_censorship",
                                     "age_at_expiration")

all_proposed_covariates_for_plots <- analytic_all_proposed_covariates |>
  select(-all_of(variables_to_exclude_from_plots))

select_covariates_for_plots <- analytic_select_covariates |>
  select(-all_of(variables_to_exclude_from_plots))

What's happening here?
  1. Creating Data Frames for Analysis:

    • analytic_all_proposed_covariates: This data frame will contain all of the variables listed in all_proposed_covariates, providing a dataset for broad exploration.

    • analytic_select_covariates: This data frame will contain only the variables in select_covariates, providing a more focused dataset for targeted analyses and our primary survival models.

  2. Creating Data Frames for Visualization:

    • variables_to_exclude_from_plots: We define a list of variables that we generally don't want to include in our missing data visualizations (e.g., ID variables, event status, time-to-event).

    • all_proposed_covariates_for_plots and select_covariates_for_plots: We create two additional data frames, based on all_proposed_covariates and select_covariates, but with the variables_to_exclude_from_plots removed. These will be used specifically for our missing data visualizations.

Why It Matters
  • Tailored Datasets: We're creating data frames that are specifically designed for different analytical tasks. This keeps our workflow organized and efficient.

  • Optimized Visualizations: By excluding variables that are not informative for missing data visualizations (like ID numbers or time variables), we ensure that our plots are clear, focused, and easy to interpret.

Conceptual Takeaways: Preparing for Insightful Exploration

These steps—defining our covariates, assigning clear labels, and creating tailored data frames—are essential for setting the stage for a robust and insightful exploratory data analysis.

Here's why this preparation is so critical:

  • Balancing Breadth and Focus: We've created both comprehensive and focused variable lists, allowing us to explore our data broadly while also maintaining a clear focus on our primary research question.

  • Addressing Overfitting: We've taken proactive steps to mitigate the risk of overfitting in our survival models by carefully selecting the variables in select_covariates.

  • Enhancing Communication: Clear and descriptive variable labels ensure that our findings will be accessible and interpretable by a wide audience.

Looking Ahead: Visualizing Missingness and Summarizing Our Data

With our data frames prepared, we're now ready to embark on the exciting phase of exploratory data analysis! In the next sections, we'll:

  • Visualize Missing Data Patterns: We'll use specialized tools to examine the patterns of missingness in our dataset, helping us understand the potential impact of missing data on our analysis and choose appropriate imputation strategies.

  • Generate Descriptive Statistics: We'll create comprehensive tables that summarize the key characteristics of our study population, providing a detailed overview of our data.

By combining careful data preparation with insightful visualizations, we're setting the stage for building robust survival models and uncovering meaningful insights into the factors influencing long-term outcomes after TBI.

2.3 Crafting a Custom Theme

Introduction

As we prepare to delve into data visualization—particularly for exploring missing data patterns—it's essential to think about the aesthetics of our plots. A consistent and well-designed visual style not only makes our results more appealing but also enhances their interpretability and impact. In this section, we'll define a custom theme that will ensure our plots are both informative and visually engaging.

Think of this as choosing the right font, colors, and layout for a presentation. Just as a well-designed presentation can captivate an audience, a well-crafted visualization can make complex data more accessible and understandable.

Step 1: Defining a Custom Theme - The customization Object

Let's start by creating a custom theme object called customization. This object will store all of our aesthetic preferences, which we can then apply to our plots.

customization <- theme(
  title = element_text(family = "Proxima Nova", face = "bold", size = 20),
  legend.title = element_text(family = "Proxima Nova", face = "bold", size = 10),
  legend.text = element_text(family = "Proxima Nova", size = 9.5),
  axis.title.x = element_text(family = "Proxima Nova", face = "bold", size = 12, margin = margin(t = 10)),
  axis.title.y = element_text(family = "Proxima Nova", face = "bold", size = 12, margin = margin(r = 10)),
  axis.text = element_text(family = "Proxima Nova", size = 10),
  text = element_text(family = "Proxima Nova"),
  legend.position = "top"
)

What's happening here?

We're using the theme() function from the ggplot2 package to define various aspects of our plot's appearance:

  1. Font Choice:

    • We've selected "Proxima Nova" as our primary font. It's a modern, clean, and highly readable font, making our plots visually appealing and easy to understand. (If this font is not available on your system, you can replace it with a similar sans-serif font like "Arial" or "Helvetica.")

  2. Title Styling:

    • title = element_text(…): We're making our plot titles bold and setting their font size to 20, ensuring they stand out.

  3. Axis Labels:

    • axis.title.x = element_text(…) and axis.title.y = element_text(…): We're making our x- and y-axis labels bold with a font size of 12. We've also added margins (margin(t = 10) and margin(r = 10)) to create some space between the labels and the axis lines, improving readability.

  4. Axis Text:

    • axis_text = element_text(…): We're setting the font size of the tick labels on our axes to 10.

  5. Legend Formatting:

    • legend.title = element_text(…): We're making the legend title bold with a font size of 10.

    • legend.text = element_text(…): We're setting the font size of the legend text to 9.5 for better readability.

    • legend.position = "top": We're placing the legend at the top of the plot. This is often a good choice when dealing with plots that have many elements, as it helps to avoid visual clutter.

Why It Matters
  • Consistency: Applying a custom theme ensures that all our plots have a consistent look and feel, making our work more professional and easier to follow.

  • Enhanced Interpretability: Clear, readable fonts, well-placed legends, and appropriate spacing make it easier for our audience to grasp the key insights from our visualizations.

  • Accessibility: A clean and well-designed visual style makes our plots more accessible to a wider audience, including those who may not be familiar with the technical details of our analysis.

Integrating the Theme into Our Workflow

The customization theme will be applied to all the plots we create during our missing data analysis. This includes:

  • Counts of Missing Values: Bar plots or other visualizations summarizing the amount of missing data for each variable.

  • Missingness Patterns: More complex visualizations, like UpSet plots, that reveal how missing values are distributed across different combinations of variables.

By applying this theme consistently, we ensure that all our visualizations are not only informative but also visually appealing and easy to understand.

Pro Tip: Saving and Reusing Your Theme

To make your own custom theme even more useful, you can save it as an R object and reuse it in future projects. Here's how:

# Save the theme for reuse
saveRDS(customization, file = "custom_theme.rds")

# Load and apply in other scripts
custom_theme <- readRDS("custom_theme.rds")

This allows you to maintain a consistent visual style across all your analyses without having to redefine the theme each time.

Looking Ahead: Bringing Our Data to Life with Visualizations

With our custom theme defined, we're now fully equipped to create impactful visualizations that will help us understand the patterns of missing data in our dataset. In the next sections, we'll define helper functions to prepare our data for plotting, and then we'll generate insightful visualizations, including UpSet plots, to explore the intricacies of missingness.

By combining careful data preparation with a polished visual style, we're setting the stage for a deeper understanding of our data and, ultimately, more reliable survival models.

2.4 Defining Helper Functions for Plotting Missingness

Introduction

Understanding the patterns of missing data in our dataset is a crucial step in preparing for survival analysis. Visualizing these patterns helps us identify potential biases, choose appropriate imputation strategies, and ultimately build more reliable models. In this section, we'll focus on creating helper functions that streamline the process of generating insightful visualizations of missing data, particularly using UpSet plots.

Step 1: Ensuring Valid Inputs

Before we start creating visualizations, we need to make sure that our functions are robust. We'll define two simple helper functions to validate our inputs and prevent errors down the line:

test_if_null: Checking for NULL inputs

test_if_null <- function(x){
  if (is.null(x)) {
    cli::cli_abort(
      c(
        "Input must not be NULL",
        "Input is {.cls {class(x)}}"
      )
    )
  }
}

What It Does
  • This function checks if the input x is NULL.

  • If it is NULL, it throws a clear error message using cli::cli_abort, indicating that the input must not be NULL and reporting the class of the provided input.

Why It's Important
  • NULL values can cause unexpected behavior in many R functions. By explicitly checking for them, we prevent our code from crashing and make debugging easier.

test_if_dataframe: Ensuring Data Frame Inputs

test_if_dataframe <- function(x){
  if (!inherits(x, "data.frame")) {
    cli::cli_abort(
      c(
        "Input must inherit from {.cls data.frame}",
        "We see class: {.cls {class(x)}}"
      )
    )
  }
}

What It Does
  • This function checks if the input x is a data frame using inherits(x, "data.frame").

  • If it's not a data frame, it throws a clear error message, indicating the expected class ("data.frame") and the actual class of the input.

Why It's Important
  • Many data manipulation and visualization functions in R expect data frames as input. This check ensures that our functions are used correctly.

Step 2: Preparing Data for UpSet Plots - The as_shadow_upset_custom Function

UpSet plots are a powerful tool for visualizing the intersections of missing data across multiple variables. To create these plots, we need to transform our data into a specific format known as a "shadow matrix."

The as_shadow_upset_custom function handles this transformation:

as_shadow_upset_custom <- function(data, preferred_labels) {
  test_if_null(data)
  test_if_dataframe(data)

  if (n_var_miss(data) <= 1) {
    glu_st <- if (n_var_miss(data) == 1) {
      glue("upset plots for missing data require at least two variables to have missing data, only one variable, '{miss_var_which(data)}', has missing values.")
    } else {
      glue("upset plots for missing data require at least two variables to have missing data, there are no missing values in your data! This is probably a good thing.")
    }
    rlang::abort(message = glu_st)
  }

  data_shadow <- is.na(data) * 1
  colnames(data_shadow) <- sapply(colnames(data), function(x) preferred_labels[x])

  data_shadow <- as.data.frame(data_shadow)
  data_shadow <- data_shadow |>
    mutate(across(where(is.numeric), as.integer))

  return(data_shadow)
}

What It Does
  1. Input Validation:

    • It first uses our helper functions, test_if_null and test_if_dataframe, to ensure that the input data is not NULL and is a data frame.

    • It then checks if the number of variables with missing data is less than 2 using n_var_miss(data) <= 1 from the naniar package. If so, it throws an error because UpSet plots are most informative when visualizing the intersections of missingness between at least two variables.

  2. Shadow Matrix Creation:

    • data_shadow <- is.na(data) * 1: This is the core of the transformation. It creates a new data frame called data_shadow where each cell indicates whether the corresponding cell in the original data is missing (NA). The is.na(data) part generates a logical matrix (TRUE for missing, FALSE for not missing), and multiplying by 1 converts TRUE to 1 and FALSE to 0.

  3. Column Renaming:

    • colnames(data_shadow) <- sapply(colnames(data), function(x) preferred_labels[x]): This line renames the columns of the data_shadow data frame using the preferred_labels we defined earlier. This makes the resulting UpSet plot more interpretable.

  4. Data Frame Conversion and Integer Type:

    • data_shadow <- as.data.frame(data_shadow): The shadow matrix is converted to a data frame.

    • data_shadow <- data_shadow |> mutate(across(where(is.numeric), as.integer)): Then, any numeric columns in the data frame are converted to integers. This step ensures that the data frame is in the correct format for the upset function we will use later.

Why It's Important
  • Prepares Data for UpSet Plots: The shadow matrix format is required by the UpSetR package, which we'll use to create UpSet plots.

  • Enhances Interpretability: Using our preferred_labels to rename columns makes the UpSet plots easier to understand.

Step 3: Creating Customized UpSet Plots - The gg_miss_upset_custom Function

Finally, we define a function called gg_miss_upset_custom to generate our customized UpSet plots:

gg_miss_upset_custom <- function(data, preferred_labels, order.by = "freq", set_size.show = TRUE, set_size.numbers_size = 4.5, ...) {
  data_shadow <- as_shadow_upset_custom(data, preferred_labels)
  UpSetR::upset(data_shadow, order.by = order.by,
                set_size.show = set_size.show,
                set_size.numbers_size = set_size.numbers_size,
                ...)
}

What It Does
  1. Data Transformation: It calls our as_shadow_upset_custom function to transform the input data into the required shadow matrix format.

  2. UpSet Plot Generation: It uses the upset() function from the UpSetR package to create the UpSet plot.

    • order.by = "freq": This argument sorts the intersections in the plot by frequency (how common they are).

    • set_size.show = TRUE: This argument ensures that the plot displays the size of each set (variable).

    • set_size.numbers_size = 4.5: This argument controls the font size of the set sizes.

    • : This allows us to pass additional arguments to the upset function for further customization.

Why It's Important
  • Visualizes Missing Data Intersections: UpSet plots are excellent for visualizing how missing values are distributed across different combinations of variables. They reveal patterns of missingness that might not be apparent from simple summaries.

  • Customization: The function allows us to customize the plot's appearance and sorting, making it easier to highlight the most relevant patterns.

Conceptual Reasons: Why These Functions Matter

These helper functions embody important principles for good data analysis:

  1. Reusability: We can easily reuse these functions with different datasets or different sets of variables, saving us time and effort in future analyses.

  2. Error Handling: The input validation checks help prevent errors and make our code more robust.

  3. Interpretability: By using preferred_labels, we ensure that our visualizations are clear and understandable to a wider audience.

Looking Ahead: Visualizing Missingness and Summarizing Our Data

With these helper functions in place, we're ready to generate insightful visualizations of missing data patterns and create comprehensive descriptive statistics tables. In the next section, we'll use these tools to explore the extent and nature of missingness in our TBIMS dataset, examining the intersections of missing data and summarizing the characteristics of our study population. This exploration will pave the way for building reliable survival models.

2.5 Visualizing Missingness Counts

Introduction

Before we can make informed decisions about how to handle missing data, we need to understand its extent and nature. Are values missing completely at random, or are there underlying patterns that could bias our results? This is where visualizing missingness comes in.

In this section, we'll focus on creating clear and informative plots that reveal the patterns of missing data in our dataset, specifically using bar plots to identify variables with substantial missingness. These visualizations will be crucial for guiding our choices regarding imputation or, in some cases, the exclusion of certain variables or participants.

Step 1: Defining a Helper Function to Count Missing Values - prepare_na_counts_df

First, let's create a helper function called prepare_na_counts_df that will streamline the process of calculating and storing the number of missing values for each variable.

# Function to calculate and store the number of missing values for each measure
prepare_na_counts_df <- function(df) {
  # Calculate the number of missing values for each covariate
  na_counts <- sapply(df, function(x) sum(is.na(x)))

  # Store the NA counts for each covariate in a data frame
  labels_df <- data.frame(
    variable = names(na_counts),
    n_missing = na_counts
  )
  return(labels_df)
}

What's happening here?
  1. Purpose: This function takes a data frame (df) as input and calculates the number of missing values (NAs) in each column (variable).

  2. sapply() for Efficiency: The sapply() function efficiently applies the sum(is.na(x)) function to each column of the data frame. This function counts the number of NA values in each column.

  3. Storing Results: The function then neatly packages the variable names and their corresponding missing value counts into a data frame called labels_df.

Why It's Important
  • Centralized Information: This function consolidates all of the missingness information into a single, easy-to-use data frame.

  • Foundation for Visualization: The labels_df data frame will be used to create our visualizations.

  • Reusability: We can reuse this function with different data frames, making it a valuable tool for our workflow.

Step 2: Preparing Missing Value Counts for Different Covariate Sets

Now, we'll use our prepare_na_counts_df function to calculate missing value counts for both all_proposed_covariates and select_covariates, the two sets of variables that we defined earlier. We will also save these results to .rds files for future reference.

# Prepare the NA counts for all and select covariates
na_counts_for_all_proposed_covariates <- prepare_na_counts_df(all_proposed_covariates_for_plots)
na_counts_for_select_covariates <- prepare_na_counts_df(select_covariates_for_plots)

# Save the NA counts data frames in single .rds files
saveRDS(na_counts_for_all_proposed_covariates, here(data_processed_dir, "na_counts_for_all_proposed_covariates.rds"))
saveRDS(na_counts_for_select_covariates, here(data_processed_dir, "na_counts_for_select_covariates.rds"))

What's happening here?

  1. Calculating Missing Value Counts: We call prepare_na_counts_df twice—once for each of our covariate sets—creating two data frames: na_counts_for_all_proposed_covariates and na_counts_for_select_covariates.

  2. Saving Results: We use saveRDS() to save these data frames as .rds files. This allows us to easily load them later without having to recalculate the missing value counts.

Why It Matters

  • Targeted Exploration: This allows us to examine missingness patterns specifically within our two sets of covariates, informing our decisions about which variables to prioritize in our analysis.

  • Reproducibility: Saving these intermediate results ensures that our analysis is reproducible and that we can easily revisit these missing value counts later if needed.

Step 3: Visualizing Missing Value Counts with Bar Plots

Now comes the exciting part: visualizing the missing data! We'll create bar plots that show the number of missing values for each variable, making it easy to identify variables that might require special attention.

Figure 1-1: Missing Value Counts for All Proposed Covariates

# Plot the number of missing values across all proposed covariates
figure_1_1 <- gg_miss_var(all_proposed_covariates_for_plots) +
  labs(x = "Covariates",
       y = "Number of Missing Values"
  ) +
  scale_x_discrete(labels = var_name_mapping) +
  scale_y_continuous(labels = label_comma()) +
  theme_classic() +
  customization +
  geom_text(
    data = na_counts_for_all_proposed_covariates,
    aes(x = variable, y = n_missing, label = scales::comma(n_missing)), # Position and format labels
    vjust = -0.75,
    hjust = 0.5,
    size = 3.25,
    family = "Proxima Nova"
  )

ggsave(here(missingness_plots_dir, "figure_1-1_all_proposed_covariates_missing_value_counts.png"),
       figure_1_1, dpi = 300)

Figure 1-2: Missing Value Counts for Select Covariates

# Plot the number of missing values across select covariates
figure_1_2 <- gg_miss_var(select_covariates_for_plots) +
  labs(x = "Covariates",
       y = "Number of Missing Values"
  ) +
  scale_x_discrete(labels = var_name_mapping) +
  scale_y_continuous(labels = label_comma()) +
  theme_classic() +
  customization +
  geom_text(
    data = na_counts_for_select_covariates,
    aes(x = variable, y = n_missing, label = scales::comma(n_missing)), # Position and format labels
    vjust = -0.75,
    hjust = 0.5,
    size = 3.25,
    family = "Proxima Nova"
  )

ggsave(here(missingness_plots_dir, "figure_1-2_select_covariates_missing_value_counts.png"),
       figure_1_2, dpi = 300)
What's happening here?
  1. gg_miss_var(): This function from the naniar package creates a bar plot showing the number of missing values for each variable in the input data frame.

  2. Customization:

    • labs(): We add clear labels for the x and y axes.

    • scale_x_discrete(labels = var_name_mapping): We use our var_name_mapping to replace the original variable names with our descriptive labels on the x-axis.

    • scale_y_continuous(labels = label_comma()): We format the y-axis labels with commas for better readability of large numbers.

    • theme_classic(): We apply a clean, classic theme to the plot.

    • customization: We apply our custom theme, defined earlier, for consistent plot aesthetics.

    • geom_text(): We add text labels above each bar, displaying the exact number of missing values. This provides additional clarity and precision.

Why It Matters
  • At-a-Glance Summary: These plots provide a quick and intuitive overview of the extent of missingness for each variable.

  • Informs Imputation/Exclusion Decisions: Variables with high levels of missingness might require imputation or, in some cases, might need to be excluded from certain analyses. These plots help us make informed decisions about how to handle missing data.

  • Publication-Ready Visuals: The customizations ensure that our plots are clear, informative, and visually appealing, suitable for inclusion in reports or publications.

Interpreting the Bar Plot of Missing Values for Select Covariates

Before we move on to more complex visualizations of missing data, let's take a moment to interpret the bar plot we created for our select_covariates (Figure 1-2). This plot provides a clear, visual summary of the number of missing values for each variable in our focused set of covariates.

What the Plot Reveals

The bar plot instantly reveals several key insights into the missingness patterns within our select_covariates:

  1. High Levels of Missingness in Mental Health and Function Factor Score Variables: The most striking observation is the substantial number of missing values in several key variables, particularly those related to mental health and the Function Factor Score:

    • Mental Health Variables:

      • Depression Level at Year 1 has 1,530 missing values.

      • History of Suicide Attempt has 1,522 missing values.

      • History of Mental Health Treatment has 1,447 missing values.

      • Problematic Substance Use at Injury has 1,131 missing values.

    • Functional Independence:

      • Function Factor Score at Year 1 (and its derived quintiles) has 845 missing values.

  2. Relatively Low Levels of Missingness in Baseline Variables: Variables measured at baseline or during the initial rehabilitation period generally have much lower levels of missingness:

    • Educational Attainment at Injury has 27 missing values.

    • Medicaid Status has 17 missing values.

    • Sex has only 1 missing value.

    • Age at Injury has no missing values.

Why These Patterns Matter:
  • Imputation Needs: The high levels of missingness in these crucial variables, especially those related to mental health and functional status, underscore the need for careful imputation. Simply discarding participants with missing data on these variables would drastically reduce our sample size and may also introduce bias.

  • Potential Biases: The fact that missingness is concentrated in mental health-related variables raises concerns about potential biases. Stigma surrounding mental health issues might make participants less likely to disclose this information, leading to higher rates of missingness. Additionally, participants with more severe mental health challenges might be more difficult to reach for follow-up.

  • Informing Imputation Strategies: Understanding which variables have the most missing data, and the potential reasons for this, is crucial for selecting appropriate imputation methods. The high correlation between missingness in depression level, suicide attempt history, and mental health treatment history suggests that these variables might be missing for similar reasons. Furthermore, the correlation between missingness in the mental health variables and Function Factor Score at Year 1 suggests that these variables may be missing due to similar underlying factors. A multivariate imputation approach, which takes into account the relationships between variables, might be particularly appropriate here.

  • Prioritizing Variables: The plot helps us prioritize our efforts in addressing missing data. Clearly, the mental health and functional status variables require the most attention.

Figure 1-2: A Visual Guide

The bar plot provides a clear visual representation of these patterns. Each bar represents a variable, and the length of the bar corresponds to the number of missing values. The exact counts are also displayed above each bar for precision.

By examining this plot, we can quickly grasp the extent of missingness in our key variables and start planning our strategy for addressing it.

Looking Ahead: Unveiling Deeper Patterns with UpSet Plots

While this bar plot provides a valuable overview of missingness per variable, it doesn't reveal how missing values are related across variables. For instance, are the participants with missing Depression Level at Year 1 also missing Function Factor Score at Year 1? Or are these distinct groups?

To answer these questions, we'll turn to UpSet plots in the next section. These powerful visualizations will allow us to explore the intersection of missing data, revealing complex patterns that can inform our imputation strategies and help us build more robust survival models. We will also use the gstummary package to create descriptive statistics tables that will help us further explore our data and prepare for survival modeling.

2.6 Visualizing Missingness Patterns

Introduction

We've calculated the amount of missing data for each variable, but to truly understand the nature of missingness in our dataset, we need to go a step further. We need to explore how missing values are related across variables. Are certain variables frequently missing together? Are there distinct patterns of missingness that could inform our imputation strategies?

This is where UpSet plots come in. These powerful visualizations are specifically designed to reveal the intersections of missing data, showing us which combinations of variables tend to be missing simultaneously. By visualizing these patterns, we can gain valuable insights that will guide our decisions about how to handle missing data in our survival analysis.

In this section, we'll generate UpSet plots for two key sets of covariates:

  1. All Proposed Covariates: This gives us a broad overview of missingness across all of the variables that we initially considered for our analysis.

  2. Select Covariates: This focuses on the final set of variables chosen for our Cox regression models after addressing potential overfitting.

Why UpSet Plots? A Powerful Tool for Exploring Missing Data

UpSet plots are particularly well-suited for visualizing missing data patterns because they:

  1. Reveal Intersections: They show us which combinations of variables are frequently missing together, helping us understand the relationships between missing values. For example, we might discover that participants who are missing Depression Level at Year 1 are often also missing Function Factor Score at Year 1.

  2. Guide Imputation Strategies: The patterns revealed by UpSet plots can inform our choice of imputation methods. For instance, if we see that several variables are often missing together, a multivariate imputation approach might be more appropriate than imputing each variable independently.

  3. Highlight Potential Biases: UpSet plots can help us identify potential biases related to missing data. If certain groups of participants are more likely to have missing values on specific combinations of variables, this could affect the generalizability of our findings.

Step 1: Creating an UpSet Plot for All Proposed Covariates

Let's start by generating an UpSet plot for all of our proposed covariates. This will give us a broad overview of missingness patterns across the entire dataset.

# Generate an UpSet plot of missing values across all proposed covariates
file_path <- here(missingness_plots_dir, "figure_2-1_all_proposed_covariates_missing_patterns_plot.png")

# Open a PNG graphics device with adjusted parameters
png(file_path, width = 3500, height = 1600, res = 300)

# Generate the plot
gg_miss_upset_custom(all_proposed_covariates_for_plots, 
                     var_name_mapping,
                     nsets = 5)

# Close the graphics device
dev.off()
What's happening here?
  1. file_path: We define the file path where the plot will be saved, within our missingness_plots_dir directory.

  2. png(…): This function opens a PNG graphics device, which means that the plot we create will be saved as a .png file. We specify the width, height, and resolution (res) of the image.

  3. gg_miss_upset_custom(…): This is our custom function (defined in "Section 2.4 Defining Helper Functions for Plotting Missingness") that generates the UpSet plot.

    • all_proposed_covariates_for_plots: This is the data frame that contains all proposed covariates, excluding those we deemed irrelevant for plotting.

    • var_name_mapping: This provides user-friendly labels for the variables in the plot.

    • nsets = 5: This argument tells the function to display the top 5 most frequent missing data intersections.

  4. dev.off(): This closes the PNG graphics device, saving the plot to the specified file.

Conceptual Breakdown
  • High-Resolution Plots: We save the plot as a high-resolution PNG file to ensure clarity and readability, especially since UpSet plots can become quite complex.

  • Customization: Our gg_miss_upset_custom function handles the data transformation needed for UpSet plots and applies our preferred variable labels for better interpretability. By limiting the plot to the top 5 intersections (nsets = 5), we focus on the most prevalent missing data patterns.

Step 2: Creating an UpSet Plot for Select Covariates

Next, we'll generate an UpSet plot specifically for our select_covariates—the variables included in our final Cox regression models.

# Generate an UpSet plot of missing values across select covariates
file_path <- here(missingness_plots_dir, "figure_2-2_select_covariates_missing_patterns_plot.png")

# Open a PNG graphics device with adjusted parameters
png(file_path, width = 3500, height = 1600, res = 300)

# Generate the plot
gg_miss_upset_custom(select_covariates_for_plots, 
                     var_name_mapping,
                     nsets = 5)

# Close the graphics device
dev.off()
What's happening here?
  • This code is very similar to the previous example, but it uses select_covariates_for_plots as the input data, focusing on our core set of predictor variables.

Why It Matters
  • Focus on Key Variables: By examining missingness patterns in our select_covariates, we can directly address the missing data issues that are most relevant to our final models.

  • Model-Specific Insights: This plot helps us understand how missing data might impact the specific variables included in our Cox regression analysis.

Interpreting UpSet Plots: Decoding the Patterns

UpSet plots might look a bit complex at first, but they are incredibly informative once you understand how to read them. Here's a quick guide:

  1. Rows in the Intersection Matrix (Bottom Part):

    • Each row represents a variable in our dataset.

    • Filled dots in a row indicate that the variable is part of a specific intersection (combination) of missing data).

  2. Bars Above the Intersection Matrix (Top Part):

    • Each bar represents a unique combination of variables with missing data (an intersection).

    • The height of the bar indicates the number of observations that have that specific pattern of missingness.

Example:

Imagine a bar that corresponds to filled dots for Depression Level at Year 1, History of Suicide Attempt, and History of Mental Health Treatment. The height of that bar would tell us how many participants are missing data on all three of those variables simultaneously.

Insights We Can Gain
  • Common Missingness Patterns: We can identify which combinations of variables are frequently missing together. This can reveal underlying reasons for missing data (e.g., variables coming from the same questionnaire or requiring similar data collection procedures).

  • Isolated vs. Overlapping Missingness: We can see whether variables are often missing independently or if they tend to be missing in conjunction with other specific variables.

  • Guiding Imputation: These insights are invaluable for choosing appropriate imputation strategies. For example, if variables are frequently missing together, a multivariate imputation method might be necessary.

Practical Takeaways: Turning Visualizations into Action

The UpSet plots that we generate provide actionable insights that will directly inform our data preprocessing decisions:

  • Joint Imputation: If we observe that certain variables are frequently missing together, we might consider using imputation methods that can handle missing variables simultaneously (e.g., multivariate imputation by chained equations [MICE]).

  • Sensitivity Analyses: If critical variables have substantial missingness, we might need to perform sensitivity analyses to assess the potential impact of excluding these variables or using different imputation methods.

  • Efficiency and Transparency: Saving our plots as .png files ensures that we can easily share them with collaborators, include them in reports, and maintain a clear record of our data exploration process.

Interpreting the UpSet Plot of Missingness for Select Covariates

Now that we've generated our UpSet plot for the select_covariates (Figure 2-2), let's dive into its interpretation. This plot provides a powerful visual representation of how missing values overlap across our key variables. By understanding these patterns, we can make more informed decisions about how to handle missing data in our survival analysis.

Understanding the Structure of the Plot

Recall that the UpSet plot displays:

  • Rows (Left Side): Each row represents one of our select_covariates. The horizontal bars represent the number of missing observations for that variable (the "set size").

  • Intersection Matrix (Bottom Right): The matrix of dots shows the different combinations of variables where missingness overlaps. Each column represents a unique intersection.

  • Vertical Bars (Top Right): The vertical bars above the matrix represent the size (number of observations) of each intersection—in other words, the number of participants who have that specific pattern of missing data.

Key Observations from Figure 2-2

Let's analyze the key patterns revealed by our UpSet plot:

  1. Dominant Missingness Patterns: The tallest bars on the plot highlight the most common missing data patterns. We can see that the most frequent pattern is having missing values in Problematic Substance Use at Injury (552 observations). The second most frequent pattern involves missingness across Function Factor Score at Year 1 Quintiles, History of Mental Health Treatment, History of Suicide Attempt, and Depression Level at Year 1 all at the same time (544 observations). The third and fourth most frequent patterns involve missingness only in Depression Level at Year 1 and in History of Mental Health Treatment and History of Suicide Attempt, respectively (417 and 401 observations, respectively). These four patterns account for the majority of observations with missing data.

  2. Clustering of Missingness in Mental Health Variables: The connected dots in the intersection matrix reveal a strong tendency for missing values to cluster within our mental health variables (History of Mental Health Treatment, History of Suicide Attempt, and Depression Level at Year 1) and Function Factor Score at Year 1 Quintiles. This suggests that participants who are missing data on one of these variables are also likely to be missing data on others.

Implications for Our Analysis
  • Multivariate Imputation: The strong clustering of missingness among our mental health variables and functional status variable suggests that a multivariate imputation approach might be the most appropriate. This type of imputation takes into account the relationship between variables when filling in missing values, potentially leading to more accurate and less biased results.

  • Potential Biases: The observed patterns raise concerns about potential biases. For instance, participants who are unwilling to disclose information about their mental health might also be less likely to participate in follow-up assessments, leading to missing data on other variables. We'll need to carefully consider these potential biases when interpreting our findings.

  • Focus on Key Variables: The UpSet plot confirms that missingness is concentrated in our select_covariates, particularly the mental health variables and the function factor score variable. This justifies our focus on these variables during the imputation process.

Figure 2-2: A Visual Guide to Missingness

By carefully examining the UpSet plot, we can quickly identify the dominant patterns of missing data and begin to formulate hypotheses about the underlying reasons for this missingness. This information is crucial for making informed decisions about how to proceed with imputation and ultimately build reliable survival models. The plot also allows us to quickly communicate these issues to others, including collaborators or individuals reviewing our work.

What's Next: Summarizing Our Data with Descriptive Statistics

Having visualized the patterns of missingness in our data, we're now ready to move on to the next crucial step of exploratory data analysis: generating descriptive statistics tables. These tables will provide a comprehensive summary of our study population's characteristics, stratified by depression levels at Year 1. This will allow us to further explore our data, identify potential confounders, and refine our hypotheses before proceeding to survival modeling.

Conclusion

We've reached a critical juncture in our survival analysis journey. We've taken raw, complex data and transformed it into a meticulously prepared, analysis-ready dataset. This hasn't just been about cleaning and organizing; it's been about crafting a powerful resource that will enable us to unlock meaningful insights into the relationship between depression one year post-TBI and long-term survival.

This installment focused on understanding and addressing the critical issue of missing data. We've explored its patterns, visualized its complexities using bar plots and UpSet plots, and made a key decision about how to handle it for this stage of our analysis. While we acknowledge that sophisticated methods like multiple imputation offer powerful ways to handle missing data—and we will explore them in detail in a forthcoming blog series—we opted for listwise deletion (complete-case analysis) in this specific study of depression's impact on all-cause mortality.

Why Listwise Deletion Here?

Our choice was driven by a need for straightforward interpretation and a streamlined analytic process. Listwise deletion, while potentially introducing bias if the data are not Missing Completely at Random (MCAR), allows us to work with a readily defined subset of our data where all variables of interest have complete information. This approach offers several advantages in the context of this introductory blog series:

  • Simplicity and Clarity: It provides a clear and easy-to-understand starting point for our analysis, making it easier to explain the core concepts of survival analysis without the added complexity of imputation.

  • Transparency: The impact of listwise deletion on our sample size is readily apparent, allowing readers to clearly see the population on which our findings are based.

  • Sufficient Power: Despite the reduction in sample size that comes with listwise deletion, our initial assessments indicated that we still retain sufficient statistical power for meaningful analysis. We will demonstrate this in the next blog post when we present our descriptive statistics tables.

We recognize the trade-off between the simplicity of listwise deletion and the potential for bias. However, for this specific analysis, we prioritized presenting a clear and direct examination of the relationship between depression and mortality in a readily interpretable manner.

A Dedicated Series on Multiple Imputation

It's important to emphasize that we do not dismiss the value of multiple imputation. In fact, we are planning a dedicated blog series that will delve into the intricacies of this powerful technique. That series will provide a comprehensive resource on multiple imputation, covering its theoretical underpinnings, practical implementation in R, and its application in the context of survival analysis. We will revisit the TBIMS data in that series to demonstrate the application of multiple imputation to this dataset.

For now, our complete-case sample provides a solid foundation for our initial exploration and modeling.

Reflecting on Our Progress: A Recap of Key Accomplishments

Let's take a moment to appreciate the significant strides we've made:

  1. Setting Up a Robust R Environment: We began by establishing a reproducible R environment, organizing our project directory, and loading essential libraries. This seemingly simple step is the bedrock of a streamlined and efficient workflow.

  2. Defining and Refining Our Variables: We carefully defined our covariates of interest, creating both comprehensive and focused lists to support different stages of analysis. We also assigned clear, descriptive labels to make our data more accessible and interpretable. We made key decisions to refine our variable list to avoid overfitting in our final models, and we made these decisions transparent and reproducible by documenting them in our code.

  3. Creating a Polished Visual Style: We crafted a custom theme for our plots, ensuring that our visualizations will be both informative and visually engaging, effectively communicating our findings to a wide audience.

  4. Illuminating Missing Data Patterns: We used bar plots and UpSet plots to visualize the extent and nature of missingness in our data. These visualizations provided crucial insights into which variables are most affected by missing data and how missing values cluster together, informing our imputation strategies.

  5. Extracting, Imputing, and Transforming Key Variables: We extracted and imputed Year 1 variables, created comprehensive mental health history variables, and transformed a skewed continuous variable into quintiles.

  6. Applying Eligibility Criteria and Handling Special Cases: We carefully applied our study's eligibility criteria, defining our analysis sample and thoughtfully handling cases with incomplete follow-up data.

Why This Matters

Addressing missing data is about ensuring that our analysis is scientifically sound and ethically responsible. By understanding and visualizing missingness patterns, we've taken crucial steps to:

  • Mitigate Potential Biases: We've gained a deeper underestanding of why data might be missing, allowing us to choose appropriate imputation methods and minimize the risk of biased results.

  • Maximize the Value of Our Data: Thoughtful handling of missing data allows us to retain as much information as possible, enhancing the statistical power of our study and the generalizability of our findings.

  • Ensure Transparency and Reproducibility: Our detailed logging and documentation ensures that our entire process is transparent and reproducible, building trust in our findings and allowing others to build upon our work.

Looking Ahead: From Exploration to Modeling - Transforming Data into Insights

With our missing data patterns illuminated and our dataset prepared, we're now poised to enter the next phase of our analysis.

In the upcoming posts, we will:

  • Explore our data through descriptive statistics: We will generate comprehensive summaries of our study population, examining the distribution of key variables and identifying potential relationships. We will stratify these summaries by depression level at year 1 to gain a better understanding of how these groups may differ.

  • Visualize survival patterns: We'll use Kaplan-Meier curves and other powerful visualizations to bring our data to life, revealing survival trends and patterns that will inform our modeling choices.

  • Build and interpret our survival models: We will use Cox regression to examine the relationship between depression and mortality, ultimately answering our core research question with precision and clarity.

We're not just analyzing data; we're uncovering a story about recovery, resilience, and the factors that shape long-term outcomes for individuals with TBI. This carefully prepared dataset is the foundation of that story, and the forthcoming blog series on multiple imputation will add another layer of depth to this narrative.

Comments