Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 2)

Blog

Introduction

1.6 Data Handling and Imputation

1.7 Transforming Continuous Variables into Quintiles

1.8 Optimizing Factor Variables

1.9 Carrying Forward Baseline Variables

Conclusion

January 13, 2025

Featured

Research

Tutorials

Introduction

Welcome back to our immersive journey into the world of survival analysis! We've covered the fundamentals of data import, cleaning, and merging. Now, it's time to delve into the more advanced, yet equally crucial, data preprocessing techniques that will elevate our analysis to the next level.

In this installment, we'll tackle the intricacies of handling missing data, transforming skewed variables, optimizing categorical data, and ensuring consistency across time points. These steps are not mere formalities; they are the essential ingredients that will allow us to build robust survival models and extract meaningful, actionable insights.

Our overarching goal remains the same: to understand how depression one year after a traumatic brain injury (TBI) influences all-cause mortality within the subsequent five years. By mastering these advanced preprocessing techniques, we're setting the stage for uncovering answers that could ultimately improve the lives of individuals with TBI.

The Power of Precision: Why These Steps Are Essential

Think of these preprocessing steps as fine-tuning a high-performance engine. Each adjustment, each refinement, ensures that our analysis runs smoothly and efficiently, producing the most accurate and reliable results.

Here's a glimpse of what we'll accomplish in this post:

1.6 Data Handling and Imputation

We'll confront the challenge of missing data head-on, focusing on the critical Year 1 follow-up interview date. This date is our anchor point, defining the start of each participant's five-year observation period. We'll learn how to:

Master Missing Data: Learn how to impute critical missing variables like the Year 1 follow-up interview date, anchoring our timeline for accurate survival analysis.
Retain Participants: Discover how imputation preserves sample size and prevents bias, ensuring the reliability of our findings.

1.7 Transforming Continuous Variables into Quintiles

Tackle Skewed Data: See how transforming skewed variables, like functional independence scores, into quintiles makes them more interpretable and robust for modeling.
Simplify Interpretation: Learn how quintiles reduce the impact of outliers and create meaningful comparisons across ordered categories.

1.8 Optimizing Factor Variables

Refine Categorical Data: Explore how to recode, collapse, and reorder categorical variables for interpretability and statistical power in Cox regression.
Set Meaningful References: Discover how setting strategic reference levels improves the clarity and significance of hazard ratio comparisons.

1.9 Carrying Forward Baseline Variables

Ensure Data Consistency: Use the Last Observation Carried Forward (LOCF) method to propagate baseline variables across time points.
Prepare for Analysis: Guarantee that critical baseline data are consistently represented, regardless of the observation selected for survival analysis.

Why This Matters: Building a Reliable Foundation

These data preprocessing steps are not just about cleaning and organizing data; they're about building a solid foundation for a reliable and insightful survival analysis. By mastering these techniques, we're ensuring that our models are built on the best possible data, leading to results that are:

Accurate and Trustworthy: Careful imputation and variable refinement minimize bias and enhance the statistical validity of our findings.
Interpretable and Meaningful: Well-defined categories and clear labels make our results easier to understand and communicate, both to technical and non-technical audiences.
Reproducible and Transparent: A well-documented preprocessing workflow ensures that our analysis can be replicated and validated by others, strengthening the credibility of our work.

Throughout this post, we'll provide clear, step-by-step R code examples and plain language explanations of the "why" behind each technique. Whether you're a seasoned data analyst or just starting your survival analysis journey, you'll gain valuable skills and insights that you can apply to your own projects.

1.6 Data Handling and Imputation

Introduction

We've reached a critical juncture in our data preprocessing journey: handling missing data and imputing key variables. In this section, we'll focus on one particularly important variable: the Year 1 follow-up interview date. This date is essential because it marks the beginning of each participant's five-year observation period in our survival analysis. Getting this right is crucial for the accuracy and reliability of our results.

Think of it like setting the starting line for a race. If the starting line is unclear or different for each runner, the race results won't be meaningful. Similarly, we need a clearly defined and consistent starting point for each participant's survival timeline.

Why This Matters: The Foundation of Time-to-Event

The Year 1 follow-up interview date is our anchor point. It's the reference from which we'll calculate other crucial time-related variables, such as time to death or time to censoring. If this date is missing or inconsistently defined, our entire survival analysis could be compromised.

Imputation—the process of filling in missing values with estimated ones—helps us ensure that every participant has a defined start date. This allows us to retain as many participants as possible in our analysis and maintain the integrity of our longitudinal data.

Step-by-Step: Creating and Imputing the Year 1 Follow-Up Date

Let's break down the process into two key steps:

Step 1: Pinpointing the Year 1 Follow-Up Date

First, we need to identify the precise date of each participant's Year 1 follow-up interview. Here's how we do it:

# Create the date_of_year_1_followup variable
merged_data <- merged_data |>
  left_join(
    merged_data |>
      filter(data_collection_period == 1) |>
      group_by(id) |>
      summarise(date_of_year_1_followup = if (all(is.na(date_of_followup))) NA else min(date_of_followup, na.rm = TRUE)) |>
      filter(!is.na(date_of_year_1_followup)), # Remove rows where date_of_year_1_followup is NA
    by = "id"
  )

Explanation

Focus on Year 1: We filter our merged_data to select only those records where data_collection_period is equal to 1, representing the Year 1 assessment.
Find the Earliest Date: For each participant (grouped by id), we use the min() function to find the earliest date_of_followup within that year. The na.rm = TRUE argument ensures that min() ignores missing values.
Handle Missing Dates: If all follow-up dates are missing for a participant in Year 1, we assign NA to date_of_year_1_followup. We then remove those participants entirely with filter(!is.na(date_of_year_1_followup)).
Merge Back: We use a left_join to merge these calculated Year 1 follow-up dates back into our main merged_data, matching on id.

Step 2: Filling in the Gaps - Imputing Across Observations

Many participants have multiple records in our dataset, corresponding to different assessments or time points. To ensure consistency, we need to "fill in" the date_of_year_1_followup for all of a participant's records, even if it was only explicitly recorded in one.

Here's how we impute the data:

merged_data <- merged_data |>
  group_by(id) |>
  fill(date_of_year_1_followup, .direction = "downup") |>
  ungroup()

Explanation

Group by Participant: We group the data by id so that the imputation happens within each participant's set of records.
Impute with fill(): The fill() function from the tidyr package is our workhorse here. It propagates non-missing values of date_of_year_1_followup both downwards and upwards (.direction = "downup") within each participant's records. This effectively fills in any gaps.
Ungroup: Finally, we ungroup the data to prepare it for subsequent steps.

Key Concept: Imputation

Imputation is a powerful technique for dealing with missing data. In this case, we're using a simple but effective method: carrying the Year 1 follow-up date forward and backward across all of a participant's records. This ensures that every observation for that individual is associated with the same starting point for the five-year observation period.

Why Imputation is Crucial for Survival Analysis

Missing data is a common challenge in longitudinal studies, and survival analysis is particularly sensitive to it. Here's why imputation of the Year 1 follow-up date is so important in this context:

Preserving Sample Size: If we simply discarded participants with any missing data, we might lose a substantial portion of our sample, reducing the statistical power of our analysis.
Avoiding Bias: Missingness is often not random. If the likelihood of a date being missing is related to the outcome (e.g., participants who died earlier were less likely to have a Year 1 follow-up), simply removing those cases could bias our results.
Consistency in Time: Survival analysis relies on accurately measuring time-to-event. Imputing the Year 1 follow-up date ensures that all of a participant's records are aligned to the same starting point, allowing for consistent time-to-event calculations.
Essential for Single-Record Selection: Later in our analysis, we will be selecting just one record per participant—their last available assessment during the study period. To calculate the time-to-event accurately from this single record, we need the Year 1 follow-up date to be present. Imputing this date across all of a participant's records ensures that this crucial information is retained, even if it wasn't explicitly recorded in their final assessment. This allows us to define a consistent starting point for every participant, regardless of when their last observation occurred.

Looking Ahead

By carefully creating and imputing the Year 1 follow-up date, we've established a crucial anchor point for our survival analysis. This seemingly small step has a significant impact on the accuracy and reliability of our results.

In the upcoming sections, we'll build on this foundation by:

Transforming and recoding other key variables to prepare them for modeling.
Defining our time_to_event variables, using the imputed Year 1 follow-up date as our reference point.
Exploring strategies for handling categorical and ordinal data.

This approach to data handling ensures that our survival analysis is both precise and meaningful, allowing us to confidently explore the relationship between depression at Year 1 and all-cause mortality in individuals with TBI.

1.7 Transforming Continuous Variables into Quintiles

Introduction

In our journey toward building robust survival models, we often encounter variables that don't quite fit the mold of a "normal" distribution. One such variable in our TBIMS dataset is func_score_at_year_1, which represents participants' functional independence scores one year after their traumatic brain injury (TBI). This variable is significantly left-skewed, meaning that most participants have scores that are bunched up at the high end of the scale, while a smaller number have scores extending out in a longer tail toward the lower end.

Why does this skewness matter? And how can transforming this variable into quintiles help us build better models? Let's dive in.

The Challenge of Skewness: Why We Can't Just Use the Raw Data

Our func_score_at_year_1 variable ranges from -5.86 to 1.39, with a mean close to 0 but a median of -0.47. This discrepancy between the mean and median is a telltale sign of skewness. If we were to use this raw, skewed variable directly in our Cox regression models, we might run into several issues:

Violating Model Assumptions: Cox regression—like many statistical models—often assumes that continuous predictors are roughly normally distributed (i.e., bell-shaped). A highly skewed variable can violate this assumption, potentially leading to inaccurate or misleading results.
Difficult Interpretation: Imagine trying to explain the effect of a one-unit change in func_score_at_year_1 on survival. Because of the skew, a one-unit change might represent a small shift in functional independence for some participants but a huge leap for others. This makes it hard to interpret the model's coefficients in a meaningful way.
Overpowering Outliers: Skewed distributions often come with extreme values (outliers). These outliers can exert a disproportionate influence on our model, potentially masking the true relationship between functional independence and survival for the majority of participants.

The Solution: Quintiles to the Rescue!

To address these challenges, we'll transform func_score_at_year_1 into quintiles. This means dividing our participants into five equal-sized groups based on their functional independence scores, effectively creating an ordinal variable with five categories.

Here's why this is a smart move:

Groups Participants into Meaningful Categories: Instead of treating functional independence as a continuous spectrum, we create five distinct groups, ranging from the lowest 20% of scores to the highest 20%. This makes it easier to identify patterns and compare outcomes across different levels of functional independence.
Simplifies Interpretation: Quintiles provide a clear, ordinal scale. We can now talk about the relative risk of mortality for participants in different quintiles, making our results more intuitive and accessible.
Reduces Sensitivity to Outliers: By grouping participants, we minimize the impact of extreme scores. Outliers are now contained within the top or bottom quintile, preventing them from dominating our analysis.

Step-by-Step: Creating Quintiles in R

Step 1: Visualizing the Skewness - A Picture is Worth a Thousand Words

Before we transform the variable, let's visualize its distribution using a histogram. This will help us understand the extent of the skewness.

# Define a custom theme for the plot
customization <- theme(
  title = element_text(family = "Proxima Nova", face = "bold", size = 15),
  legend.title = element_text(family = "Proxima Nova", face = "bold", size = 10),
  axis.title.x = element_text(face = "bold", size = 10, margin = margin(t = 10)),
  axis.title.y = element_text(face = "bold", size = 10, margin = margin(r = 10))
)

# Remove non-finite values and plot the histogram
clean_data <- function_factor_scores[is.finite(function_factor_scores$func), ]

plot <- ggplot(data = function_factor_scores, aes(x = func)) +
  geom_histogram(binwidth = 0.15, alpha = 0.6, fill = "#9575cd", color = "#9575cd") +
  labs(
    x = "Function Factor Score Values",
    y = "Frequency",
    title = "Distribution of Year 1 Function Factor Scores"
  ) +
  scale_x_continuous(breaks = seq(-6, 2, by = 1), limits = c(-6, 2)) +
  theme_minimal() +
  customization

# Save the plot
ggsave(here(plots_dir, "function_factor_scores_plot.png"), plot, dpi = 300, width = 8, height = 6)

What the Code Does

Defines a custom theme to be applied to the histogram plot for stylistic purposes.
Creates a histogram of the distribution of the func variable using ggplot2.
The geom_histogram() function creates the histogram, binwidth specifies the width of the bins, alpha adjusts the transparency, and fill and color set the colors.
The labs() function adds labels for the x-axis, y-axis, and title of the plot.
scale_x_continuous() is used to define the scale of the x-axis, setting breaks and limits to ensure that the plot displays a specific range and intervals.
theme_minimal() applies a minimal theme to the plot for a clean look.
Finally, customization applies the custom theme to the plot.
Saves the histogram plot to the plots_dir directory.

This histogram visually confirms the left skewness of our func_score_at_year_1 variable, reinforcing the need for transformation.

Step 2: Calculating the Quintile Breakpoints

Now, let's calculate the cut-off points that will divide our participants into five equal groups:

# Calculate quintile cut points
quintile_breaks <- merged_data |>
  filter(data_collection_period == 1) |>
  pull(func_score_at_year_1) |>
  quantile(probs = seq(0, 1, by = 0.20), na.rm = TRUE)

# Ensure quintile cut points are unique
quintile_breaks <- unique(quintile_breaks)

Explanation

Focus on Year 1: We filter our merged_data to include only Year 1 observations because these scores will define our quintile boundaries.
Extract Scores: The pull() function extracts the func_score_at_year_1 values as a vector.
Calculate Quantiles: The quantile() function is the workhorse here. We provide it with our vector of scores and a sequence of probabilities (probs = seq(0, 1, by = 0.20)), representing the 0th, 20th, 40th, 60th, 80th, and 100th percentiles. These percentiles will serve as our quintile breakpoints. The na.rm = TRUE argument ensures that missing values are ignored.
Ensure Uniqueness: We use unique() to remove any duplicate breakpoints, which can sometimes occur due to tied values or a narrow range of scores.

Step 3: Assigning Participants to Quintiles

With our breakpoints defined, we can now assign each participant to their corresponding quintile:

merged_data <- merged_data |>
  group_by(id) |>
  mutate(
    func_score_at_year_1_q5 = if_else(
      data_collection_period == 1 & !is.na(func_score_at_year_1),
      cut(func_score_at_year_1, breaks = quintile_breaks, include.lowest = TRUE, labels = FALSE),
      NA_integer_
    )
  ) |>
  fill(func_score_at_year_1_q5, .direction = "downup") |>
  ungroup()

Explanation

Group By Participants: We group the data by id to ensure that quintile assignment and imputation are done within each participant's records.
Create func_score_at_year_1_q5: This new variable will store the quintile assignments (1 through 5).
- We use if_else to apply the cut() function only to Year 1 observations with non-missing func_score_at_year_1 values.
- The cut() function assigns each participant to a quintile based on the calculated quintile_breaks.
- include.lowest = TRUE ensures that the participant with the absolute lowest score is included in the first quintile.
- labels = FALSE assigns numeric labels (1-5) instead of text labels.
Impute Quintiles: We use fill() to propagate the quintile assignment across all observations for each participant. This ensures that even if a participant's Year 1 score is missing, they will still be assigned to a quintile based on their other available data.
Ungroup: We ungroup the data for further processing.

The Power of Quintiles: A Transformed Variable Ready for Modeling

By transforming our skewed continuous variable into quintiles, we've created a new variable, func_score_at_year_1_q5, that is:

More Robust to Skewness and Outliers: Quintiles are less sensitive to extreme values, providing a more stable representation of functional independence.
Easier to Interpret: We can now examine how mortality risk changes across distinct categories of functional independence, making our results more accessible and clinically relevant.
Suitable for Model Assumptions: The ordinal nature of quintiles is generally more compatible with the assumptions of Cox regression and other survival models than a highly skewed continuous variable.

Looking Ahead: Completing the Data Preparation Puzzle

Our data is now taking shape, but our preprocessing journey isn't over yet. In the following sections, we'll:

Update variable labels to ensure that our dataset is well-documented and easy to understand.
Address other potentially skewed variables and handle any remaining categorical recoding.
Define our crucial time_to_event and event indicator variables—the final ingredients for our Cox regression models.

By preparing our data, we're setting the stage for an insightful survival analysis that can shed light on the important relationship between depression, functional independence, and mortality after TBI.

1.8 Optimizing Factor Variables

Introduction

We're making excellent progress in preparing our data for survival analysis! Now, we'll focus on refining our factor (categorical) variables. This critical step involves two main parts:

Updating Variable Labels: Ensuring our labels are clear, descriptive, and consistent.
Processing Factor Levels: Strategically recoding, collapsing, and reordering factor levels to optimize them for Cox regression modeling.

These refinements are essential for both interpretability and the statistical validity of our analysis. In the context of our study—examining the impact of depression one year post-TBI on five-year mortality—these steps ensure that our results are both meaningful and reliable.

Why These Refinements Matter: The Key to Meaningful Models

Think of this stage as polishing the lenses of our analytical microscope. We're fine-tuning our variables to ensure that we can see the relationships in our data with maximum clarity. Here's why these steps are so important:

Enhanced Interpretability: Clear and descriptive labels make our results easier to understand, both for us as researchers and for anyone who reads our work.
Consistency Across the Dataset: Harmonizing coding schemes across different data collection periods ensures that our variables are consistently defined. throughout the dataset.
Optimized for Cox Regression: Cox models have specific requirements for categorical variables. Properly defining reference levels and grouping categories strategically improves model convergence, enhances statistical power, and facilitates meaningful comparisons.

Step 1: Ensuring Clarity with Updated Variable Labels

First, we need to make sure that our factor variables have meaningful labels. We'll use our custom update_labels_with_sjlabelled function, which leverages the power of the sjlabelled package to automate this process.

update_labels_with_sjlabelled <- function(data, mapping_lists) {
  for (mapping_list in mapping_lists) {
    for (var_name in names(mapping_list)) {
      if (is.factor(data[[var_name]])) {
        original_labels <- get_labels(data[[var_name]], values = "n")
        current_levels <- levels(data[[var_name]])
        valid_labels <- original_labels[names(original_labels) %in% current_levels]
        data[[var_name]] <- set_labels(data[[var_name]], labels = valid_labels)
      }
    }
  }
  return(data)
}

# Applying the function to our merged dataset:
merged_data <- update_labels_with_sjlabelled(
  merged_data,
  list(baseline_name_and_na_mappings, followup_name_and_na_mappings)
)

What It Does

The function takes our dataset (data) and a list of mappings (mapping_lists) as input.
It iterates through each variable specified in the mappings.
For factor variables, it retrieves original labels, filters them to match the current factor levels, and then reapplies these updated labels to the variable.

Why It's Important

Maintains Consistency: This automated process ensures that our labels are always in sync with the underlying data, even after data cleaning or merging.
Reduces Manual Error: Automating the process minimizes the risk of errors that can occur with manual label updates.

Step 2: Optimizing Factor Variables: Recoding, Collapsing, and Releveling

Now, let's optimize our factor variables for Cox regression. This involves strategically recoding, collapsing, and reordering their levels.

Here's how we do it in R, using the powerful forcats package:

merged_data <- merged_data |>
  mutate(
    # Specify value labels for follow-up interview status
    status_at_followup = fct_recode(status_at_followup,
      Followed = "1",
      Lost = "2",
      Refused = "3",
      Incarcerated = "4",
      Withdrew = "5",
      Expired = "6",
      "No Funding" = "7"
    ),

    # Set 'Male' as the reference group for sex
    sex = fct_recode(sex, Female = "1", Male = "2") |>
      fct_relevel("Male", "Female"),

    # Collapse and reorder levels for cause of injury
    cause_of_injury = fct_recode(cause_of_injury,
      "Motor Vehicle" = "1",
      Motorcycle = "2",
      Bicycle = "3",
      "All-Terrain Vehicle (ATV)" = "4",
      Fall = "19"
    ) |>
      fct_collapse(
        Vehicular = c("Motor Vehicle", "Motorcycle", "Bicycle", "All-Terrain Vehicle (ATV)"),
        Falls = "Fall"
      ) |>
      fct_relevel("Vehicular", "Falls"),
      
      # Relevel employment at injury
      employment_at_injury = fct_recode(employment_at_injury,
     "Full-Time Student" = "2",
     "Part-Time Student" = "3",
     "Special Employed" = "8",
     "Homemaker" = "7",
     Retired = "9",
     Unemployed = "10",
     Volunteer = "11",
     Other = "12"
     ) |>
      fct_collapse(
     "Full-Time Student" = c("Full-Time Student"),
     "Part-Time Student" = c("Part-Time Student"),
     "Special Employed" = c("Special Employed"),
     Homemaker = c("Homemaker"),
     Retired = c("Retired"),
     Unemployed = c("Unemployed"),
     Volunteer = c("Volunteer"),
     Other = c("Other")
     ) |>
     fct_relevel("Full-Time Student", "Part-Time Student", "Special Employed", "Homemaker", "Retired", "Unemployed", "Volunteer", "Other"),

    # Update and relevel substance use variable
    problematic_substance_use_at_injury = fct_recode(
      problematic_substance_use_at_injury, No = "0", Yes = "1"
    ) |>
      fct_relevel("No", "Yes")
  )

Key Techniques

fct_recode: This function from forcats allows us to rename factor levels. We use it to replace numeric codes with descriptive labels (e.g., "1" becomes "Followed" for status_at_followup).
fct_collapse: This function lets us group multiple levels into a single category. For example, we collapse different types of vehicular injuries into a broader "Vehicular" category for cause_of_injury. This simplifies the variable and can increase statistical power. We also use it to collapse employment_at_injury into fewer categories.
fct_relevel: This function is crucial for Cox regression. It allows us to specify the reference level for a factor variable. The reference level serves as the baseline for comparison when interpreting the hazard ratios in our model. For instance, we set "Male" as the reference level for sex, "Vehicular" as the reference level for cause_of_injury, "No" as the reference level for problematic_substance_use_at_injury, and "Full-Time Student" as the reference level for employment_at_injury.

Example: cause_of_injury

Let's take a closer look at how we transformed cause_of_injury:

Recode: We initially used fct_recode to give descriptive labels to the numeric codes.
Collapse: We then used fct_collapse to group related causes into broader categories: "Vehicular," "Falls," "Violence," and "Other."
Relevel: Finally, we used fct_relevel to set "Vehicular" as the reference level. This means that our Cox model will estimate the hazard ratios for "Falls," "Violence," and "Other" relative to "Vehicular" causes of injury.

Step 3: Final Touches - Dropping Unused Levels

After recoding and collapsing, some factor levels may no longer be present in our data. We use the droplevels function to tidy up our dataset:

# Drop unused levels
merged_data <- droplevels(merged_data)

Why It's Important

Data Cleanliness: Removing unused levels keeps our dataset tidy and prevents potential issues in some statistical procedures that are sensitive to empty levels.

The Importance of Thoughtful Factor Handling

These steps might seem detailed, but they are important for ensuring that our Cox regression models are both statistically sound and interpretable:

Meaningful Comparisons: By carefully choosing reference levels, we ensure that our model results provide meaningful comparisons between different categories.
Improved Model Performance: Collapsing categories can improve model stability and statistical power, especially when some categories have very few observations.
Actionable Insights: Clear labels and well-defined categories make it easier to translate our statistical findings into actionable insights that can inform interventions and improve outcomes for individuals with TBI.

Looking Ahead: Building the Foundation for Survival Analysis

With our factor variables carefully prepared, we're now ready to move on to the next critical steps in our data preparation journey:

Carrying Forward Baseline Variables: We'll ensure that baseline information is consistently represented across all time points for each participant.
Defining Event Times and Censoring Indicators: We'll create the essential time_to_event and censoring variables that form the core of our survival models.
Logging and Validating Transformations: We'll document all of our data transformations to ensure reproducibility and transparency.

By mastering these data preparation techniques, we're laying the groundwork for a powerful survival analysis that can contribute to a deeper understanding of the factors influencing long-term outcomes after TBI.

1.9 Carrying Forward Baseline Variables

Introduction

We're nearing the end of our data preparation journey, and we've reached a critical step: ensuring that each participant's baseline information is correctly represented across all of their records. This is essential because, in survival analysis, we often select a single "representative" record for each participant (typically their last observed record) to calculate their time_to_event. By carrying forward baseline variables, we guarantee that this critical information is available, regardless of which record is ultimately chosen.

In this section, we'll focus on two key tasks:

Propagating Baseline Variables: Using the Last Observation Carried Forward (LOCF) method to fill in baseline information across all subsequent observations for each participant.
Maintaining Factor Consistency: Ensuring that our factor variables retain their correct levels and labels after the imputation process.

Why Impute Baseline Variables to Subsequent Observations?

You might be wondering, "Why go through the trouble of carrying baseline information forward? Isn't it enough to just have it in the first record?" Here's why this step is so important:

Flexibility in Defining Time-to-Event: In survival analysis, a participant's "time zero" (the starting point for their observation period) isn't always their first observation. Often, it's defined by a specific event, like their Year 1 follow-up. By imputing baseline data to all records, we ensure that we can define "time zero" flexibly, without losing crucial information.
Avoiding Prediction Errors: When we ultimately select a single record per participant for our Cox regression, imputing baseline information to all records eliminates the guesswork. We don't have to predict in advance which record will be selected; we know that the necessary baseline data will be present regardless.
Consistent Modeling: This approach ensures that all records are complete and ready for downstream analysis, regardless of which observation is used to represent a participant in the final model.

Step 1: Carrying Baseline Variables Forward with LOCF

Let's see how we implement the Last Observation Carried Forward (LOCF) method in R.

# List of baseline variables to carry forward
baseline_vars <- c(
  "date_of_birth", "date_of_injury", "sex", "age_at_injury", "cause_of_injury",
  "marital_status_at_injury", "education_level_at_injury", "employment_at_injury",
  "rehab_payor_primary", "rehab_payor_primary_type", "problematic_substance_use_at_injury",
  "mental_health_tx_lifetime_at_injury", "mental_health_tx_past_year_at_injury",
  "psych_hosp_hx_lifetime_at_injury", "psych_hosp_hx_past_year_at_injury",
  "suicide_attempt_hx_lifetime_at_injury", "suicide_attempt_hx_past_year_at_injury"
)

# Store original factor levels and labels before applying LOCF
original_factor_info <- list()
for (var in baseline_vars) {
  if (is.factor(merged_data[[var]])) {
    original_factor_info[[var]] <- list(
      levels = levels(merged_data[[var]]),
      labels = get_labels(merged_data[[var]], values = "n")
    )
  }
}

# Convert factors to characters before applying LOCF
merged_data <- merged_data |>
  mutate(across(all_of(baseline_vars), ~ if (is.factor(.)) { as.character(.) } else { . })) |>
  # Perform LOCF within each participant
  group_by(id) |>
  fill(!!!syms(baseline_vars), .direction = "down") |>
  ungroup()

# Convert characters back to factors with original levels and labels
for (var in names(original_factor_info)) {
  if (!is.null(original_factor_info[[var]])) {
    levels_info <- original_factor_info[[var]]$levels
    labels_info <- original_factor_info[[var]]$labels

    # Convert to factor and reapply labels
    merged_data[[var]] <- factor(merged_data[[var]], levels = levels_info)
    merged_data[[var]] <- set_labels(merged_data[[var]], labels = setNames(levels_info, labels_info))
  }
}

Conceptual Breakdown

Define Baseline Variables: We create a list called baseline_vars that contains the names of all variables collected at baseline that we want to carry forward.
Preserve Factor Information: Before applying LOCF, we store the original factor levels and labels for each of these variables. This is crucial because we'll need to restore them later. We use a loop to iterate through all of the variables in baseline_vars, storing the factor levels using levels() and the factor labels using get_labels(). This information is stored in the original_factor_info list.
Prepare for LOCF: We temporarily convert all factor variables in our baseline_vars list to character variables. This is necessary because the fill() function, which we'll use for LOCF, doesn't work directly with factors.
Perform LOCF with fill():
- We group our data by id to ensure that LOCF is applied within each participant's records.
- We use the fill() function from the tidyr package to propagate the baseline values downward (.direction = "down") within each group. The !!!syms(baselin_vars) part expands our list of variable names into individual arguments for fill().
Restore Factor Structure: After applying LOCF, we loop through the variables again. This time, we use the information stored in original_factor_info to convert the variables back to factors using factor() and reapply the original labels using set_labels().

Why This Matters

Ensures Baseline Data Availability: LOCF ensures that every record for a participant has their baseline information, even if it was only collected once at the beginning of the study.
Facilitates Time-to-Event Calculations: By having baseline data on every record, we can accurately calculate time_to_event from any chosen starting point, regardless of which record is ultimately used in the survival model.

Step 2: Finalizing the Dataset - Selection, Arrangement, and Saving

With our baseline variables propagated, we're ready to organize our dataset for the final stages of data preprocessing.

# Select and reorder columns
merged_data <- merged_data |>
  select(
    id, status_at_followup, data_collection_period, date_of_year_1_followup,
    date_of_followup, date_of_death, date_of_birth, date_of_injury, sex,
    age_at_injury, cause_of_injury, cause_of_death_1, cause_of_death_2,
    cause_of_death_e, marital_status_at_injury, education_level_at_injury,
    employment_at_injury, rehab_payor_primary, rehab_payor_primary_type,
    drs_total_at_followup, fim_total_at_followup, gose_total_at_followup,
    drs_total_at_year_1, fim_total_at_year_1, gose_total_at_year_1,
    func_score_at_year_1, func_score_at_year_1_q5, mental_health_tx_lifetime_at_injury,
    mental_health_tx_past_year_at_injury, psych_hosp_hx_lifetime_at_injury,
    psych_hosp_hx_past_year_at_injury, problematic_substance_use_at_injury,
    problematic_substance_use_at_followup, suicide_attempt_hx_lifetime_at_injury,
    suicide_attempt_hx_past_year_at_injury, suicide_attempt_hx_past_year_at_followup,
    phq1, phq2, phq3, phq4, phq5, phq6, phq7, phq8, phq9
  ) |>
  arrange(id, data_collection_period)

What We're Doing

Select Relevant Variables: We use select() to keep only the variables that are essential for our survival analysis, decluttering our dataset.
Reorder Columns: We rearrange the columns in a logical order, making it easier to inspect the data and understand the relationships between variables.
Sort Records: We use arrange() to sort the data by participant ID (id) and data collection period (data_collection_period), ensuring that each participant's records are in chronological order.

Looking Ahead: From Prepared Data to Survival Insights

By carrying forward baseline variables and ensuring their integrity, we've created a dataset that's nearly ready for survival analysis. Every record is now complete and consistent, providing a solid foundation for calculating our crucial time_to_event variables.

In the next steps, we'll:

Define Time-to-Event and Censoring: We'll create the core variables for our survival models, using the information we've so carefully prepared.
Explore Key Covariates: We'll further refine our categorical and continuous variables, preparing them for inclusion in our Cox regression models.
Document Our Transformations: We'll log every step of our data preparation process to ensure reproducibility and transparency.

This thorough preprocessing ensures that our study—examining the impact of Year 1 depression on mortality after TBI—rests on a reliable foundation. We're now poised to transform this meticulously prepared dataset into actionable insights that can contribute to improved care and outcomes for individuals with TBI.

Conclusion

We're near the end of our data preprocessing journey—a journey that has transformed our raw TBIMS data into a carefully prepared dataset. We've taken a complex collection of records and shaped it into a powerful resource for investigating the crucial link between depression one year post-TBI and five-year all-cause mortality.

This process hasn't just been about cleaning and organizing; it has been about building a solid foundation for robust survival analysis. Every decision, every transformation, every imputation was guided by our ultimate goal: to extract meaningful, reliable, and actionable insights that can improve the lives of individuals with TBI.

A Recap of Our Accomplishments: Transforming Data into Knowledge

Section 1.6 Data Handling and Imputation:
- We tackled the challenge of missing data head-on, focusing on the critical Year 1 follow-up interview date. By strategically imputing this variable, we anchored each participant's timeline, ensuring a consistent starting point for our survival analysis.
- This imputation was essential for preserving our valuable sample size and minimizing potential biases that could have arisen from excluding participants with missing data.
Section 1.7 Transforming Continuous Variables into Quintiles:
- We transformed the skewed distribution of Year 1 functional independence scores into quintiles. This not only made the variable more suitable for our models but also enhanced the interpretability of our results.
- By creating these five distinct groups, we can now examine how mortality risk changes across different levels of functional independence, providing clinically relevant insights.
Section 1.8 Optimizing Factor Variables:
- We meticulously refined our factor variables through recoding, collapsing, and reordering levels. This process ensured that our categorical data are both meaningful and statistically sound.
- By strategically choosing reference levels, we set the stage for insightful comparisons of hazard ratios in our Cox regression models.
Section 1.9 Carrying Forward Baseline Variables:
- We used the Last Observation Carried Forward (LOCF) method to propagate baseline variables across all of a participant's records.
- This crucial step guarantees that every observation is complete, regardless of which one is ultimately selected for our time-to-event calculations, and allows us to flexibly define our "time zero."

Looking Ahead: The Next Steps in Our Survival Analysis Journey

Our data are now primed and ready for the exciting next stage: building our survival models! In the next installment, we'll:

Define Time-to-Event and Censoring Variables: We'll create the core components of our survival analysis, using the carefully prepared data we've assembled.
Explore Key Covariates: We'll delve deeper into the relationships between our predictor variables and survival outcomes.
Build and Interpret Cox Regression Models: We'll finally bring our data to life, constructing and interpreting survival models that can reveal the sophisticated network of factors influencing mortality after TBI.

Comments

Newer

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 3)

Older

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 1)