Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 1)

January 6, 2025

Featured

Research

Tutorials

Introduction

Welcome to the first installment in our hands-on series about survival analysis! In this series, we'll equip you with the practical skills needed to prepare your data, build robust models, and extract meaningful insights from complex datasets.

Our focus will be on a critical question in healthcare: How do depression levels one year after a traumatic brain injury (TBI) influence all-cause mortality within the subsequent five years? By harnessing the capabilities of R, a leading statistical programming language, we aim to analyze real-world data and uncover insights that could ultimately lead to improved interventions and patient outcomes.

This introductory post is dedicated to the crucial, often under-appreciated, phase of data preprocessing. Just like an architect lays the groundwork for a magnificent building, we need to carefully prepare our data before constructing our survival models. This involves a series of essential steps to transform raw, often messy data into a clean, structured, and analysis-ready format.

Here's what you'll gain from this post:

1.1 Initial Setup and Library Loading

Learn how to efficiently set up your R environment and load the necessary packages for a smooth workflow.

1.2 Data Import

Master the techniques for importing your datasets into R, understanding their structure, and resolving initial compatibility challenges.

1.3 Data Cleaning

Discover how to identify and handle missing, inconsistent, or erroneous values, ensuring your dataset is of the highest quality.

1.4 Data Merging and Enrichment

Learn how to integrate baseline and follow-up datasets, append new variables, and resolve data redundancies for a more comprehensive dataset.

1.5 Data Transformation and Recoding

Explore how to create derived variables, standardize data formats, and prepare categorical variables for optimal use in your survival models.

Why This Matters: The Foundation of Reliable Insights

While data preprocessing might not be the most glamorous part of data analysis, it's arguably the most important. A rigorous and well-documented preprocessing workflow offers several key benefits:

  • Time Savings and Reduced Stress: By automating tasks and proactively addressing potential issues, we streamline the entire analysis process, saving valuable time and minimizing frustration down the line.

  • Guaranteed Reproducibility: A transparent and well-documented process ensures that our work can be easily understood, replicated, and validated by others—a cornerstone of scientific rigor.

  • Enhanced Data Quality: Cleaning and transforming data ensures that our survival models are built upon a foundation of accurate and reliable inputs, leading to more trustworthy results.

Throughout this post, we will provide step-by-step R code examples accompanied by clear, jargon-free explanations. Whether you are a seasoned data analyst or just beginning your journey into the world of survival analysis, you will find practical techniques and insights that you can apply to your own projects.

1.1 Initial Setup and Library Loading

Introduction

This script establishes the foundational environment for data analysis by loading essential R libraries, setting up a structured directory system for data management, loading preprocessed data, and configuring the study timeline. These steps ensure a reproducible, organized, and visually consistent workflow.

Step 1: Equipping Ourselves - Loading Essential Libraries

First, we need to gather our tools. We'll be using a curated selection of R packages, each playing a specific role in our data preprocessing procedures. Here's how we'll bring them on board:

# Load the pacman package (install if necessary)
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Install and load prerequisite libraries
pacman::p_load(haven, here, lubridate, labelled, sjlabelled, tidyverse)
Let's break down what's happening:
  1. The pacman Advantage: pacman is our secret weapon for streamlined package management. It's like a conductor for an orchestra, ensuring that all of our packages work harmoniously. The if (!requireNamespace("pacman", quietly = TRUE)) checks if you already have pacman installed. If not, install.packages("pacman") fetches it for you.

  2. Our Arsenal of Libraries:

  • haven: Our bridge to the data world. It allows us to read data from various formats, including SPSS .sav files, which is how the TBIMS data are stored.

  • here: The pathfinder. It helps us create clean, standardized file paths that work consistently across different computers, making collaboration and reproducibility a breeze.

  • lubridate: The time traveler. It makes working with dates and times in R incredibly intuitive.

  • labelled and sjlabelled: The label guardians. They ensure that valuable information encoded in variable labels isn't lost during data cleaning.

  • tidyverse: The data wrangling dream team. This collection of packages (including dplyr, ggplot2, and others) gives us superpowers for manipulating, transforming, and visualizing data.

Pro Tip: Using pacman is a game-changer, especially when collaborating or switching between computers. It gracefully handles missing packages, eliminating the dreaded "package not found" errors that can derail your analysis.

Step 2: Building Our Home Base - Creating a Project Directory

Now that we have our tools, let's create a well-organized home for our project. We'll set up a directory structure to keep our files tidy and our analysis on track:

# Create the 'logs' subdirectory if not already accessible
log_dir <- here("Logs")
if (!dir.exists(log_dir)) {
  dir.create(log_dir)
}

# Create the 'data/raw' subdirectory if not already accessible
data_raw_dir <- here("Data", "Raw")
if (!dir.exists(data_raw_dir)) {
  dir.create(data_raw_dir, recursive = TRUE)
}

# Create the 'data/processed' subdirectory if not already accessible
data_processed_dir <- here("Data", "Processed")
if (!dir.exists(data_processed_dir)) {
  dir.create(data_processed_dir, recursive = TRUE)
}

# Create the 'Output/Plots' subdirectory if not already accessible
plots_dir <- here("Output", "Plots")
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir, recursive = TRUE)
}

Here's the rationale:
  1. Why These Folders?

    • Logs: Our meticulous record-keeper. This folder will store detailed information about our data processing steps, ensuring transparency and making it easy to retrace our steps if needed.

    • Data/Raw: The vault. Here, we'll keep pristine, untouched copies of our original datasets. This is crucial for maintaining data integrity.

    • Data/Processed: The workshop. This is where we'll store our cleaned, transformed, and analysis-ready datasets.

    • Output/Plots: The gallery. This directory will store our visual output (i.e., plots).

  2. here() and dir.create() in Action:

    • The here() function, from the here package, automatically determines the root directory of your project, regardless of where you run the code. This makes your file paths portable and reliable.

    • dir.create() creates the directories. The recursive = TRUE argument is a handy feature that allows us to create nested directories (like Data/Raw and Data/Processed) in a single command, even if the parent directory (Data) doesn't exist yet.

    • The if (!dir.exists(…)) ensures that we don't accidentally overwrite existing directories.

Step 3: Defining Our Time Window - Setting Study Dates

Finally, let's define the crucial time parameters for our study:

# Define the study period start and end dates
study_entry_period_start_date <- as.Date("2006-10-01")
study_entry_period_end_date <- as.Date("2012-10-01")
What's the significance?
  1. The Study Window: These dates define the eligibility period for our study participants. Only individuals enrolled within this time frame will be included in our analysis.

  2. Date Handling Best Practices: Using as.Date() ensures that R understands these values as dates, not just text. This is essential for accurate date-based calculations, filtering, and merging operations later on.

Pro Tip: Defining your study parameters upfront and storing them as variables is a great habit. It promotes consistency throughout your code and makes it easy to adjust the parameters if needed.

Conclusion

Congratulations! You've successfully laid the groundwork for your survival analysis project. We've:

  • Installed and loaded essential R libraries.

  • Created a well-structured project directory.

  • Defined key study parameters.

This might seem like a small step, but it's a giant leap toward a robust, reproducible, and insightful analysis.

In the next sections, we'll take the plunge into the data! We'll learn how to import our raw datasets, tackle the challenges of missing data, and transform our variables into a format suitable for survival analysis.

1.2 Data Import

Introduction

We've set the stage, and now it's time to bring our data into the spotlight! In the world of data analysis, the import process is like the grand opening act: it's our first real interaction with the data, and it needs to be handled with precision and care. It's not just about loading files into our R environment; it's about ensuring the raw data's integrity is preserved, gracefully handling any unexpected hiccups, and setting the stage for all the transformations that follow.

For our survival analysis journey—remember, we're exploring how depression one year post-TBI impacts all-cause mortality within five years—we'll be using a robust and flexible approach to import data.

A Closer Look at the Code: Importing with Precision and Care

Let's break down the code into manageable chunks and understand the "why" behind each step.

  1. Defining a Versatile Import Function: Our Data's Gateway

At the heart of our import process is the import_data function. Think of it as a skilled translator, capable of understanding different data languages (file formats) and converting them into a format that R can work with.

# Function to import data from a file
import_data <- function(file_path, file_type = c("sav", "csv")) {
  tryCatch({
    if (file_type == "sav") {
      read_sav(file_path)
    } else if (file_type == "csv") {
      read.csv(file_path)
    } else {
      stop("Unsupported file type. Please specify '.sav' or '.csv'.")
    }
  }, error = function(e) {
    cat("Error importing file:", file_path, "\nError message:", e$message, "\n")
    return(NULL)
  })
}
Why This Matters:
  • Flexibility: This function is built to handle both .sav (SPSS) and .csv files. This is crucial because real-world data often come in various formats. Our function ensures that we're prepared for different data sources.

  • Error Handling: The tryCatch block is our safety net. It gracefully catches any errors during the import process, logs them for us to review, and prevents the entire script from crashing. This makes debugging much easier.

  1. Specifying File Paths: Guiding R to Our Data

We use the here package to construct our file paths dynamically. This ensures that our code works seamlessly across different computer environments and operating systems.

# Specify the file paths to the datasets
tbims_form1_path <- here(data_raw_dir, "TBIMS_2023Q2_SPSS/TBIMSForm1_20230712.sav")
tbims_form2_path <- here(data_raw_dir, "TBIMS_2023Q2_SPSS/TBIMSForm2_20230726.sav")
function_scores_path <- here(data_raw_dir, "function_factorscore20240131.csv")
Why This Matters:
  • Reproducibility: Using relative paths (thanks to here) means that our script can find the data files regardless of the user's specific working directory setup. This is essential for reproducible research.

  • Clarity: We use descriptive variable names (tbims_form1_path, etc.) to clearly indicate which file is being referenced. This makes our code easier to understand and maintain.

  1. Importing the Data: The Moment of Truth

Now, we use our import_data function to load the key datasets:

tbims_form1_data <- import_data(tbims_form1_path, file_type = "sav")
tbims_form2_data <- import_data(tbims_form2_path, file_type = "sav")
function_factor_scores <- import_data(function_scores_path, file_type = "csv")

Why This Matters:
  • Scalability: If we need to import more files in the future, we can easily do so by adding a few more lines of code, thanks to our flexible import_data function.

  • Traceability: Each dataset is assigned to a specific variable, making it easy to track where each piece of data came from.

  1. Preserving Variable Labels: Keeping the Context

Before we start transforming our data, we save the original variable labels. These labels are like the metadata that describe the meaning of each variable.

tbims_form1_labels <- tbims_form1_data |> var_label()
tbims_form2_labels <- tbims_form2_data |> var_label()

Why This Matters:
  • Context Retention: Labels provide crucial context, especially when dealing with large datasets with many variables. They help us understand what each variable represents, which is vital for accurate interpretation.

Pro Tips for a Smooth Data Import
  • Standardize Your Files: Consistent file naming conventions make your life much easier when importing multiple files.

  • Double-Check Your Imports: Always take a moment to verify that your data have been imported correctly. Check the number of rows and columns, variable types, and a few sample rows to ensure everything looks as expected.

  • Document Everything: Clearly note any assumptions you're making about the data or any issues you encounter during the import process.

Conclusion

With our data successfully imported and variable labels preserved, we've built a clean and organized foundation for the next stages of our analysis. We're now ready to roll up our sleeves and dive into data cleaning, where we'll tackle missing values, refine variable formats, and prepare our data for the exciting world of survival modeling.

1.3 Data Cleaning

Introduction

We've imported our raw data and defined our cleaning tools, but the journey to insightful analysis has just begun. Before we can build our survival models, we need to transform our raw data into a clean, reliable, and analysis-ready format. This is where the crucial step of data cleaning comes into play.

Think of this process as preparing ingredients for a gourmet meal. Raw data often arrive with inconsistencies, errors, and formatting quirks—much like unwashed vegetables or unmeasured spices. Our goal is to clean and prepare each data element, ensuring that our final "dish"—our survival analysis—is both delicious and accurate.

In this section, we'll dive deep into the specifics of data cleaning, focusing on how we handle the unique challenges of the longitudinal TBIMS dataset.

Step 1: Defining Data Cleaning Functions

Raw data rarely speak the language of statistical models. Datasets often contain placeholder codes for missing values, variables stored in the wrong format, and other inconsistencies that can trip up our analysis. To tackle these issues systematically, we'll define three powerful cleaning functions:

  1. handle_date_conversion: Our Date Harmonizer

Dates are the lifeblood of survival analysis. They allow us to calculate time_to_event, the core of our investigation. The handle_date_conversion function ensures that all date variables are consistently formatted and that any invalid date codes are correctly identified as missing.

handle_date_conversion <- function(x, na_codes) {
  if (!inherits(x, "Date")) {
    x <- as.Date(x)
  }
  na_codes <- lapply(na_codes, function(code) as.Date(code, format = "%Y-%m-%d"))
  for (code in na_codes) {
    x[x == code] <- NA
  }

  return(x)
}

Purpose
  • Converts variables to R's standard Date format.

  • Replaces specified invalid date codes (e.g., 9999-09-09, often used as placeholders) with NA, R's standard for missing data.

Why It's Necessary
  • Ensures accurate time_to_event calculations, which are fundamental to survival analysis.

  • Prevents errors that could arise from trying to perform calculations on invalid date formats.

How It Works
  • Checks if the input variable x is already a Date object. If not, it converts it using as.Date().

  • Converts the user-provided na_codes (invalid date codes) to Date objects as well.

  • Iterates through the na_codes and replaces any matching values in x with NA.

  1. replace_na: The Missing Value Master

Missing data is a common challenge in real-world datasets. The replace_na function is our all-purpose tool for handling missing values and ensuring that variables are stored in the correct format.

replace_na <- function(x, na_codes, to_class = NULL) {
  if (!is.null(to_class)) {
    if (to_class == "factor") {
      x <- as.character(x)
      x[x %in% na_codes] <- NA
      x <- factor(x, exclude = NA)
    } else if (to_class == "numeric") {
      x <- as.numeric(x)
      x[x %in% na_codes] <- NA
    } else if (to_class == "Date") {
      x <- handle_date_conversion(x, na_codes)
    } else if (to_class == "character") {
      x <- as.character(x)
      x[x %in% na_codes] <- NA
    }
  } else {
    x[x %in% na_codes] <- NA
  }
  return(x)
}

Purpose
  • Replaces non-standard missing value codes (e.g., 999, 88, Refused) with NA.

  • Converts variables to the correct data type (e.g., numeric, factor, Date, character).

Why It's Necessary
  • Ensures that statistical models handle missing data correctly. Most R functions are designed to work seamlessly with NA.

  • Prevents errors that can occur when models encounter unexpected values or data types.

How It Works
  1. Checks if a target data type (to_class) is specified.

  2. Based on to_class, it converts the variable x accordingly.

  3. Replaces any values in x that match the provided na_codes with NA.

  4. If no target data type is specified, it will still replace the user-specified na_codes with NA.

  1. clean_and_convert: The Cleaning Powerhouse

The clean_and_convert function is the conductor of our cleaning orchestra. It orchestrates the application of handle_date_conversion and replace_na to multiple variables, guided by a set of instructions we'll call "variable mappings."

clean_and_convert <- function(data, mapping_list) {
  for (var in names(mapping_list)) {
    na_values <- mapping_list[[var]]$na_values
    original_name <- mapping_list[[var]]$original_name
    to_class <- mapping_list[[var]]$to_class
    data <- data |> rename(!!var := all_of(original_name))
    data <- data |> mutate(!!var := replace_na(!!sym(var), na_values, to_class))
    data <- data |> mutate(!!var := haven::zap_labels(!!sym(var)))
  }
  return(data)
}

Purpose
  • Automates the cleaning process for multiple variables, applying the appropriate cleaning rules to each.

  • Renames the variables according to the user-specified mapping list.

  • Removes variable labels from variables imported using haven.

Why It's Necessary
  • Ensures consistency and efficiency when cleaning large datasets with many variables.

  • Reduces the risk of errors that can occur when manually cleaning each variable.

How It Works
  1. Iterates through a list of variables and their corresponding cleaning rules (the mapping_list).

  2. For each variable:

    • Retrieves the list of na_values (values to be treated as missing).

    • Retrieves the original_name of the variable.

    • Retrieves the desired data type to_class.

    • Renames the variable using rename().

    • Applies the replace_na function to handle missing values and convert the variable to the correct type.

    • Uses haven::zap_labels to remove any variable labels.

Step 2: Defining Variable Mappings

Think of variable mappings as the detailed blueprints that guide our cleaning process. They provide specific instructions for how each variable should be handled, ensuring consistency and accuracy.

Why It Matters
  • Data-Specific Rules: Datasets often have unique quirks. Mappings allow us to tailor our cleaning process to the specific characteristics of our TBIMS data.

  • Transparency and Reproducibility: Mappings provide a clear record of our cleaning decisions, making our analysis transparent and reproducible.

Examples:
  • Baseline Data Mappings: These mappings specify how to handle variables from the baseline assessment. For example:

baseline_name_and_na_mappings <- list(
  id = list(original_name = "Mod1id", to_class = "numeric"),
  sex = list(original_name = "SexF", na_values = 99, to_class = "factor"),
  age_at_injury = list(original_name = "AGE", na_values = 9999, to_class = "numeric"),
  date_of_birth = list(original_name = "Birth", na_values = as.Date("9999-09-09"), to_class = "Date"),
  # ... (More mappings for other baseline variables) ...
)

Explanation:
  • id: The participant ID. It's originally named Mod1id and should be treated as numeric.

  • sex: The participant's sex. It's originally named SexF, has a missing value code of 99, and should be converted to a factor (a categorical variable).

  • age_at_injury: The participant's age at the time of injury. It's originally named AGE, has a missing value code of 9999, and should be treated as numeric.

  • date_of_birth: The date of birth of the participant. It's originally named Birth, has a missing value code of 9999-09-09, and should be treated as a date.

  • Follow-Up Data Mappings: We create similar mappings for variables collected during follow-up assessments:

followup_name_and_na_mappings <- list(
  id = list(original_name = "Mod1id", to_class = "numeric"),
  date_of_followup = list(original_name = "Followup", na_values = as.Date(c("4444-04-04", "5555-05-05")), to_class = "Date"),
  # ... (More mappings for other follow-up variables) ...
)

Explanation:
  • id: The participant ID. It's originally named Mod1id and should be treated as numeric.

  • date_of_followup: The date of each follow-up interview. It's originally named Followup, has missing value codes of 4444-04-04 and 5555-05-05, and should be treated as a date.

Step 3: Transforming Raw Data into Clean Data

We've imported our raw data and defined our cleaning tools. Now it's time to transform those raw datasets into clean, analysis-ready formats. Remember, directly imported datasets are rarely ever ready for prime time. They need to be carefully inspected, cleaned, and often restructured before we can extract meaningful insights.

This step is particularly crucial for our longitudinal TBIMS dataset, which tracks participants over time. We need to ensure that variables are consistently defined across different time points and that our data structure is suitable for survival analysis.

Adding Identifiers: Setting the Stage for Longitudinal Analysis

One of the first things we'll do is add a data_collection_period variable to our baseline data:

# Add data collection period to baseline data
tbims_form1_data <- tbims_form1_data |>
  mutate(data_collection_period = 0)

What It Does
  • This simple line of code adds a new variable called data_collection_period to our tbims_form1_data (baseline) dataset and assigns it a value of 0. This "0" acts as a flag, clearly identifying these records as belonging to the baseline assessment.

Why It's Necessary
  • Longitudinal Data Management: Our data is in "long format," meaning that each participant has multiple rows corresponding to different time points. The data_collection_period variable is essential for distinguishing between thse different observations for the same individual. Think of it as a timestamp for each assessment.

  • Ensuring Accurate Merging: Later, when we merge our baseline and follow-up datasets, this variable will be crucial for ensuring that records are correctly matched. Without it, we risk creating a jumbled mess, with baseline data incorrectly linked to follow-up data.

Applying the Cleaning Power: Integrating Data Cleaning

Now, let's unleash our cleaning functions on the raw datasets, guided by the mappings we defined earlier:

# Clean and transform the baseline and follow-up datasets
clean_tbims_form1_data <- clean_and_convert(tbims_form1_data, baseline_name_and_na_mappings)
clean_tbims_form2_data <- clean_and_convert(tbims_form2_data, followup_name_and_na_mappings)

What Happens Here
  • The clean_and_convert function swoops in, systematically cleaning each variable in both the baseline (tbims_form1_data) and follow-up (tbims_form2_data) datasets.

  • For each variable, it:

    1. Renames it according to our mappings (e.g., AGE becomes age_at_injury).

    2. Replaces any non-standard missing value codes (like 9999) with NA, R's standard for missing data.

    3. Converts the variable to the correct data type (e.g., numeric, factor, or date).

Example: Transforming AGE to age_at_injury

Let's revisit how this works for the AGE variable in our baseline data:

  1. clean_and_convert consults the baseline_name_and_na_mappings.

  2. It finds that AGE should be renamed to age_at_injury.

  3. It identifies 9999 as the missing value code and numeric as the desired data type.

  4. It calls replace_na to perform the transformation, resulting in a clean age_at_injury variable.

Why It Matters

These cleaning and transformation steps are not just about tidying up. They are essential for building a robust foundation for our survival analysis.

  1. Data Integrity: By standardizing variable names and formats, we ensure consistency across our datasets, preventing errors that could arise from mismatched variables during merging or analysis. The data_collection_period variable is particularly important for maintaining the integrity of our longitudinal data.

  2. Reproducibility: Our modular cleaning functions and detailed mappings make our process transparent and easy to replicate. Others can understand exactly how we transformed the raw data, promoting trust in our findings.

  3. Setting the Stage for Survival Analysis: We're now perfectly positioned to merge our cleaned datasets, align variables across different time points, and ultimately derive the time_to_event variables that are the cornerstone of survival analysis.

Conclusion

By investing time and effort in data cleaning, we're building a solid foundation that will support the more complex survival analysis techniques that we'll explore in subsequent blog posts.

In the next sections, we'll continue our data preprocessing journey, merging our cleaned datasets, creating new derived variables, and applying our study eligibility criteria to define our final analytic sample.

1.4 Data Merging and Enrichment

Introduction

Integrating the baseline and follow-up datasets is a critical step in preparing the TBIMS data for analysis. By merging cleaned datasets and resolving overlapped variables, we create a comprehensive view of each participant's information across multiple time points. This ensures that all relevant data are readily accessible for analysis and minimizes redundancy within the dataset.

Merging Baseline and Follow-Up Data

The full_join function from the dplyr package merges the cleaned baseline and follow-up datasets. The merge is performed using the unique participant identifier (id) and the data collection period (data_collection_period) as keys. This approach preserves all records from both datasets, ensuring complete participant representation.

# Merge baseline and follow-up datasets
merged_data <- full_join(clean_tbims_form1_data, clean_tbims_form2_data, by = c("id", "data_collection_period"))

Adding Functional Status Scores

To enrich the dataset, Year 1 functional status scores—sourced from the function_factor_scores dataset, are appended. The left_join function ensures that all records in the merged dataset are retained, with scores added where available. For clarity, the new variable is renamed to func_score_at_year_1.

# Append Year 1 functional status scores
merged_data <- merged_data |>
  left_join(function_factor_scores, by = "id") |>
  rename(func_score_at_year_1 = func)

Resolving Redundant Variables

After merging, some variables may be duplicated across the datasets (e.g., date_of_death). To resolve these redundancies, the coalesce function is used. coalesce selects the first non-missing value for each participant across the overlapping variables, creating a single, definitive column.

# Resolve redundancies using coalesce
merged_data <- merged_data |>
  mutate(date_of_death = coalesce(date_of_death.x, date_of_death.y))

Conclusion

The integration process results in a unified dataset that consolidates baseline and follow-up data, resolves data redundancies, and incorporates additional functional status scores. This comprehensive dataset provides a complete and reliable foundation for subsequent analyses, such as evaluating functional outcomes and assessing mortality risk.

1.5 Data Transformation and Recoding

Introduction

We've cleaned our data, and now it's time for the next crucial phase: data merging and enrichment. In this stage, we'll combine our separate datasets into a unified whole and enhance it with additional information, creating a richer, more powerful dataset for our survival analysis.

Remember, our ultimate goal is to investigate how depression one year post-TBI influences all-cause mortality within five years of the initial interview. To do this effectively, we need a dataset that seamlessly integrates information from different time points and sources.

Let's break down the process into three key steps:

Step 1: Unifying the Data - Merging Baseline and Follow-Up Records

Longitudinal studies, like the TBIMS study, involve collecting data from participants at multiple time points. To get a complete picture of each participant's journey, we need to merge these separate records into a single, unified dataset.

Here's how we do it:

# Merge cleaned baseline and follow-up datasets
merged_data <- full_join(clean_tbims_form1_data, clean_tbims_form2_data, by = c("id", "data_collection_period"))
What It Does
  • This command uses the powerful full_join function from the dplyr package to combine our cleaned baseline data (clean_tbims_form1_data) with our cleaned follow-up data (clean_tbims_form2_data).

  • The by = c("id", "data_collection_period") part tells full_join to match records based on both the participant's unique identifier (id) and the data collection period (data_collection_period). This ensures that the correct baseline and follow-up records are linked for each individual.

Why It's Important
  • Creating a Holistic View: Merging these datasets gives us a comprehensive view of each participant's clinical trajectory over time. We can now see their baseline characteristics alongside their follow-up outcomes, all in one place.

  • Foundation for Longitudinal Analysis: This merged dataset is the foundation for our survival analysis. Without it, we'd be analyzing fragmented pieces of information, unable to connect crucial baseline factors to later outcomes.

  • Preserving the Entire Cohort: Using a full_join ensures that we retain all participants, even those who might be missing data in either the baseline or follow-up datasets.

Step 2: Adding Depth - Appending Functional Independence Scores

Our dataset becomes even more valuable when we enrich it with additional relevant information. In this step, we'll add functional independence scores, which provide crucial insights into a participant's recovery progress after TBI.

# Append Year 1 functional independence scores
merged_data <- merged_data |>
  left_join(function_factor_scores, by = "id") |>
  rename(func_score_at_year_1 = func)
What It Does
  • We use a left_join to link the functional independence scores (from the function_factor_scores dataset) to our merged data, matching records based on the participant's id.

  • We then rename the appended variable to func_score_at_year_1 for clarity.

Why It's Important
  • Prognostic Significance: Functional independence is a strong predictor of long-term outcomes after TBI. Including these scores allows us to investigate how functional status relates to survival and how it might interact with other factors, like depression.

  • Stratified Analysis: These scores will enable us to perform stratified analyses, exploring whether the relationship between depression and mortality differs across different levels of functional independence. For example, we might ask: "Does the impact of depression on survival differ between participants with high versus low functional independence at one year post-TBI?"

Pro Tip: We're appending the raw functional independence scores (which range from -5.86 to 1.39) here. This gives us maximum flexibility later on to create different types of derived variables, such as quintiles or categories, based on these scores.

Step 3: Cleaning House - Resolving Variable Redundancy

When merging datasets, we often encounter variables that are recorded in both datasets, leading to redundancy. For example, both our baseline and follow-up datasets might contain a date_of_death variable. We need to resolve these redundancies to create a clean and consistent dataset.

# Coalesce shared variables
merged_data <- merged_data |>
  mutate(
    date_of_death = coalesce(date_of_death.x, date_of_death.y)
  )
What It Does
  • We use the coalesce function from dplyr to combine the date_of_death.x (from the baseline data) and date_of_death.y (from the follow-up data) into a single date_of_death variable.

  • coalesce picks the first non-missing value for each participant. So, if date_of_death.x is missing and date_of_death.y has a value, it will use the value from date_of_death.y.

Why It's Important
  • Data Cleanliness: Resolving redundancy eliminates duplicate columns, making our dataset cleaner and easier to work with.

  • Preventing Errors: Having multiple versions of the same variable can lead to confusion and potential errors in our analysis. coalesce ensures that we have a single, authoritative value for each variable.

Pro Tip: Make it a habit to use coalesce for any variable that appears in both datasets during a merge. This systematic approach prevents inconsistencies and ensures that no information is accidentally lost.

Conclusion

These three steps—merging, enriching, and resolving redundancies—are important for transforming our raw data into a resource for survival analysis. By creating a unified, comprehensive, and clean dataset, we've laid a solid foundation for exploring the complex interplay between depression and mortality after TBI.

In the next post, we'll continue our data preprocessing journey by deriving new variables, such as the precise date of the one-year follow-up, and further preparing our data for the exciting world of survival modeling.

Conclusion

This blog post has detailed the systematic preparation of the TBIMS dataset for analysis, transforming raw data into a comprehensive, analysis-ready resource for investigating traumatic brain injury outcomes.

  1. Section 1.2 Initial Setup and Library Loading: Established a streamlined R environment using pacman for library management, a structured directory system, and project-specific settings.

  2. Section 1.2 Data Import: Introduced the custom import_data function for efficient and error-resistant importing of .sav and .csv files.

  3. Section 1.3 Data Cleaning: Employed custom R functions to standardize dates, handle missing values, and clean data frames based on predefined rules.

  4. Section 1.4 Data Merging and Enrichment: Unified baseline and follow-up datasets using dplyr's full_join and left_join, retaining all participant records and enriching the dataset with functional status scores. Redundant variables were resolved using coalesce.

  5. Section 1.5 Variable Refinement: Transformed variables to enhance interpretability and prepare for statistical modeling, including:

    • Creation and imputation of date_of_year_1_followup for time-to-event analyses.

    • Categorization of func_score_at_year_1 into quintiles for group comparisons.

    • Updating, recording, and collapsing of factor variables using update_labels_with_sjlablled and forcats functions.

Outcomes and Implications

The data preparation process detailed in this post has resulted in a robust TBIMS dataset. This resource enables researchers to investigate the long-term outcomes of individuals with msTBIs, explore recovery trajectories, and identify predictors of outcomes. The methodology employed underscores the importance of thorough data preparation in ensuring the validity and reliability of subsequent analyses, ultimately contributing to a deeper understanding of TBI recovery and informing clinical practice.

Comments