Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 3)

Blog

Introduction

1.10 Logging and Tracking

1.11 Applying Study Eligibility Criteria

1.12 Deriving Depression Level at Year 1

Conclusion

January 20, 2025

Featured

Research

Tutorials

Introduction

Welcome back to our hands-on series exploring the power of survival analysis! We've covered a lot of ground already—importing, cleaning, and merging our data. Now, we're entering the crucial final stages of data preprocessing, where we transform our refined dataset into a powerhouse of insights, ready to tackle our central research question: How do depression levels one year after a traumatic brain injury (TBI) impact all-cause mortality within the subsequent five years?

This post will guide you through three essential phases of data prerprocessing:

Logging and Tracking Changes: Ensuring transparency and reproducibility by documenting every transformation and exclusion.
Applying Eligibility Criteria: Refining our analytic sample to include only those participants who meet our study's requirements and have sufficient data for survival analysis.
Creating Our Key Predictor Variable: Deriving a clinically meaningful measure of depression severity from raw PHQ-9 responses.

These steps might seem technical, but they are the backbone of a robust and reliable analysis. They ensure that our survival models are built on a solid foundation of high-quality data, ultimately leading to more trustworthy and impactful findings.

Why Preprocessing Matters: Building Trust in Our Results

Data preprocessing is more than just a technical necessity; it's the foundation of sound research. A meticulous and well-documented preprocessing workflow provides:

Transparency and Reproducibility: By carefully tracking every change we make to our data, we create a transparent and reproducible analysis that others can understand, validate, and build upon. This is the cornerstone of scientific rigor.
Data Quality Assurance: Addressing inconsistencies, errors, and missing data ensures that our models are based on accurate and reliable information, leading to more trustworthy results.
Meaningful Insights: Transforming raw data into well-defined, clinically relevant variables allows us to extract insights that are both statistically sound and relevant to real-world practice.

1.10 Logging and Tracking

Every time we transform our data—whether it's applying an eligibility criterion, recoding a variable, or imputing missing values—we need to keep a detailed record of the changes. This is where logging comes in. Think of it as our data's audit trail, providing a clear history of how our dataset evolved.

In this section, we introduce two essential logging functions:

log_sample_size(): This function tracks changes in our sample size after each transformation, allowing us to monitor the impact of our preprocessing steps and quickly identify any unexpected data loss.
log_removes_ids(): This function keeps a record of the specific participants excluded at each step and the reason for their exclusion. This tracking is crucial for assessing potential biases and ensuring the transparency of our selection process.

Why This Matters: Logging ensures that our analysis is transparent, reproducible, and accountable. It allows us to retrace our steps, understand the evolution of our dataset, and build trust in our findings.

1.11 Applying Study Eligibility Criteria

With our logging tools in place, we're ready to apply our study's eligibility criteria. These criteria define the specific participants that we want to include in our final analytic sample. We'll apply three criteria sequentially:

Criterion 1: This criterion defines our study's enrollment period and ensures that we have sufficient follow-up data or a recorded event (death) for each participant. It also involves calculating the crucial time_to_event and event_status variables.
Criterion 2: This criterion excludes participants with biologically implausible survival times, such as those who died on the same day as their Year 1 follow-up or those with negative survival times (likely due to data entry errors).
Criterion 3: This criterion ensures that all participants have the necessary date information (either a follow-up date or a date of death) to calculate their survival time.

Why This Matters: Applying these criteria systematically refines our dataset, ensuring that we're focusing on the right participants and that our survival analysis is based on valid and reliable data.

1.12 Deriving Depression Level at Year 1

Finally, we'll create our key predictor variable: Depression Level at Year 1. This variable—derived from the PHQ-9 questionnaire—categorizes participants into "No Depression," "Minor Depression," or "Major Depression" based on their symptom profile at the one-year follow-up.

We'll also create two related variables:

positive_symptoms_at_year_1: A count of endorsed depressive symptoms.
cardinal_symptoms_at_year_1: A categorical variable indicating the presence or absence of the two cardinal symptoms of depression: anhedonia and depressed mood.

Why This Matters: This clinically meaningful variable will be the mainstay of our investigation into the link between depression and post-TBI mortality.

The Result: A Dataset Primed for Survival Analysis

By logging our changes, applying our eligibility criteria, and deriving our key predictor variable, we've transformed our raw TBIMS data into a refined dataset, nearly ready for survival modeling. We've addressed data quality issues, ensured consistency across time points, and created variables that are both statistically sound and clinically relevant.

Looking Ahead: Imputation, Final Dataset Selection, and Model Building

Our data preparation journey is nearing its end! In the next post, we'll tackle the crucial step of imputing missing values in our Year 1 variables, and then we'll select our final analytic dataset, choosing a single representative record for each participant. Finally, we'll be ready to build our Cox regression models and uncover the relationship between depression and survival after TBI.

1.10 Logging and Tracking

Introduction

As we navigate the intricate process of data preprocessing for survival analysis, it's crucial to keep a record of every step we take. Just as a scientist carefully documents their experiments in a lab notebook, we need to diligently track how our dataset changes with each transformation, inclusion, and exclusion. This is where logging comes in—our essential tool for ensuring transparency, reproducibility, and ultimately, the trustworthiness of our findings.

In this section, we'll focus on two vital aspects of logging:

Tracking Sample Size Changes: Monitoring how our sample size evolves with each preprocessing step.
Identifying Excluded Participants: Keeping a record of exactly which participants are removed and why.

Why Logging Matters: More Than Just Housekeeping

You might be thinking, "Isn't logging just extra work?" Well, it's an investment that pays off significantly in the long run. Here's why:

Reproducibility is Paramount: In research, reproducibility is the gold standard. We need to ensure that anyone (including our future selves!) can follow our steps and arrive at the same results. Detailed logging makes this possible.
Transparency Builds Trust: Openly documenting our data preprocessing steps allows others to understand, evaluate, and build upon our work. This transparency is crucial for the credibility of our findings.
Error Detection and Debugging: Unexpected changes in sample size can be a red flag, signaling a potential error in our code or a misunderstanding of the data. Logging helps us quickly identify and address such issues.
Understanding Bias: Systematically tracking who gets excluded (and why) helps us assess and address potential biases that might skew our results.

Tool #1: `log_sample_size()` - Tracking the Flow of Participants

The log_sample_size() function is our dedicated tool for monitoring changes in our sample size throughout the preprocessing journey.

Here's the code:

log_sample_size <- function(data, criterion_number, log_dir = here::here("Logs"), original_data = NULL) {
  # Ensure the log directory exists
  if (!dir.exists(log_dir)) {
    dir.create(log_dir, recursive = TRUE)
  }

  # Create the log file name
  log_file <- file.path(log_dir, "sample_sizes.log")

  # Calculate key metrics
  unique_ids <- length(unique(data$id))
  total_observations <- nrow(data)
  original_unique_ids <- if (!is.null(original_data)) length(unique(original_data$id)) else NA

  # Format the message
  message <- sprintf(
    "Original Unique IDs: %s\n Unique IDs after Applying Criterion %d: %s\n Total Observations after Applying Criterion %d: %s\n\n",
    format(original_unique_ids, big.mark = ","),
    criterion_number,
    format(unique_ids, big.mark = ","),
    criterion_number,
    format(total_observations, big.mark = ",")
  )

  # Write the message to the log file
  cat(message, file = log_file, append = TRUE)
}

How It Works

Directory Check: It first ensures that a "Logs" directory exists to store our log files. If not, it creates one.
File Setup: It defines the name of the log file (sample_sizes.log).
Metric Calculation: It calculates:
- unique_ids: The number of unique participant IDs in the current dataset.
- total_observations: The total number of rows in the current dataset.
- original_unique_ids: The number of unique participant IDs in the original dataset (if provided). This allows for comparisons across different stages of preprocessing.
Message Formatting: It crafts a clear message summarizing the key metrics, using sprintf() for formatting.
Logging: It appends this message to the sample_sizes.log file, creating a chronological record of sample size changes.

Example Usage:

log_sample_size(data, criterion_number = 1, original_data = merged_data)

Benefits

Trend Monitoring: We can easily track how our sample size changes as we apply different criteria or transformations.
Error Detection: Sudden, unexpected drops in the number of participants or observations can alert us to potential problems in our code.
Reproducibility: Provides a clear, step-by-step record of how our dataset evolved, making our analysis transparent and replicable.

Tool #2: `log_removed_ids()` - Keeping Track of Exclusions

Knowing how many participants are removed is important, but knowing who they are is equally crucial. The log_removed_ids() function helps us track exactly which participant IDs are excluded at each step.

Here's the code:

log_removed_ids <- function(original_data, filtered_data, criterion_number, log_dir = here::here("Logs")) {
  # Ensure the log directory exists
  if (!dir.exists(log_dir)) {
    dir.create(log_dir, recursive = TRUE)
  }

  # Identify removed IDs
  removed_ids <- setdiff(unique(original_data$id), unique(filtered_data$id))

  # Write IDs to a log file
  log_file <- file.path(log_dir, sprintf("criterion_%d_removed_ids.log", criterion_number))
  writeLines(as.character(removed_ids), log_file)
}

How It Works

Directory Check: Similar to log_sample_sizes(), it ensures that the "Logs" directory exists.
Exclusion Identification: It uses the setdiff() function to compare the unique participant IDs in the original dataset with those in the filtered dataset, identifying the IDs that were removed.
Logging: It saves these removed IDs to a specific log file named according to the criterion that was applied (e.g., criterion_1_removed_ids.log).

Example Usage:

log_removed_ids(original_data = merged_data, filtered_data = data, criterion_number = 1)

Benefits

Error Verification: Allows us to quickly check if any important participants were unintentionally excluded.
Transparency: Provides a detailed record of exclusions, which is essential for peer review and ensuring the validity of our findings.
Reproducibility: Helps others (or ourselves) understand and replicate the exact dataset used in the analysis.

Putting It All Together: Integrating Logging into Our Workflow

Here's how we can seamlessly integrate these logging functions into our data preprocessing steps:

# Step 1: Filter data based on some criterion
filtered_data <- original_data |> 
  filter(some_condition)

# Step 2: Log the changes in sample size
log_sample_size(filtered_data, criterion_number = 1, original_data = original_data)

# Step 3: Log the IDs of the removed participants
log_removed_ids(original_data, filtered_data, criterion_number = 1)

By consistently applying these logging functions after each filtering or transformation step, we create a detailed audit trail of our data preprocessing journey.

Why This is Crucial for Survival Analysis

In survival analysis, we're dealing with sensitive data and often complex calculations. Our sample size and the characteristics of our participants directly impact the statistical power and generalizability of our findings. Moreover, non-random exclusions can introduce bias, skewing our results.

By implementing robust logging:

We ensure reproducibility: Our work can be easily replicated and verified.
We mitigate bias: Logs help us identify and address potential patterns of exclusion, leading to more reliable results.
We build trust: Comprehensive documentation enhances the credibility of our findings.

Looking Ahead: From Logs to Insights

With our logging tools in place, we're ready to continue refining our dataset. In the next sections, we'll apply our study eligibility criteria, further transform our variables, and ultimately define the time_to_event variables that are the heart of survival analysis.

By combining cautious data preprocessing with transparent logging, we're building a solid foundation for uncovering meaningful insights into the relationship between depression and long-term survival after TBI.

1.11 Applying Study Eligibility Criteria

Introduction

We've cleaned, transformed, and enriched our data. Now it's time to apply our study eligibility criteria. This crucial step ensures that we're focusing on the right participants and that our data are truly ready for survival analysis.

Essentially, we are defining the specific requirements that participants must meet to be included in our final analytical sample. We also need to ensure that the data for each participant are complete and accurate enough to support the calculations required for survival modeling.

In this section, we'll apply three key eligibility criteria, refining our dataset step-by-step. We'll also introduce a helper function to streamline our calculations.

Step 1: Defining Our Time Window - The `is_within_study_period` Helper Function

Before we apply our criteria, let's define a handy helper function called is_within_study_period. This function will help us determine whether a participant's follow-up data fall within our defined study period.

is_within_study_period <- function(date_of_year_1_followup, date_of_followup) {
  participant_end_date <- date_of_year_1_followup + years(5)
  !is.na(date_of_followup) & date_of_followup >= study_entry_period_start_date & date_of_followup <= participant_end_date
}

What It Does

Takes two arguments: date_of_year_1_followup (the participant's Year 1 follow-up date) and date_of_followup (the date of a specific follow-up observation).
Calculates the participant's individual observation end date (participant_end_date) by adding five years to their date_of_year_1_followup.
Checks if the date_of_followup is not missing (!is.na()) and falls within the study period, considering both the study's overall start date and the participant's individual end date.

Why It's Important

Dynamic Time Window: This function allows us to define a specific five-year observation window for each participant, starting from their Year 1 follow-up date.
Ensures Data Relevance: It helps us ensure that we're only using follow-up data that is relevant to our research question.

Step 2: Applying Criterion 1 - Defining the Study Population and Calculating Time-to-Event

We've now arrived at the application of our first study eligibility criterion. This is where we start shaping our dataset to include only those participants who meet the specific requirements of our research question. Criterion 1 focuses on the timing of participants' enrollment in the study and ensures that we have the necessary data to calculate their survival times.

Eligibility Criterion 1: Defining the Inclusion Window

Our first criterion is all about timing. To be included in our final analytical sample, participants need to meet the following conditions:

Year 1 Follow-Up Within Study Period: Their Year 1 follow-up interview date must fall within our defined study period, which spans from October 1, 2006, to October 1, 2012.
Valid Follow-Up or Event: They must have at least one valid follow-up observation or a recorded death (event) within their individual five-year observation window (calculated as five years from their Year 1 follow-up date).

Addressing Special Cases

Real-world data collection is rarely perfect. In longitudinal studies like the TBIMS, it's common for some participants to have incomplete follow-up data. This could be due to various reasons, such as:

Loss to Follow-Up: Researchers are unable to contact the participant for subsequent assessments.
Refusal to Participate: Participants may choose to withdraw from the study at some point.
Incarceration: Participants may become incarcerated, making it difficult or impossible to collect follow-up data.
Administrative Barriers: Sometimes, external factors like funding limitations can hinder data collection efforts.

We've defined these situations as "special cases" and created a variable called special_case to flag these observations in our dataset.

Why This Matters

Simply excluding participants with incomplete follow-up could lead to a significant reduction in our sample size and, more importantly, introduce bias into our analysis. For instance, participants who are lost to follow-up might have different characteristics or outcomes than those who remain in the study.

To address this, the apply_criterion_1 function intelligently handles these special cases, retaining their last valid observation under specific conditions. This helps us preserve valuable data while maintaining the integrity of our analysis.

The `apply_criterion_1` Function: A Deep Dive

The apply_criterion_1 function is the engine that drives this entire process. It performs several crucial operations:

Filtering Participants: It selects the participants who meet our Year 1 follow-up timing requirement.
Handling Special Cases: It intelligently deals with participants who have incomplete follow-up data due to circumstances like being lost to follow-up or refusing further participation.
Calculating Key Survival Variables: It creates the essential event_status and time_to_event variables, which are the foundation of our survival analysis.
Logging Changes: It meticulously documents the impact of applying this criterion on our dataset, ensuring transparency and reproducibility.

Let's break down the key parts of the function:

Filtering by Date

data <- data |>
  filter(date_of_year_1_followup >= study_entry_period_start_date & date_of_year_1_followup <= study_entry_period_end_date)

This code snippet filters our dataset, retaining only those participants whose date_of_year_1_followup falls within the bounds of our study_entry_period_start_date (October 1, 2006) and study_entry_period_end_date (October 1, 2012). This ensures that we're focusing on the correct cohort.

Handling Special Cases

The function then defines two important flags:

valid_followup: This flag indicates whether a follow-up observation or a death date falls within the participant's five-year observation window. This is determined using the is_within_study_period helper function that we defined earlier.
special_case: This flag identifies observations where follow-up data might be incomplete due to specific circumstances (e.g., participant refusal, loss to follow-up, administrative limitations).

valid_followup = is_within_study_period(date_of_year_1_followup, date_of_followup) |
                 is_within_study_period(date_of_year_1_followup, date_of_death)

special_case = status_at_followup %in% c("Lost", "Refused", "Incarcerated", "Withdrew", "No Funding")

Determining Observation Inclusion

Next, the function determines which observations to keep based on these flags and the data_collection_period:

include_observation = if_else(
  data_collection_period == 0,
  TRUE,
  valid_followup | (special_case & lead(as.logical(valid_followup, default = FALSE)))
)

Here's the logic:

Baseline Data: All baseline observations (where data_collection_period == 0) are always retained.
Follow-Up Data: A follow-up observation is retained if:
- It's a valid_followup (i.e., the follow-up or death date falls within the five-year window), OR
- It's flagged as a special_case and the next observation for that participant is a valid_followup. This ensures that we keep the last available observation for participants with incomplete data, as long as there's a valid subsequent follow-up.

The data are then filtered using this include_observation variable to only retain the desired observations:

filter(include_observation)

Calculating Key Survival Variables

The function then calculates the crucial variables needed for survival analysis:

event_status: A binary indicator (0 or 1) signaling whether the participant experienced the event of interest (death) during their observation period.

event_status = if_else(!is.na(date_of_death) & date_of_death <= participant_end_date, 1, 0)

time_to_event: This variable measures the time (in days) from the Year 1 follow-up to either the event (death) or the last valid follow-up (censorship). It uses two helper variables:
- time_to_censorship: Time to the last follow-up for censored participants.
- time_to_expiration: Time to death for participants who died.

time_to_censorship = if_else(
  is.na(date_of_death) & is.na(date_of_followup), 0,
  if_else(!is.na(date_of_followup),
    as.numeric(difftime(date_of_followup, date_of_year_1_followup, units = "days")), NA_real_
  )
)

time_to_expiration = if_else(
  is.na(date_of_death) & is.na(date_of_followup), 0,
  if_else(!is.na(date_of_death),
    as.numeric(difftime(date_of_death, date_of_year_1_followup, units = "days")), NA_real_
  )
)

time_to_event = if_else(!is.na(date_of_death), time_to_expiration, time_to_censorship)

Logging for Transparency

Finally, the function uses our logging tools to document the impact of applying Criterion 1:

log_sample_size(data, 1, log_dir = here::here("Logs"), original_data)
log_removed_ids(original_data, data, 1, log_dir = here::here("Logs"))

This ensures that we have a clear record of how our sample size changed and which participants were excluded at this step.

The Complete `apply_criterion_1` Function

Here's the full code for the function, incorporating all of the steps described above:

apply_criterion_1 <- function(data, study_entry_period_start_date, study_entry_period_end_date) {
  original_data <- data  # Store the original data for logging removed IDs

  data <- data |>
    filter(date_of_year_1_followup >= study_entry_period_start_date & date_of_year_1_followup <= study_entry_period_end_date) |>
    arrange(id, date_of_followup, date_of_death) |>
    group_by(id) |>
    mutate(
      participant_end_date = date_of_year_1_followup + years(5),
      valid_followup = is_within_study_period(date_of_year_1_followup, date_of_followup) |
                       is_within_study_period(date_of_year_1_followup, date_of_death),
      special_case = status_at_followup %in% c("Lost", "Refused", "Incarcerated", "Withdrew", "No Funding"),
      include_observation = if_else(data_collection_period == 0, TRUE,
                                     valid_followup | (special_case & lead(as.logical(valid_followup, default = FALSE)))),
      calendar_year_of_event = if_else(!is.na(date_of_death), year(date_of_death),
                                      if_else(!is.na(date_of_followup), year(date_of_followup), NA_integer_)),
      calendar_year_of_injury = year(date_of_injury)
    ) |>
    filter(include_observation) |>
    mutate(
      event_status = if_else(!is.na(date_of_death) & date_of_death <= participant_end_date, 1, 0),
      time_to_censorship = if_else(is.na(date_of_death) & is.na(date_of_followup), 0,
                                   as.numeric(difftime(date_of_followup, date_of_year_1_followup, units = "days"))
      ),
      time_to_expiration = as.numeric(difftime(date_of_death, date_of_year_1_followup, units = "days")),
      time_to_event = if_else(!is.na(date_of_death), time_to_expiration, time_to_censorship)
    ) |>
    ungroup() |>
    select(
      id, data_collection_period, status_at_followup, event_status,
      time_to_event, time_to_censorship, time_to_expiration,
      calendar_year_of_injury, calendar_year_of_event, date_of_year_1_followup,
      date_of_followup, date_of_death, participant_end_date, everything()
    ) |>
    select(-valid_followup, -special_case, -include_observation) |>
    arrange(id, data_collection_period)

  log_sample_size(data, 1, log_dir = here::here("Logs"), original_data)
  log_removed_ids(original_data, data, 1, log_dir = here::here("Logs"))

  return(data)
}

Why This Matters

The apply_criterion_1 function is more than just a set of code instructions. It embodies careful consideration of our study's requirements and the nuances of our data. By applying this criterion, we ensure that:

Our analysis focuses on the correct participants: Those who had their Year 1 follow-up within the defined study period.
We handle incomplete data intelligently: Special cases are managed appropriately, preserving valuable information while maintaining data integrity.
We have accurately calculated the essential survival variables: event_status and time_to_event are now defined, forming the foundation for our subsequent analyses.
Our process is transparent and reproducible: The logging functions document every step, allowing others to understand and replicate our work.

With the application of Criterion 1, our dataset is taking shape. We're moving closer to building our survival models and uncovering the factors that influence long-term outcomes after TBI. In the next steps, we'll apply the remaining eligibility criteria, further refining our dataset and setting the stage for insightful and impactful analyses.

Step 3: Applying Eligibility Criterion 2 - Excluding Early Mortality and Invalid Survival Times

We're now ready to apply our second eligibility criterion, which focuses on maintaining the integrity of our data by excluding participants with early mortality or invalid survival times. This step is crucial for ensuring that our survival models are based on biologically plausible and methodologically sound data.

Why This Criterion is Essential: Focusing on Meaningful Survival Data

Survival analysis relies on accurate time_to_event calculations. However, two scenarios can create problematic data points:

Early Mortality: Participants who die on the same day as their injury, and thus have a time_to_event of 0. These cases don't provide meaningful information about long-term survival, which is the focus of our study. In other studies, participants who die on the same day as their injury may provide meaningful insight into short-term survival. However, participants who die on the same day as their Year 1 follow-up date are likely due to data entry error.
Negative Survival Times: These are likely due to data entry errors, such as an incorrect follow-up date or date of death. Negative survival times are logically impossible and must be excluded.

Including these cases in our analysis could distort our survival models and lead to misleading conclusions. Criterion 2 helps us identify and remove these problematic data points.

Implementation: The `apply_criterion_2` Function

The apply_criterion_2 function is designed to filter out these invalid observations while preserving valuable baseline information. Let's break down how it works:

apply_criterion_2 <- function(data) {
  original_data <- data  # Store the original data for logging the removed IDs

  data <- data |>
    group_by(id) |>
    mutate(last_observation = row_number() == n()) |>
    filter(time_to_event > 0 | data_collection_period == 0 & !last_observation) |>
    ungroup() |>
    select(-last_observation)

  # Log the sample size after applying Criterion 2
  log_sample_size(data, 2, log_dir = here::here("Logs"))

  # Log the removed IDs
  log_removed_ids(original_data, data, 2, log_dir = here::here("Logs"))

  return(data)
}

Store Original Data: As always, we create a copy of the input dataset (original_data) for logging purposes.
Group by ID: We group the data by participant ID using group_by(id) to prepare for identifying the last observation for each participant.
Identify Last Observation:
- mutate(last_observation = row_number() == n()): This creates a temporary flag, last_observation, which is TRUE if the observation is the last one for that participant (based on the row number within the group) and FALSE otherwise.
Filter Invalid Observations:
- filter(time_to_event > 0 | data_collection_period == 0 & !last_observation): This is the core of Criterion 2. It filters the data based on these conditions:
  - time_to_event > 0: Retains observations where the time_to_event is greater than 0, thus excluding early mortality (where time_to_event is 0) and negative survival times.
  - data_collection_period == 0 & !last_observation: Retains baseline observations (where data_collection_period is 0) unless it's the participant's last observation and their time_to_event is less than or equal to 0. This ensures that we keep baseline data, even for participants who might be excluded later due to early mortality or invalid survival times in their follow-up data. This exception is made for baseline observations because our analytic dataset will contain one record per participant, so we will retain all baseline observations to ensure that we are not excluding participants based on missing data or errors in follow-up data.
Ungroup and Remove Temporary Variable: The data is ungrouped using ungroup() and the temporary last_observation flag is removed.
Log Changes: The log_sample_size and log_removed_ids functions document the impact of applying this criterion.
Return Data: The function returns the modified dataset with invalid cases removed.

Example Walkthrough: Illustrating the Filtering Process

Let's see how apply_criterion_2 works with a simplified example.

Input Dataset

Filtering Steps

Identify Last Observations: The last_observation flag is applied:
- For id = 1, the second row is the last observation.
- For id = 2, the second row is the last observation.
- For id = 3, the first row is the last observation.
- For id = 4, the first row is the last observation.
Apply Exclusion Criteria:
- id = 1, row 1: Retained because it's a baseline observation (data_collection_period == 0) and not the last observation.
- id = 1, row 2: Retained because time_to_event > 0.
- id = 2, row 1: Excluded because time_to_event = 0, and it is the last observation (redundant).
- id = 2, row 2: Retained because time_to_event > 0.
- id = 3, row 1: Excluded because time_to_event = 0, and it is the last observation (redundant).
- id = 4, row 1: Retained because time_to_event > 0.
Log Changes:
- Removed IDs:
  - id = 2 (row 1): Removed due to redundancy (time_to_event = 0 and it is the last observation).
  - id = 3 (row 1): Removed due to redundancy (time_to_event = 0 and it is the last observation).

Output Dataset

Key Takeaways

Criterion 2 Safeguards Data Integrity: By excluding participants with invalid survival times, we ensure that our models are based on plausible data.
Prioritizes Baseline Data: The function is designed to retain baseline observations whenever possible, preserving valuable information about participants' initial characteristics.
Transparency Through Logging: Our logging functions provide a clear record of the changes made to the dataset, ensuring transparency and reproducibility.

Step 4: Applying Eligibility Criterion 3 - Ensuring Sufficient Date Data for Survival Time Calculations

We've arrived at our third and final eligibility criterion: ensuring that all participants in our analytic sample have sufficient date data to calculate their survival times. This is a non-negotiable requirement for survival analysis, as we simply cannot compute time_to_event without accurate and complete date information.

Think of it like this: you can't determine how long a journey took if you don't know when it started or ended. Similarly, we need clearly defined starting and ending points (or censoring dates) to calculate survival times.

Why This Criterion is Crucial: No Dates, No Survival Analysis

The core of survival analysis is understanding the time elapsed between a starting point and an event (or censoring). If participants are missing crucial dates—like their date_of_followup or date_of_death—their time_to_event becomes ambiguous or impossible to determine.

Including such participants in our analysis would:

Introduce Bias: Survival models rely on accurate time-to-event data. Missing or undefined survival times could distort the results and lead to misleading conclusions.
Compromise Model Validity: Most statistical software packages will either throw errors or quietly drop participants with missing survival times, leading to a loss of data and potentially biased results.

Implementation: The `apply_criterion_3` Function

The apply_criterion_3 function is designed to address this issue by retaining only those participants with sufficient date information. Let's break down the code:

apply_criterion_3 <- function(data, tbims_form1_labels, tbims_form2_labels, baseline_name_and_na_mappings, followup_name_and_na_mappings) {
  original_data <- data  # Preserve the original dataset for logging purposes

  # Sort data by ID and relevant dates
  data <- data |>
    arrange(id, date_of_followup, date_of_death) |>
    group_by(id) |>
    mutate(
      is_last_observation = lead(id, default = last(id)) != id,
      last_valid_date = pmax(date_of_followup, date_of_death, na.rm = TRUE)
    ) |>
    # Retain participants with at least one valid date
    filter(any(!is.na(last_valid_date))) |>
    ungroup() |>
    select(-is_last_observation, -last_valid_date)

  # Apply variable labels to renamed variables for readability
  data <- apply_labels_to_renamed_vars(
    data,
    tbims_form1_labels,
    tbims_form2_labels,
    baseline_name_and_na_mappings,
    followup_name_and_na_mappings
  )

  # Log the resulting sample size and IDs of excluded participants
  log_sample_size(data, 3, log_dir = here::here("Logs"))
  log_removed_ids(original_data, data, 3, log_dir = here::here("Logs"))

  return(data)
}

Preserve Original Data: original_data <- data: We store a copy of the original data for logging purposes, allowing us to track any exclusions.
Arrange and Group: arrange(id, date_of_followup, date_of_death)sorts the data by participant ID, follow-up date, and death date. group_by(id) groups the data by participant ID, preparing for calculations within each participant's record set.
Identify Last Observation: We create a temporary variable is_last_observation to indicate if an observation is the last one for a given participant.
Determine Last Valid Date:
- last_valid_date = pmax(date_of_followup, date_of_death, na.rm = TRUE): This is the core of Criterion 3. For each participant, it finds the most recent valid date between date_of_followup and date_of_death. The pmax() function returns the element-wise maximum of the input vectors, and na.rm = TRUE ensures that missing values are ignored.
Filter Participants:
- filter(any(!is.na(last_valid_date))): This crucial line filters the data at the participant level. It retains only those participants who have at least one non-missing last_valid_date across any of their observations. In other words, if a participant has a valid date_of_followup or date_of_death (or both) in any of their records, they are kept.
Ungroup and Remove Temporary Variables: The data is ungrouped using ungroup(), and the temporary variables is_last_observation and last_valid_date are removed from the dataset.
Reapply Variable Labels: data <- apply_labels_to_renamed_vars(...): After all transformations, we reapply the original variable labels to ensure that our dataset remains interpretable.
Log Changes: The log_sample_size and log_removed_ids functions document the impact of applying this criterion on our dataset.
Return Data: The function returns the modified dataset.

Example: Illustrating the Filtering Process

Let's consider a simplified example to see how this works in practice.

Input Dataset

Filtering Steps

Calculate last_valid_date:
- For id = 1, last_valid_date would be 2006-10-10.
- For id = 2, last_valid_date would be NA (since both dates are missing).
- For id = 3, last_valid_date would be 2008-03-15.
- For id = 4, last_valid_date would be 2009-05-20.
Filter Participants:
- Participant id = 2 would be excluded because they have no valid last_valid_date.
- All other participants would be retained because they have at least one valid date.

Output Dataset

As you can see, participant 2 is removed because they are missing both the follow-up date and the date of death, making it impossible to calculate their survival time.

Applying All Three Criteria: A Refined Dataset

We apply all three eligibility criteria sequentially to progressively refine our dataset:

# Apply Study Eligibility Criterion 1
analytic_data <- apply_criterion_1(merged_data, study_entry_period_start_date, study_entry_period_end_date)

# Apply Study Eligibility Criterion 2
analytic_data <- apply_criterion_2(analytic_data)

# Apply Study Eligibility Criterion 3
analytic_data <- apply_criterion_3(analytic_data, tbims_form1_labels, tbims_form2_labels, baseline_name_and_na_mappings, followup_name_and_na_mappings)

Each step builds upon the previous one, ensuring that our final analytic_data dataset only includes participants who meet all of our eligibility requirements and have sufficient data for survival analysis.

Key Takeaways: Ensuring Data Quality for Survival Analysis

By applying Criterion 3, we've taken a critical step toward ensuring the quality and integrity of our data:

Data Completeness: We've excluded participants with missing or ambiguous date information, preventing potential errors in our survival time calculations.
Focus on Valid Cases: Our analysis will now focus on participants for whom we can accurately determine survival times.
Transparency and Reproducibility: Our logging functions provide a clear record of the impact of this criterion, ensuring that our data preprocessing steps are transparent and reproducible.

Looking Ahead: From Refined Data to Meaningful Insights

With our eligibility criteria applied and our dataset refined, we're now ready to move on to the final stages of data preparation. In the next sections, we'll explore techniques for imputing missing values in our Year 1 variables, and then we'll create our final analytic dataset, selecting a single representative record for each participant. We'll then be fully equipped to build our Cox regression models and uncover the crucial relationship between depression and survival after TBI!

1.12 Deriving Depression Level at Year 1

Introduction

We've prepared our dataset, and now it's time to create the star of our analysis: Depression Level at Year 1. This variable—derived from participant responses to the Patient Health Questionnaire (PHQ-9) during their first-year follow-up interview—will serve as our primary exposure variable. It's the key predictor that we'll be using in our Cox regression models to investigate the relationship between depression and all-cause mortality within five years of the initial interview.

Why the PHQ-9?

The PHQ-9 is a widely used and validated screening tool for depression. It asks participants to rate the frequency of nine depression symptoms over the past two weeks, using a scale from 0 (Not at All) to 3 (Nearly Every Day).

From Raw Scores to Meaningful Categories

Our goal is to transform these raw PHQ-9 responses into a clinically meaningful depression_level_at_year_1 variable. We'll achieve this by creating a function called calculate_depression_level that generates three new variables:

positive_symptoms_at_year_1: This variable simply counts the total number of PHQ-9 symptoms endorsed by each participant at their Year 1 follow-up.
cardinal_symptoms_at_year_1: This categorical variable captures whether a participant endorsed either or both of the two cardinal symptoms of depression: anhedonia (loss of interest or pleasure) and depressed mood.
depression_level_at_year_1: This is our main exposure variable. It classifies participants into three categories based on their Year 1 PHQ-9 responses:
- No Depression
- Minor Depression
- Major Depression

The `calculate_depression_level` Function: A Deep Dive

Let's dissect the calculate_depression_level function to understand how it works:

calculate_depression_level <- function(data) {
  data <- data |>
    mutate(
      # Initialize new columns with default NA values
      positive_symptoms_at_year_1 = NA_real_,
      cardinal_symptoms_at_year_1 = factor(NA, levels = c("0", "1", "2", "3")),
      depression_level_at_year_1 = factor(NA, levels = c("0", "1", "2"))
    ) |>
    rowwise() |>
    mutate(
      positive_symptoms_at_year_1 = if_else(data_collection_period == 1,
        sum(c_across(starts_with("phq")) >= 1),
        NA_real_
      ),
      cardinal_symptoms_at_year_1 = factor(if_else(data_collection_period == 1,
        case_when(
          phq1 < 1 & phq2 < 1 ~ "0",  # Denied both cardinal symptoms
          phq1 >= 1 & phq2 < 1 ~ "1", # Endorsed anhedonia only
          phq1 < 1 & phq2 >= 1 ~ "2", # Endorsed depressed mood only
          phq1 >= 1 & phq2 >= 1 ~ "3" # Endorsed both cardinal symptoms
        ),
        NA_character_
      ),
      levels = c("0", "1", "2", "3"),
      labels = c("None", "Anhedonia", "Depressed Mood", "Both")
      ),
      depression_level_at_year_1 = factor(if_else(data_collection_period == 1,
        case_when(
          positive_symptoms_at_year_1 <= 1 | (phq1 < 1 & phq2 < 1) ~ "0",  # 'No Depression'
          positive_symptoms_at_year_1 <= 4 & (phq1 >= 1 | phq2 >= 1) ~ "1", # 'Minor Depression'
          positive_symptoms_at_year_1 >= 5 & (phq1 >= 1 | phq2 >= 1) ~ "2"  # 'Major Depression'
        ),
        NA_character_
      ),
      levels = c("0", "1", "2"),
      labels = c("No Depression", "Minor Depression", "Major Depression")
      )
    ) |>
    ungroup()

  return(data)
}

Let's break down the code step-by-step.

Initialization:
- positive_symptoms_at_year_1 = NA_real_: Creates a new column named positive_symptoms_at_year_1 and initially fills it with NA_real_ (a special type of NA for numeric values).
- cardinal_symptoms_at_year_1 = factor(NA, levels = c("0", "1", "2", "3")): Creates a new factor variable named cardinal_symptoms_at_year_1 with four possible levels (0, 1, 2, 3) and initially fills it with NA values.
- depression_level_at_year_1 = factor(NA, levels = c("0", "1", "2")): Creates a new factor variable named depression_level_at_year_1 with three levels (0, 1, 2) and initially fills it with NA values.
rowwise(): This crucial function tells R to perform the subsequent calculations row by row. This is essential because we need to evaluate the PHQ-9 responses for each participant individually.
Calculating positive_symptoms_at_year_1:
- if_else(data_collection_period == 1, ... , NA_real_): We only calculate this variable for Year 1 data (data_collection_period == 1).
- sum(c_across(starts_with("phq")) >= 1): This is the core of the calculation.
  - c_across(starts_with("phq")) selects all columns starting with "phq" (i.e., the nine PHQ-9 items).
  - >= 1 checks if each PHQ-9 item is at least 1 (meaning the symptom was present to some degree).
  - sum(…) adds up the number of TRUE values, effectively counting the number of endorsed symptoms.
Determining cardinal_symptoms_at_year_1:
- if_else(data_collection_period == 1, ... , NA_character_): Again, we only calculate this for Year 1 data.
- case_when(…): This function allows us to define different conditions and their corresponding outcomes:
  - phq1 < 1 & phq2 < 1 ~ "0": If both phq1 (anhedonia) and phq2 (depressed mood) are less than 1 (not endorsed), assign 0 (None).
  - phq1 >= 1 & phq2 < 1 ~ "1": If phq1 is at least 1 but phq2 is less than 1, assign 1 (Anhedonia only).
  - phq1 < 1 & phq2 >= 1 ~ "2": If phq1 is less than 1 but phq2 is at least 1, assign 2 (Depressed Mood only).
  - phq1 >= 1 & phq2 >= 1 ~ "3": If both phq1 and phq2 are at least 1, assign 3 (Both).
- levels = c("0", "1", "2", "3"), labels = c("None", "Anhedonia", "Depressed Mood", "Both"): We define the factor levels and their corresponding labels to make the variable more interpretable.
Assigning depression_level_at_year_1:
- if_else(data_collection_period == 1, ... , NA_character_): This is calculated only for Year 1 data.
- case_when(…): We use case_when again to define the criteria for each depression level.
  - positive_symptoms_at_year_1 <= 1 | (phq1 < 1 & phq2 < 1) ~ "0": "No Depression" if the participant endorsed 1 or fewer symptoms or denied both cardinal symptoms.
  - positive_symptoms_at_year_1 <= 4 & (phq1 >= 1 | phq2 >= 1) ~ "1": "Minor Depression" if the participant endorsed between 2 and 4 symptoms and at least one of the cardinal symptoms.
  - positive_symptoms_at_year_1 >= 5 & (phq1 >= 1 | phq2 >= 1) ~ "2": "Major Depression" if the participant endorsed 5 or more symptoms and at least one of the cardinal symptoms.
- levels = c("0", "1", "2"), labels = c("No Depression", "Minor Depression", "Major Depression"): We define the factor levels and labels for clear interpretation.
ungroup(): We remove the rowwise grouping, allowing for further data manipulation.
return(data): The function returns the modified dataset with the three new depression-related variables.

Example: Bringing the Code to Life

Let's see how this function transforms some sample data:

Input Dataset

Output Dataset

Explanation

Participant 1: Endorsed 5 symptoms, including both cardinal symptoms, and is classified as having "Minor Depression" (since we are using the DSM-IV criteria for depression classification).
Participant 2: Endorsed 0 symptoms and is classified as having "No Depression."
Participant 3: Endorsed 9 symptoms, including both cardinal symptoms, and is classified as having "Major Depression."

Key Takeaways: Why This Matters for Our Analysis

This derivation of depression_level_at_year_1 is crucial because:

Creates Our Primary Predictor: This variable will be the key exposure in our Cox regression models, allowing us to examine its association with mortality risk.
Clinical Relevance: The categories ("No Depression," "Minor Depression," "Major Depression") are based on established clinical criteria, making our findings more meaningful and applicable to real-world settings.
Sets the Stage for Imputation: We've now defined our depression variable, and in the next step, we'll address the issue of missing values in this crucial variable.

By transforming raw PHQ-9 responses into a well-defined, clinically relevant depression variable, we're taking a significant step toward building an insightful survival analysis. In the next post, we'll tackle the challenge of missing data in our Year 1 variables, using imputation to ensure that we can make the most of the valuable information in our TBIMS dataset.

Conclusion

We've reached a pivotal point in our exploration of post-TBI survival. This post marked the culmination of our data preprocessing journey, transforming a raw, complex dataset into a refined and reliable foundation for analysis. We are now more equipped to investigate our core research question: How do depression levels one year after a traumatic brain injury (TBI) influence the five-year risk of all-cause mortality?

Our data preparation has ensured that our dataset meets the rigorous demands of survival analysis while preserving valuable information. Each step—from implementing a transparent logging system to deriving clinically meaningful variables—has been crucial for the integrity and trustworthiness of our findings.

A Journey Recapped: Key Accomplishments

Let's reflect on the significant strides we've made in this post:

Transparent Logging for Reproducibility: We established a robust logging system to document every data transformation. This ensures complete transparency and reproducibility, providing a clear audit trail that tracks changes in sample size and identifies any excluded participants.
Rigorous Eligibility Criteria: We carefully applied three key eligibility criteria to refine our dataset, ensuring that only participants with valid and relevant data were included. This process guaranteed the quality and consistency of our time-to-event calculations and eliminated biologically implausible survival times.
Crafting a Powerful Predictor: We derived a clinically meaningful measure of depression severity—our primary predictor—from PHQ-9 responses. This variable will be central to our survival models, allowing us to examine the nuanced impact of depression on mortality.

Why This Matters: The Foundation of Sound Research

These preprocessing steps are the bedrock of robust and reliable research. By addressing missing data, refining eligibility, and creating clinically relevant variables, we have:

Elevated Data Quality: We've ensured that our models are built upon a foundation of accurate, consistent, and reliable data, minimizing bias and maximizing validity.
Optimized Analytical Power: We've retained as much valuable information as possible while upholding the highest standards of data integrity, allowing for more powerful and nuanced analyses.
Paved the Way for Trustworthy Insights: We've established a robust framework that will yield clear, interpretable findings with direct relevance to improving clinical practice and public health outcomes for individuals with TBI.

The Road Ahead: Transforming Data into Insights

Our data preparation has laid a solid foundation for exploring the complex interplay between depression and survival after TBI. We've addressed missing data, crafted clinically relevant variables, and rigorously refined our dataset. Now, we embark on a new phase of our journey, transforming our prepared data into actionable insights, starting with extracting and imputing our Year 1 variables.

What's Next on Our Journey

In the upcoming blog post, we will focus on maximizing the value of our Year 1 data and finalizing our analytic dataset:

Unlocking Year 1 Data: We will extract key variables from the Year 1 assessment and strategically impute them across all observations for each participant. This crucial step ensures that every record in our dataset carries essential Year 1 information, regardless of when the observation occurred. This process will enhance data completeness, avoid unnecessary exclusions, maintain data integrity, and facilitate record selection for our final analytic dataset.
Creating Robust Mental Health History Variables: We will construct new variables that capture participants' self-reported histories of suicide attempts, mental health treatment, and psychiatric hospitalizations. These variables will provide a more holistic view of participants' mental health status beyond their Year 1 depression scores, allowing us to explore the potential impact of prior mental health challenges on long-term survival.
Refining Time and Age Variables: To enhance interpretability and align with standard practices, we will convert our time-to-event variables from days to years. Additionally, we will calculate age-related variables to provide valuable context regarding the timing of events or censoring.
Strategically Organizing Our Data: We will carefully reorganize our dataset by grouping related variables and placing key variables in a logical order. This seemingly simple step will significantly improve the readability and usability of our data, streamlining subsequent analyses.
Selecting the Representative Record: We will identify and select a single, representative record (the last valid observation) for each participant. This process will transform our longitudinal dataset into a cross-sectional format suitable for many survival analysis techniques, including Cox regression.
Finalizing the Analytic Dataset: We will perform a final check of our data, ensuring that all variables are in the correct format and that factor variables have appropriate labels and reference levels. The finalized dataset will be saved in both .rds and .csv formats for easy access and reproducibility.

From Data to Discovery

These steps represent the final bridge between data preparation and insightful analysis. By addressing these details, we ensure that our subsequent analyses are built upon a solid foundation of high-quality, well-structured data.

Once these steps are complete, we will be fully equipped to:

Generate Descriptive Statistics: We'll create comprehensive tables summarizing key characteristics of our study population, providing a detailed overview of demographics, injury characteristics, and mental health variables.
Visualize Our Data: We'll use informative plots like histograms, box plots, and Kaplan-Meier curves to explore distributions, relationships between variables, and survival patterns.

These descriptive explorations will pave the way for deeper understanding and inform the development of our survival models. We're poised to transform this carefully prepared data into meaningful insights that can contribute to improving the lives of individuals with TBI. The journey continues—stay tuned for the next chapter!

Comments

Newer

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 4)

Older

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 2)

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 3)

Introduction

Why Preprocessing Matters: Building Trust in Our Results

1.10 Logging and Tracking

1.11 Applying Study Eligibility Criteria

1.12 Deriving Depression Level at Year 1

The Result: A Dataset Primed for Survival Analysis

Looking Ahead: Imputation, Final Dataset Selection, and Model Building

1.10 Logging and Tracking

Introduction

Why Logging Matters: More Than Just Housekeeping

Tool #1: log_sample_size() - Tracking the Flow of Participants

How It Works

Example Usage:

Benefits

Tool #2: log_removed_ids() - Keeping Track of Exclusions

How It Works

Example Usage:

Benefits

Putting It All Together: Integrating Logging into Our Workflow

Why This is Crucial for Survival Analysis

Looking Ahead: From Logs to Insights

1.11 Applying Study Eligibility Criteria

Introduction

Step 1: Defining Our Time Window - The is_within_study_period Helper Function

What It Does

Why It's Important

Step 2: Applying Criterion 1 - Defining the Study Population and Calculating Time-to-Event

Eligibility Criterion 1: Defining the Inclusion Window

Addressing Special Cases

Why This Matters

The apply_criterion_1 Function: A Deep Dive

Filtering by Date

Handling Special Cases

Determining Observation Inclusion

Here's the logic:

Calculating Key Survival Variables

Logging for Transparency

The Complete apply_criterion_1 Function

Why This Matters

Step 3: Applying Eligibility Criterion 2 - Excluding Early Mortality and Invalid Survival Times

Why This Criterion is Essential: Focusing on Meaningful Survival Data

Implementation: The apply_criterion_2 Function

Example Walkthrough: Illustrating the Filtering Process

Input Dataset

Filtering Steps

Output Dataset

Key Takeaways

Step 4: Applying Eligibility Criterion 3 - Ensuring Sufficient Date Data for Survival Time Calculations

Why This Criterion is Crucial: No Dates, No Survival Analysis

Implementation: The apply_criterion_3 Function

Example: Illustrating the Filtering Process

Input Dataset

Filtering Steps

Output Dataset

Applying All Three Criteria: A Refined Dataset

Key Takeaways: Ensuring Data Quality for Survival Analysis

Looking Ahead: From Refined Data to Meaningful Insights

1.12 Deriving Depression Level at Year 1

Introduction

Why the PHQ-9?

From Raw Scores to Meaningful Categories

The calculate_depression_level Function: A Deep Dive

Example: Bringing the Code to Life

Input Dataset

Output Dataset

Explanation

Key Takeaways: Why This Matters for Our Analysis

Conclusion

A Journey Recapped: Key Accomplishments

Why This Matters: The Foundation of Sound Research

The Road Ahead: Transforming Data into Insights

What's Next on Our Journey

From Data to Discovery

Tool #1: `log_sample_size()` - Tracking the Flow of Participants

Tool #2: `log_removed_ids()` - Keeping Track of Exclusions

Step 1: Defining Our Time Window - The `is_within_study_period` Helper Function

The `apply_criterion_1` Function: A Deep Dive

The Complete `apply_criterion_1` Function

Implementation: The `apply_criterion_2` Function

Implementation: The `apply_criterion_3` Function

The `calculate_depression_level` Function: A Deep Dive