Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 4)

Blog

Introduction

1.13 Extracting and Imputing Year 1 Variables

1.14 Creating Mental Health History Variables

1.15 Converting Time-to-Event Variables and Organizing Our Data

1.16 Selecting the Representative Record

Conclusion

January 27, 2025

Featured

Research

Tutorials

Introduction

Welcome back to our exploration of survival analysis! We've meticulously cleaned, transformed, and refined our data. Now, we're ready to assemble the final pieces, creating a unified and powerful dataset primed for insightful analysis. In this installment, we'll tackle four crucial preprocessing steps that will transform our raw, fragmented data into a cohesive, analysis-ready format. These steps bring us significantly closer to answering our core research question: How do depression levels one year after traumatic brain injury (TBI) impact long-term survival?

This post will guide you through the following key stages:

1.13 Extracting and Imputing Year 1 Variables

We'll ensure that every participant's record, regardless of when it was collected, includes crucial Year 1 information. This includes depression levels, functional independence scores, and other key covariates. By imputing these variables across all observations, we create a complete and consistent dataset.

1.14 Creating Mental Health History Variables

We'll go beyond simple depression scores by constructing new variables that summarize participants' histories of suicide attempts, mental health treatment, and psychiatric hospitalizations. This provides a richer, more nuanced understanding of their mental health background.

1.15 Converting Time-to-Event Variables and Organizing Our Data

To enhance interpretability, we'll convert our time-to-event variables from days to years and calculate participants' ages at key time points. We'll then strategically reorganize our dataset, placing the most important variables front and center.

1.16 Selecting the Representative Record

Finally, we'll simplify our longitudinal dataset by selecting a single, representative record for each participant—typically their last valid observation. This aligns our data with the standard format for many survival analysis techniques.

Why These Steps Matter: Building a Narrative with Data

Data preprocessing is much more than just cleaning; it's about crafting a coherent narrative from our data. Each of these steps plays a vital role in shaping that narrative:

Year 1 Imputation: This ensures data completeness and consistency, allowing us to maximize the use of our valuable data and avoid unnecessary exclusions. It's like filling in missing pieces of a puzzle to reveal the full picture.
Mental Health History Variables: These variables add depth and context to our analysis, enabling us to explore the complex interplay between past mental health experiences and long-term survival after TBI. They provide a more holistic view of the individual beyond a single snapshot in time.
Time and Age Adjustments: Transforming our time variables into years and calculating age-related metrics makes our results more intuitive and comparable to other studies. It allows us to speak a common language in the field of survival analysis.
Representative Record Selection: This critical step streamlines our dataset, ensuring that each participant is represented by a single, meaningful observation that captures their most up-to-date status. It prepares our data for standard survival analysis methods.

A Holistic Approach to Data Preparation

By the end of this post, you'll have a clear understanding of how these four preprocessing steps work together to create a dataset that is not just clean and consistent but also rich in information and perfectly poised for exploration. We'll provide practical R code, clear explanations, and insights into how these choices align with best practices in survival analysis.

Whether you're preparing for your own research or honing your analytical skills, this installment offers the tools and strategies needed to transform complex, multi-faceted data into a powerful resource for generating meaningful and actionable insights. Let's dive in and continue building the foundation for our survival analysis journey!

1.13 Extracting and Imputing Year 1 Variables

Introduction

We're moving into the final stages of our data preprocessing journey! In this section, we tackle a crucial step: extracting key variables from the Year 1 assessment and imputing them across all observations for each participant. This process ensures that every record in our dataset has the essential Year 1 information, regardless of when the observation occurred.

This step is particularly important when working with longitudinal data, where participants might have missing data at various time points. By extracting and imputing Year 1 variables, we create a more complete and consistent dataset, setting the stage for more robust and reliable survival analysis.

Why Extract and Impute? The Importance of Year 1 Data

The Year 1 assessment in our TBIMS dataset provides a wealth of information about participants' status one year after their traumatic brain injury. This includes important variables like:

Depression Level: Our primary predictor variable, derived from the PHQ-9.
Functional Independence Scores: Measures of a participant's ability to perform daily activities.
Other Key Covariates: Variables like substance use and history of suicide attempts.

But here's the challenge: not all participants have complete data for every single Year 1 variable. Simply removing participants with any missing data would lead to unnecessary exclusions, reducing our sample size and potentially introducing bias.

Here's why imputation is crucial in this context:

Data Completeness: Ensures that all observations for a participant carry the same Year 1 data, regardless of whether it was directly measured at that time point.
Avoiding Unnecessary Exclusions: Missing values in key variables can lead to participants being dropped from the analysis. Imputation helps us retain these participants, maximizing our statistical power.
Maintaining Data Integrity: By strategically imputing Year 1 data, we create a more complete and consistent picture of each participant's status at this critical time point. This is especially important because in the next steps, we will select a single observation to represent each participant in our final analytic dataset. If we did not impute these variables, we may lose critical information when we select this representative observation.
Facilitating Record Selection: Later in our process, we'll select a single "representative" record for each participant to use in our survival models. Imputing Year 1 data to all records ensures that we don't lose crucial information, regardless of which record is ultimately chosen.

The `extract_and_impute_year_1` Function: Our Imputation Powerhouse

The extract_and_impute_year_1_data function is designed to perform two essential tasks:

Extract Year 1 Values: It identifies and extracts the values of specific variables from the Year 1 assessment records.
Impute Across Observations: It then imputes (fills in) these Year 1 values across all observations for each participant, ensuring that every record has this crucial information.

Here's the code:

# Function to extract and impute Year 1 Data
extract_and_impute_year_1_data <- function(data) {
  # Specify the variables to extract and impute
  vars_to_extract <- c("drs_total_at_followup",
                       "fim_total_at_followup",
                       "gose_total_at_followup",
                       "problematic_substance_use_at_followup",
                       "suicide_attempt_hx_past_year_at_followup")
  year_1_vars_to_impute <- c("cardinal_symptoms_at_year_1",
                             "positive_symptoms_at_year_1",
                             "depression_level_at_year_1")

  # Extract Year 1 values and create new columns
  for (var in vars_to_extract) {
    # Replace '_followup' with '_year_1' in the variable name
    new_var_name <- sub("_followup", "_year_1", var)
    data <- data |>
      mutate(!!new_var_name := if_else(data_collection_period == 1, !!sym(var), NA))
    year_1_vars_to_impute <- c(year_1_vars_to_impute, new_var_name)
  }

  # Impute the values across all observations per participant
  imputed_data <- data |>
    group_by(id) |>
    fill(!!!syms(year_1_vars_to_impute), .direction = "downup")

  return(imputed_data)
}

Let's break down the code step-by-step:

Specify Variables:
- vars_to_extract: This list defines the variables that we want to extract from the Year 1 assessment records.
- year_1_vars_to_impute: This list specifies the Year 1 variables that need to be imputed (including those we just extracted and those that were previously calculated, like depression_level_at_year_1).
Extract Year 1 Values:
- The code then loops through each variable in vars_to_extract.
- new_var_name <- sub("_followup", "_year_1", var): For each variable, it creates a new variable name by replacing the suffix "_followup" with "_year_1" (e.g., drs_total_at_followup becomes drs_total_at_year_1).
- data <- data |> mutate(!!new_var_name := if_else(data_collection_period == 1, !!sym(var), NA)): This creates the new Year 1 variable. If the observation is from Year 1 (data_collection_period == 1), it assigns the value from the original variable; otherwise, it assigns NA.
- year_1_vars_to_impute <- c(year_1_vars_to_impute, new_var_name): The new Year 1 variable name is added to the year_1_vars_to_impute list, ensuring that it will be included in the imputation step.
Impute Year 1 Values:
- imputed_data <- data |> group_by(id) |> fill(!!!syms(year_1_vars_to_impute), .direction = "downup"): This is where the imputation happens.
  - group_by(id): We group the data by participant ID.
  - fill(!!!syms(year_1_vars_to_impute), .direction = "downup"): The fill() function from tidyr is used to impute the Year 1 variables. It fills missing values both downwards and upwards within each participant's records, using the non-missing Year 1 values.
Return Imputed Data: The function returns the modified dataset (imputed_data) with the extracted and imputed Year 1 variables.

Applying the Function to Our Data

To apply this function to our data, we simply call it:

# Extract and impute the Year 1 variables of interest for the analytic sample
analytic_data_imputed <- extract_and_impute_year_1_data(analytic_data_depression)

This creates a new dataset, analytic_data_imputed, where all observations for each participant now have consistent Year 1 data for the specified variables.

Conceptual Considerations: Making Informed Choices

Imputation Direction: We've used both downward and upward imputation (.direction = "downup"). This ensures that Year 1 data are propagated across all of a participant's records, maximizing data completeness.
Variable Selection: The choice of which variables to extract and impute depends on their relevance to our research question and their importance as predictors or covariates in our survival models.
Transparency: As always, documenting the imputation process is crucial for ensuring reproducibility and allowing others to understand our methodological choices.

Example: Visualizing the Transformation

Let's look at a simplified example to see how the function works:

Input Data

Output Data After Imputation

Key Observations

New columns are created for Year 1 values (e.g., drs_total_at_year_1, fim_total_at_year_1).
Year 1 values are imputed across all observations for each participant. For example, participant 1 now has the same values for drs_total_at_year_1, fim_total_at_year_1, cardinal_symptoms_at_year_1, and depression_level_at_year_1 in both their Year 1 and Year 2 records.

Looking Ahead: Completing the Puzzle

With our Year 1 variables extracted and imputed, our dataset is rapidly taking shape. We're now incredibly close to finalizing our analytic dataset and preparing descriptive tables and plots!

In the next sections, we'll:

Create additional derived variables to further enhance our analysis.
Address any remaining imputation needs for variables beyond those collected at Year 1.
Select a single, representative record for each participant to use in our final analytic dataset.

By extracting and imputing Year 1 data, we've ensured that our survival analysis will be based on a complete, consistent, and information-rich dataset. We're well on our way to uncovering crucial insights into the relationship between depression and long-term survival after TBI.

1.14 Creating Mental Health History Variables

Introduction

As we continue to refine our dataset, we'll now focus on creating new variables that capture participants' self-reported histories of mental health experiences. Specifically, we'll look at suicide attempts, mental health treatment, and psychiatric hospitalizations.

These variables are essential for providing a more complete picture of participants' mental health status beyond their PHQ-9 depression scores at Year 1. They allow us to explore the potential impact of prior mental health challenges on long-term survival after TBI.

We'll use custom R functions to systematically categorize and label these histories, ensuring consistency and clarity in our dataset.

History of Suicide Attempt

Our goal is to create a single, informative variable, suicide_attempt_hx, that summarizes participants' history of suicide attempts. This variable will distinguish between participants who:

Denied any history of suicide attempts.
Reported a suicide attempt prior to their TBI.
Reported a suicide attempt in the first year following their TBI.

The `create_suicide_attempt_hx_factor` Function

Here's the R function that accomplishes this:

create_suicide_attempt_hx_factor <- function(lifetime, past_year_at_injury, past_year_at_year_1) {
  new_factor <- integer(length = length(lifetime))

  for (i in 1:length(lifetime)) {
    if (is.na(lifetime[i]) || is.na(past_year_at_injury[i]) || is.na(past_year_at_year_1[i])) {
      new_factor[i] <- NA
    } else if (lifetime[i] == "No" && past_year_at_injury[i] == "No" && past_year_at_year_1[i] == "No") {
      new_factor[i] <- 0
    } else if (past_year_at_year_1[i] == "Yes") {
      new_factor[i] <- 2
    } else if (lifetime[i] == "Yes" || past_year_at_injury[i] == "Yes") {
      new_factor[i] <- 1
    } else if (lifetime[i] == "Refused" || past_year_at_injury[i] == "Refused" || past_year_at_year_1[i] == "Refused") {
      new_factor[i] <- NA
    } else {
      new_factor[i] <- NA
    }
  }

  factor(new_factor,
    levels = c(0, 1, 2),
    labels = c(
      "Denied any history of suicide attempt",
      "Suicide attempt history prior to injury",
      "Suicide attempt in the first year post-injury"
    )
  )
}

How It Works

Initialization: new_factor <- integer(length = length(lifetime)): Creates a new numeric vector called new_factor filled with placeholder values. The length of this vector is the same as the number of observations in the dataset.
Iterating Through Records: for(i in 1:length(lifetime)): The function then loops through each participant's records, using the lifetime, past_year_at_injury, and past_year_at_year_1 variables (which capture different aspects of suicide attempt history) to determine the appropriate category.
Conditional Logic: The if/else if/else statements within the loop assign a numeric code to new_factor based on the following logic:
- if(is.na(lifetime[i]) || is.na(past_year_at_injury[i]) || is.na(past_year_at_year_1[i])): If any of the history variables are missing for a participant, assign NA to new_factor.
- else if(lifetime[i] == "No" && past_year_at_injury[i] == "No" && past_year_at_year_1[i] == "No"): If the participant denied a history of suicide attempts across all three variables, assign 0 to new_factor.
- else if(past_year_at_year_1[i] == "Yes"): If the participant reported an attempt in the first year post-injury, assign 2 to new_factor.
- else if(lifetime[i] == "Yes" || past_year_at_injury[i] == "Yes"): If the participant reported an attempt either at any point in their lifetime or specifically in the year before their injury, assign 1 to new_factor.
- else if(lifetime[i] == "Refused" || past_year_at_injury[i] == "Refused" || past_year_at_year_1[i] == "Refused"): If the participant refused to answer any of the history questions, assign NA to new_factor.
Creating a Factor Variable: factor(new_factor, levels = c(0, 1, 2), labels = c(...)): Finally, the function converts new_factor into a factor variable with descriptive labels:
1. 0: "Denied any history of suicide attempt"
2. 1: "Suicide attempt history prior to injury"
3. 2: "Suicide attempt in the first year post-injury"

Applying the Function and Cleaning Up

We apply this function to our dataset like this:

analytic_data_imputed$suicide_attempt_hx <- create_suicide_attempt_hx_factor(
  lifetime = analytic_data_imputed$suicide_attempt_hx_lifetime_at_injury,
  past_year_at_injury = analytic_data_imputed$suicide_attempt_hx_past_year_at_injury,
  past_year_at_year_1 = analytic_data_imputed$suicide_attempt_hx_past_year_at_year_1
)

This creates the new suicide_attempt_hx variable in our analytic_data_imputed dataset.

We then remove any unused factor levels (categories that don't appear in our data) using:

analytic_data_imputed$suicide_attempt_hx <- droplevels(analytic_data_imputed$suicide_attempt_hx)

History of Mental Health Treatment

Next, we create a variable called mental_health_tx_hx to categorize participants based on their self-reported history of mental health treatment. This variable will have the following categories:

No history of mental health treatment.
Treatment received before the year preceding their injury.
Treatment received within the year preceding their injury.
Refused to answer or gave inconsistent reports.

The `create_mental_health_tx_factor` Function

Here's the R function that creates this variable:

create_mental_health_tx_factor <- function(lifetime, past_year_at_injury) {
  new_factor <- integer(length = length(lifetime))

  for (i in 1:length(lifetime)) {
    if (is.na(lifetime[i]) || is.na(past_year_at_injury[i])) {
      new_factor[i] <- NA
    } else if (lifetime[i] == "No" && past_year_at_injury[i] == "No") {
      new_factor[i] <- 0
    } else if (lifetime[i] == "Yes" && past_year_at_injury[i] == "No") {
      new_factor[i] <- 1
    } else if (lifetime[i] == "Yes" && past_year_at_injury[i] == "Yes") {
      new_factor[i] <- 2
    } else if (lifetime[i] == "Refused" || past_year_at_injury[i] == "Refused") {
      new_factor[i] <- 5
    } else {
      new_factor[i] <- 6
    }
  }

  factor(new_factor,
    levels = c(0, 1, 2, 5, 6),
    labels = c(
      "Denied any history of mental health treatment",
      "Mental health treatment received prior to year preceding injury only",
      "Mental health treatment received within year preceding injury",
      "Participant refused to provide full mental health history",
      "Inconsistent reports of mental health history across time points"
    )
  )
}

How It Works

This function follows a similar logic to the previous one, but with slightly different categories:

Initialization: Creates a numeric vector new_factor to store the results.
Iteration: Loops through each participant's records.
Conditional Logic: Assigns a numeric code to new_factor based on the lifetime and past_year_at_injury variables, which capture different aspects of mental health treatment history.
Factor Creation: Converts new_factor into a factor variable with descriptive labels.

Applying the Function

analytic_data_imputed$mental_health_tx_hx <- create_mental_health_tx_factor(
  lifetime = analytic_data_imputed$mental_health_tx_lifetime_at_injury,
  past_year_at_injury = analytic_data_imputed$mental_health_tx_past_year_at_injury
)

analytic_data_imputed$mental_health_tx_hx <- droplevels(analytic_data_imputed$mental_health_tx_hx)

History of Psychiatric Hospitalization

Finally, we create a variable called psych_hosp_hx to capture participants' self-reported history of psychiatric hospitalizations. The categories are analogous to the mental health treatment variable:

No history of psychiatric hospitalization.
Hospitalization before the year preceding their injury.
Hospitalization within the year preceding their injury.
Refused to answer or gave inconsistent reports.

The `create_psych_hosp_hx_factor` Function

create_psych_hosp_hx_factor <- function(lifetime, past_year_at_injury) {
  new_factor <- integer(length = length(lifetime))

  for (i in 1:length(lifetime)) {
    if (is.na(lifetime[i]) || is.na(past_year_at_injury[i])) {
      new_factor[i] <- NA
    } else if (lifetime[i] == "No" && past_year_at_injury[i] == "No") {
      new_factor[i] <- 0
    } else if (lifetime[i] == "Yes" && past_year_at_injury[i] == "No") {
      new_factor[i] <- 1
    } else if (lifetime[i] == "Yes" && past_year_at_injury[i] == "Yes") {
      new_factor[i] <- 2
    } else if (lifetime[i] == "Refused" || past_year_at_injury[i] == "Refused") {
      new_factor[i] <- 5
    } else {
      new_factor[i] <- 6
    }
  }

  factor(new_factor,
    levels = c(0, 1, 2, 5, 6),
    labels = c(
      "Denied any history of psychiatric hospitalization",
      "Psychiatric hospital admission prior to year preceding index injury only",
      "Psychiatric hospital admission within year preceding index injury",
      "Participant refused to provide full psychiatric hospitalization history",
      "Inconsistent reports of psychiatric hospitalization history across time points"
    )
  )
}

Applying the Function

analytic_data_imputed$psych_hosp_hx <- create_psych_hosp_hx_factor(
  lifetime = analytic_data_imputed$psych_hosp_hx_lifetime_at_injury,
  past_year_at_injury = analytic_data_imputed$psych_hosp_hx_past_year_at_injury
)

analytic_data_imputed$psych_hosp_hx <- droplevels(analytic_data_imputed$psych_hosp_hx)

Why These Variables Matter: A Holistic View of Mental Health

By creating these three comprehensive mental health history variables, we're adding valuable depth to our dataset. We can now explore how these past experiences relate to depression at Year 1 and, ultimately, to long-term survival after TBI.

Key Advantages

Standardization: We've created consistent categories across these variables, making it easier to compare and analyze the impact of different mental health histories.
Clinical Relevance: The categories align with clinically meaningful distinctions, ensuring that our analysis addresses real-world concerns.
Interpretability: The factor levels have clear, descriptive labels, making our results easier to understand and communicate.
Analysis-Ready: These variables are now in the correct format for inclusion as covariates in our Cox regression models.

Looking Ahead: Final Preparations and Model Building

We've made significant progress in preparing our data for survival analysis! We've created key variables, handled missing data strategically, and refined our dataset through careful application of eligibility criteria. Now, we're ready to tackle the very last steps of data preprocessing before we can dive into the exciting world of descriptive data exploration and visualization.

1.15 Converting Time-to-Event Variables and Organizing Our Data

Introduction

We're in the home stretch of our data preprocessing journey! In this section, we'll focus on two crucial tasks:

Converting our time-to-event variables from days to years.
Calculating age-related variables.
Reorganizing our dataset for clarity and optimal use in Cox regression models.

These steps might seem like minor details, but they play a vital role in ensuring our survival analysis is both accurate and easy to interpret.

From Days to Years: Making Time-to-Event Variables More Meaningful

Our current time_to_event, time_to_censorship, and time_to_expiration variables are measured in days. While precise, reporting survival times in days can be difficult to grasp intuitively. Converting these variables to years makes our results more interpretable and aligns with common practices in reporting survival outcomes.

Why It Matters

Interpretability: It's much easier to understand and compare survival times expressed in years (e.g., "a median survival time of 3.5 years") rather than days (e.g., "a median survival time of 1,278 days").
Standard Reporting: Most survival analyses report results in terms of years, making our findings more comparable to existing literature.

Implementation in R

We use the mutate() function to create new variables representing time in years, dividing the day counts by 365.25 to account for leap years:

analytic_data_imputed <- analytic_data_imputed |>
  mutate(
    time_to_event_in_years = time_to_event / 365.25,
    time_to_censorship_in_years = time_to_censorship / 365.25,
    time_to_expiration_in_years = time_to_expiration / 365.25
  )

Adding Context with Age Metrics

In addition to time-to-event, understanding the age at which events occur can provide valuable context. We'll calculate two new variables:

age_at_censorship: The participant's age at their last follow-up (for those who were censored).
age_at_expiration: The participant's age at death (for those who experienced the event).

Why It Matters

Contextualizing Outcomes: Age can be a significant factor influencing survival. These variables allow us to explore how age relates to the timing of events or censoring.

Implementation in R

We use the interval() and years() functions from the lubridate package to calculate the age at the time of censoring or death:

analytic_data_imputed <- analytic_data_imputed |>
  mutate(
    age_at_censorship = interval(start = date_of_birth, end = date_of_followup) / years(1),
    age_at_expiration = interval(start = date_of_birth, end = date_of_death) / years(1)
  )

Organizing for Clarity: Strategic Column Reordering

Now, let's reorganize our dataset by strategically ordering the columns. This might seem like a purely cosmetic change, but it significantly improves the readability and usability of our data.

Why It Matters

Logical Flow: Grouping related variables (e.g., time variables, demographic variables, mental health variables) makes it easier to navigate the dataset and understand the relationships between different pieces of information.
Easier Access to Key Variables: Placing the most important variables (like id, event_status, and time_to_event) at the beginning of the dataset makes them readily accessible for analysis.

Implementation in R

We use the select() function to reorder the columns in a way that makes logical sense:

analytic_data_imputed <- analytic_data_imputed |>
  select(
    id,
    status_at_followup,
    event_status,
    data_collection_period,
    time_to_event,
    time_to_censorship,
    time_to_expiration,
    time_to_event_in_years,
    time_to_censorship_in_years,
    time_to_expiration_in_years,
    calendar_year_of_injury,
    calendar_year_of_event,
    date_of_year_1_followup,
    date_of_followup,
    date_of_death,
    date_of_birth,
    date_of_injury,
    cause_of_death_1,
    cause_of_death_2,
    cause_of_death_e,
    sex,
    age_at_injury,
    age_at_censorship,
    age_at_expiration,
    education_level_at_injury,
    employment_at_injury,
    marital_status_at_injury,
    cause_of_injury,
    rehab_payor_primary,
    rehab_payor_primary_type,
    drs_total_at_followup,
    fim_total_at_followup,
    gose_total_at_followup,
    drs_total_at_year_1,
    fim_total_at_year_1,
    gose_total_at_year_1,
    func_score_at_year_1,
    func_score_at_year_1_q5,
    mental_health_tx_lifetime_at_injury,
    mental_health_tx_past_year_at_injury,
    mental_health_tx_hx,
    psych_hosp_hx_lifetime_at_injury,
    psych_hosp_hx_past_year_at_injury,
    psych_hosp_hx,
    problematic_substance_use_at_injury,
    problematic_substance_use_at_year_1,
    problematic_substance_use_at_followup,
    suicide_attempt_hx_lifetime_at_injury,
    suicide_attempt_hx_past_year_at_injury,
    suicide_attempt_hx_past_year_at_year_1,
    suicide_attempt_hx_past_year_at_followup,
    suicide_attempt_hx,
    phq1, phq2, phq3, phq4, phq5, phq6, phq7, phq8, phq9,
    positive_symptoms_at_year_1,
    cardinal_symptoms_at_year_1,
    depression_level_at_year_1
  ) |>
  arrange(id, data_collection_period)

Key Takeaways: Preparing for Powerful Analysis

By converting our time variables to years, calculating age-related metrics, and strategically reorganizing our dataset, we've taken significant steps toward preparing for robust survival modeling. These seemingly minor adjustments contribute to:

Enhanced Interpretability: Using years as our time unit and grouping related variables makes our results easier to understand and communicate.
Consistency with Best Practices: Aligning our data with standard practices in survival analysis ensures that our findings are comparable to other studies.
Streamlined Workflow: A well-organized dataset simplifies subsequent analysis steps, allowing us to focus on model building and interpretation.

Looking Ahead: Selecting the Final Record and Preparing for

We're almost ready to start exploring our data visually and through descriptive statistics! In the next section, we will select the single, representative record for each participant (typically their last valid observation) that will be used in our analysis. This crucial step will allow us to transition from a longitudinal dataset to a cross-sectional format that is ideal for summarizing and visualizing key characteristics of our study population. We will also perform a final check of our data, ensuring that all variables are in the correct format and that factor variables have appropriate labels and reference levels. Finally, we will save this final analytic dataset in both .rds and .csv formats.

Once these steps are complete, we will be fully equipped to:

Generate Descriptive Tables: We'll summarize key demographic, injury-related, and mental health variables, providing a comprehensive overview of our study population.
Create Informative Visualizations: We'll use plots like histograms, boxplots, and Kaplan-Meier curves to explore distributions, relationships between variables, and survival patterns.

This descriptive exploration will provide valuable insights into our data and inform the development of our subsequent survival models. We are poised to transform this carefully prepared data into meaningful summaries and visualizations that can contribute to improving the lives of individuals with TBI!

1.16 Selecting the Representative Record

Introduction

We're now in the final stage of preparing our dataset for survival analysis! We've meticulously cleaned, transformed, and enriched our data, and we've applied our study eligibility criteria. Now, it's time to create our final analytic dataset—the one we'll use for exploring our data through descriptive statistics and visualizations, and later, for building our Cox regression models.

This section focuses on three important steps:

Retaining the Last Valid Observation: Selecting a single, representative record for each participant from our longitudinal data.
Ensuring Data Type Consistency: Double-checking that all variables are in the correct format (e.g., numeric, factor, date).
Adjusting Factor Variables: Making sure that our factor variables have appropriate labels and levels for meaningful analysis.

Let's dive into the details of each step.

Selecting the Representative Record: The Last Valid Observation

Since our TBIMS data are longitudinal, each participant has multiple observations collected over time. For many types of survival analysis, including standard Cox regression, we need a dataset where each participant is represented by a single row of data.

Why This Matters

Standard Format for Survival Analysis: Most survival analysis methods are designed to work with datasets where each row represents a unique individual and their associated survival time and event status.
Reflecting the Most Recent Status: We want our analysis to reflect the most up-to-date information available for each participant. Therefore, we'll typically select their last valid observation within the study period. This record captures their status closest to the event (death) or censoring.

Implementation in R

Here's how we select the last valid observation for each participant:

analytic_data_final <- analytic_data_imputed |>
  arrange(id, time_to_event) |>
  group_by(id) |>
  slice(n()) |>
  ungroup()

What It Does

arrange(id, time_to_event): We sort the data by participant ID (id) and then by time_to_event. This ensures that the last observation for each participant will be the one with the latest (largest) time_to_event.
group_by(id): We group the data by participant ID.
slice(n()): This is the key step. For each participant (within each group), slice(n()) selects only the last row (the row with the maximum time_to_event).
ungroup(): We remove the grouping.

After applying this code, analytic_data_final will contain only one record per participant, representing their most recent status.

Handling Missing Data: Replacing Empty Strings

A common issue in data cleaning is the presence of empty strings ("") that should actually be treated as missing values (NAs). We will convert these empty strings to NA values:

analytic_data_final <- apply(analytic_data_final, 2, function(x) ifelse(x == "", NA, x))

Why This Matters

Consistent Missing Value Representation: Using NA for all missing values ensures that R handles them correctly in subsequent analyses.

Accurate Calculations: Statistical functions and models in R are designed to handle NA values appropriately, but they might produce incorrect results if empty strings are present.

Ensuring Data Type Consistency: A Crucial Check

Before we proceed, it's essential to double-check that all of our variables have the correct data types. For example, numeric variables should be stored as numeric or integer, categorical variables as factor, and dates as Date.

Why It Matters

Correct Calculations: R needs to know the data type of each variable to perform calculations correctly. For instance, you can't calculate the mean of a character variable.
Model Compatibility: Statistical models, including Cox regression, expect variables to be in specific formats. Using the wrong data type can lead to errors or incorrect model fitting.

Implementation in R

We'll use a custom function called match_class to ensure that the data types in our analytic_data_final dataset match those in our analytic_data_imputed dataset (which we carefully prepared in earlier steps):

match_class <- function(target_col, reference_col) {
  target_class <- class(reference_col)

  if ("factor" %in% target_class) {
    return(as.factor(target_col))
  } else if ("numeric" %in% target_class) {
    return(as.numeric(target_col))
  } else if ("integer" %in% target_class) {
    return(as.integer(target_col))
  } else if ("character" %in% target_class) {
    return(as.character(target_col))
  } else if ("logical" %in% target_class) {
    return(as.logical(target_col))
  } else if ("Date" %in% target_class) {
    return(as.Date(target_col))
  } else {
    stop(paste("Unsupported class:", target_class))
  }
}

analytic_data_final <- as.data.frame(analytic_data_final) |>
  mutate(across(everything(), ~match_class(.x, analytic_data_imputed[[cur_column()]])))

What It Does

The match_class function takes two arguments: a target column (target_col) and a reference column (reference_col).
It determines the data type of the reference_col using class().
It then converts the target_col to the same data type using the appropriate conversion function (e.g., as.factor(), as.numeric(), as.Date()).
The mutate(across(everything(), …)) part applies this match_class function to every column in our analytic_data_final dataset, using the corresponding column in analytic_data_imputed as the reference for the correct data type.

Fine-Tuning Factor Variables: Levels and Labels

Finally, we'll make sure our factor variables are properly defined with appropriate levels and labels.

Why It Matters

Interpretability: Clear and descriptive factor levels make our results easier to understand.
Model Requirements: Cox regression models often require factor variables to have well-defined reference levels for meaningful comparisons.

Implementation in R

We'll use mutate() along with factor() to redefine our factor variables, explicitly specifying their levels:

analytic_data_final <- analytic_data_final |>
  mutate(
    depression_level_at_year_1 = factor(depression_level_at_year_1,
      levels = c(
        "No Depression",
        "Minor Depression",
        "Major Depression"
      )
    ),
    sex = factor(sex, levels = c("Male", "Female")),
    employment_at_injury = factor(employment_at_injury,
      levels = c(
        "Competitively Employed",
        "Unemployed",
        "Student",
        "Retired",
        "Other"
      )
    ),
    marital_status_at_injury = factor(marital_status_at_injury,
      levels = c(
                 "Single",
                 "Married",
                 "Divorced",
                 "Other"
      )
    ),
    rehab_payor_primary = factor(rehab_payor_primary,
                                 levels = c("Private Insurance", "Public Insurance", "Other")
                                 ),
    rehab_payor_primary_type = factor(rehab_payor_primary_type,
                                      levels = c("Non-Medicaid", "Medicaid")
                                      ),
    cause_of_injury = factor(cause_of_injury,
                             levels = c("Vehicular", "Falls", "Violence", "Other")
                             ),
    mental_health_tx_hx = factor(mental_health_tx_hx,
      levels = c(
        "Denied any history of mental health treatment",
        "Mental health treatment received prior to year preceding index injury only",
        "Mental health treatment received within year preceding index injury"
      )
    ),
    problematic_substance_use_at_injury = factor(problematic_substance_use_at_injury,
                                                 levels = c("No", "Yes")
                                                 ),
    suicide_attempt_hx = factor(suicide_attempt_hx,
                               levels = c("Denied any history of suicide attempt",
                                          "Suicide attempt history prior to injury",
                                          "Suicide attempt in the first year post-injury")
                               ),
    psych_hosp_hx = factor(psych_hosp_hx,
                           levels = c("Denied any history of psychiatric hospitalization",
                                      "Psychiatric hospital admission prior to year preceding index injury only",
                                      "Psychiatric hospital admission within year preceding index injury"))
  )

What It Does

For each factor variable (e.g., depression_level_at_year_1, sex, employment_at_injury, etc.), we use factor() to explicitly define the order of the levels.

Conceptual Highlights

Last Observation: Selecting the last observation aligns with survival analysis principles, ensuring that the most recent participant status is used.
Data Type Matching: Consistent data types are crucial for accurate calculations and model fitting.
Factor Level Clarity: Well-defined factor levels enhance the interpretability of our results.

The Finish Line: A Dataset Ready for Exploration and Modeling

With these final steps completed, our analytic_data_final dataset is now fully prepared for exploration through descriptive statistics, visualization, and eventually, Cox regression modeling. We've carefully cleaned, transformed, and organized our data, creating a solid foundation for uncovering meaningful insights into the relationship between depression and survival after TBI.

Now that our dataset is fully prepared, we're ready to dive into the exciting world of data exploration! In the next blog post, we will focus on creating and interpreting descriptive statistics tables. These tables will provide a comprehensive overview of our study population, summarizing key characteristics and laying the groundwork for understanding the relationships between our variables. We'll examine the distribution of demographics, injury characteristics, mental health variables, and our crucial outcome: survival. Stay tuned as we begin to unlock the insights hidden within our carefully prepared data!

Conclusion

We've reached a pivotal moment in our survival analysis journey! We've navigated the complexities of data preprocessing and emerged with more than just a clean dataset; we've crafted a powerful resource, meticulously prepared and poised to reveal important insights. Every step we've taken, from imputing missing data to refining variable definitions, has brought us closer to our ultimate goal: understanding the intricate relationship between depression one year post-TBI and long-term survival.

A Recap of Our Transformation: Shaping Data into Knowledge

Let's take a moment to appreciate the significant transformations we've accomplished:

Unifying Year 1 Data: We extracted and imputed essential Year 1 variables, including depression levels, functional independence scores, and other key covariates. This ensured that every participant's record, regardless of when it was collected, contains this critical information, providing a consistent baseline for our analysis.
Enriching with Mental Health Histories: We created comprehensive mental health history variables, capturing participants' experiences with suicide attempts, mental health treatment, and psychiatric hospitalizations. This added depth and nuance to our dataset, allowing us to explore the impact of these factors on survival.
Optimizing for Clarity and Precision: We converted our time-to-event variables to years and calculated age-related metrics, making our data more interpretable and aligning it with standard practices in survival analysis. We then strategically reorganized our dataset for optimal clarity and ease of use.
Creating a Streamlined Analytic Dataset: We selected a single, representative record for each participant (their last valid observation), simplifying our dataset while preserving the most up-to-date information about their status. This aligns our data with the typical structure needed for many survival analysis techniques.

Why This Matters

These preprocessing steps were not simply about tidying up. They were about transforming raw, fragmented data into a cohesive and meaningful narrative. We've ensured that:

Our Data are High-Quality: We've addressed missing data, refined variable definitions, and ensured consistency across time points, leading to more accurate and reliable results.
We've Maximized Information: By strategically imputing missing values and carefully selecting representative records, we've retained as much valuable data as possible, increasing the statistical power of our analysis.
Our Analysis is Set Up for Success: Our dataset is now structured perfectly for exploring our data through descriptive statistics and visualizations, and eventually, for building robust survival models. We've laid the groundwork for uncovering meaningful patterns and relationships.

Looking Ahead: Exploring Our Data and Unveiling Insights

Our data preprocessing journey has reached its conclusion, and we now have a dataset that is primed for discovery. In the next blog posts, we'll shift our focus to:

Descriptive Exploration: We'll use descriptive statistics to paint a comprehensive picture of our study population, summarizing key demographic, injury-related, and mental health characteristics.
Visualizing Survival Patterns: We'll create compelling visualizations, including histograms, box plots, and Kaplan-Meier curves, to explore distributions, relationships between variables, and survival patterns over time.

These exploratory steps will not only deepen our understanding of the TBIMS dataset but also inform the development of our subsequent survival models.

We're now ready to transform this data into actionable insights that can contribute to improving the lives of individuals with TBI. The journey continues, and we're excited to share the next phase of discovery with you!

Comments

Newer

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 5)

Older

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 3)

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 4)

Introduction

1.13 Extracting and Imputing Year 1 Variables

1.14 Creating Mental Health History Variables

1.15 Converting Time-to-Event Variables and Organizing Our Data

1.16 Selecting the Representative Record

Why These Steps Matter: Building a Narrative with Data

A Holistic Approach to Data Preparation

1.13 Extracting and Imputing Year 1 Variables

Introduction

Why Extract and Impute? The Importance of Year 1 Data

Here's why imputation is crucial in this context:

The extract_and_impute_year_1 Function: Our Imputation Powerhouse

Applying the Function to Our Data

Conceptual Considerations: Making Informed Choices

Example: Visualizing the Transformation

Input Data

Output Data After Imputation

Key Observations

Looking Ahead: Completing the Puzzle

1.14 Creating Mental Health History Variables

Introduction

History of Suicide Attempt

The create_suicide_attempt_hx_factor Function

How It Works

Applying the Function and Cleaning Up

History of Mental Health Treatment

The create_mental_health_tx_factor Function

How It Works

Applying the Function

History of Psychiatric Hospitalization

The create_psych_hosp_hx_factor Function

Applying the Function

Why These Variables Matter: A Holistic View of Mental Health

Key Advantages

Looking Ahead: Final Preparations and Model Building

1.15 Converting Time-to-Event Variables and Organizing Our Data

Introduction

From Days to Years: Making Time-to-Event Variables More Meaningful

Why It Matters

Implementation in R

Adding Context with Age Metrics

Why It Matters

Implementation in R

Organizing for Clarity: Strategic Column Reordering

Why It Matters

Implementation in R

Key Takeaways: Preparing for Powerful Analysis

Looking Ahead: Selecting the Final Record and Preparing for

1.16 Selecting the Representative Record

Introduction

Selecting the Representative Record: The Last Valid Observation

Why This Matters

Implementation in R

What It Does

Handling Missing Data: Replacing Empty Strings

Why This Matters

Ensuring Data Type Consistency: A Crucial Check

Why It Matters

Implementation in R

What It Does

Fine-Tuning Factor Variables: Levels and Labels

Why It Matters

Implementation in R

What It Does

Conceptual Highlights

The Finish Line: A Dataset Ready for Exploration and Modeling

Conclusion

A Recap of Our Transformation: Shaping Data into Knowledge

Why This Matters

Looking Ahead: Exploring Our Data and Unveiling Insights

The `extract_and_impute_year_1` Function: Our Imputation Powerhouse

The `create_suicide_attempt_hx_factor` Function

The `create_mental_health_tx_factor` Function

The `create_psych_hosp_hx_factor` Function