Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 6)

Blog

Introduction

3.1 Initial Setup and Library Loading

3.2 Defining Covariates and Assigning Clear Labels

3.3 Creating a Complete-Case Sample

3.4 Generating Descriptive Statistics Tables

3.5 Interpreting Descriptive Statistics Tables

Conclusion

February 10, 2025

Research

Tutorials

Introduction

Welcome back to our journey into the world of survival analysis! We're moving beyond the initial stages of data cleaning and transformation, and into the crucial step of exploring our data through descriptive statistics. This phase is all about building a solid foundation for reproducible analysis, ensuring that our later survival models are built upon a bedrock of high-quality, well-understood data.

In this installment, we continue our exploration of the critical question: How do depression levels one year after a traumatic brain injury (TBI) influence all-cause mortality within the subsequent five years? The insights we uncover have the potential to inform interventions and improve patient care, making this not just an academic exercise but a journey with real-world implications.

This post will guide you through the essential steps required to create a robust and transparent analytical workflow. We'll cover:

3.1 Initial Setup and Library Loading

We'll show you how to create a clean and efficient R environment. This includes loading essential libraries like tidyverse, naniar, and gtsummary, setting up a structured directory system for managing data and outputs, and configuring plot aesthetics. These steps ensure a smooth and reproducible workflow.

3.2 Defining Covariates and Assigning Clear Labels

We'll walk through the process of defining our key variables of interest and assigning them clear, reader-friendly labels. This enhances the interpretability of our data and sets the stage for effective communication of our findings.

3.3 Creating a Complete-Case Sample

We'll create subset of our data containing only participants with complete data on our key variables. While we acknowledge the limitations of listwise deletion, this complete-case sample provides a transparent and straightforward baseline for our initial analyses and descriptive statistics, allowing us to directly assess the impact of missing data.

3.4 Generating Descriptive Statistics Tables

We'll dive into the art of summarizing our data using the powerful gtsummary package. You'll learn how to create publication-ready tables that showcase the key characteristics of our study population, stratified by depression level at Year 1. These tables will provide crucial insights into the relationships between depression, demographics, injury characteristics, and other clinical factors.

3.5 Interpreting Descriptive Statistics Tables

We'll go beyond the numbers and explore what our descriptive statistics reveal about our data. We'll compare findings between the full analytic sample and the complete-case sample, highlighting potential biases and informing our modeling choices.

Why This Matters: More Than Just Housekeeping

These steps might appear to be mere "data housekeeping," but they are, in fact, the cornerstone of a successful survival analysis. A well-documented workflow offers several crucial benefits:

Enhanced Insights: Properly prepared data allows us to uncover patterns and relationships that might otherwise be obscured by missing values, inconsistencies, or poorly defined variables. These descriptive analyses provide a richer context for understanding our data before we even begin to model it.
Improved Reproducibility: A structured workflow, with clear documentation and well-organized code, makes it easy to replicate and validate our findings. This is essential for scientific rigor and ensures that others can build upon our work.
Streamlined Workflow: By organizing our workspace, defining variables clearly, and automating tasks, we save valuable time and reduce frustration during the later, more complex stages of analysis.
Actionable Results: Clean, well-understood data leads to models and visualizations that are more likely to yield actionable insights, ultimately guiding better decisions in patient care and policy.

Throughout this post, we'll provide detailed R code examples, accompanied by clear explanations of the "why" and "how" behind each step. Whether you're new to survival analysis or a seasoned data analyst, you'll find practical tips, tools, and strategies that you can apply to your own projects.

Let's dive in and build the foundation for an impactful survival analysis that can contribute to improving the lives of individuals with TBI!

3.1 Initial Setup and Library Loading

Introduction

This script establishes the foundational environment for data analysis by loading essential R libraries, setting up a structured directory system for data management, loading preprocessed data, and configuring the study timeline. These steps ensure a reproducible, organized, and visually consistent workflow.

Step 1: Equipping Ourselves - Loading Essential Libraries

Before we can start exploring our data, we need to ensure that we have the right tools at our disposal. We'll load a curated set of R libraries, each chosen for its specific role in data analysis, visualization, or reporting.

# Load the pacman package (install if necessary)
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}

# Install and load prerequisite libraries
pacman::p_load(extrafont, gt, gtsummary, here, naniar, scales, tidyverse)

Let's break down what's happening:

pacman: Our Package Manager:
- The pacman package simplifies the process of managing R packages. The code first checks if pacman is installed, and if not, it installs it.
- Why It Matters: pacman streamlines our workflow by allowing us to install and load multiple packages with a single command (p_load). It also handles situations where a package is already installed, preventing unnecessary re-installations.
Our Arsenal of Libraries:
- extrafont: This package allows us to customize our plots with specific fonts, giving our visualizations a polished and professional look.
- gt and gtsummary: These packages are our tools for creating beautiful, publication-ready tables. They offer extensive customization options, making it easy to present our descriptive statistics in a clear and informative way.
- here: This package is essential for creating reproducible file paths. It automatically detects the project's root directory, making our code portable across different computer environments.
- naniar: This package specializes in working with missing data. We'll use it to analyze and visualize missingness patterns in our dataset.
- scales: This package provides tools for customizing plot scales and labels, enhancing the clarity and readability of our visualizations.
- tidyverse: This is a collection of essential R packages for data science, including dplyr (for data manipulation), ggplot2 (for data visualization), and many others. The tidyverse provides a cohesive and powerful framework for working with data in R.

Pro Tip: Using pacman::p_load is a a best practice for managing package dependencies. It ensures that all necessary libraries are installed and loaded efficiently, saving you time and preventing potential errors.

Step 2: Building Our Home Base - Creating a Project Directory

A well-organized project directory is essential for keeping our files in order, ensuring reproducibility, and making collaboration easier. Let's create a clear structure for our project:

data_processed_dir <- here("Data", "Processed")
if (!dir.exists(data_processed_dir)) {
  dir.create(data_processed_dir, recursive = TRUE)
}

output_tables_dir <- here("Output", "Tables")
if (!dir.exists(output_tables_dir)) {
  dir.create(output_tables_dir, recursive = TRUE)
}

missingness_plots_dir <- here("Output", "Plots", "Missingness")
if (!dir.exists(missingness_plots_dir)) {
  dir.create(missingness_plots_dir, recursive = TRUE)
}

What's happening here?

Defining Directories:
- Data/Processed: This directory will house our preprocessed datasets, keeping them separate from the raw data.
- Output/Tables: This directory will store our descriptive statistics tables.
- Output/Plots/Missingness: This directory will store visualizations related to missing data patterns.
Automating Directory Creation:
- here(): This function from the here package dynamically defines file paths relative to the project's root directory, ensuring portability.
- dir.create(): This function creates the specified directories. The recursive = TRUE argument ensures that any necessary parent directories are also created. The if (!dir.exists(…)) checks prevent these directories from being recreated if they already exist.

Why It Matters

This structured approach eliminates confusion about file locations, ensures that outputs and intermediate datasets are systematically organized, and promotes reproducibility.

Step 3: Loading Our Preprocessed Data

Now that our environment is set up, let's load the preprocessed dataset that we've prepared in previous steps:

analytic_data_final <- readRDS(file.path(data_processed_dir, "analytic_data_final.rds"))
na_counts_for_all_proposed_covariates <- readRDS(file.path(data_processed_dir, "na_counts_for_all_proposed_covariates.rds"))
na_counts_for_select_covariates <- readRDS(file.path(data_processed_dir, "na_counts_for_select_covariates.rds"))

What's happening here?

readRDS(): This function reads R objects that was previously saved as .rds files. We're loading:
- analytic_data_final: Our main dataset, which has undergone cleaning, transformation, and eligibility criteria application.
- na_counts_for_all_proposed_covariates: A data frame containing missing value counts for all potential covariates.
- na_counts_for_select_covariates: A data frame containing missing value counts for our selected set of covariates.

Why It Matters

These datasets are the result of our careful preprocessing efforts. They are now ready for exploration, visualization, and ultimately, survival modeling.
Using .rds files allows for efficient storage and retrieval of R objects, preserving all data structures, including factor levels, labels, and metadata.

Step 4: Polishing Our Tables and Plots - Configuring Aesthetics

To ensure that our tables and visualizations effectively communicate our findings, let's import some custom fonts:

# Import fonts from Font Book
loadfonts(device = "all", quiet = TRUE)

What's happening here?

extrafont: This package allows us to use fonts beyond the standard R defaults.
loadfonts(): This function imports fonts installed on your system, making them available for use in R.

Why It Matters

Consistent aesthetics and enhanced readability make our tables and visualizations more professional and impactful.

Pro Tip: If you are sharing code with others, it is best to specify a font that is commonly available across systems.

The Big Picture: A Foundation for Discovery

These initial setup steps might seem like small details, but they are the cornerstone of a successful and reproducible analysis pipeline. By investing in this foundation, we ensure that:

Our workflow is efficient and organized.
Our project is reproducible.
Our data are readily accessible.
Our visualizations are polished and impactful.

Looking Ahead: Exploring and Visualizing Our Data

With our R environment configured and our data loaded, we're now ready to continue the exciting phase of exploratory data analysis. In the next sections, we will:

Prepare covariate sets for generating descriptive statistics tables.
Define preferred variable labels for clarity and consistency.
Generate comprehensive tables summarizing the key characteristics of our study population.

Each step builds upon this foundation, paving the way for survival models that will address our central research question.

3.2 Defining Covariates and Assigning Clear Labels

Introduction

We're now ready to continue our focus on exploratory data analysis, where we'll use descriptive statistics and visualization to delve into the characteristics of our study population and begin to uncover patterns in the data. But before we can start generating insightful tables and plots, we need to make sure our dataset is properly organized and that our variables are clearly defined.

In this section, we'll focus on two essential preparatory tasks:

Defining Our Covariates of Interest: We'll create specific lists of variables that will guide our exploratory analyses and inform our subsequent modeling choices.
Assigning Descriptive Variable Labels: We'll replace cryptic variable names with clear, reader-friendly labels that enhance the interpretability of our results.

Let's dive into how we accomplish these tasks.

Step 1: Defining Our Covariates of Interest

First, we need to explicitly define the variables that we'll be working with. We'll create two lists:

all_proposed_covariates: This is an exhaustive list of all potential predictor variables in our dataset that might be relevant to our research question. It includes a wide range of variables capturing demographic information, injury characteristics, functional status, and mental health history. Think of this as our initial long list of potential players for our analysis.
select_covariates: This is a more curated list, containing a subset of variables that we've deemed particularly important for our core research question or that are most suitable for initial exploration based on careful consideration of previous research and clinical knowledge. This is our starting lineup—the key players that we'll first focus on. It's important to note that this selection isn't set in stone; we refined it after our initial Cox regression analyses, as detailed below.

Here's how we define these lists in our R code:

# Define all covariates of interest
all_proposed_covariates <- c("id",
                             "event_status",
                             "time_to_event_in_years",
                             "time_to_censorship_in_years",
                             "time_to_expiration_in_years",
                             "age_at_censorship",
                             "age_at_expiration",
                             "calendar_year_of_injury",
                             "sex",
                             "age_at_injury",
                             "education_level_at_injury",
                             "employment_at_injury",
                             "marital_status_at_injury",
                             "rehab_payor_primary_type",
                             "cause_of_injury",
                             "drs_total_at_year_1",
                             "fim_total_at_year_1",
                             "gose_total_at_year_1",
                             "func_score_at_year_1",
                             "func_score_at_year_1_q5",
                             "mental_health_tx_lifetime_at_injury",
                             "mental_health_tx_past_year_at_injury",
                             "mental_health_tx_hx",
                             "psych_hosp_hx_lifetime_at_injury",
                             "psych_hosp_hx_past_year_at_injury",
                             "psych_hosp_hx",
                             "problematic_substance_use_at_injury",
                             "problematic_substance_use_at_year_1",
                             "suicide_attempt_hx_lifetime_at_injury",
                             "suicide_attempt_hx_past_year_at_injury",
                             "suicide_attempt_hx_past_year_at_year_1",
                             "suicide_attempt_hx",
                             "depression_level_at_year_1")

# Define select covariates of interest
select_covariates <- c("id",
                        "event_status",
                        "time_to_event_in_years",
                        "time_to_censorship_in_years",
                        "time_to_expiration_in_years",
                        "age_at_censorship",
                        "age_at_expiration",
                        "sex",
                        "age_at_injury",
                        "education_level_at_injury",
                        "rehab_payor_primary_type",
                        "func_score_at_year_1",
                        "func_score_at_year_1_q5",
                        "mental_health_tx_hx",
                        "problematic_substance_use_at_injury",
                        "suicide_attempt_hx",
                        "depression_level_at_year_1")

What's happening here?

We're creating two character vectors, all_proposed_covariates and select_covariates, that list the names of the variables that we'll be using.
select_covariates is a subset of all_proposed_covariates.

Addressing Potential Overfitting

It's important to note that the select_covariates list was refined based on initial model diagnostics and concerns about potential overfitting. Overfitting occurs when a model is too complex relative to the amount of data, leading to poor generalization on new data.

One rule of thumb to mitigate overfitting is to have roughly 10-15 events (in our case, deaths) per predictor variable (or degree of freedom) in the model. Our initial 5-year dataset had approximately 4 events per degree of freedom (113 events and 26 df), falling short of this guideline.

To address this, we carefully considered the variables in our initial model and removed those that were deemed less critical or potentially redundant. This included:

calendar_year_of_injury: This variable might capture time trends that could be confounded with other factors.
psych_hosp_hx: This variable could be correlated with other mental health variables, leading to redundancy.
employment_at_injury
cause_of_injury: This variable, while potentially relevant, had many categories, increasing the degrees of freedom in our model and thus the risk of overfitting for this particular analysis.

By creating a more parsimonious model, we aim to improve its generalizability and robustness.

Why It Matters

Flexibility and Focus: Having both comprehensive and focused lists gives us flexibility. We can use all_proposed_covariates for broad exploratory analyses, generating hypotheses and examining a wide range of potential predictors. We can then use select_covariates for more targeted investigations related to our primary research question and for building our final survival models.
Organization and Clarity: Explicitly defining these lists makes our code more organized and easier to understand. It clearly signals which variables we're considering at each stage of the analysis.
Model Stability: The refined select_covariates list helps us build more stable and reliable survival models by reducing the risk of overfitting.

Step 2: Defining Preferred Variable Labels - Speaking a Common Language

Raw variable names are often cryptic and inconsistent. To make our data more accessible and interpretable, we'll assign clear, descriptive labels to our variables.

# Define the preferred variable labels for all covariates
var_name_mapping <- c(
  depression_level_at_year_1 = "Depression Level at Year 1",
  calendar_year_of_injury = "Calendar Year of Injury",
  sex = "Sex",
  age_at_injury = "Age at Injury",
  education_level_at_injury = "Educational Attainment at Injury",
  employment_at_injury = "Employment Status at Injury",
  marital_status_at_injury = "Marital Status at Injury",
  rehab_payor_primary_type = "Medicaid Status",
  cause_of_injury = "Mechanism of Injury",
  drs_total_at_year_1 = "DRS Score at Year 1",
  fim_total_at_year_1 = "FIM Score at Year 1",
  gose_total_at_year_1 = "GOS-E Score at Year 1",
  func_score_at_year_1 = "Function Factor Score at Year 1",
  func_score_at_year_1_q5 = "Function Factor Score at Year 1 Quintiles",
  mental_health_tx_lifetime_at_injury = "Lifetime History of Mental Health Treatment at Injury",
  mental_health_tx_past_year_at_injury = "Past-Year History of Mental Health Treatment at Injury",
  mental_health_tx_hx = "History of Mental Health Treatment",
  psych_hosp_hx_lifetime_at_injury = "Lifetime History of Psychiatric Hospitalization at Injury",
  psych_hosp_hx_past_year_at_injury = "Past-Year History of Psychiatric Hospitalization at Injury",
  psych_hosp_hx = "History of Psychiatric Hospitalization",
  problematic_substance_use_at_injury = "Problematic Substance Use at Injury",
  problematic_substance_use_at_year_1 = "Problematic Substance Use at Year 1",
  suicide_attempt_hx_lifetime_at_injury = "Lifetime History of Suicide Attempt at Injury",
  suicide_attempt_hx_past_year_at_injury = "Past-Year History of Suicide Attempt at Injury",
  suicide_attempt_hx_past_year_at_year_1 = "Past-Year History of Suicide Attempt at Year 1",
  suicide_attempt_hx = "History of Suicide Attempt"
)

What's happening here?

var_name_mapping: We create a named list where the names are the original variable names in our dataset, and the values are the new, descriptive labels we want to assign. For example, we're mapping the variable depression_level_at_year_1 to the label "Depression Level at Year 1."

Why It Matters

Readability: Descriptive labels will make our tables, plots, and model outputs much easier to understand, especially for those who are not intimately familiar with the raw dataset.
Consistency: Using these labels ensures that our variables are consistently named throughout our analysis, reducing the risk of confusion.

Pro Tip: When creating labels, aim for a balance between brevity and informativeness. Choose labels that are both concise and easily understandable by a broad audience.

Step 3: Creating Data Frames for Analysis and Visualization

Before we can create our plots and tables, we will create two data frames tailored for these specific tasks:

analytic_data_for_tables_all: This data frame will include all variables in our all_proposed_covariates list, providing a comprehensive dataset for broad exploration.
analytic_data_for_tables_select: This data frame will include only the variables in our select_covariates list, offering a more focused dataset for targeted analyses related to our primary research question.

Here's how we create these data frames using R:

analytic_data_for_tables_all <- analytic_data_final |>
  mutate(depression_level_at_year_1 = fct_na_value_to_level(depression_level_at_year_1, "Missing")) |>
  select("id", all_of(all_proposed_covariates)) |>
  arrange(id)

analytic_data_for_tables_select <- analytic_data_final |>
  mutate(depression_level_at_year_1 = fct_na_value_to_level(depression_level_at_year_1, "Missing")) |>
  select("id", all_of(select_covariates)) |>
  arrange(id)

What's happening here?

analytic_data_for_tables_all:
- We start with our analytic_data_final dataset (the result of all our previous preprocessing).
- mutate(depression_level_at_year_1 = fct_na_value_to_level(depression_level_at_year_1, "Missing")): We take a variable called depression_level_at_year_1 and convert any missing values (NA) within it to a new level specifically labeled "Missing." This is performed using the fct_na_value_to_level function.
- select("id", all_of(all_proposed_covariates)): We select only the id column and the columns listed in our all_proposed_covariates variable.
- arrange(id): Finally, we sort the data by participant ID.
analytic_data_for_tables_select:
- We create a similar data frame, but this time we select only the id column and the columns listed in our select_covariates variable, using select("id", all_of(select_covariates)).
- We also use the fct_na_value_to_level function to convert missing values in depression_level_at_year_1 to a "Missing" level.
- The data are also sorted by participant ID.

Why It Matters

Targeted Data Frames: We now have two data frames specifically designed for generating descriptive statistics tables. analytic_data_for_tables_all allows for a broad overview of all potential covariates, while analytic_data_for_tables_select focuses on the variables most relevant to our primary research question.
Handling Missing Data in Categorical Variables: By converting missing values in depression_level_at_year_1 to a distinct "Missing" level, we are preparing this variable for inclusion in our descriptive tables. This allows us to represent and analyze missingness within this key variable, rather than simply ignoring it.
Foundation for Exploration: These data frames will be the foundation for creating informative tables that summarize the characteristics of our study population, overall and stratified by depression level.

Conceptual Takeaways: Preparing for Insightful Exploration

These steps—defining our covariates, assigning clear labels, and creating tailored data frames—are essential for setting the stage for a robust and insightful exploratory data analysis.

Here's why this preparation is so critical:

Balancing Breadth and Focus: We've created both comprehensive and focused variable lists, allowing us to explore our data broadly while also maintaining a clear focus on our primary research question.
Model Stability: The refined select_covariates list helps us build more stable and reliable survival models by reducing the risk of overfitting.
Enhanced Communication: Clear and descriptive variable labels ensure that our findings will be accessible and interpretable by a wide audience.

Looking Ahead: Visualizing Missingness and Generating Descriptive Statistics

With our data frames prepared, we're now ready to continue our journey in exploratory data analysis. In the next sections, we'll:

Creating Descriptive Statistics Tables: We'll create comprehensive tables that summarize the characteristics of our study population, overall and stratified by depression level at Year 1.

By combining careful data preparation with insightful visualizations and descriptive summaries, we're setting the stage for building robust survival models and uncovering meaningful insights into the relationship between depression and long-term survival after TBI.

3.3 Creating a Complete-Case Sample

Introduction

Before we dive into generating descriptive statistics and building our survival models, we need to address the issue of missing data. While more sophisticated methods like multiple imputation exist, for this analysis, we will focus on a simpler and more transparent approach: complete-case analysis, also known as listwise deletion.

This means that we'll be creating a subset of our data—a complete-case sample—that includes only those participants who have complete data on our key variables of interest. This sample will be used for generating our descriptive statistics tables and for our Cox regression models, providing a clear picture of the characteristics of participants with complete information.

Why Complete-Case Analysis (Listwise Deletion) in This Context?

While complete-case analysis has limitations (mainly a potential reduction in sample size and potential for bias if data are not missing completely at random), we are choosing this method here for several reasons:

Simplicity and Transparency: Complete-case analysis is straightforward to implement and understand. It involves simply removing any participant with missing values on the variables of interest. This transparency makes our analysis easier to interpret and reproduce. It also allows us to clearly see the impact of missing data on our sample size.
Consistency Across Analyses: By using the same complete-case sample for our descriptive tables and Cox regression models, we ensure that all analyses are based on the same group of participants. This makes our results directly comparable.
Foundation for Comparison: The complete-case analysis serves as an important baseline. While we will not perform multiple imputation in this blog series, we acknowledge that doing so is generally preferred. However, for illustrative purposes, presenting the complete-case analysis first provides a clear and simple starting point for understanding our data.

Important Note: We understand that listwise deletion can potentially reduce statistical power and may introduce bias if the data are not missing completely at random (MCAR). However, for the purposes of this blog series, we are prioritizing simplicity to illustrate the core concepts of survival analysis. A detailed exploration of multiple imputation will be the focus of a later blog series.

Step 1: Excluding Non-Essential Variables - Focusing on Key Predictors

Not all variables are equally important when defining our complete-case sample. For our descriptive statistics and our Cox regression models, we want to focus on the core set of predictor variables. Therefore, we'll exclude variables that are:

Not directly used in our descriptive tables or models: This helps to streamline the process and focus on the variables that matter most for these specific analyses.
Innately Incomplete Due to Study Design: Importantly, we will exclude time_to_censorship_in_years and time_to_expiration_in_years. These variables are, by definition, incomplete. Not all participants will have experienced the event (death), so some will have a time_to_censorship but no time_to_expiration, and vice versa. Instead, we will focus on the completeness of the combined variable, time_to_event_in_years, which captures each participant's time to either censorship or expiration, ensuring greater completeness. We will also exclude age_at_censorship and age_at_expiration for similar reasons.

Here's how we define the variables to exclude:

# Select the variables to exclude from the complete-case data frame
variables_to_exclude <- c("time_to_censorship_in_years",
                          "time_to_expiration_in_years", 
                          "age_at_censorship",
                          "age_at_expiration")

We then use the setdiff function to create two new lists of variables that will be used when creating our complete-case datasets:

# Adjust the dataset to exclude specified variables for evaluation of missingness among model covariates
variables_for_cc_all <- setdiff(all_proposed_covariates, variables_to_exclude)
variables_for_cc_select <- setdiff(select_covariates, variables_to_exclude)

What's happening here?

We create a character vector called variables_to_exclude containing the names of variables we want to omit when checking for complete cases.
setdiff is used to create lists of variables for our two complete-case data frames, variables_for_cc_all and variables_for_cc_select, by subtracting the variables_to_exclude from all_proposed_covariates and select_covariates, respectively.

Why It Matters

Focus: By excluding these variables, we're focusing on the completeness of our core predictor variables and our key outcome variable, time_to_event_in_years.
Efficiency: This simplifies the process of identifying complete cases without affecting the integrity of our analysis for these specific tasks.

Step 2: Preparing Complete-Case Data Frames - Creating Subsets for Analysis

Now, we'll create two complete-case data frames based on our previously defined variable lists:

analytic_data_for_cc_all: Contains all variables in all_proposed_covariates (minus the excluded variables).
analytic_data_for_cc_select: Contains only the variables in select_covariates (minus the excluded variables).

# Prepare the complete-case data frames
analytic_data_for_cc_all <- analytic_data_final |>
  select(id, all_of(variables_for_cc_all)) |>
  arrange(id)

analytic_data_for_cc_select <- analytic_data_final |>
  select(id, all_of(variables_for_cc_select)) |>
  arrange(id)

What's happening here?

We're using select() to create subsets of our analytic_data_final data frame, including only the relevant variables for each complete-cases analysis.
arrange(id) ensures that the data are sorted by participant ID, maintaining consistency across datasets.

Why It Matters

Tailored Datasets: We're creating data frames specifically designed for complete-case analysis, making our workflow more organized and efficient.
Foundation for Comparison: These data frames will serve as the basis for identifying complete cases in the next step.

Step 3: Isolating Complete Cases - The `complete.cases` Function

Now, we'll use the powerful complete.cases() function to identify and isolate the rows (participants) in our data frames that have no missing values across the selected variables.

# Create a new data frame with only complete cases
complete_cases_all <- analytic_data_for_cc_all[complete.cases(analytic_data_for_cc_all), ]
complete_cases_select <- analytic_data_for_cc_select[complete.cases(analytic_data_for_cc_select), ]

What's happening here?

complete.cases(): This function checks each row of a data frame and returns TRUE if all columns in that row have non-missing values, and FALSE otherwise.
We apply complete.cases() to both analytic_data_for_cc_all and analytic_data_for_cc_select, creating two new data frames, complete_cases_all and complete_cases_select, that contain only the complete cases.

Why It Matters

Identifying Complete Cases: This step efficiently identifies the subset of participants for whom we have complete data on the variables of interest.
Creating Analysis-Ready Datasets: The resulting complete_cases_all and complete_cases_select data frames are now ready for generating descriptive statistics and for use in our Cox regression models.

Step 4: Enhancing the Complete-Case Data for Tables

For our descriptive statistics tables, we want to include some additional variables (like our time-to-event variables) that weren't used in defining complete cases. We'll add these variables back into our complete_cases_all and complete_cases_select datasets.

What's happening here?

We are going to merge our complete-case datasets with a subset of our main dataset (analytic_data_final) that only contains the id column and the variables that we want to add back in. We will then reorder the columns to ensure that the newly added variables are placed in a logical position within the data frame.

Why It Matters

Comprehensive Tables: This step ensures that our descriptive statistics tables include all relevant variables, even those not used in defining complete cases.
Contextual Information: Including variables like time_to_event_in_years in our tables provides important context when describing the characteristics of our complete-case sample.

Here's how we enhance both complete-case data frames:

# Create a new data frame with only complete cases for tables (all covariates)
complete_cases_for_tables_all <- complete_cases_all |>
  left_join(analytic_data_final |>
    select(id, time_to_censorship_in_years, time_to_expiration_in_years, age_at_censorship, age_at_expiration), by = "id") |>
  select(
    names(complete_cases_all)[1:3], # Retain the first three variables, up to 'time_to_event_in_years'
    time_to_censorship_in_years,
    time_to_expiration_in_years,
    age_at_censorship,
    age_at_expiration,
    names(complete_cases_all)[-c(1:3)]
  ) # Add the remaining variables from 'complete_cases_all', excluding the first three

# Create a new data frame with only complete cases for tables (select covariates)
complete_cases_for_tables_select <- complete_cases_select |>
  left_join(analytic_data_final |>
    select(id, time_to_censorship_in_years, time_to_expiration_in_years, age_at_censorship, age_at_expiration), by = "id") |>
  select(
    names(complete_cases_select)[1:3], # Retain the first three variables, up to 'time_to_event_in_years'
    time_to_censorship_in_years,
    time_to_expiration_in_years,
    age_at_censorship,
    age_at_expiration,
    names(complete_cases_select)[-c(1:3)]
  ) # Add the remaining variables from 'complete_cases_select', excluding the first three

What's happening here?

Joining Additional Variables:
- left_join(…): We use left_join to merge our complete_cases_all (and complete_cases_select) data frame with a subset of analytic_data_final that contains the id column and the variables that we want to add back in (time_to_censorship_in_years, time_to_expiration_in_years, age_at_censorship, age_at_expiration). The merge is performed based on the common id column.
Reordering Columns:
- select(…): We carefully reorder the columns to ensure that the newly added variables are placed in a logical position within the data frame. We place them after the first three columns (which are id, event_status, and time_to_event_in_years) and before the rest of the original columns.

Now, both complete_cases_for_tables_all and complete_cases_for_tables_select are ready for generating comprehensive descriptive statistics tables that include all relevant variables for describing our complete-case sample.

Step 5: Saving Complete-Case Data Frames for Future Use

Finally, we'll save our complete-case data frames to both .rds (for use in R) and .csv (for broader accessibility) files.

# Save the complete_cases_all and complete_cases_select data frames in a single .rds file and CSV file
saveRDS(complete_cases_all, here(data_processed_dir, "complete_cases_all_final.rds"))
write.csv(complete_cases_all, file.path(data_processed_dir, "complete_cases_all_final.csv"), row.names = FALSE)
saveRDS(complete_cases_select, here(data_processed_dir, "complete_cases_select_final.rds"))
write.csv(complete_cases_select, here(data_processed_dir, "complete_cases_select_final.csv"), row.names = FALSE)

Why It Matters

Reproducibility: Saving these data frames ensures that we can easily recreate our analyses and share our work with others.
Efficiency: We can load these data frames directly in future sessions, avoiding the need to repeat the complete-case selection process.

Key Takeaways: Building a Transparent Analysis

By creating these complete-case samples, we've taken a crucial step toward ensuring the robustness and transparency of our analysis. We've:

Defined Clear Criteria: We've established clear criteria for identifying participants with complete data on our key variables.
Created Tailored Datasets: We've generated data frames specifically designed for complete-case analysis, providing a solid foundation for our descriptive statistics and Cox regression models.
Prioritized Simplicity and Transparency: We've opted for a straightforward approach (listwise deletion) to enhance the interpretability and reproducibility of our findings.

Looking Ahead: Exploring Our Data Through Descriptive Statistics

With our complete-case datasets in hand, we're now ready to generate descriptive statistics tables. In the next section, we'll summarize the key characteristics of our study population, comparing the complete-case sample to the full analytic sample and exploring potential differences between participants with different levels of depression at Year 1. This crucial exploratory step will pave the way for a deeper understanding of our data and inform our subsequent survival modeling.

3.4 Generating Descriptive Statistics Tables

Introduction

We're prepared our data, and now comes the exciting part: exploring its characteristics through descriptive statistics tables! These tables will provide a comprehensive overview of our study population, summarizing key variables and revealing important patterns that will inform our survival models. We'll generate these descriptive statistics tables for both the full analytic sample and the complete-case sample (i.e., participants with no missing data on key variables) to assess the impact of missing data on the characteristics of the sample.

Think of this stage as getting to know the participants in our study. Who are they? What are their demographics, injury characteristics, and mental health histories? How do these characteristics differ across depression levels at Year 1? Descriptive statistics will help us answer these questions.

Why Descriptive Statistics Matter

Descriptive statistics tables are more than just lists of numbers. They provide a crucial foundation for understanding our data by:

Summarizing Key Variables: They provide a snapshot of the distribution of important variables like age, sex, functional independence scores, and mental health history.
Revealing Patterns and Trends: They help us identify potential relationships between variables and highlight differences between groups (e.g., participants with and without depression at Year 1).
Assessing Data Quality: They can reveal potential issues with our data, such as unexpected distributions or high levels of missingness.
Guiding Model Building: The insights gained from descriptive statistics inform the development of our survival models, helping us choose appropriate covariates and model specifications.

Creating Our Descriptive Tables: A Step-by-Step Guide

We'll be using the powerful gtsummary package in R to create our descriptive tables. gtsummary simplifies the process of generating beautiful, publication-ready tables with minimal code.

Step 1: Handling Missing Data in `depression_level_at_year_1`

Before we generate our tables, we need to make a decision about how to handle missing values in our key stratifying variable, depression_level_at_year_1. For the descriptive tables, we'll treat missing values as a separate category, allowing us to see the characteristics of participants for whom we don't have depression data.

Here is the code we will use to do this:

analytic_data_for_tables_all <- analytic_data_for_tables_all |>
  mutate(depression_level_at_year_1 = fct_na_value_to_level(depression_level_at_year_1, level = "Missing"))

analytic_data_for_tables_select <- analytic_data_for_tables_select |>
  mutate(depression_level_at_year_1 = fct_na_value_to_level(depression_level_at_year_1, level = "Missing"))

What It Does

The fct_na_value_to_level() function from the forcats package takes any NA values in the depression_level_at_year_1 variable and converts them to a new factor level labeled "Missing."

Why It Matters

This ensures that participants with missing depression data are not excluded from our descriptive tables. Instead, they are included as a distinct category, allowing us to examine their characteristics and compare them to other groups.

Step 2: Generating Descriptive Tables for the Full Analytic Sample

Now, let's create our first descriptive table, summarizing the characteristics of our full analytic sample:

table_1_1 <- analytic_data_for_tables_all |>
  select(-"id") |>
  tbl_summary(
    by = depression_level_at_year_1,
    type = list(calendar_year_of_injury ~ "continuous",
                gose_total_at_year_1 ~ "continuous"),  # Explicitly set calendar year of injury and GOS-E to continuous
    statistic = list(all_continuous() ~ "{median} ({p25}-{p75})",
                     all_categorical() ~ "{n} ({p}%)"),
    digits = list(all_continuous() ~ 2,  # Default for all continuous variables
                  starts_with("calendar") ~ 0,
                  starts_with("age") ~ 0,
                  starts_with("education") ~ 0,
                  starts_with("drs") ~ 1,
                  starts_with("fim") ~ 0,
                  starts_with("gose") ~ 0),
    label = var_name_mapping
  ) |>
  add_overall() |>
  bold_labels() |>
  add_p(
    pvalue_fun = ~ style_pvalue(.x, digits = 3),  # Format p-values with 3 decimal places
  ) |>
  bold_p() |>
  as_gt() |>
  tab_header("Table 1-1: Sociodemographic, Functional, and Clinical Characteristics of the Full Analytic Sample with All Proposed Covariates by Depression Level at Year 1") |>
  gtsave(
    filename = here(output_tables_dir, "table_1-1_full_analytic_sample_with_all_proposed_covariates.png")
  )

What's happening here?

select(-"id"): We remove the id column, as it's not needed for our descriptive table.
tbl_summary(…): This is the core function from gtsummary that generates the descriptive table.
- by = depression_level_at_year_1: We stratify our table by depression level at Year 1, allowing us to compare characteristics across these groups.
- type = …: We specify the data type for certain variables. Here, we indicate that calendar_year_of_injury and gose_total_at_year_1 should be treated as continuous variables.
- statistic = …: We define the summary statistics to display. For continuous variables, we use the median and interquartile range (IQR), which are appropriate for non-normally distributed data. For categorical variables, we show frequencies and percentages.
- digits = …: We specify the number of decimal places to display for different variable types.
- label = var_name_mapping: We use our var_name_mapping list to apply descriptive labels to the variables in the table.
- add_overall(): This adds a column summarizing the entire sample, without stratification.
- bold_labels(): This bolds the variable labels in the table for better readability.
- add_p(…): This adds p-values for comparisons between depression levels, allowing us to assess the statistical significance of any observed differences.
- as_gt(): This converts the gtsummary table object to a gt table object, which offers more advanced formatting options.
- tab_header(…): This adds a title to our table.
- gtsave(…): This saves the table as a .png file.

Why It Matters

Comprehensive Overview: This table provides a detailed overview of our full analytic sample, stratified by depression level.
Informs Hypothesis Generation: By examining differences between depression groups, we can start to generate hypotheses about the factors that might influence survival.

Step 3: Generating Descriptive Tables for the Complete-Case Sample

We repeat the process, this time using our complete_cases_for_tables_all dataset to create a table summarizing the characteristics of our complete-case sample:

table_1_2 <- complete_cases_for_tables_all |>
  select(-"id") |>
  tbl_summary(
    by = depression_level_at_year_1,
    type = list(calendar_year_of_injury ~ "continuous",
                gose_total_at_year_1 ~ "continuous",
                education_level_at_injury ~ "continuous"),
    statistic = list(all_continuous() ~ "{median} ({p25}-{p75})",
                     all_categorical() ~ "{n} ({p}%)"),
    digits = list(all_continuous() ~ 2,  # Default for all continuous variables
                  starts_with("calendar") ~ 0,
                  starts_with("age") ~ 0,
                  starts_with("education") ~ 0,
                  starts_with("drs") ~ 1,
                  starts_with("fim") ~ 0,
                  starts_with("gose") ~ 0),
    label = var_name_mapping
  ) |>
  add_overall() |>
  bold_labels() |>
  add_p(
    pvalue_fun = ~ style_pvalue(.x, digits = 3),  # Format p-values with 3 decimal places
  ) |>
  bold_p() |>
  as_gt() |>
  tab_header("Table 1-2: Sociodemographic, Functional, and Clinical Characteristics of the Complete-Case Sample with All Proposed Covariates by Depression Level at Year 1") |>
  gtsave(
    filename = here(output_tables_dir, "table_1-2_complete_case_sample_with_all_proposed_covariates.png")
  )

Why It Matters

Assessing the Impact of Missingness: Comparing this table to the one generated from the full sample allows us to assess whether participants with complete data differ systematically from those with missing data. This helps us understand the potential impact of missingness on our findings.
Transparency: Presenting both tables provides a transparent view of our data and the potential limitations of using a complete-case approach.

Comparing Full and Complete Case Samples

By examining both the full analytic sample and the complete-case sample, we can:

Assess Potential Bias: We can observe whether the characteristics of participants with complete data differ systematically from the full dataset. This helps us understand potential biases introduced by listwise deletion.
Ensure Robustness: We can validate that key findings are consistent across both datasets, increasing our confidence in the results.

Pro Tips for Creating Effective Tables

Label Early and Often: Define your variable labels early in the preprocessing process (as we did with var_name_mapping) and apply them consistently throughout your analysis.
Automate Repetitive Tasks: Use functions or loops to streamline table generation, especially when creating multiple tables with similar structures.
Validate Your Tables: Always carefully check the output of your tables to ensure that all variables and categories are displayed as expected and that the numbers make sense.

Looking Ahead: Visualizing Our Data and Building Survival Models

With our descriptive statistics tables in hand, we're well-equipped to interpret the characteristics of our study population and begin formulating hypotheses about the relationship between depression and survival.

In the next blog section, we'll focus on interpreting the descriptive statistics tables that we just created. We'll carefully examine the characteristics of our study population, both overall and stratified by depression level at Year 1. This crucial step will involve:

Comparing Groups: We'll analyze differences in sociodemographics, mental health histories, and functional status between participants with different levels of depression at Year 1.
Assessing the Impact of Missingness: We'll compare the characteristics of the full analytic sample to the complete-case sample, helping us understand the potential impact of missing data on our findings.
Generating Hypotheses: The insights gained from these tables will inform our hypotheses about the relationship between depression and survival, guiding the development of our Cox regression models.

This in-depth exploration of our descriptive statistics will provide a crucial foundation for understanding our data and building robust survival models. We'll be transforming numbers into narratives, setting the stage for uncovering meaningful insights about the factors that influence long-term outcomes after TBI.

3.5 Interpreting Descriptive Statistics Tables

Introduction

We've reached a critical point in our survival analysis journey—interpreting our descriptive statistics tables. These tables, generated from our carefully prepared complete-case sample, provide a wealth of information about the characteristics of our study population. They offer a vital foundation for understanding the relationships between our key variables and for informing the development of our survival models.

In this section, we'll focus on interpreting Table 2-2, which summarizes the characteristics of our complete-case sample, stratified by depression level at Year 1. We'll also compare these findings to Table 2-1, which describes the full analytic sample, to assess the potential impact of missing data.

Table 2-2: A Snapshot of Our Complete-Case Sample

Table 2-2 provides a detailed overview of our complete-case sample (N = 1,549), broken down by depression level at Year 1: "No Depression," "Minor Depression," and "Major Depression." Let's examine some of the key findings:

Sociodemographic Characteristics:
- Sex: The majority of participants in the complete-case sample are male (71%), which is consistent with the full analytic sample. However, the proportion of males is slightly lower in the major depression group (67%) compared to the other groups.
- Age at Injury: The median age at injury is 36 years in the complete-case sample. The distribution of age at injury appears to vary across depression levels. Participants with minor depression tend to be slightly younger (median 32 years) compared to those with no depression (median 38) or major depression (median 37).
- Educational Attainment: The median educational attainment is 12 years (equivalent to a high school diploma) across all depression levels. However, the interquartile range (IQR) suggests slightly less variability in educational attainment among those with major depression.
- Medicaid Status: A statistically significant difference exists in Medicaid status across depression levels. Participants with major depression have a higher proportion enrolled in Medicaid (26%) compared to those with no depression (19%) or minor depression (23%).
Clinical and Functional Characteristics:
- Mortality: Overall, 9.5% of participants in the complete-case sample died during the study period. There are no statistically significant differences in mortality rates across depression levels, though this could be related to the reduction in sample size with the complete-case sample.
- Time to Event/Censorship/Expiration: The median time to event, time to censorship, and time to expiration are similar across all groups in the complete-case sample, as well as to the full analytic sample. However, the median age at expiration is notably lower in the major depression group (58 years) compared to the no depression (74 years) and minor depression (69 years) groups.
- Function Factor Score at Year 1: This variable, reflecting functional independence, shows significant differences across depression levels. As expected, participants with major depression have a lower median score (indicating greater functional impairment) compared to those with minor or no depression. This pattern is consistent with what was observed in the full analytic sample.
Mental Health and Substance Use History:
- History of Mental Health Treatment: Participants with major depression were more likely to report receiving mental health treatment within the year preceding their injury (18%) compared to those with no depression (7.2%) or minor depression (12%). A similar, though less pronounced, pattern is observed for mental health treatment received prior to the year preceding the injury.
- History of Suicide Attempt: Participants with major or minor depression reported significantly higher rates of suicide attempts both prior to their injury and within the first year post-injury compared to those with no depression.
- Problematic Substance Use at Injury: Individuals with major or minor depression had higher rates of problematic substance use at injury (61% and 60%, respectively) compared to those with no depression (52%).
Impact of Missingness: Comparing to the Full Analytic Sample (Table 2-1):
- Reduced Sample Size: The complete-case sample (N = 1,549) is considerably smaller than the full analytic sample (N = 4,283), primarily due to missing data on mental health-related variables, including depression level at Year 1.
- Potential for Bias: While the overall patterns are generally similar between the two samples, there is no longer a statistically significant difference in mortality rates across depression levels in the complete-case sample. Other differences in the magnitude of effects and p-values are also noted. This suggests that the missing data may not be completely random and that listwise deletion could have introduced some bias.
- Lower Representation of Certain Groups: For instance, the proportion of the sample with problematic substance use at injury is slightly higher in the complete-case sample. This could indicate underlying differences between those with and without missing data that are important to keep in mind when interpreting the results of models fit to this sample.

Key Insights and Implications

Depression and Functioning: The descriptive statistics confirm the expected association between depression and functional status. Participants with greater depression severity tend to have lower functional independence scores.
Mental Health History: The data highlights the complex interplay between depression and other mental health factors. Individuals with major depression are more likely to have a history of mental health treatment and substance use issues. The association between depression and suicide attempts is also particularly strong.
Potential Confounders: The observed differences in demographic and clinical characteristics across depression levels suggest that these variables might be confounders in the relationship between depression and mortality. We'll need to account for these potential confounders in our survival models.
Limitations of Complete-Case Analysis: The comparison with the full analytic sample underscores the potential limitations of listwise deletion, particularly the reduction in sample size and the possibility of bias.

Moving Forward: Building Our Survival Models

These descriptive statistics tables have provided us with a foundation for understanding our study population and the potential relationships between depression, covariates, and survival. But numbers alone can only tell us so much.

In the next blog post, we'll bring our data to life through univariate and bivariate visualizations. We'll create plots that complement our descriptive tables, allowing us to:

Visualize distributions: We'll use histograms and boxplots to examine the distribution of key variables like age, functional scores, and time-to-event, both overall and stratified by depression level.
Explore relationships between variables: We'll create scatter plots and other visualizations to explore how different variables relate to each other, helping us identify potential confounders and interaction effects. For instance, we might plot the relationship between depression severity and functional independence scores, or between age and time-to-event.
Visualize survival patterns: We will use Kaplan-Meier curves to visualize the survival probabilities over time for different depression groups, providing an intuitive graphical representation of survival differences.

These visualizations will not only enhance our understanding of the data but also help us communicate our findings more effectively. They will also inform the development of our survival models in subsequent posts, providing a bridge between descriptive exploration and formal statistical modeling.

By combining the power of descriptive statistics with insightful visualizations, we're setting the stage for a more nuanced and impactful analysis of the relationship between depression and survival after TBI.

Conclusion

We've reached a pivotal point in our survival analysis journey! We've transitioned from the meticulous work of data cleaning and preparation to the exciting realm of data exploration. By generating and interpreting detailed descriptive statistics tables, we've gained a much richer understanding of our study population and the intricate relationships within our data. This isn't just about summarizing numbers; it's about unveiling the story hidden within our data, a story that will ultimately help us understand the impact of depression on long-term survival after TBI.

Reflecting on Our Progress: Building a Foundation for Discovery

Let's recap the crucial steps that have brought us to this point:

Establishing a Reproducible Workflow: We began by setting up a well-organized R environment, loading essential libraries, and creating a structured directory system. This ensures that our analysis is transparent, efficient, and easily replicable.
Defining and Refining Our Variables: We carefully curated our list of covariates, creating both comprehensive and focused sets for different stages of analysis. We also assigned clear, descriptive labels, ensuring that our data is readily interpretable.
Addressing Missing Data: We created complete-case samples, providing a transparent and straightforward way to explore our data without the complexities of imputation. This allowed us to directly assess the impact of missing data on our sample characteristics.
Generating Insightful Descriptive Tables: Using the gtsummary package, we created detailed tables summarizing the key demographic, injury-related, functional, and mental health characteristics of our study population, stratified by depression level at Year 1. These tables revealed important differences between groups and highlighted potential confounders to consider in our models.
Interpreting the Story in the Numbers: We went beyond simply reporting statistics; we analyzed the patterns in our tables, comparing the full and complete-case samples, and drawing initial insights into the potential relationship between depression and other key variables.

Why This Matters: Transforming Numbers into Narratives

These steps are far more than just technical exercises. They're about transforming raw data into a coherent narrative that we can understand and learn from. We've ensured that:

Our Data is High-Quality and Reliable: By addressing missingness and carefully defining our variables, we've built a foundation of trustworthy data.

We Have a Deeper Understanding of Our Population: Our descriptive tables have given us a nuanced view of the individuals in our study, their characteristics, and their experiences.
We're Ready to Ask More Complex Questions: The insights gained from our descriptive exploration have informed our hypotheses and prepared us for building sophisticated survival models.

Looking Ahead: Visualizing Patterns and Building Models

Our exploratory journey is far from over! In the next blog posts, we'll:

Bring Our Data to Life with Visualizations: We'll create a variety of plots, including histograms, scatter plots, and Kaplan-Meier curves, to visually explore distributions, relationships between variables, and survival patterns. These visualizations will complement our tables and provide a more intuitive understanding of our data.
Construct and Interpret Survival Models: Armed with the insights gained from our descriptive and visual exploration, we'll build Cox proportional hazards models to quantify the impact of depression on long-term survival, while controlling for other important factors.

We're now on the cusp of transforming our carefully prepared data into actionable insights that have the potential to improve clinical practice and enhance the lives of individuals recovering from TBI. The journey continues, and we're excited to share the next chapter with you.

Comments

Newer

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 7)

Older

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 5)

Traumatic Brain Injury and Depression: A Survival Analysis Study in R (Part 6)

Introduction

3.1 Initial Setup and Library Loading

3.2 Defining Covariates and Assigning Clear Labels

3.3 Creating a Complete-Case Sample

3.4 Generating Descriptive Statistics Tables

3.5 Interpreting Descriptive Statistics Tables

Why This Matters: More Than Just Housekeeping

3.1 Initial Setup and Library Loading

Introduction

Step 1: Equipping Ourselves - Loading Essential Libraries

Let's break down what's happening:

Step 2: Building Our Home Base - Creating a Project Directory

What's happening here?

Why It Matters

Step 3: Loading Our Preprocessed Data

What's happening here?

Why It Matters

Step 4: Polishing Our Tables and Plots - Configuring Aesthetics

What's happening here?

Why It Matters

The Big Picture: A Foundation for Discovery

Looking Ahead: Exploring and Visualizing Our Data

3.2 Defining Covariates and Assigning Clear Labels

Introduction

Step 1: Defining Our Covariates of Interest

What's happening here?

Addressing Potential Overfitting

Why It Matters

Step 2: Defining Preferred Variable Labels - Speaking a Common Language

What's happening here?

Why It Matters

Step 3: Creating Data Frames for Analysis and Visualization

What's happening here?

Why It Matters

Conceptual Takeaways: Preparing for Insightful Exploration

Looking Ahead: Visualizing Missingness and Generating Descriptive Statistics

3.3 Creating a Complete-Case Sample

Introduction

Why Complete-Case Analysis (Listwise Deletion) in This Context?

Step 1: Excluding Non-Essential Variables - Focusing on Key Predictors

What's happening here?

Why It Matters

Step 2: Preparing Complete-Case Data Frames - Creating Subsets for Analysis

What's happening here?

Why It Matters

Step 3: Isolating Complete Cases - The complete.cases Function

What's happening here?

Why It Matters

Step 4: Enhancing the Complete-Case Data for Tables

What's happening here?

Why It Matters

What's happening here?

Step 5: Saving Complete-Case Data Frames for Future Use

Why It Matters

Key Takeaways: Building a Transparent Analysis

Looking Ahead: Exploring Our Data Through Descriptive Statistics

3.4 Generating Descriptive Statistics Tables

Introduction

Why Descriptive Statistics Matter

Creating Our Descriptive Tables: A Step-by-Step Guide

Step 1: Handling Missing Data in depression_level_at_year_1

What It Does

Why It Matters

Step 2: Generating Descriptive Tables for the Full Analytic Sample

What's happening here?

Why It Matters

Step 3: Generating Descriptive Tables for the Complete-Case Sample

Why It Matters

Comparing Full and Complete Case Samples

Pro Tips for Creating Effective Tables

Looking Ahead: Visualizing Our Data and Building Survival Models

3.5 Interpreting Descriptive Statistics Tables

Introduction

Table 2-2: A Snapshot of Our Complete-Case Sample

Key Insights and Implications

Moving Forward: Building Our Survival Models

Conclusion

Reflecting on Our Progress: Building a Foundation for Discovery

Why This Matters: Transforming Numbers into Narratives

Step 3: Isolating Complete Cases - The `complete.cases` Function

Step 1: Handling Missing Data in `depression_level_at_year_1`