CardioNIR — Predictive Modeling Pipeline

R/tidymodels Framework for Clinical Risk Prediction from NIRS, Biomarker, and Clinical Data

CardioNIR
Machine Learning
tidymodels
Pipeline
R
Author

CardioNIR Team

Published

March 9, 2026

1 Overview

Work Package 6 (WP6) of the CardioNIR project developed a modular, reproducible R/tidymodels framework for clinical risk prediction integrating NIRS spectral data, molecular biomarkers (proteomic and metabolomic), and clinical variables. The pipeline covers the full analytical workflow from data ingestion through model evaluation, interpretability, and network analysis.

Figure 1: Figure 1. Predictive Modeling Pipeline overview: five-stage workflow from data input through recipe preprocessing, stratified splitting, cross-validated tuning, and model selection. Seven model engines are available. Evaluation spans four dimensions: discrimination, calibration, clinical utility, and interpretability. A network analysis module complements predictive modelling.

2 Pipeline Architecture

The pipeline is built entirely on the tidymodels ecosystem, ensuring a consistent grammar for model specification, preprocessing, tuning, and evaluation.

2.1 Stage 1: Data Input

The framework accepts three data modalities:

  • Clinical variables — demographics, surgical parameters, comorbidities, haemodynamic measures
  • Biomarkers — Olink proteomic NPX values (49 proteins), LC-MS metabolomic features (46 metabolites)
  • NIRS spectra — SNV-transformed spectral features from the 900–1700 nm range

Data fusion strategies include early integration (feature concatenation), late integration (model stacking), and multi-block methods.

2.2 Stage 2: Preprocessing Recipe

Preprocessing is defined as a reproducible recipe() with the following steps:

recipe(outcome ~ ., data = training_data) |>
  step_zv(all_predictors()) |>        # Remove zero-variance features
  step_dummy(all_nominal()) |>         # Encode categorical variables
  step_normalize(all_numeric()) |>     # Centre and scale
  step_corr(threshold = 0.90) |>       # Optional: remove highly correlated
  step_pca(threshold = 0.95)           # Optional: dimensionality reduction

Additional steps for spectral data include step_ns() for basis expansion, step_filter_missing(), and custom spectral derivative steps.

2.3 Stage 3: Data Splitting

Stratified train/test splitting ensures balanced outcome representation:

initial_split(data, prop = 0.75, strata = outcome)

2.4 Stage 4: Cross-Validation and Tuning

Hyperparameter tuning uses 10-fold cross-validation with a grid search of 20 candidate parameter sets:

vfold_cv(training_data, v = 10, strata = outcome)
tune_grid(resamples = folds, grid = 20, metrics = metric_set(roc_auc))

2.5 Stage 5: Model Selection

The best model is selected by select_best(metric = "roc_auc") and finalised with finalize_workflow()last_fit().

3 Model Engines

The pipeline supports seven model families, all specified through the tidymodels interface:

Table 1: Table 1. Model engines available in the CardioNIR pipeline.
Engine Function Type Key Hyperparameters
Elastic Net logistic_reg(glmnet) Penalised regression penalty, mixture (LASSO/Ridge/EN)
Random Forest rand_forest(ranger) Ensemble mtry, trees, min_n
XGBoost boost_tree(xgboost) Gradient boosting tree_depth, learn_rate, trees
SVM svm_rbf(kernlab) Kernel methods cost, rbf_sigma
Neural Net mlp(nnet) Single-layer perceptron hidden_units, penalty, epochs
Bayesian Logistic logistic_reg(stan) Bayesian regression Priors, chains, iter
Decision Tree decision_tree(C5.0/rpart) Rule-based cost_complexity, tree_depth

All engines follow the same workflow()tune_grid()select_best()last_fit() pattern, enabling systematic comparison across model families.

4 Evaluation Framework

Model evaluation is structured across four complementary dimensions:

4.1 Discrimination

Assessment of how well the model separates outcome classes:

  • roc_auc() and roc_curve() — area under the ROC curve
  • Precision-Recall curves — particularly important for imbalanced outcomes
  • Lift charts — clinical gain over random selection

4.2 Calibration

Assessment of whether predicted probabilities match observed frequencies:

  • brier_class() — Brier score for probability accuracy
  • Calibration plots (decile-based) — predicted vs. observed event rates
  • Hosmer-Lemeshow goodness-of-fit

4.3 Clinical Utility

Translation of model predictions into clinical decision-making value:

  • dcurves::dca() — Decision Curve Analysis
  • Net Benefit across threshold probabilities
  • Clinical Impact Curves — number of patients classified as high-risk vs. true positives at each threshold

4.4 Interpretability

Understanding which features drive predictions:

  • vip::vi() — variable importance (permutation-based and model-specific)
  • tidy() — model coefficients for penalised regression
  • SHAP values (via shapviz) — local and global feature contributions
  • Partial dependence plots

5 Network Analysis Module

Complementing the predictive modelling framework, the pipeline includes a dedicated network analysis module for co-expression analysis of proteomic and metabolomic data:

  • WGCNA — Weighted Gene Co-expression Network Analysis adapted for protein/metabolite data
  • Phase-specific topology — network construction at each surgical phase to capture dynamic rewiring
  • Hub identification — degree centrality, betweenness, hub scores for identifying key molecular players
  • Module-trait correlations — association of co-expression modules with clinical outcomes

This network module is the analytical engine behind the proteomics and metabolomics results pages.

6 Reusable Modules

The pipeline has been modularised into 22 reusable data analysis modules covering:

  • Preprocessing and quality control
  • Feature selection (stability selection, Boruta, recursive feature elimination)
  • Model specification and tuning
  • Evaluation and reporting
  • Network construction and community detection
  • Visualisation (volcano plots, heatmaps, PCA biplots, calibration plots)

These modules are currently deployed across active projects at the Cardiovascular Research Centre, including CardioNIR, BEACON, CARDIA, and IBERIA.