Showcase example: Stacked Domain Learning with simulated data

General

In this document we will walk through an example application in R using simulated data with the following steps:

Simulate data with a target $R^2$,
Split data into training and test set,
Fit an MVS model,
Evaluate base models and the meta model using MSE,
Compare results in a table and simple scatter plot,
Inspect meta-model coefficients

1. Generate example data

For the simulated data, we generate 15 normally distributed predictors, of which 5 are moderately strong, 5 weakly, and 5 very weakly related to a continuous outcome. The signal-to-noise ratio is defined by a target $R^2$ value. Each of the 5 predictors will represent a domain. The goal of the later modeling process is to yield a meta-model that selects only the first and second domain, and assign a coefficient of zero to the third domain.

# Function to generate data

set.seed(123)

generate_example_data <- function(
    n = 500,                       # sample size
    n_predictors = 15,             # number of predictors
    n_strong = 5,                  # number of strong predictors
    n_weak = 5,                    # number of weak predictors
    n_very_weak = 5,               # number of very weak predictors
    strong_effect = 0.4,           # effect size for strong predictors
    weak_effect = 0.1,             # effect size for weak predictors
    very_weak_effect = 0.01,       # effect size for very weak predictors
    r2 = 0.2                       # Desired R squared
) {
  
  # Generate a set of predictors 
  X <- matrix(rnorm(n * n_predictors), nrow = n, ncol = n_predictors)
  colnames(X) <- paste0("X", 1:n_predictors)
  
  # Define a vector with the true coefficients
  beta <- c(
    rep(strong_effect, n_strong),
    rep(weak_effect, n_weak),
    rep(very_weak_effect, n_very_weak)
  )
  
  # Define signal and noise components
  eta <- X %*% beta
  var_eta <- var(as.vector(eta))
  
  # Calculate noise variance from target R^2
  noise_var <- var_eta * (1 - r2) / r2
  noise_sd <- sqrt(noise_var)
  
  # Generate outcome variable
  y <- eta + rnorm(n, mean = 0, sd = noise_sd)
  
  # Return data frame
  data <- data.frame(y = as.numeric(y), X)
  
  return(data)
}

# Generate example data
data <- generate_example_data(n = 1000,
                              r2 = 0.25,
                              strong_effect = 0.4,
                              weak_effect = 0.15,
                              very_weak_effect = 0.01,)

2. Divide the data into a training and test set

In the next step, we divide the data into a training set for model fitting and a test set for model evaluation.

# Number of folds for cross-validation
n_folds <- 10

# Define n from the generated data
n <- nrow(data)

# Create a vector of fold labels (approximately balanced)
fold_labels <- rep(1:n_folds, length.out = n)

# Randomly shuffle the fold assignments
fold_assignments <- sample(fold_labels)

# Randomly choose one fold to act as the test set
test_fold <- sample(1:n_folds, 1)

# Indices for training and test sets
train_indices <- which(fold_assignments != test_fold)
test_indices <- which(fold_assignments == test_fold)

# Subset the data
train <- data[train_indices, ]
test <- data[test_indices, ]

3. Fit the MVS model

In the next step, we use the mvs function from the mvs package. To fit the model, we use the training data and provide the domain labels that indicate which predictor belongs to which domain. We also define that ridge regression is used at the base level, and lasso at the meta level.

# Install and load mvs package
# install.packages("mvs")
library(mvs)

# Create a vector that indicates which predictors belong to which domain
domain_labels <- rep(1:3, each = 5)

# Define predictors and outcome explicitly
x_train <- train[, -1]  # Predictors: all columns except the first
y_train <- train[,  1]  # Outcome: first column

# Define x_train as matrix object
x_train <- as.matrix(x_train)

# Fit the MVS model
fit <- MVS(
  x = x_train,
  y = y_train,
  views = domain_labels,
  family = "gaussian",
  alphas = c(0, 1),       # penalization for base and meta models: 0 = ridge, 1 = lasso
  std.meta = TRUE,
  progress = FALSE       # Turn off progress bar
)

The fitted mvs model contains information about both the base models (level 1) and the meta-model (level 2). To access the base model information, the user can use the accessor $"Level 1". To receive information on the fitted base models, the user can add $models. To access the meta-model information, the accessor $"Level 2" can be used.

For extracting coefficients, use the coef() function with the fitted object and then specify either $"Level 1" or $"Level 2" to retrieve coefficients for the corresponding level.

For example, the following code retrieves coefficients for the base models:

coef(fit)$"Level 1"

4. Evaluate MSEs: base models, reference, and meta model

Once the model is fitted, we can use the test set to evaluate the base models and meta model by computign the prediction error in form of the MSE. Furthermore, we compute the prediction error for a reference model to compare the MSE values against. The reference model uses only the mean of the outcome in the training set to make predictions.

# Define predictors and outcome variable in the test data
x_test <- test[,-1]
y_test <- test[,1]

# Define x_test as matrix object
x_test <- as.matrix(x_test)

# Indices for each view based on the domain labels
base_indexes <- split(seq_len(ncol(x_test)), domain_labels)

# Base models: MSE per domain
base_MSEs <- list()
for (j in 1:length(base_indexes) ) {
  base_pred <- predict(fit$`Level 1`$models[[j]], x_test[, base_indexes[[j]]], s = "lambda.min") # 'Level 1' refers to the base models
  base_MSEs[[j]] <- mean((y_test - base_pred)^2)
}

# Reference model evaluation: MSE value when using an intercept only model
reference_model <- lm('y ~ 1', data.frame(y = y_train, x_train))
reference_pred <- predict(reference_model, data.frame(y = y_test))
reference_MSE <- mean((y_test - reference_pred)^2)

# Meta-model evaluation: MSE value for predicting the test outcome using the meta-model coefficients
meta_pred <- predict(fit, x_test, std.meta = TRUE)
meta_MSE <- mean((y_test - meta_pred)^2)

5. Compare models

To compare models, we aggregate the MSE values in a table and visualize the comparison as a scatter plot.

# Comparison table
mse_table <- data.frame(
  Model = c("Base model 1", "Base model 2", "Base model 3", "Reference model", "Meta-model"),
  MSE   = round(c(unlist(base_MSEs), reference_MSE, meta_MSE), 3)
)

print(mse_table)

##             Model   MSE
## 1    Base model 1 2.472
## 2    Base model 2 2.757
## 3    Base model 3 2.877
## 4 Reference model 2.877
## 5      Meta-model 2.367

# Plot
library(ggplot2)
library(RColorBrewer)

# Ensure Model is a factor
mse_table$Model <- factor(mse_table$Model, levels = mse_table$Model)

# Set lmits for y axis
max_mse <- max(mse_table$MSE)
min_mse <- min(mse_table$MSE)
upper_limit <- max_mse * 1.20  # 20% higher than max MSE
lower_limit <- min_mse * 0.80  # 20% lower than min MSE

# Scatterplot
ggplot(mse_table, aes(x = Model, y = MSE)) +
  geom_point(
    fill = "lightblue", 
    shape = 21,          # circle with border
    color = "black",     # border color
    size = 2,
    stroke = 0.3           # thickness of the border
  ) +
  geom_text(aes(label = round(MSE, 3)), vjust = -1, size = 3.5) +
  scale_y_continuous(limits = c(lower_limit, upper_limit)) +
  labs(
    title = "Prediction Error (MSE) by Model",
    x = "Model",
    y = "MSE"
  ) +
  #theme_minimal(base_size = 13) +
  theme_classic(base_size = 13) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

The plot illustrates the performance differences between the models. The first base model achieved the lowest prediction error among the base models, which correspoonds to its stronger signal contribution in the data-generating process. The meta-model further improves performance by integrating the information across the domains and combining the predictive strength of selected base models.

6. Inspect meta-model coefficients

Lastly, we are interested in how the meta-model selects among the domains and the relative magnitude of their contributions. For this, we inspect the meta-model at level 2 of the fitted MVS object.

The argument cvlambda determines which of the cross-validated penalty parameters is used:

lambda.min corresponds to the value of the penalty parameter that achieves the lowest cross-validated MSE.

lambda.1se corresponds to the largest penalty value within one standard error of lambda.min. This is often close in prediction performance but tends to produce sparser models, i.e., a more conservative selection of predictors (or domains in this case).

coef(fit, cvlambda = "lambda.min")$`Level 2`

## [[1]]
## 4 x 1 sparse Matrix of class "dgCMatrix"
##             lambda.min
## (Intercept) 0.01456725
## V1          1.00732970
## V2          0.95665378
## V3          .

The resulting meta-model successfully selected only the first two domains.