In this document we will walk through an example application in R using simulated data with the following steps:
For the simulated data, we generate 15 normally distributed predictors, of which 5 are moderately strong, 5 weakly, and 5 very weakly related to a continuous outcome. The signal-to-noise ratio is defined by a target \(R^2\) value. Each of the 5 predictors will represent a domain. The goal of the later modeling process is to yield a meta-model that selects only the first and second domain, and assign a coefficient of zero to the third domain.
# Function to generate data
set.seed(123)
generate_example_data <- function(
n = 500, # sample size
n_predictors = 15, # number of predictors
n_strong = 5, # number of strong predictors
n_weak = 5, # number of weak predictors
n_very_weak = 5, # number of very weak predictors
strong_effect = 0.4, # effect size for strong predictors
weak_effect = 0.1, # effect size for weak predictors
very_weak_effect = 0.01, # effect size for very weak predictors
r2 = 0.2 # Desired R squared
) {
# Generate a set of predictors
X <- matrix(rnorm(n * n_predictors), nrow = n, ncol = n_predictors)
colnames(X) <- paste0("X", 1:n_predictors)
# Define a vector with the true coefficients
beta <- c(
rep(strong_effect, n_strong),
rep(weak_effect, n_weak),
rep(very_weak_effect, n_very_weak)
)
# Define signal and noise components
eta <- X %*% beta
var_eta <- var(as.vector(eta))
# Calculate noise variance from target R^2
noise_var <- var_eta * (1 - r2) / r2
noise_sd <- sqrt(noise_var)
# Generate outcome variable
y <- eta + rnorm(n, mean = 0, sd = noise_sd)
# Return data frame
data <- data.frame(y = as.numeric(y), X)
return(data)
}
# Generate example data
data <- generate_example_data(n = 1000,
r2 = 0.25,
strong_effect = 0.4,
weak_effect = 0.15,
very_weak_effect = 0.01,)
In the next step, we divide the data into a training set for model fitting and a test set for model evaluation.
# Number of folds for cross-validation
n_folds <- 10
# Define n from the generated data
n <- nrow(data)
# Create a vector of fold labels (approximately balanced)
fold_labels <- rep(1:n_folds, length.out = n)
# Randomly shuffle the fold assignments
fold_assignments <- sample(fold_labels)
# Randomly choose one fold to act as the test set
test_fold <- sample(1:n_folds, 1)
# Indices for training and test sets
train_indices <- which(fold_assignments != test_fold)
test_indices <- which(fold_assignments == test_fold)
# Subset the data
train <- data[train_indices, ]
test <- data[test_indices, ]
In the next step, we use the mvs function from the mvs package. To fit the model, we use the training data and provide the domain labels that indicate which predictor belongs to which domain. We also define that ridge regression is used at the base level, and lasso at the meta level.
# Install and load mvs package
# install.packages("mvs")
library(mvs)
# Create a vector that indicates which predictors belong to which domain
domain_labels <- rep(1:3, each = 5)
# Define predictors and outcome explicitly
x_train <- train[, -1] # Predictors: all columns except the first
y_train <- train[, 1] # Outcome: first column
# Define x_train as matrix object
x_train <- as.matrix(x_train)
# Fit the MVS model
fit <- MVS(
x = x_train,
y = y_train,
views = domain_labels,
family = "gaussian",
alphas = c(0, 1), # penalization for base and meta models: 0 = ridge, 1 = lasso
std.meta = TRUE,
progress = FALSE # Turn off progress bar
)
The fitted mvs model contains information about both the base models
(level 1) and the meta-model (level 2). To access the base model
information, the user can use the accessor $"Level 1". To
receive information on the fitted base models, the user can add
$models. To access the meta-model information, the accessor
$"Level 2" can be used.
For extracting coefficients, use the coef() function with the fitted
object and then specify either $"Level 1" or
$"Level 2" to retrieve coefficients for the corresponding
level.
For example, the following code retrieves coefficients for the base models:
coef(fit)$"Level 1"
Once the model is fitted, we can use the test set to evaluate the base models and meta model by computign the prediction error in form of the MSE. Furthermore, we compute the prediction error for a reference model to compare the MSE values against. The reference model uses only the mean of the outcome in the training set to make predictions.
# Define predictors and outcome variable in the test data
x_test <- test[,-1]
y_test <- test[,1]
# Define x_test as matrix object
x_test <- as.matrix(x_test)
# Indices for each view based on the domain labels
base_indexes <- split(seq_len(ncol(x_test)), domain_labels)
# Base models: MSE per domain
base_MSEs <- list()
for (j in 1:length(base_indexes) ) {
base_pred <- predict(fit$`Level 1`$models[[j]], x_test[, base_indexes[[j]]], s = "lambda.min") # 'Level 1' refers to the base models
base_MSEs[[j]] <- mean((y_test - base_pred)^2)
}
# Reference model evaluation: MSE value when using an intercept only model
reference_model <- lm('y ~ 1', data.frame(y = y_train, x_train))
reference_pred <- predict(reference_model, data.frame(y = y_test))
reference_MSE <- mean((y_test - reference_pred)^2)
# Meta-model evaluation: MSE value for predicting the test outcome using the meta-model coefficients
meta_pred <- predict(fit, x_test, std.meta = TRUE)
meta_MSE <- mean((y_test - meta_pred)^2)
To compare models, we aggregate the MSE values in a table and visualize the comparison as a scatter plot.
# Comparison table
mse_table <- data.frame(
Model = c("Base model 1", "Base model 2", "Base model 3", "Reference model", "Meta-model"),
MSE = round(c(unlist(base_MSEs), reference_MSE, meta_MSE), 3)
)
print(mse_table)
## Model MSE
## 1 Base model 1 2.472
## 2 Base model 2 2.757
## 3 Base model 3 2.877
## 4 Reference model 2.877
## 5 Meta-model 2.367
# Plot
library(ggplot2)
library(RColorBrewer)
# Ensure Model is a factor
mse_table$Model <- factor(mse_table$Model, levels = mse_table$Model)
# Set lmits for y axis
max_mse <- max(mse_table$MSE)
min_mse <- min(mse_table$MSE)
upper_limit <- max_mse * 1.20 # 20% higher than max MSE
lower_limit <- min_mse * 0.80 # 20% lower than min MSE
# Scatterplot
ggplot(mse_table, aes(x = Model, y = MSE)) +
geom_point(
fill = "lightblue",
shape = 21, # circle with border
color = "black", # border color
size = 2,
stroke = 0.3 # thickness of the border
) +
geom_text(aes(label = round(MSE, 3)), vjust = -1, size = 3.5) +
scale_y_continuous(limits = c(lower_limit, upper_limit)) +
labs(
title = "Prediction Error (MSE) by Model",
x = "Model",
y = "MSE"
) +
#theme_minimal(base_size = 13) +
theme_classic(base_size = 13) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
The plot illustrates the performance differences between the models. The first base model achieved the lowest prediction error among the base models, which correspoonds to its stronger signal contribution in the data-generating process. The meta-model further improves performance by integrating the information across the domains and combining the predictive strength of selected base models.
Lastly, we are interested in how the meta-model selects among the domains and the relative magnitude of their contributions. For this, we inspect the meta-model at level 2 of the fitted MVS object.
The argument cvlambda determines which of the
cross-validated penalty parameters is used:
lambda.min corresponds to the value of the penalty
parameter that achieves the lowest cross-validated MSE.
lambda.1se corresponds to the largest penalty value
within one standard error of lambda.min. This is often close in
prediction performance but tends to produce sparser models, i.e., a more
conservative selection of predictors (or domains in this case).
coef(fit, cvlambda = "lambda.min")$`Level 2`
## [[1]]
## 4 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 0.01456725
## V1 1.00732970
## V2 0.95665378
## V3 .
The resulting meta-model successfully selected only the first two domains.