---
title: "Vignette 3: Alternatives approaches"
output:
rmarkdown::html_vignette:
css: github-markdown.css
toc: true
number_sections: true
vignette: >
%\VignetteIndexEntry{Vignette 3: Alternatives approaches}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
In this vignette we consider a simple scenario with a continuous dependent variable and a set of continuous predictors. First, we load the required packages and store the **example dataset** `GSPCRexdata` (see the helpfile for details `?GSPCRexdata`) in two separate objects:
```r
# Load R packages
library(gspcr) # this package!
library(superpc) # alternative comparison package
library(patchwork) # combining ggplots
# Load data
X <- GSPCRexdata$X$cont
y <- GSPCRexdata$y$cont
# Define names of measures to compare
fit_measure_vec <- c("LRT", "PR2", "MSE", "F", "AIC", "BIC")
```
# Compare results with `superpc`
This package extends the `superpc` R package by introducing more fit measures and threshold types, which allows for the consideration of a wider range of variable types.
Here we want to show that the **results** obtained by `superpc` can be **replicated** by `gspcr` when using the same fit measure and threshold type.
To use `superpc` we need to prepare the data in a different format.
```r
# Prepare data for superpc
data.train <- list(
x = t(as.matrix(scale(X))),
y = y,
featurenames = colnames(X)
)
```
We can then train the model with the `superpc::superpc.train()` function and observe the solution paths.
```r
# Train the model (computes the scores for each feature)
train.obj <- superpc.train(
data = data.train,
type = "regression"
)
# Cross-validate the model
cv.obj <- superpc.cv(
fit = train.obj,
data = data.train,
min.features = 1,
max.features = nrow(data.train$x),
n.fold = 10,
n.threshold = 20,
n.components = 5
)
# Cross-validation solution paths
cv.obj_plot <- superpc.plotcv(cv.obj)
```
Finally, we can specify and train `gspcr` in the same way and compare the cross-validation solution paths.
```r
# Train gspcr with the same specification as superpc
gspcr_superpc <- cv_gspcr(
dv = y,
ivs = X,
fit_measure = "F",
thrs = "normalized",
nthrs = 20,
npcs_range = 1:5,
K = 10,
min_features = 1,
max_features = ncol(X)
)
# Create plot the cross-validation curves
plot(
gspcr_superpc,
errorBars = TRUE,
discretize = FALSE,
)
```
We can also see that the thresholds values computed are exactly the same:
```r
# Report the threshold values
data.frame(
superpc = round(cv.obj$thresholds, 3),
gpscr = round(gspcr_superpc$thr, 3),
diff = round(cv.obj$thresholds - gspcr_superpc$thr, 3)
)
```
```
superpc gpscr diff
1 0.051 0.051 0
2 0.398 0.398 0
3 0.745 0.745 0
4 1.093 1.093 0
5 1.440 1.440 0
6 1.787 1.787 0
7 2.135 2.135 0
8 2.482 2.482 0
9 2.829 2.829 0
10 3.177 3.177 0
11 3.524 3.524 0
12 3.871 3.871 0
13 4.219 4.219 0
14 4.566 4.566 0
15 4.913 4.913 0
16 5.260 5.260 0
17 5.608 5.608 0
18 5.955 5.955 0
19 6.302 6.302 0
20 6.650 6.650 0
```
# Is K-fold cross-validation working?
In this section, I want to showcase the effectiveness of using cross-validation to **select the number of PCs** for GSPCR.
We can estimate the same solution paths for GSPCR with and without using K-fold cross-validation by changing the number of folds used.
If we set `K = 1` in the `cv_gspcr()` call, the data is assigned a single fold, and the fit measures are evaluated on the same data the model is trained on.
First, let's estimate the solution paths with K-fold cross-validation.
To keep the comparison simple, we specify two options for the number of PCs (5 and 20), but any other set of values could be used.
We train the model with the six available fit measures stored in the object `fit_measure_vec`.
```r
# Train the GSPCR model with two number of PCs options
out_fit_meas_cv <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
K = 10,
npcs_range = c(5, 20),
fit_measure = i,
thrs = "normalized"
)
})
# Plot solution paths
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas_cv[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = FALSE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
```
Then, we repeat the same procedure but we set `K = 1`, to avoid K-fold cross-validation.
```r
# Train the GSPCR model with many number of components
out_fit_meas_no_CV <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
K = 1,
npcs_range = c(5, 20),
fit_measure = i,
thrs = "normalized"
)
})
# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas_no_CV[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = TRUE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
```
You can already see from the plots that when using LRT, MSE, and PR2 as fit measures, without cross-validation we would end up selecting the highest number of PCs provided (20 in this case).
However, when using K-fold cross-validation, the solution paths would lead us to choose 5 PCs instead.
We can also look at the solutions tables to confirm our read of the plots.
The solution we would have found using K-fold CV is:
```r
# Standard solutions
res_CV <- sapply(
1:length(out_fit_meas_cv),
function(meth) {
as.numeric(out_fit_meas_cv[[meth]]$sol_table["standard", ])
}
)
# Give meaningful names
dimnames(res_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res_CV), 3)
```
```
thr_value thr_number Q
LRT 1.517 3 5
PR2 1.517 3 5
MSE 1.517 3 5
F 1.517 3 5
AIC 1.517 3 5
BIC 0.784 2 5
```
The solutions we would have obtained without using K-fold CV are:
```r
# Standard solutions
res_no_CV <- sapply(
1:length(out_fit_meas_no_CV),
function(meth) {
as.numeric(out_fit_meas_no_CV[[meth]]$sol_table["standard", ])
}
)
# Give meaningful names
dimnames(res_no_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res_no_CV), 3)
```
```
thr_value thr_number Q
LRT 1.517 3 20
PR2 1.517 3 20
MSE 1.517 3 20
F 1.517 3 5
AIC 1.517 3 5
BIC 1.517 3 5
```
As you can see, using CV we find the same solution no matter what the outcome measure, while without using CV, only the AIC, BIC, and F are able to select a low number of PCs.
# 1SE solutions
The results for all fit measures except `F` struggle with accounting for measure complexity. The use of a simple 1-standard-error rule helps obviate this problem.
First, fit the models with many possible number of components:
```r
# Train the GSPCR model with many number of components
out_fit_meas <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
fam = "gaussian",
nthrs = 10,
npcs_range = 1:10,
K = 10,
fit_measure = i,
thrs = "normalized",
min_features = 1,
max_features = ncol(X),
oneSE = TRUE
)
})
# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = TRUE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
```
Then, extract the solutions obtained by each:
```r
# Standard solutions
res <- sapply(
1:length(out_fit_meas),
function(meth) {
as.numeric(out_fit_meas[[meth]]$sol_table["standard", ])
}
)
# Give meaningful names
dimnames(res) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res), 3)
```
```
thr_value thr_number Q
LRT 0.784 2 4
PR2 0.784 2 4
MSE 1.517 3 5
F 2.250 4 1
AIC 1.517 3 3
BIC 2.250 4 1
```
Finally, you can check which solutions would be chosen by using the 1-standard-error rule:
```r
# 1se solutions
res_1se <- sapply(
1:length(out_fit_meas),
function(meth) {
as.numeric(out_fit_meas[[meth]]$sol_table["oneSE", ])
}
)
# Give meaningful names
dimnames(res_1se) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res_1se), 3)
```
```
thr_value thr_number Q
LRT 1.517 3 3
PR2 1.517 3 3
MSE 1.517 3 3
F 1.517 3 1
AIC 2.250 4 1
BIC 2.250 4 2
```
# Alternatives to CV
To speed up the model-fitting process, it can be a good idea to find model-building strategies that are less time-consuming than CV.
You can use the BIC fit measure without CV to select the appropriate threshold value and the number of components.
To do so, you can specify the number of folds to 1 and the fit measure to BIC or AIC.
```r
# Define vector of measures to be used
fit_measure_vec <- c("LRT", "AIC", "BIC")
# Train the GSPCR model with the different values
out_fit_meas <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
fam = "gaussian",
nthrs = 10,
npcs_range = c(1, 2, 5, 20),
K = 1,
fit_measure = i,
thrs = "normalized",
min_features = 1,
max_features = ncol(X),
oneSE = TRUE
)
})
# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = FALSE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
plots[[1]] + plots[[2]] + plots[[3]]
```
You can also look at the solutions:
```r
# Put solutions together
rbind(
LRT = out_fit_meas[[1]]$sol_table["standard", ],
AIC = out_fit_meas[[2]]$sol_table["standard", ],
BIC = out_fit_meas[[3]]$sol_table["standard", ]
)
```
```
thr_value thr_number Q
LRT 1.517276 3 20
AIC 1.517276 3 5
BIC 1.517276 3 5
```