--- title: "Vignette 3: Alternatives approaches" output: rmarkdown::html_vignette: css: github-markdown.css toc: true number_sections: true vignette: > %\VignetteIndexEntry{Vignette 3: Alternatives approaches} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- In this vignette we consider a simple scenario with a continuous dependent variable and a set of continuous predictors. First, we load the required packages and store the **example dataset** `GSPCRexdata` (see the helpfile for details `?GSPCRexdata`) in two separate objects: ```r # Load R packages library(gspcr) # this package! library(superpc) # alternative comparison package library(patchwork) # combining ggplots # Load data X <- GSPCRexdata$X$cont y <- GSPCRexdata$y$cont # Define names of measures to compare fit_measure_vec <- c("LRT", "PR2", "MSE", "F", "AIC", "BIC") ``` # Compare results with `superpc` This package extends the `superpc` R package by introducing more fit measures and threshold types, which allows for the consideration of a wider range of variable types. Here we want to show that the **results** obtained by `superpc` can be **replicated** by `gspcr` when using the same fit measure and threshold type. To use `superpc` we need to prepare the data in a different format. ```r # Prepare data for superpc data.train <- list( x = t(as.matrix(scale(X))), y = y, featurenames = colnames(X) ) ``` We can then train the model with the `superpc::superpc.train()` function and observe the solution paths. ```r # Train the model (computes the scores for each feature) train.obj <- superpc.train( data = data.train, type = "regression" ) # Cross-validate the model cv.obj <- superpc.cv( fit = train.obj, data = data.train, min.features = 1, max.features = nrow(data.train$x), n.fold = 10, n.threshold = 20, n.components = 5 ) # Cross-validation solution paths cv.obj_plot <- superpc.plotcv(cv.obj) ```

Figure 1: Solution paths obtained with `superpc`.

Finally, we can specify and train `gspcr` in the same way and compare the cross-validation solution paths. ```r # Train gspcr with the same specification as superpc gspcr_superpc <- cv_gspcr( dv = y, ivs = X, fit_measure = "F", thrs = "normalized", nthrs = 20, npcs_range = 1:5, K = 10, min_features = 1, max_features = ncol(X) ) # Create plot the cross-validation curves plot( gspcr_superpc, errorBars = TRUE, discretize = FALSE, ) ```

Figure 2: Solution paths obtained with `gspcr`.

We can also see that the thresholds values computed are exactly the same: ```r # Report the threshold values data.frame( superpc = round(cv.obj$thresholds, 3), gpscr = round(gspcr_superpc$thr, 3), diff = round(cv.obj$thresholds - gspcr_superpc$thr, 3) ) ``` ``` superpc gpscr diff 1 0.051 0.051 0 2 0.398 0.398 0 3 0.745 0.745 0 4 1.093 1.093 0 5 1.440 1.440 0 6 1.787 1.787 0 7 2.135 2.135 0 8 2.482 2.482 0 9 2.829 2.829 0 10 3.177 3.177 0 11 3.524 3.524 0 12 3.871 3.871 0 13 4.219 4.219 0 14 4.566 4.566 0 15 4.913 4.913 0 16 5.260 5.260 0 17 5.608 5.608 0 18 5.955 5.955 0 19 6.302 6.302 0 20 6.650 6.650 0 ``` # Is K-fold cross-validation working? In this section, I want to showcase the effectiveness of using cross-validation to **select the number of PCs** for GSPCR. We can estimate the same solution paths for GSPCR with and without using K-fold cross-validation by changing the number of folds used. If we set `K = 1` in the `cv_gspcr()` call, the data is assigned a single fold, and the fit measures are evaluated on the same data the model is trained on. First, let's estimate the solution paths with K-fold cross-validation. To keep the comparison simple, we specify two options for the number of PCs (5 and 20), but any other set of values could be used. We train the model with the six available fit measures stored in the object `fit_measure_vec`. ```r # Train the GSPCR model with two number of PCs options out_fit_meas_cv <- lapply(fit_measure_vec, function(i) { cv_gspcr( dv = y, ivs = X, K = 10, npcs_range = c(5, 20), fit_measure = i, thrs = "normalized" ) }) # Plot solution paths plots <- lapply(seq_along(fit_measure_vec), function(i) { # Reverse y? rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i]) # Make plots plot( x = out_fit_meas_cv[[i]], y = fit_measure_vec[[i]], labels = TRUE, y_reverse = rev, errorBars = FALSE, discretize = FALSE, print = FALSE ) }) # Patchwork ggplots (plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]]) ```

Figure 3: Solution paths for different fit measures using K-fold CV.

Then, we repeat the same procedure but we set `K = 1`, to avoid K-fold cross-validation. ```r # Train the GSPCR model with many number of components out_fit_meas_no_CV <- lapply(fit_measure_vec, function(i) { cv_gspcr( dv = y, ivs = X, K = 1, npcs_range = c(5, 20), fit_measure = i, thrs = "normalized" ) }) # Plot them plots <- lapply(seq_along(fit_measure_vec), function(i) { # Reverse y? rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i]) # Make plots plot( x = out_fit_meas_no_CV[[i]], y = fit_measure_vec[[i]], labels = TRUE, y_reverse = rev, errorBars = TRUE, discretize = FALSE, print = FALSE ) }) # Patchwork ggplots (plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]]) ```

Figure 4: Solution paths for different fit measures without using K-fold CV.

You can already see from the plots that when using LRT, MSE, and PR2 as fit measures, without cross-validation we would end up selecting the highest number of PCs provided (20 in this case). However, when using K-fold cross-validation, the solution paths would lead us to choose 5 PCs instead. We can also look at the solutions tables to confirm our read of the plots. The solution we would have found using K-fold CV is: ```r # Standard solutions res_CV <- sapply( 1:length(out_fit_meas_cv), function(meth) { as.numeric(out_fit_meas_cv[[meth]]$sol_table["standard", ]) } ) # Give meaningful names dimnames(res_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec) # Print rounded results round(t(res_CV), 3) ``` ``` thr_value thr_number Q LRT 1.517 3 5 PR2 1.517 3 5 MSE 1.517 3 5 F 1.517 3 5 AIC 1.517 3 5 BIC 0.784 2 5 ``` The solutions we would have obtained without using K-fold CV are: ```r # Standard solutions res_no_CV <- sapply( 1:length(out_fit_meas_no_CV), function(meth) { as.numeric(out_fit_meas_no_CV[[meth]]$sol_table["standard", ]) } ) # Give meaningful names dimnames(res_no_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec) # Print rounded results round(t(res_no_CV), 3) ``` ``` thr_value thr_number Q LRT 1.517 3 20 PR2 1.517 3 20 MSE 1.517 3 20 F 1.517 3 5 AIC 1.517 3 5 BIC 1.517 3 5 ``` As you can see, using CV we find the same solution no matter what the outcome measure, while without using CV, only the AIC, BIC, and F are able to select a low number of PCs. # 1SE solutions The results for all fit measures except `F` struggle with accounting for measure complexity. The use of a simple 1-standard-error rule helps obviate this problem. First, fit the models with many possible number of components: ```r # Train the GSPCR model with many number of components out_fit_meas <- lapply(fit_measure_vec, function(i) { cv_gspcr( dv = y, ivs = X, fam = "gaussian", nthrs = 10, npcs_range = 1:10, K = 10, fit_measure = i, thrs = "normalized", min_features = 1, max_features = ncol(X), oneSE = TRUE ) }) # Plot them plots <- lapply(seq_along(fit_measure_vec), function(i) { # Reverse y? rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i]) # Make plots plot( x = out_fit_meas[[i]], y = fit_measure_vec[[i]], labels = TRUE, y_reverse = rev, errorBars = TRUE, discretize = FALSE, print = FALSE ) }) # Patchwork ggplots (plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]]) ```

Figure 5: Solution paths for different fit measures when using the 1-standard-error rule.

Then, extract the solutions obtained by each: ```r # Standard solutions res <- sapply( 1:length(out_fit_meas), function(meth) { as.numeric(out_fit_meas[[meth]]$sol_table["standard", ]) } ) # Give meaningful names dimnames(res) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec) # Print rounded results round(t(res), 3) ``` ``` thr_value thr_number Q LRT 0.784 2 4 PR2 0.784 2 4 MSE 1.517 3 5 F 2.250 4 1 AIC 1.517 3 3 BIC 2.250 4 1 ``` Finally, you can check which solutions would be chosen by using the 1-standard-error rule: ```r # 1se solutions res_1se <- sapply( 1:length(out_fit_meas), function(meth) { as.numeric(out_fit_meas[[meth]]$sol_table["oneSE", ]) } ) # Give meaningful names dimnames(res_1se) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec) # Print rounded results round(t(res_1se), 3) ``` ``` thr_value thr_number Q LRT 1.517 3 3 PR2 1.517 3 3 MSE 1.517 3 3 F 1.517 3 1 AIC 2.250 4 1 BIC 2.250 4 2 ``` # Alternatives to CV To speed up the model-fitting process, it can be a good idea to find model-building strategies that are less time-consuming than CV. You can use the BIC fit measure without CV to select the appropriate threshold value and the number of components. To do so, you can specify the number of folds to 1 and the fit measure to BIC or AIC. ```r # Define vector of measures to be used fit_measure_vec <- c("LRT", "AIC", "BIC") # Train the GSPCR model with the different values out_fit_meas <- lapply(fit_measure_vec, function(i) { cv_gspcr( dv = y, ivs = X, fam = "gaussian", nthrs = 10, npcs_range = c(1, 2, 5, 20), K = 1, fit_measure = i, thrs = "normalized", min_features = 1, max_features = ncol(X), oneSE = TRUE ) }) # Plot them plots <- lapply(seq_along(fit_measure_vec), function(i) { # Reverse y? rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i]) # Make plots plot( x = out_fit_meas[[i]], y = fit_measure_vec[[i]], labels = TRUE, y_reverse = rev, errorBars = FALSE, discretize = FALSE, print = FALSE ) }) # Patchwork ggplots plots[[1]] + plots[[2]] + plots[[3]] ```

Figure 6: Solution paths obtained using fit measures that do not need K-fold CV.

You can also look at the solutions: ```r # Put solutions together rbind( LRT = out_fit_meas[[1]]$sol_table["standard", ], AIC = out_fit_meas[[2]]$sol_table["standard", ], BIC = out_fit_meas[[3]]$sol_table["standard", ] ) ``` ``` thr_value thr_number Q LRT 1.517276 3 20 AIC 1.517276 3 5 BIC 1.517276 3 5 ```