Title: | Projection Predictive Feature Selection |
---|---|
Description: | Performs projection predictive feature selection for generalized linear models (Piironen, Paasiniemi, and Vehtari, 2020, <doi:10.1214/20-EJS1711>) with or without multilevel or additive terms (Catalina, Bürkner, and Vehtari, 2022, <https://proceedings.mlr.press/v151/catalina22a.html>), for some ordinal and nominal regression models (Weber and Vehtari, 2023, <arXiv:2301.01660>), and for many other regression models (using the latent projection by Catalina, Bürkner, and Vehtari, 2021, <arXiv:2109.04702>, which can also be applied to most of the former models). The package is compatible with the 'rstanarm' and 'brms' packages, but other reference models can also be used. See the vignettes and the documentation for more information and examples. |
Authors: | Juho Piironen [aut], Markus Paasiniemi [aut], Alejandro Catalina [aut], Frank Weber [cre, aut], Aki Vehtari [aut], Jonah Gabry [ctb], Marco Colombo [ctb], Paul-Christian Bürkner [ctb], Hamada S. Badr [ctb], Brian Sullivan [ctb], Sölvi Rögnvaldsson [ctb], The LME4 Authors [cph] (see file 'LICENSE' for details), Yann McLatchie [ctb], Juho Timonen [ctb] |
Maintainer: | Frank Weber <fweber144@protonmail.com> |
License: | GPL-3 | file LICENSE |
Version: | 2.5.0.9000 |
Built: | 2023-05-22 14:10:28 UTC |
Source: | https://github.com/stan-dev/projpred |
The R package projpred performs the projection predictive variable (or
"feature") selection for various regression models. We recommend to read the
README
file (available with enhanced formatting
online) and the main vignette (topic = "projpred"
, but also available
online) before
continuing here.
Throughout the whole package documentation, we use the term "submodel" for
all kinds of candidate models onto which the reference model is projected.
For custom reference models, the candidate models don't need to be actual
submodels of the reference model, but in any case (even for custom
reference models), the candidate models are always actual submodels of the
full formula
used by the search procedure. In this regard, it is correct
to speak of submodels, even in case of a custom reference model.
The following model type abbreviations will be used at multiple places throughout the documentation: GLM (generalized linear model), GLMM (generalized linear multilevel—or "mixed"—model), GAM (generalized additive model), and GAMM (generalized additive multilevel—or "mixed"—model). Note that the term "generalized" includes the Gaussian family as well.
For the projection of the reference model onto a submodel, projpred currently relies on the following functions (in other words, these are the workhorse functions used by the default divergence minimizers):
Submodel without multilevel or additive terms:
For the traditional (or latent) projection (or the augmented-data
projection in case of the binomial()
or brms::bernoulli()
family): An
internal C++ function which basically serves the same purpose as lm()
for the gaussian()
family and glm()
for all other families.
For the augmented-data projection: MASS::polr()
for the
brms::cumulative()
family or rstanarm::stan_polr()
fits,
nnet::multinom()
for the brms::categorical()
family.
Submodel with multilevel but no additive terms:
For the traditional (or latent) projection (or the augmented-data
projection in case of the binomial()
or brms::bernoulli()
family):
lme4::lmer()
for the gaussian()
family, lme4::glmer()
for all other
families.
For the augmented-data projection: ordinal::clmm()
for the
brms::cumulative()
family, mclogit::mblogit()
for the
brms::categorical()
family.
Submodel without multilevel but additive terms: mgcv::gam()
.
Submodel with multilevel and additive terms: gamm4::gamm4()
.
Setting the global option projpred.extra_verbose
to TRUE
will print out
which submodel projpred is currently projecting onto as well as (if
method = "forward"
and verbose = TRUE
in varsel()
or cv_varsel()
)
which submodel has been selected at those steps of the forward search for
which a percentage (of the maximum submodel size that the search is run up
to) is printed. In general, however, we cannot recommend setting this global
option to TRUE
for cv_varsel()
with validate_search = TRUE
(simply due
to the amount of information that will be printed, but also due to the
progress bar which will not work anymore as intended).
The projection of the reference model onto a submodel can be run on multiple
CPU cores in parallel (across the projected draws). This is powered by the
foreach package. Thus, any parallel (or sequential) backend compatible
with foreach can be used, e.g., the backends from packages
doParallel, doMPI, or doFuture. Using the global option
projpred.prll_prj_trigger
, the number of projected draws below which no
parallelization is applied (even if a parallel backend is registered) can be
modified. Such a "trigger" threshold exists because of the computational
overhead of a parallelization which makes parallelization only useful for a
sufficiently large number of projected draws. By default, parallelization is
turned off, which can also be achieved by supplying Inf
(or NULL
) to
option projpred.prll_prj_trigger
. Note that we cannot recommend
parallelizing the projection on Windows because in our experience, the
parallelization overhead is larger there, causing a parallel run to take
longer than a sequential run. Also note that the parallelization works well
for GLMs, but for all other models, the fitted model objects are quite big,
which—when running in parallel—may lead to excessive memory usage which
in turn may crash the R session. Thus, we currently cannot recommend the
parallelization for models other than GLMs.
In case of multilevel models, projpred offers two global options for
"integrating out" group-level effects: projpred.mlvl_pred_new
and
projpred.mlvl_proj_ref_new
. When setting projpred.mlvl_pred_new
to TRUE
(default is FALSE
), then at
prediction time, projpred will treat group levels existing in the
training data as new group levels, implying that their group-level effects
are drawn randomly from a (multivariate) Gaussian distribution. This concerns
both, the reference model and the (i.e., any) submodel. Furthermore, setting
projpred.mlvl_pred_new
to TRUE
causes as.matrix.projection()
to omit
the projected group-level effects (for the group levels from the original
dataset). When setting projpred.mlvl_proj_ref_new
to TRUE
(default is
FALSE
), then at projection time, the reference model's fitted values
(that the submodels fit to) will be computed by treating the group levels
from the original dataset as new group levels, implying that their
group-level effects will be drawn randomly from a (multivariate) Gaussian
distribution (as long as the reference model is a multilevel model,
which—for custom reference models—does not need to be the case). This
also affects the latent response values for a latent projection
correspondingly. Setting projpred.mlvl_pred_new
to TRUE
makes sense,
e.g., when the prediction task is such that any group level will be treated
as a new one. Typically, setting projpred.mlvl_proj_ref_new
to TRUE
only
makes sense when projpred.mlvl_pred_new
is already set to TRUE
. In that
case, the default of FALSE
for projpred.mlvl_proj_ref_new
ensures that at
projection time, the submodels fit to the best possible fitted values from
the reference model, and setting projpred.mlvl_proj_ref_new
to TRUE
would
make sense if the group-level effects should be integrated out completely.
init_refmodel()
, get_refmodel()
For setting up an object
containing information about the reference model, the submodels, and how
the projection should be carried out. Explicit calls to init_refmodel()
and get_refmodel()
are only rarely needed.
varsel()
, cv_varsel()
For running the search part and the evaluation part for a projection predictive variable selection, possibly with cross-validation (CV).
summary.vsel()
, print.vsel()
, plot.vsel()
,
suggest_size.vsel()
, ranking()
, cv_proportions()
,
plot.cv_proportions()
For post-processing the results from varsel()
and cv_varsel()
.
project()
For projecting the reference model onto submodel(s). Typically, this follows the variable selection, but it can also be applied directly (without a variable selection).
as.matrix.projection()
For extracting projected parameter draws.
proj_linpred()
, proj_predict()
For making predictions from a submodel (after projecting the reference model onto it).
Maintainer: Frank Weber fweber144@protonmail.com
Authors:
Juho Piironen juho.t.piironen@gmail.com
Markus Paasiniemi
Alejandro Catalina alecatfel@gmail.com
Aki Vehtari
Other contributors:
Jonah Gabry [contributor]
Marco Colombo [contributor]
Paul-Christian Bürkner [contributor]
Hamada S. Badr [contributor]
Brian Sullivan [contributor]
Sölvi Rögnvaldsson [contributor]
The LME4 Authors (see file 'LICENSE' for details) [copyright holder]
Yann McLatchie [contributor]
Juho Timonen [contributor]
Useful links:
Report bugs at https://github.com/stan-dev/projpred/issues/
This is the as.matrix()
method for projection
objects (returned by
project()
, possibly as elements of a list
). It extracts the projected
parameter draws and returns them as a matrix.
## S3 method for class 'projection'
as.matrix(x, nm_scheme = "auto", ...)
x |
An object of class |
nm_scheme |
The naming scheme for the columns of the output matrix.
Either |
... |
Currently ignored. |
In case of the augmented-data projection for a multilevel submodel
of a brms::categorical()
reference model, the multilevel parameters (and
therefore also their names) slightly differ from those in the brms
reference model fit (see section "Augmented-data projection" in
extend_family()
's documentation).
An matrix of projected
draws, with
denoting the number of projected
draws and
the number of parameters.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Projection onto an arbitrary combination of predictor terms (with a small
# value for `nclusters`, but only for the sake of speed in this example;
# this is not recommended in general):
prj <- project(fit, solution_terms = c("X1", "X3", "X5"), nclusters = 10,
seed = 9182)
prjmat <- as.matrix(prj)
### For further post-processing (e.g., via packages `bayesplot` and
### `posterior`), we will here ignore the fact that clustering was used
### (due to argument `nclusters` above). CAUTION: Ignoring the clustering
### is not recommended and only shown here for demonstrative purposes. A
### better solution for the clustering case is explained below.
# If the `bayesplot` package is installed, the output from
# as.matrix.projection() can be used there. For example:
if (requireNamespace("bayesplot", quietly = TRUE)) {
print(bayesplot::mcmc_intervals(prjmat))
}
# If the `posterior` package is installed, the output from
# as.matrix.projection() can be used there. For example:
if (requireNamespace("posterior", quietly = TRUE)) {
prjdrws <- posterior::as_draws_matrix(prjmat)
print(posterior::summarize_draws(
prjdrws,
"median", "mad", function(x) quantile(x, probs = c(0.025, 0.975))
))
}
### Better solution for post-processing clustered draws (e.g., via
### `bayesplot` or `posterior`): Don't ignore the fact that clustering was
### used. Instead, resample the clusters according to their weights (e.g.,
### via posterior::resample_draws()). However, this requires access to the
### cluster weights which is not implemented in `projpred` yet. This
### example will be extended as soon as those weights are accessible.
}
This is the function which has to be supplied to extend_family()
's argument
augdat_ilink
in case of the augmented-data projection for the binomial()
family.
augdat_ilink_binom(eta_arr, link = "logit")
eta_arr |
An array as described in section "Augmented-data projection"
of |
link |
The same as argument |
An array as described in section "Augmented-data projection" of
extend_family()
's documentation.
This is the function which has to be supplied to extend_family()
's argument
augdat_link
in case of the augmented-data projection for the binomial()
family.
augdat_link_binom(prb_arr, link = "logit")
prb_arr |
An array as described in section "Augmented-data projection"
of |
link |
The same as argument |
An array as described in section "Augmented-data projection" of
extend_family()
's documentation.
Sometimes there can be terms in a formula that refer to a matrix instead of a single predictor. This function breaks up the matrix term into individual predictors to handle separately, as that is probably the intention of the user.
break_up_matrix_term(formula, data)
formula |
A |
data |
The original |
A list
containing the expanded formula
and the expanded
data.frame
.
This function aggregates parameter draws that have been clustered
into
clusters by averaging across the draws that
belong to the same cluster. This averaging can be done in a weighted fashion.
cl_agg(
draws,
cl = seq_len(nrow(draws)),
wdraws = rep(1, nrow(draws)),
eps_wdraws = 0
)
draws |
An |
cl |
A numeric vector of length |
wdraws |
A numeric vector of length |
eps_wdraws |
A positive numeric value (typically small) which will be
used to improve numerical stability: The weights of the draws within each
cluster are multiplied by |
An matrix of aggregated
parameter draws.
set.seed(323)
S <- 100L
P <- 3L
draws <- matrix(rnorm(S * P), nrow = S, ncol = P)
# Clustering example:
S_cl <- 10L
cl_draws <- sample.int(S_cl, size = S, replace = TRUE)
draws_cl <- cl_agg(draws, cl = cl_draws)
# Clustering example with nonconstant `wdraws`:
w_draws <- rgamma(S, shape = 4)
draws_cl <- cl_agg(draws, cl = cl_draws, wdraws = w_draws)
# Thinning example (implying constant `wdraws`):
S_th <- 50L
idxs_thin <- round(seq(1, S, length.out = S_th))
th_draws <- rep(NA, S)
th_draws[idxs_thin] <- seq_len(S_th)
draws_th <- cl_agg(draws, cl = th_draws)
Calculates the ranking proportions from the fold-wise predictor rankings in
a cross-validation (CV) with fold-wise searches. For a given predictor
and a given submodel size
, the ranking proportion is the
proportion of CV folds which have predictor
at position
of
their predictor ranking. While these ranking proportions are helpful for
investigating variability in the predictor ranking, they can also be
cumulated across submodel sizes. The cumulated ranking proportions are more
helpful when it comes to model selection.
cv_proportions(object, ...)
## S3 method for class 'ranking'
cv_proportions(object, cumulate = FALSE, ...)
## S3 method for class 'vsel'
cv_proportions(object, ...)
object |
For |
... |
For |
cumulate |
A single logical value indicating whether the ranking
proportions should be cumulated across increasing submodel sizes ( |
A numeric matrix containing the ranking proportions. This matrix has
nterms_max
rows and nterms_max
columns, with nterms_max
as specified
in the (possibly implicit) ranking()
call. The rows correspond to the
submodel sizes and the columns to the predictor terms (sorted according to
the full-data predictor ranking). If cumulate
is FALSE
, then the
returned matrix is of class cv_proportions
. If cumulate
is TRUE
, then
the returned matrix is of classes cv_proportions_cumul
and
cv_proportions
(in this order).
Note that if cumulate
is FALSE
, then the values in the returned matrix
only need to sum to 1 (column-wise and row-wise) if nterms_max
(see
above) is equal to the full model size. Likewise, if cumulate
is TRUE
,
then the value 1
only needs to occur in each column of the returned
matrix if nterms_max
is equal to the full model size.
The cv_proportions()
function is only applicable if the ranking
object
includes fold-wise predictor rankings (i.e., if it is based on a vsel
object created by cv_varsel()
with validate_search = TRUE
). If the
ranking
object contains only a full-data predictor ranking (i.e., if it
is based on a vsel
object created by varsel()
or by cv_varsel()
, but
the latter with validate_search = FALSE
), then an error is thrown because
in that case, there are no fold-wise predictor rankings from which to
calculate ranking proportions.
# For an example, see `?plot.cv_proportions`.
Run the search part and the evaluation part for a projection predictive
variable selection. The search part determines the solution path, i.e., the
best submodel for each submodel size (number of predictor terms). The
evaluation part determines the predictive performance of the submodels along
the solution path. In contrast to varsel()
, cv_varsel()
performs a
cross-validation (CV) by running the search part with the training data of
each CV fold separately (an exception is explained in section "Note" below)
and running the evaluation part on the corresponding test set of each CV
fold.
cv_varsel(object, ...)
## Default S3 method:
cv_varsel(object, ...)
## S3 method for class 'refmodel'
cv_varsel(
object,
method = NULL,
cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold",
ndraws = NULL,
nclusters = 20,
ndraws_pred = 400,
nclusters_pred = NULL,
refit_prj = !inherits(object, "datafit"),
nterms_max = NULL,
penalty = NULL,
verbose = TRUE,
nloo = NULL,
K = if (!inherits(object, "datafit")) 5 else 10,
lambda_min_ratio = 1e-05,
nlambda = 150,
thresh = 1e-06,
regul = 1e-04,
validate_search = TRUE,
seed = NA,
search_terms = NULL,
...
)
object |
An object of class |
... |
Arguments passed to |
method |
The method for the search part. Possible options are |
cv_method |
The CV method, either |
ndraws |
Number of posterior draws used in the search part. Ignored if
|
nclusters |
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of |
ndraws_pred |
Only relevant if |
nclusters_pred |
Only relevant if |
refit_prj |
A single logical value indicating whether to fit the
submodels along the solution path again ( |
nterms_max |
Maximum submodel size (number of predictor terms) up to
which the search is continued. If |
penalty |
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of |
verbose |
A single logical value indicating whether to print out additional information during the computations. |
nloo |
Caution: Still experimental. Only relevant if |
K |
Only relevant if |
lambda_min_ratio |
Only relevant for L1 search. Ratio between the smallest and largest lambda in the L1-penalized search. This parameter essentially determines how long the search is carried out, i.e., how large submodels are explored. No need to change this unless the program gives a warning about this. |
nlambda |
Only relevant for L1 search. Number of values in the lambda grid for L1-penalized search. No need to change this unless the program gives a warning about this. |
thresh |
Only relevant for L1 search. Convergence threshold when computing the L1 path. Usually, there is no need to change this. |
regul |
A number giving the amount of ridge regularization when projecting onto (i.e., fitting) submodels which are GLMs. Usually there is no need for regularization, but sometimes we need to add some regularization to avoid numerical problems. |
validate_search |
Only relevant if |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
search_terms |
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( |
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmented-data projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.
An L1 search may select interaction terms before the corresponding main terms are selected. If this is undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes . The candidate
models at model size
are constructed from those elements from
search_terms
which yield model size when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size . Also note that internally,
search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 1) always consists of the intercept-only model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
. This search would start with y ~ 1
as the only
candidate at model size 1. At model size 2, y ~ x1
would be the only
candidate. At model size 3, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 4, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
.
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpred-package).
If validate_search
is FALSE
, the search is not included in the CV
so that only a single full-data search is run.
For PSIS-LOO CV, projpred calls loo::psis()
with r_eff = NA
. This
is only a problem if there was extreme autocorrelation between the MCMC
iterations when the reference model was built. In those cases however, the
reference model should not have been used anyway, so we don't expect
projpred's r_eff = NA
to be a problem.
Magnusson, Måns, Michael Andersen, Johan Jonasson, and Aki Vehtari. 2019. "Bayesian Leave-One-Out Cross-Validation for Large Data." In Proceedings of the 36th International Conference on Machine Learning, edited by Kamalika Chaudhuri and Ruslan Salakhutdinov, 97:4244–53. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v97/magnusson19a.html.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. "Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC." Statistics and Computing 27 (5): 1413–32. doi:10.1007/s11222-016-9696-4.
Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2022. "Pareto Smoothed Importance Sampling." arXiv. doi:10.48550/arXiv.1507.02646.
# Note: The code from this example is not executed when called via example().
# To execute it, you have to copy and paste it manually to the console.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876
)
# Run cv_varsel() (with small values for `K`, `nterms_max`, `nclusters`,
# and `nclusters_pred`, but only for the sake of speed in this example;
# this is not recommended in general):
cvvs <- cv_varsel(fit, cv_method = "kfold", K = 2, nterms_max = 3,
nclusters = 5, nclusters_pred = 10, seed = 5555)
# Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
# and `?ranking` for possible post-processing functions.
}
These are helper functions to create cross-validation (CV) folds, i.e., to
split up the indices from 1 to n
into K
subsets ("folds") for
-fold CV. These functions are potentially useful when creating the
cvfits
and cvfun
arguments for init_refmodel()
. Function cvfolds()
is
deprecated; please use cv_folds()
instead (apart from the name, they are
the same). The return value of cv_folds()
and cv_ids()
is different, see
below for details.
cv_folds(n, K, seed = NA)
cvfolds(n, K, seed = NA)
cv_ids(n, K, out = c("foldwise", "indices"), seed = NA)
n |
Number of observations. |
K |
Number of folds. Must be at least 2 and not exceed |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
out |
Format of the output, either |
cv_folds()
returns a vector of length n
such that each element is
an integer between 1 and K
denoting which fold the corresponding data
point belongs to. The return value of cv_ids()
depends on the out
argument. If out = "foldwise"
, the return value is a list
with K
elements, each being a list
with elements tr
and ts
giving the
training and test indices, respectively, for the corresponding fold. If
out = "indices"
, the return value is a list
with elements tr
and ts
each being a list
with K
elements giving the training and test indices,
respectively, for each fold.
n <- 100
set.seed(1234)
y <- rnorm(n)
cv <- cv_ids(n, K = 5)
# Mean within the test set of each fold:
cvmeans <- sapply(cv, function(fold) mean(y[fold$ts]))
Binomial toy example
df_binom
A simulated classification dataset containing 100 observations.
response, 0 or 1.
predictors, 30 in total.
https://web.stanford.edu/~hastie/glmnet/glmnetData/BNExample.RData
Gaussian toy example
df_gaussian
A simulated regression dataset containing 100 observations.
response, real-valued.
predictors, 20 in total. Mean and SD are approximately 0 and 1, respectively.
https://web.stanford.edu/~hastie/glmnet/glmnetData/QSExample.RData
This function adds some internally required elements to an object of class
family
(see, e.g., family()
). It is called internally by
init_refmodel()
, so you will rarely need to call it yourself.
extend_family(
family,
latent = FALSE,
latent_y_unqs = NULL,
latent_ilink = NULL,
latent_ll_oscale = NULL,
latent_ppd_oscale = NULL,
augdat_y_unqs = NULL,
augdat_link = NULL,
augdat_ilink = NULL,
augdat_args_link = list(),
augdat_args_ilink = list(),
...
)
family |
An object of class |
latent |
A single logical value indicating whether to use the latent
projection ( |
latent_y_unqs |
Only relevant for a latent projection where the original
response space has finite support (i.e., the original response values may
be regarded as categories), in which case this needs to be the character
vector of unique response values (which will be assigned to |
latent_ilink |
Only relevant for the latent projection, in which case
this needs to be the inverse-link function. If the original response family
was the |
latent_ll_oscale |
Only relevant for the latent projection, in which
case this needs to be the function computing response-scale (not
latent-scale) log-likelihood values. If |
latent_ppd_oscale |
Only relevant for the latent projection, in which
case this needs to be the function sampling response values given latent
predictors that have been transformed to response scale using
|
augdat_y_unqs |
Only relevant for augmented-data projection, in which
case this needs to be the character vector of unique response values (which
will be assigned to |
augdat_link |
Only relevant for augmented-data projection, in which case
this needs to be the link function. Use |
augdat_ilink |
Only relevant for augmented-data projection, in which
case this needs to be the inverse-link function. Use |
augdat_args_link |
Only relevant for augmented-data projection, in which
case this may be a named |
augdat_args_ilink |
Only relevant for augmented-data projection, in
which case this may be a named |
... |
Ignored (exists only to swallow up further arguments which might be passed to this function). |
In the following, ,
,
,
, and
from help topic refmodel-init-get are used.
Note that
does not necessarily denote the number of original
observations; it can also refer to new observations. Furthermore, let
denote either
or
,
whichever is appropriate in the context where it is used.
The family
object extended in the way needed by projpred.
As their first input, the functions supplied to arguments augdat_link
and
augdat_ilink
have to accept:
For augdat_link
: an array containing the probabilities for the response categories. The
order of the response categories is the same as in
family$cats
(see
argument augdat_y_unqs
).
For augdat_ilink
: an array containing the linear predictors.
The return value of these functions needs to be:
For augdat_link
: an array containing the linear predictors.
For augdat_ilink
: an array containing the probabilities for the response categories. The
order of the response categories has to be the same as in
family$cats
(see
argument augdat_y_unqs
).
For the augmented-data projection, the response vector resulting from
extract_model_data
(see init_refmodel()
) is coerced to a factor
(using
as.factor()
) at multiple places throughout this package. Inside of
init_refmodel()
, the levels of this factor
have to be identical to
family$cats
(after applying extend_family()
inside of
init_refmodel()
). Everywhere else, these levels have to be a subset of
<refmodel>$family$cats
(where <refmodel>
is an object resulting from
init_refmodel()
). See argument augdat_y_unqs
for how to control
family$cats
.
For ordinal brms families, be aware that the submodels (onto which the reference model is projected) currently have the following restrictions:
The discrimination parameter disc
is not supported (i.e., it is a
constant with value 1).
The thresholds are "flexible"
(see brms::brmsfamily()
).
The thresholds do not vary across the levels of a factor
-like variable
(see argument gr
of brms::resp_thres()
).
The "probit_approx"
link is replaced by "probit"
.
For the brms::categorical()
family, be aware that:
For multilevel submodels, the group-level effects are allowed to be correlated between different response categories.
For multilevel submodels, mclogit versions < 0.9.4 may throw the
error 'a' (<number> x 1) must be square
. Updating mclogit to a
version >= 0.9.4 should fix this.
The function supplied to argument latent_ilink
needs to have the prototype
latent_ilink(lpreds, cl_ref, wdraws_ref = rep(1, length(cl_ref)))
where:
lpreds
accepts an matrix containing the linear
predictors.
cl_ref
accepts a numeric vector of length ,
containing projpred's internal cluster indices for these draws.
wdraws_ref
accepts a numeric vector of length
, containing weights for these draws. These
weights should be treated as not being normalized (i.e., they don't
necessarily sum to
1
).
The return value of latent_ilink
needs to contain the linear predictors
transformed to the original response space, with the following structure:
If is.null(family$cats)
(after taking latent_y_unqs
into account): an
matrix.
If !is.null(family$cats)
(after taking latent_y_unqs
into account): an
array. In that case,
latent_ilink
needs to return probabilities (for the response categories
given in family$cats
, after taking latent_y_unqs
into account).
The function supplied to argument latent_ll_oscale
needs to have the
prototype
latent_ll_oscale(ilpreds, y_oscale, wobs = rep(1, length(y_oscale)), cl_ref, wdraws_ref = rep(1, length(cl_ref)))
where:
ilpreds
accepts the return value from latent_ilink
.
y_oscale
accepts a vector of length containing response values on
the original response scale.
wobs
accepts a numeric vector of length containing observation
weights.
cl_ref
accepts the same input as argument cl_ref
of latent_ilink
.
wdraws_ref
accepts the same input as argument wdraws_ref
of
latent_ilink
.
The return value of latent_ll_oscale
needs to be an
matrix containing the response-scale (not latent-scale) log-likelihood values
for the
observations from its inputs.
The function supplied to argument latent_ppd_oscale
needs to have the
prototype
latent_ppd_oscale(ilpreds_resamp, wobs, cl_ref, wdraws_ref = rep(1, length(cl_ref)), idxs_prjdraws)
where:
ilpreds_resamp
accepts the return value from latent_ilink
, but possibly
with resampled (clustered) draws (see argument nresample_clusters
of
proj_predict()
).
wobs
accepts a numeric vector of length containing observation
weights.
cl_ref
accepts the same input as argument cl_ref
of latent_ilink
.
wdraws_ref
accepts the same input as argument wdraws_ref
of
latent_ilink
.
idxs_prjdraws
accepts a numeric vector of length dim(ilpreds_resamp)[1]
containing the resampled indices of the projected draws (i.e., these indices
are values from the set where
ilpreds
denotes the return value of
latent_ilink
).
The return value of latent_ppd_oscale
needs to be a
matrix containing the response-scale (not latent-scale) draws from the
posterior(-projection) predictive distributions for the
observations
from its inputs.
If the bodies of these three functions involve parameter draws from the
reference model which have not been projected (e.g., for latent_ilink
, the
thresholds in an ordinal model), cl_agg()
is provided as a helper function
for aggregating these reference model draws in the same way as the draws have
been aggregated for the first argument of these functions (e.g., lpreds
in
case of latent_ilink
).
In fact, the weights passed to argument wdraws_ref
are nonconstant only in
case of cv_varsel()
with cv_method = "LOO"
and validate_search = TRUE
.
In that case, the weights passed to this argument are the PSIS-LOO CV weights
for one observation. Note that although argument wdraws_ref
has the suffix
_ref
, wdraws_ref
does not necessarily obtain weights for the initial
reference model's posterior draws: In case of cv_varsel()
with cv_method = "kfold"
, these weights may refer to one of the reference model
re-fits (but in that case, they are constant anyway).
If family$cats
is not NULL
(after taking latent_y_unqs
into account),
then the response vector resulting from extract_model_data
(see
init_refmodel()
) is coerced to a factor
(using as.factor()
) at multiple
places throughout this package. Inside of init_refmodel()
, the levels of
this factor
have to be identical to family$cats
(after applying
extend_family()
inside of init_refmodel()
). Everywhere else, these levels
have to be a subset of <refmodel>$family$cats
(where <refmodel>
is an
object resulting from init_refmodel()
).
Family objects not in the set of default family
objects.
Student_t(link = "identity", nu = 3)
link |
Name of the link function. In contrast to the default |
nu |
Degrees of freedom for the Student- |
A family object analogous to those described in family
.
Support for the Student_t()
family is still experimental.
The mesquite bushes yields dataset from Gelman and Hill (2006) (http://www.stat.columbia.edu/~gelman/arm/).
mesquite
The response variable is the total weight (in grams) of photosynthetic material as derived from actual harvesting of the bush. The predictor variables are:
diameter of the canopy (the leafy area of the bush) in meters, measured along the longer axis of the bush.
canopy diameter measured along the shorter axis.
height of the canopy.
total height of the bush.
plant unit density (# of primary stems per plant unit).
group of measurements (0 for the first group, 1 for the second group).
http://www.stat.columbia.edu/~gelman/arm/examples/mesquite/mesquite.dat
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press. doi:10.1017/CBO9780511790942.
Plots the ranking proportions (see cv_proportions()
) from the fold-wise
predictor rankings in a cross-validation with fold-wise searches. This is a
visualization of the transposed matrix returned by cv_proportions()
. The
proportions printed as text inside of the colored tiles are rounded to whole
percentage points (the plotted proportions themselves are not rounded).
## S3 method for class 'cv_proportions'
plot(x, text_angle = NULL, ...)
## S3 method for class 'ranking'
plot(x, ...)
x |
For |
text_angle |
Passed to argument |
... |
For |
A ggplot2 plotting object (of class gg
and ggplot
).
Idea and original code by Aki Vehtari. Slight modifications of the original code by Frank Weber, Yann McLatchie, and Sölvi Rögnvaldsson. Final implementation in projpred by Frank Weber.
# Note: The code from this example is not executed when called via example().
# To execute it, you have to copy and paste it manually to the console.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876
)
# Run cv_varsel() (with small values for `K`, `nterms_max`, `nclusters`,
# and `nclusters_pred`, but only for the sake of speed in this example;
# this is not recommended in general):
cvvs <- cv_varsel(fit, cv_method = "kfold", K = 2, nterms_max = 3,
nclusters = 5, nclusters_pred = 10, seed = 5555)
# Extract predictor rankings:
rk <- ranking(cvvs)
# Compute ranking proportions:
pr_rk <- cv_proportions(rk)
# Visualize the ranking proportions:
gg_pr_rk <- plot(pr_rk)
print(gg_pr_rk)
# Since the object returned by plot.cv_proportions() is a standard ggplot2
# plotting object, you can modify the plot easily, e.g., to remove the
# legend:
print(gg_pr_rk + theme(legend.position = "none"))
}
This is the plot()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). It visualizes the predictive performance of the reference
model (possibly also that of some other "baseline" model) and that of the
submodels along the full-data predictor ranking. Basic information about the
(CV) variability in the ranking of the predictors is included as well (if
available; inferred from cv_proportions()
). For a tabular representation,
see summary.vsel()
.
## S3 method for class 'vsel'
plot(
x,
nterms_max = NULL,
stats = "elpd",
deltas = FALSE,
alpha = 2 * pnorm(-1),
baseline = if (!inherits(x$refmodel, "datafit")) "ref" else "best",
thres_elpd = NA,
resp_oscale = TRUE,
ranking_nterms_max = NULL,
ranking_abbreviate = FALSE,
ranking_abbreviate_args = list(),
ranking_repel = NULL,
ranking_repel_args = list(),
cumulate = FALSE,
text_angle = NULL,
...
)
x |
An object of class |
nterms_max |
Maximum submodel size (number of predictor terms) for which
the performance statistics are calculated. Using |
stats |
One or more character strings determining which performance
statistics (i.e., utilities or losses) to estimate based on the
observations in the evaluation (or "test") set (in case of
cross-validation, these are all observations because they are partitioned
into multiple test sets; in case of
|
deltas |
If |
alpha |
A number determining the (nominal) coverage |
baseline |
For |
thres_elpd |
Only relevant if |
resp_oscale |
Only relevant for the latent projection. A single logical
value indicating whether to calculate the performance statistics on the
original response scale ( |
ranking_nterms_max |
Maximum submodel size (number of predictor terms)
for which the predictor names and the corresponding ranking proportions are
added on the x-axis. Using |
ranking_abbreviate |
A single logical value indicating whether the
predictor names in the full-data predictor ranking should be abbreviated by
|
ranking_abbreviate_args |
A |
ranking_repel |
Either |
ranking_repel_args |
A |
cumulate |
Passed to argument |
text_angle |
Passed to argument |
... |
Arguments passed to the internal function which is used for
bootstrapping (if applicable; see argument |
The stats
options "mse"
and "rmse"
are only available for:
the traditional projection,
the latent projection with resp_oscale = FALSE
,
the latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
The stats
option "acc"
(= "pctcorr"
) is only available for:
the binomial()
family in case of the traditional projection,
all families in case of the augmented-data projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
,
all families (on the original response scale) in case of the latent
projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being not NULL
.
The stats
option "auc"
is only available for:
the binomial()
family in case of the traditional projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
A ggplot2 plotting object (of class gg
and ggplot
). If
ranking_abbreviate
is TRUE
, the output of abbreviate()
is stored in
an attribute called projpred_ranking_abbreviated
(to allow the
abbreviations to be easily mapped back to the original predictor names).
As long as the reference model's performance is computable, it is always
shown in the plot as a dashed red horizontal line. If baseline = "best"
,
the baseline model's performance is shown as a dotted black horizontal line.
If !is.na(thres_elpd)
and any(stats %in% c("elpd", "mlpd"))
, the value
supplied to thres_elpd
(which is automatically adapted internally in case
of the MLPD or deltas = FALSE
) is shown as a dot-dashed gray horizontal
line for the reference model and, if baseline = "best"
, as a long-dashed
green horizontal line for the baseline model.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Run varsel() (here without cross-validation and with small values for
# `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of
# speed in this example; this is not recommended in general):
vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
seed = 5555)
print(plot(vs))
}
After the projection of the reference model onto a submodel, the linear
predictors (for the original or a new dataset) based on that submodel can be
calculated by proj_linpred()
. These linear predictors can also be
transformed to response scale and averaged across the projected parameter
draws. Furthermore, proj_linpred()
returns the corresponding log predictive
density values if the (original or new) dataset contains response values. The
proj_predict()
function draws from the predictive distributions (there is
one such distribution for each observation from the original or new dataset)
of the submodel that the reference model has been projected onto. If the
projection has not been performed yet, both functions call project()
internally to perform the projection. Both functions can also handle multiple
submodels at once (for object
s of class vsel
or object
s returned by a
project()
call to an object of class vsel
; see project()
).
proj_linpred(
object,
newdata = NULL,
offsetnew = NULL,
weightsnew = NULL,
filter_nterms = NULL,
transform = FALSE,
integrated = FALSE,
.seed = NA,
...
)
proj_predict(
object,
newdata = NULL,
offsetnew = NULL,
weightsnew = NULL,
filter_nterms = NULL,
nresample_clusters = 1000,
.seed = NA,
resp_oscale = TRUE,
...
)
object |
An object returned by |
newdata |
Passed to argument |
offsetnew |
Passed to argument |
weightsnew |
Passed to argument |
filter_nterms |
Only applies if |
transform |
For |
integrated |
For |
.seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
... |
Arguments passed to |
nresample_clusters |
For |
resp_oscale |
Only relevant for the latent projection. A single logical
value indicating whether to draw from the posterior-projection predictive
distributions on the original response scale ( |
Currently, proj_predict()
ignores observation weights that are not
equal to 1
. A corresponding warning is thrown if this is the case.
In case of the latent projection and transform = FALSE
:
Output element pred
contains the linear predictors without any
modifications that may be due to the original response distribution (e.g.,
for a brms::cumulative()
model, the ordered thresholds are not taken into
account).
Output element lpd
contains the latent log predictive density values,
i.e., those corresponding to the latent Gaussian distribution. If newdata
is not NULL
, this requires the latent response values to be supplied in a
column called .<response_name>
of newdata
where <response_name>
needs
to be replaced by the name of the original response variable (if
<response_name>
contained parentheses, these have been stripped off by
init_refmodel()
; see the left-hand side of formula(<refmodel>)
). For
technical reasons, the existence of column <response_name>
in newdata
is another requirement (even though .<response_name>
is actually used).
In the following, ,
,
, and
from help
topic refmodel-init-get are used. (For
proj_linpred()
with integrated = TRUE
, we have .) Furthermore, let
denote either
(if
transform = TRUE
)
or (if
transform = FALSE
). Then, if the
prediction is done for one submodel only (i.e., length(nterms) == 1 || !is.null(solution_terms)
in the call to project()
):
proj_linpred()
returns a list
with the following elements:
Element pred
contains the actual predictions, i.e., the linear
predictors, possibly transformed to response scale (depending on
argument transform
).
Element lpd
is non-NULL
only if newdata
is NULL
or if
newdata
contains response values in the corresponding column. In that
case, it contains the log predictive density values (conditional on
each of the projected parameter draws if integrated = FALSE
and
averaged across the projected parameter draws if integrated = TRUE
).
In case of (i) the traditional projection, (ii) the latent projection
with transform = FALSE
, or (iii) the latent projection with
transform = TRUE
and <refmodel>$family$cats
(where <refmodel>
is
an object resulting from init_refmodel()
; see also
extend_family()
's argument latent_y_unqs
) being NULL
, both
elements are matrices. In
case of (i) the augmented-data projection or (ii) the latent projection
with
transform = TRUE
and <refmodel>$family$cats
being not NULL
,
pred
is an
array and
lpd
is an
matrix.
proj_predict()
returns an
matrix of predictions where
denotes
nresample_clusters
in case of clustered projection. In case of (i) the
augmented-data projection or (ii) the latent projection with resp_oscale = TRUE
and <refmodel>$family$cats
being not NULL
, this matrix has an
attribute called cats
(the character vector of response categories) and
the values of the matrix are the predicted indices of the response
categories (these indices refer to the order of the response categories
from attribute cats
).
If the prediction is done for more than one submodel, the output from above
is returned for each submodel, giving a named list
with one element for
each submodel (the names of this list
being the numbers of solution terms
of the submodels when counting the intercept, too).
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Projection onto an arbitrary combination of predictor terms (with a small
# value for `nclusters`, but only for the sake of speed in this example;
# this is not recommended in general):
prj <- project(fit, solution_terms = c("X1", "X3", "X5"), nclusters = 10,
seed = 9182)
# Predictions (at the training points) from the submodel onto which the
# reference model was projected:
prjl <- proj_linpred(prj)
prjp <- proj_predict(prj, .seed = 7364)
}
This is the predict()
method for refmodel
objects (returned by
get_refmodel()
or init_refmodel()
). It offers three types of output which
are all based on the reference model and new (or old) observations: Either
the linear predictor on link scale, the linear predictor transformed to
response scale, or the log posterior predictive density.
## S3 method for class 'refmodel'
predict(
object,
newdata = NULL,
ynew = NULL,
offsetnew = NULL,
weightsnew = NULL,
type = "response",
...
)
object |
An object of class |
newdata |
Passed to argument |
ynew |
If not |
offsetnew |
Passed to argument |
weightsnew |
Passed to argument |
type |
Usually only relevant if |
... |
Currently ignored. |
Argument weightsnew
is only relevant if !is.null(ynew)
.
In case of a multilevel reference model, group-level effects for new group
levels are drawn randomly from a (multivariate) Gaussian distribution. When
setting projpred.mlvl_pred_new
to TRUE
, all group levels from newdata
(even those that already exist in the original dataset) are treated as new
group levels (if is.null(newdata)
, all group levels from the original
dataset are considered as new group levels in that case).
In the following, ,
, and
from help topic refmodel-init-get are used.
Furthermore, let
denote either
(if
type = "response"
) or (if
type = "link"
).
Then, if is.null(ynew)
, the returned object contains the reference
model's predictions (with the scale depending on argument type
) as:
a length- vector in case of (i) the traditional projection, (ii)
the latent projection with
type = "link"
, or (iii) the latent projection
with type = "response"
and object$family$cats
being NULL
;
an matrix in case of (i) the augmented-data
projection or (ii) the latent projection with
type = "response"
and
object$family$cats
being not NULL
.
If !is.null(ynew)
, the returned object is a length- vector of log
posterior predictive densities evaluated at
ynew
.
project()
runFor a projection
object (returned by project()
, possibly as elements of a
list
), this function extracts the combination of predictor terms onto which
the projection was performed.
predictor_terms(object, ...)
## S3 method for class 'projection'
predictor_terms(object, ...)
object |
An object of class |
... |
Currently ignored. |
A character vector of predictor terms.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Projection onto an arbitrary combination of predictor terms (with a small
# value for `nclusters`, but only for the sake of speed in this example;
# this is not recommended in general):
prj <- project(fit, solution_terms = c("X1", "X3", "X5"), nclusters = 10,
seed = 9182)
print(predictor_terms(prj)) # gives `c("X1", "X3", "X5")`
}
varsel()
or cv_varsel()
runThis is the print()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). It displays a summary of a varsel()
or cv_varsel()
run by
first calling summary.vsel()
and then print.vselsummary()
.
## S3 method for class 'vsel'
print(x, ...)
x |
An object of class |
... |
Arguments passed to |
The output of summary.vsel()
(invisible).
varsel()
or cv_varsel()
runThis is the print()
method for summary objects created by summary.vsel()
.
It displays a summary of the results from a varsel()
or cv_varsel()
run.
## S3 method for class 'vselsummary'
print(x, ...)
x |
An object of class |
... |
Arguments passed to |
In the table printed at the bottom, column solution_terms
contains
the full-data predictor ranking and column cv_proportions_diag
contains
the main diagonal of the matrix returned by cv_proportions()
(with
cumulate
as set in the summary.vsel()
call that created x
).
The output of summary.vsel()
(invisible).
Project the posterior of the reference model onto the parameter space of a single submodel consisting of a specific combination of predictor terms or (after variable selection) onto the parameter space of a single or multiple submodels of specific sizes.
project(
object,
nterms = NULL,
solution_terms = NULL,
refit_prj = TRUE,
ndraws = 400,
nclusters = NULL,
seed = NA,
regul = 1e-04,
...
)
object |
An object which can be used as input to |
nterms |
Only relevant if |
solution_terms |
If not |
refit_prj |
A single logical value indicating whether to fit the
submodels (again) ( |
ndraws |
Only relevant if |
nclusters |
Only relevant if |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
regul |
A number giving the amount of ridge regularization when projecting onto (i.e., fitting) submodels which are GLMs. Usually there is no need for regularization, but sometimes we need to add some regularization to avoid numerical problems. |
... |
Arguments passed to |
Arguments ndraws
and nclusters
are automatically truncated at
the number of posterior draws in the reference model (which is 1
for
datafit
s). Using less draws or clusters in ndraws
or nclusters
than
posterior draws in the reference model may result in slightly inaccurate
projection performance. Increasing these arguments affects the computation
time linearly.
Note that if project()
is applied to output from cv_varsel()
, then
refit_prj = FALSE
will take the results from the full-data search.
If the projection is performed onto a single submodel (i.e.,
length(nterms) == 1 || !is.null(solution_terms)
), an object of class
projection
which is a list
containing the following elements:
dis
Projected draws for the dispersion parameter.
ce
The cross-entropy part of the Kullback-Leibler (KL) divergence from the reference model to the submodel. For some families, this is not the actual cross-entropy, but a reduced one where terms which would cancel out when calculating the KL divergence have been dropped. In case of the Gaussian family, that reduced cross-entropy is further modified, yielding merely a proxy.
wdraws_prj
Weights for the projected draws.
solution_terms
A character vector of the submodel's predictor terms.
outdmin
A list
containing the submodel fits (one fit per
projected draw). This is the same as the return value of the
div_minimizer
function (see init_refmodel()
), except if project()
was used with an object
of class vsel
based on an L1 search as well
as with refit_prj = FALSE
, in which case this is the output from an
internal L1-penalized divergence minimizer.
cl_ref
A numeric vector of length equal to the number of posterior draws in the reference model, containing the cluster indices of these draws.
wdraws_ref
A numeric vector of length equal to the number of
posterior draws in the reference model, giving the weights of these
draws. These weights should be treated as not being normalized (i.e.,
they don't necessarily sum to 1
).
p_type
A single logical value indicating whether the
reference model's posterior draws have been clustered for the projection
(TRUE
) or not (FALSE
).
refmodel
The reference model object.
If the projection is performed onto more than one submodel, the output from
above is returned for each submodel, giving a list
with one element for
each submodel.
The elements of an object of class projection
are not meant to be
accessed directly but instead via helper functions (see the main vignette
and projpred-package). An exception is element wdraws_prj
which is
currently needed to weight quantities derived from the projected draws in
case of clustered projection, e.g., after applying as.matrix.projection()
(which throws a warning in case of clustered projection to make users aware
of this problem).
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Run varsel() (here without cross-validation and with small values for
# `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of
# speed in this example; this is not recommended in general):
vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
seed = 5555)
# Projection onto the best submodel with 2 predictor terms (with a small
# value for `nclusters`, but only for the sake of speed in this example;
# this is not recommended in general):
prj_from_vs <- project(vs, nterms = 2, nclusters = 10, seed = 9182)
# Projection onto an arbitrary combination of predictor terms (with a small
# value for `nclusters`, but only for the sake of speed in this example;
# this is not recommended in general):
prj <- project(fit, solution_terms = c("X1", "X3", "X5"), nclusters = 10,
seed = 9182)
}
Extracts the predictor ranking(s) from an object of class vsel
(returned
by varsel()
or cv_varsel()
). A predictor ranking is simply a character
vector of predictor terms ranked by predictive relevance (with the most
relevant term first). In any case, objects of class vsel
contain the
predictor ranking based on the full-data search. If an object of class
vsel
is based on a cross-validation (CV) with fold-wise searches (i.e., if
it was created by cv_varsel()
with validate_search = TRUE
), then it also
contains fold-wise predictor rankings.
ranking(object, ...)
## S3 method for class 'vsel'
ranking(object, nterms_max = NULL, ...)
object |
The object from which to retrieve the predictor ranking(s). Possible classes may be inferred from the names of the corresponding methods (see also the description). |
... |
Currently ignored. |
nterms_max |
Maximum submodel size (number of predictor terms) for the
predictor ranking(s), i.e., the submodel size at which to cut off the
predictor ranking(s). Using |
An object of class ranking
which is a list
with the following
elements:
fulldata
: The predictor ranking from the full-data search.
foldwise
: The predictor rankings from the fold-wise
searches in the form of a character matrix (only available if object
is
based on a CV with fold-wise searches, otherwise element foldwise
is
NULL
). The rows of this matrix correspond to the CV folds and the columns
to the submodel sizes. Each row contains the predictor ranking from the
search of that CV fold.
# For an example, see `?plot.cv_proportions`.
Function get_refmodel()
is a generic function whose methods usually call
init_refmodel()
which is the underlying workhorse (and may also be used
directly without a call to get_refmodel()
).
Both, get_refmodel()
and init_refmodel()
, create an object containing
information needed for the projection predictive variable selection, namely
about the reference model, the submodels, and how the projection should be
carried out. For the sake of simplicity, the documentation may refer to the
resulting object also as "reference model" or "reference model object", even
though it also contains information about the submodels and the projection.
A "typical" reference model object is created by get_refmodel.stanreg()
and
brms::get_refmodel.brmsfit()
, either implicitly by a call to a top-level
function such as project()
, varsel()
, and cv_varsel()
or explicitly by
a call to get_refmodel()
. All non-"typical" reference model objects will be
called "custom" reference model objects.
Some arguments are for -fold cross-validation (
-fold CV) only;
see
cv_varsel()
for the use of -fold CV in projpred.
get_refmodel(object, ...)
## S3 method for class 'refmodel'
get_refmodel(object, ...)
## S3 method for class 'vsel'
get_refmodel(object, ...)
## Default S3 method:
get_refmodel(object, formula, family = NULL, ...)
## S3 method for class 'stanreg'
get_refmodel(object, latent = FALSE, dis = NULL, ...)
init_refmodel(
object,
data,
formula,
family,
ref_predfun = NULL,
div_minimizer = NULL,
proj_predfun = NULL,
extract_model_data,
cvfun = NULL,
cvfits = NULL,
dis = NULL,
cvrefbuilder = NULL,
...
)
object |
For |
... |
For |
formula |
The full formula to use for the search procedure. For custom
reference models, this does not necessarily coincide with the reference
model's formula. For general information about formulas in R, see
|
family |
An object of class |
latent |
A single logical value indicating whether to use the latent
projection ( |
dis |
A vector of posterior draws for the reference model's dispersion
parameter or—more precisely—the posterior values for the reference
model's parameter-conditional predictive variance (assuming that this
variance is the same for all observations). May be |
data |
A |
ref_predfun |
Prediction function for the linear predictor of the
reference model, including offsets (if existing). See also section
"Arguments |
div_minimizer |
A function for minimizing the Kullback-Leibler (KL)
divergence from the reference model to a submodel (i.e., for performing the
projection of the reference model onto a submodel). The output of
|
proj_predfun |
Prediction function for the linear predictor of a
submodel onto which the reference model is projected. See also section
"Arguments |
extract_model_data |
A function for fetching some variables (response,
observation weights, offsets) from the original dataset (supplied to
argument |
cvfun |
For |
cvfits |
For |
cvrefbuilder |
For |
An object that can be passed to all the functions that take the
reference model fit as the first argument, such as varsel()
,
cv_varsel()
, project()
, proj_linpred()
, and proj_predict()
.
Usually, the returned object is of class refmodel
. However, if object
is NULL
, the returned object is of class datafit
as well as of class
refmodel
(with datafit
being first). Objects of class datafit
are
handled differently at several places throughout this package.
The elements of the returned object are not meant to be accessed directly
but instead via downstream functions (see the functions mentioned above as
well as predict.refmodel()
).
Although bad practice (in general), a reference model lacking an intercept can be used within projpred. However, it will always be projected onto submodels which include an intercept. The reason is that even if the true intercept in the reference model is zero, this does not need to hold for the submodels.
In multilevel (group-level) terms, function calls on the right-hand side of
the |
character (e.g., (1 | gr(group_variable))
, which is possible in
brms) are currently not allowed in projpred.
For additive models (still an experimental feature), only mgcv::s()
and
mgcv::t2()
are currently supported as smooth terms. Furthermore, these need
to be called without any arguments apart from the predictor names (symbols).
For example, for smoothing the effect of a predictor x
, only s(x)
or
t2(x)
are allowed. As another example, for smoothing the joint effect of
two predictors x
and z
, only s(x, z)
or t2(x, z)
are allowed (and
analogously for higher-order joint effects, e.g., of three predictors). Note
that all smooth terms need to be included in formula
(there is no random
argument as in rstanarm::stan_gamm4()
, for example).
ref_predfun
, proj_predfun
, and div_minimizer
Arguments ref_predfun
, proj_predfun
, and div_minimizer
may be NULL
for using an internal default (see projpred-package for the functions used
by the default divergence minimizers). Otherwise, let denote the
number of observations (in case of CV, these may be reduced to each fold),
the number of posterior draws for the reference
model's parameters, and
the number of draws for
the parameters of a submodel that the reference model has been projected onto
(short: the number of projected draws). For the augmented-data projection,
let
denote the number of response categories,
the number of latent response categories (which
typically equals
), and define
as well as
. Then the functions supplied to these arguments need to have the
following prototypes:
ref_predfun
: ref_predfun(fit, newdata = NULL)
where:
fit
accepts the reference model fit as given in argument object
(but possibly re-fitted to a subset of the observations, as done in
-fold CV).
newdata
accepts either NULL
(for using the original dataset,
typically stored in fit
) or data for new observations (at least in the
form of a data.frame
).
proj_predfun
: proj_predfun(fits, newdata)
where:
fits
accepts a list
of length
containing this number of submodel fits. This
list
is the same as that
returned by project()
in its output element outdmin
(which in turn is
the same as the return value of div_minimizer
, except if project()
was used with an object
of class vsel
based on an L1 search as well
as with refit_prj = FALSE
).
newdata
accepts data for new observations (at least in the form of a
data.frame
).
div_minimizer
does not need to have a specific prototype, but it needs to
be able to be called with the following arguments:
formula
accepts either a standard formula
with a single response
(if or in case of the
augmented-data projection) or a
formula
with response variables
cbind()
-ed on the left-hand side in
which case the projection has to be performed for each of the response
variables separately.
data
accepts a data.frame
to be used for the projection. In case of
the traditional or the latent projection, this dataset has rows.
In case of the augmented-data projection, this dataset has
rows.
family
accepts an object of class family
.
weights
accepts either observation weights (at least in the form of a
numeric vector) or NULL
(for using a vector of ones as weights).
projpred_var
accepts an
matrix of predictive variances (necessary for projpred's internal
GLM fitter) in case of the traditional or the latent projection and an
matrix (containing only
NA
s) in case of the augmented-data projection.
projpred_regul
accepts a single numeric value as supplied to argument
regul
of project()
, for example.
projpred_ws_aug
accepts an
matrix of expected values for the response in case of the traditional or
the latent projection and an
matrix of probabilities for the
response categories in case of the augmented-data projection.
...
accepts further arguments specified by the user.
The return value of these functions needs to be:
ref_predfun
: for the traditional or the latent projection, an matrix; for the augmented-data
projection, an
array (the only exception is the augmented-data projection for
the
binomial()
family in which case ref_predfun
needs to return an matrix just like for the traditional
projection because the array is constructed by an internal wrapper function).
proj_predfun
: for the traditional or the latent projection, an matrix; for the augmented-data
projection, an
array.
div_minimizer
: a list
of length
containing this number of submodel fits.
extract_model_data
The function supplied to argument extract_model_data
needs to have the
prototype
extract_model_data(object, newdata, wrhs = NULL, orhs = NULL, extract_y = TRUE)
where:
object
accepts the reference model fit as given in argument object
(but
possibly re-fitted to a subset of the observations, as done in -fold
CV).
newdata
accepts either NULL
(for using the original dataset, typically
stored in object
) or data for new observations (at least in the form of a
data.frame
).
wrhs
accepts at least either NULL
(for using a vector of ones) or a
right-hand side formula consisting only of the variable in newdata
containing the weights.
orhs
accepts at least either NULL
(for using a vector of zeros) or a
right-hand side formula consisting only of the variable in newdata
containing the offsets.
extract_y
accepts a single logical value indicating whether output
element y
(see below) shall be NULL
(TRUE
) or not (FALSE
).
The return value of extract_model_data
needs to be a list
with elements
y
, weights
, and offset
, each being a numeric vector containing the data
for the response, the observation weights, and the offsets, respectively. An
exception is that y
may also be NULL
(depending on argument extract_y
),
a non-numeric vector, or a factor
.
The weights and offsets returned by extract_model_data
will be assumed to
hold for the reference model as well as for the submodels.
If a custom reference model for an augmented-data projection is needed, see
also extend_family()
.
For the augmented-data projection, the response vector resulting from
extract_model_data
is internally coerced to a factor
(using
as.factor()
). The levels of this factor
have to be identical to
family$cats
(after applying extend_family()
internally; see
extend_family()
's argument augdat_y_unqs
).
Note that response-specific offsets (i.e., one length- offset vector
per response category) are not supported by projpred yet. So far, only
offsets which are the same across all response categories are supported. This
is why in case of the
brms::categorical()
family, offsets are currently not
supported at all.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the augmented-data projection.
If a custom reference model for a latent projection is needed, see also
extend_family()
.
For the latent projection, family$cats
(after applying extend_family()
internally; see extend_family()
's argument latent_y_unqs
) currently must
not be NULL
if the original (i.e., non-latent) response is a factor
.
Conversely, if family$cats
(after applying extend_family()
) is
non-NULL
, the response vector resulting from extract_model_data
is
internally coerced to a factor
(using as.factor()
). The levels of this
factor
have to be identical to that non-NULL
element family$cats
.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the latent projection.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Define the reference model explicitly:
ref <- get_refmodel(fit)
print(class(ref)) # gives `"refmodel"`
# Now see, for example, `?varsel`, `?cv_varsel`, and `?project` for
# possible post-processing functions. Most of the post-processing functions
# call get_refmodel() internally at the beginning, so you will rarely need
# to call get_refmodel() yourself.
# A custom reference model which may be used in a variable selection where
# the candidate predictors are not a subset of those used for the reference
# model's predictions:
ref_cust <- init_refmodel(
fit,
data = dat_gauss,
formula = y ~ X6 + X7,
family = gaussian(),
extract_model_data = function(object, newdata = NULL, wrhs = NULL,
orhs = NULL, extract_y = TRUE) {
if (!extract_y) {
resp_form <- NULL
} else {
resp_form <- ~ y
}
if (is.null(newdata)) {
newdata <- dat_gauss
}
args <- projpred:::nlist(object, newdata, wrhs, orhs, resp_form)
return(projpred::do_call(projpred:::.extract_model_data, args))
},
cvfun = function(folds) {
kfold(
fit, K = max(folds), save_fits = TRUE, folds = folds, cores = 1
)$fits[, "fit"]
},
dis = as.matrix(fit)[, "sigma"]
)
# Now, the post-processing functions mentioned above (for example,
# varsel(), cv_varsel(), and project()) may be applied to `ref_cust`.
}
varsel()
or cv_varsel()
run
or the predictor combination from a project()
runThe solution_terms.vsel()
method retrieves the solution path from a
full-data search (vsel
objects are returned by varsel()
or
cv_varsel()
). The solution_terms.projection()
method retrieves the
predictor combination onto which a projection was performed (projection
objects are returned by project()
, possibly as elements of a list
). Both
methods (and hence also the solution_terms()
generic) are deprecated and
will be removed in a future release. Please use ranking()
instead of
solution_terms.vsel()
(ranking()
's output element fulldata
contains the
full-data predictor ranking that is extracted by solution_terms.vsel()
;
ranking()
's output element foldwise
contains the fold-wise predictor
rankings—if available—which were previously not accessible via a built-in
function) and predictor_terms()
instead of solution_terms.projection()
.
solution_terms(object, ...)
## S3 method for class 'vsel'
solution_terms(object, ...)
## S3 method for class 'projection'
solution_terms(object, ...)
object |
The object from which to retrieve the predictor terms. Possible classes may be inferred from the names of the corresponding methods (see also the description). |
... |
Currently ignored. |
A character vector of predictor terms.
This function can suggest an appropriate submodel size based on a decision
rule described in section "Details" below. Note that this decision is quite
heuristic and should be interpreted with caution. It is recommended to
examine the results via plot.vsel()
and/or summary.vsel()
and to make the
final decision based on what is most appropriate for the problem at hand.
suggest_size(object, ...)
## S3 method for class 'vsel'
suggest_size(
object,
stat = "elpd",
pct = 0,
type = "upper",
thres_elpd = NA,
warnings = TRUE,
...
)
object |
An object of class |
... |
Arguments passed to |
stat |
Performance statistic (i.e., utility or loss) used for the
decision. See argument |
pct |
A number giving the proportion (not percents) of the relative null model utility one is willing to sacrifice. See section "Details" below for more information. |
type |
Either |
thres_elpd |
Only relevant if |
warnings |
Mainly for internal use. A single logical value indicating
whether to throw warnings if automatic suggestion fails. Usually there is
no reason to set this to |
In general (beware of special extensions below), the suggested model
size is the smallest model size for which either the
lower or upper bound (depending on argument
type
) of the
normal-approximation (or bootstrap; see argument stat
) confidence
interval (with nominal coverage 1 - alpha
; see argument alpha
of
summary.vsel()
) for (with
denoting the
-th submodel's true utility and
denoting the baseline model's true utility)
falls above (or is equal to)
where denotes the null
model's estimated utility and
the baseline
model's estimated utility. The baseline model is either the reference model
or the best submodel found (see argument
baseline
of summary.vsel()
).
If !is.na(thres_elpd)
and stat = "elpd"
, the decision rule above is
extended: The suggested model size is then the smallest model size
fulfilling the rule above or
. Correspondingly, in case
of
stat = "mlpd"
(and !is.na(thres_elpd)
), the suggested model size is
the smallest model size fulfilling the rule above or
with
denoting the number of observations.
For example (disregarding the special extensions in case of
!is.na(thres_elpd)
with stat = "elpd"
or stat = "mlpd"
),
alpha = 2 * pnorm(-1)
, pct = 0
, and type = "upper"
means that we
select the smallest model size for which the upper bound of the
1 - 2 * pnorm(-1)
(approximately 68.3%) confidence interval for
exceeds (or is equal to) zero,
that is (if
stat
is a performance statistic for which the normal
approximation is used, not the bootstrap), for which the submodel's utility
estimate is at most one standard error smaller than the baseline model's
utility estimate (with that standard error referring to the utility
difference).
Apart from the two summary.vsel()
arguments mentioned above (alpha
and
baseline
), resp_oscale
is another important summary.vsel()
argument
that may be passed via ...
.
A single numeric value, giving the suggested submodel size (or NA
if the suggestion failed).
The intercept is not counted by suggest_size()
, so a suggested size of
zero stands for the intercept-only model.
Loss statistics like the root mean squared error (RMSE) and the mean
squared error (MSE) are converted to utilities by multiplying them by -1
,
so a call such as suggest_size(object, stat = "rmse", type = "upper")
finds the smallest model size whose upper confidence interval bound for the
negative RMSE or MSE exceeds the cutoff (or, equivalently, has the lower
confidence interval bound for the RMSE or MSE below the cutoff). This is
done to make the interpretation of argument type
the same regardless of
argument stat
.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Run varsel() (here without cross-validation and with small values for
# `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of
# speed in this example; this is not recommended in general):
vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
seed = 5555)
print(suggest_size(vs))
}
varsel()
or cv_varsel()
runThis is the summary()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). Apart from some general information about the varsel()
or
cv_varsel()
run, it shows the full-data predictor ranking, basic
information about the (CV) variability in the ranking of the predictors (if
available; inferred from cv_proportions()
), and estimates for
user-specified predictive performance statistics. For a graphical
representation, see plot.vsel()
.
## S3 method for class 'vsel'
summary(
object,
nterms_max = NULL,
stats = "elpd",
type = c("mean", "se", "diff", "diff.se"),
deltas = FALSE,
alpha = 2 * pnorm(-1),
baseline = if (!inherits(object$refmodel, "datafit")) "ref" else "best",
resp_oscale = TRUE,
cumulate = FALSE,
...
)
object |
An object of class |
nterms_max |
Maximum submodel size (number of predictor terms) for which
the performance statistics are calculated. Using |
stats |
One or more character strings determining which performance
statistics (i.e., utilities or losses) to estimate based on the
observations in the evaluation (or "test") set (in case of
cross-validation, these are all observations because they are partitioned
into multiple test sets; in case of
|
type |
One or more items from |
deltas |
If |
alpha |
A number determining the (nominal) coverage |
baseline |
For |
resp_oscale |
Only relevant for the latent projection. A single logical
value indicating whether to calculate the performance statistics on the
original response scale ( |
cumulate |
Passed to argument |
... |
Arguments passed to the internal function which is used for
bootstrapping (if applicable; see argument |
The stats
options "mse"
and "rmse"
are only available for:
the traditional projection,
the latent projection with resp_oscale = FALSE
,
the latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
The stats
option "acc"
(= "pctcorr"
) is only available for:
the binomial()
family in case of the traditional projection,
all families in case of the augmented-data projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
,
all families (on the original response scale) in case of the latent
projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being not NULL
.
The stats
option "auc"
is only available for:
the binomial()
family in case of the traditional projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
An object of class vselsummary
.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Run varsel() (here without cross-validation and with small values for
# `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of
# speed in this example; this is not recommended in general):
vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
seed = 5555)
print(summary(vs), digits = 1)
}
Run the search part and the evaluation part for a projection predictive variable selection. The search part determines the solution path, i.e., the best submodel for each submodel size (number of predictor terms). The evaluation part determines the predictive performance of the submodels along the solution path.
varsel(object, ...)
## Default S3 method:
varsel(object, ...)
## S3 method for class 'refmodel'
varsel(
object,
d_test = NULL,
method = NULL,
ndraws = NULL,
nclusters = 20,
ndraws_pred = 400,
nclusters_pred = NULL,
refit_prj = !inherits(object, "datafit"),
nterms_max = NULL,
verbose = TRUE,
lambda_min_ratio = 1e-05,
nlambda = 150,
thresh = 1e-06,
regul = 1e-04,
penalty = NULL,
search_terms = NULL,
seed = NA,
...
)
object |
An object of class |
... |
Arguments passed to |
d_test |
A |
method |
The method for the search part. Possible options are |
ndraws |
Number of posterior draws used in the search part. Ignored if
|
nclusters |
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of |
ndraws_pred |
Only relevant if |
nclusters_pred |
Only relevant if |
refit_prj |
A single logical value indicating whether to fit the
submodels along the solution path again ( |
nterms_max |
Maximum submodel size (number of predictor terms) up to
which the search is continued. If |
verbose |
A single logical value indicating whether to print out additional information during the computations. |
lambda_min_ratio |
Only relevant for L1 search. Ratio between the smallest and largest lambda in the L1-penalized search. This parameter essentially determines how long the search is carried out, i.e., how large submodels are explored. No need to change this unless the program gives a warning about this. |
nlambda |
Only relevant for L1 search. Number of values in the lambda grid for L1-penalized search. No need to change this unless the program gives a warning about this. |
thresh |
Only relevant for L1 search. Convergence threshold when computing the L1 path. Usually, there is no need to change this. |
regul |
A number giving the amount of ridge regularization when projecting onto (i.e., fitting) submodels which are GLMs. Usually there is no need for regularization, but sometimes we need to add some regularization to avoid numerical problems. |
penalty |
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of |
search_terms |
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( |
seed |
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument |
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmented-data projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.
An L1 search may select interaction terms before the corresponding main terms are selected. If this is undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes . The candidate
models at model size
are constructed from those elements from
search_terms
which yield model size when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size . Also note that internally,
search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 1) always consists of the intercept-only model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
. This search would start with y ~ 1
as the only
candidate at model size 1. At model size 2, y ~ x1
would be the only
candidate. At model size 3, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 4, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
.
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpred-package).
d_test
If not NULL
, then d_test
needs to be a list
with the following
elements:
data
: a data.frame
containing the predictor variables for the test set.
offset
: a numeric vector containing the offset values for the test set
(if there is no offset, use a vector of zeros).
weights
: a numeric vector containing the observation weights for the test
set (if there are no observation weights, use a vector of ones).
y
: a vector or a factor
containing the response values for the test
set. In case of the latent projection, this has to be a vector containing the
latent response values, but it can also be a vector full of NA
s if
latent-scale post-processing is not needed.
y_oscale
: Only needs to be provided in case of the latent projection
where this needs to be a vector or a factor
containing the original
(i.e., non-latent) response values for the test set.
if (requireNamespace("rstanarm", quietly = TRUE)) {
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Run varsel() (here without cross-validation and with small values for
# `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of
# speed in this example; this is not recommended in general):
vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
seed = 5555)
# Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
# and `?ranking` for possible post-processing functions.
}