Title: | Covariance Regression with Random Forests |
---|---|
Description: | Covariance Regression with Random Forests (CovRegRF) is a random forest method for estimating the covariance matrix of a multivariate response given a set of covariates. Random forest trees are built with a new splitting rule which is designed to maximize the distance between the sample covariance matrix estimates of the child nodes. The method is described in Alakus et al. (2023) <doi:10.1186/s12859-023-05377-y>. 'CovRegRF' uses 'randomForestSRC' package (Ishwaran and Kogalur, 2022) <https://cran.r-project.org/package=randomForestSRC> by freezing at the version 3.1.0. The custom splitting rule feature is utilised to apply the proposed splitting rule. The 'randomForestSRC' package implements 'OpenMP' by default, contingent upon the support provided by the target architecture and operating system. In this package, 'LAPACK' and 'BLAS' libraries are used for matrix decompositions. |
Authors: | Cansu Alakus [aut, cre], Denis Larocque [aut], Aurelie Labbe [aut], Hemant Ishwaran [ctb] (Author of included 'randomForestSRC' codes), Udaya B. Kogalur [ctb] (Author of included 'randomForestSRC' codes), Intel Corporation [cph] (Copyright holder of included LAPACKE codes), Keita Teranishi [ctb] (Author of included cblas_dgemm.c codes) |
Maintainer: | Cansu Alakus <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.0.0 |
Built: | 2024-11-04 03:33:33 UTC |
Source: | https://github.com/calakus/covregrf |
Covariance Regression with Random Forests (CovRegRF) is a random forest method for estimating the covariance matrix of a multivariate response given a set of covariates. Random forest trees are built with a new splitting rule which is designed to maximize the distance between the sample covariance matrix estimates of the child nodes. The method is described in Alakus et al. (2023). CovRegRF uses 'randomForestSRC' package (Ishwaran and Kogalur, 2022) by freezing at the version 3.1.0. The custom splitting rule feature is utilised to apply the proposed splitting rule.
covregrf
predict.covregrf
significance.test
vimp.covregrf
plot.vimp.covregrf
print.covregrf
Alakus, C., Larocque, D., and Labbe, A. (2023). Covariance regression with random forests. BMC Bioinformatics 24, 258.
Ishwaran H., Kogalur U. (2022). Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). R package version 3.1.0, https://cran.r-project.org/package=randomForestSRC.
Estimates the covariance matrix of a multivariate response given a set of covariates using a random forest framework.
covregrf( formula, data, params.rfsrc = list(ntree = 1000, mtry = ceiling(px/3), nsplit = max(round(n/50), 10)), nodesize.set = round(0.5^(1:100) * sampsize)[round(0.5^(1:100) * sampsize) > py], importance = FALSE )
covregrf( formula, data, params.rfsrc = list(ntree = 1000, mtry = ceiling(px/3), nsplit = max(round(n/50), 10)), nodesize.set = round(0.5^(1:100) * sampsize)[round(0.5^(1:100) * sampsize) > py], importance = FALSE )
formula |
Object of class |
data |
The multivariate data set which has |
params.rfsrc |
List of parameters that should be passed to
|
nodesize.set |
The set of |
importance |
Should variable importance of covariates be assessed? The
default is |
An object of class (covregrf, grow)
which is a list with the
following components:
predicted.oob |
OOB predicted covariance matrices for training observations. |
importance |
Variable importance measures (VIMP) for covariates. |
best.nodesize |
Best |
params.rfsrc |
List of parameters that was used to fit random forest
with |
n |
Sample size of the data ( |
xvar.names |
A character vector of the covariate names. |
yvar.names |
A character vector of the response variable names. |
xvar |
Data frame of covariates. |
yvar |
Data frame of responses. |
rf.grow |
Fitted random forest object. This object is used for prediction with training or new data. |
For mean regression problems, random forests search for the optimal level
of the nodesize
parameter by using out-of-bag (OOB) prediction
errors computed as the difference between the true responses and OOB
predictions. The nodesize
value having the smallest OOB prediction
error is chosen. However, the covariance regression problem is
unsupervised by nature. Therefore, we tune nodesize
parameter with a
heuristic method. We use OOB covariance matrix estimates. The general idea
of the proposed tuning method is to find the nodesize
level where
the OOB covariance matrix predictions converge. The steps are as follows.
Firstly, we train separate random forests for a set of nodesize
values. Secondly, we compute the OOB covariance matrix estimates for each
random forest. Next, we compute the mean absolute difference (MAD) between
the upper triangular OOB covariance matrix estimates of two consecutive
nodesize
levels over all observations. Finally, we take the pair of
nodesize
levels having the smallest MAD. Among these two
nodesize
levels, we select the smaller since in general deeper trees
are desired in random forests.
predict.covregrf
significance.test
vimp.covregrf
print.covregrf
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50), importance = TRUE) ## get the OOB predictions pred.oob <- covregrf.obj$predicted.oob ## predict with new test data pred.obj <- predict(covregrf.obj, newdata = testdata) pred <- pred.obj$predicted ## get the variable importance measures vimp <- covregrf.obj$importance
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50), importance = TRUE) ## get the OOB predictions pred.oob <- covregrf.obj$predicted.oob ## predict with new test data pred.obj <- predict(covregrf.obj, newdata = testdata) pred <- pred.obj$predicted ## get the variable importance measures vimp <- covregrf.obj$importance
A generated data set containing two multivariate data sets: X and Y, which represent the set of covariates and responses, respectively. The covariance matrix of Y has a compound symmetry structure with heterogeneous variances. Both variances and correlations are functions of the covariates. X variables are generated from the standard normal distribution. The correlations are generated with a logit model and the variances are functions of these generated correlations. The sample size is 200. There are 3 covariates and 3 response variables. x1 and x2 are the importantvariables for the varying covariance matrix of Y. x3 is the noise variable.
data
data
A list with two elements namely X and Y. Each element has 200 rows. X has 3 columns and Y has 3 columns.
## load generated example data data(data, package = "CovRegRF")
## load generated example data data(data, package = "CovRegRF")
Plots variable importance measures (VIMP) for covariates for training data.
## S3 method for class 'covregrf' plot.vimp(x, sort = TRUE, ndisp = NULL, ...)
## S3 method for class 'covregrf' plot.vimp(x, sort = TRUE, ndisp = NULL, ...)
x |
An object of class (covregrf, grow) or (covregrf, vimp). |
sort |
Should the covariates be sorted according to their variable
importance measures in the plot? The default is |
ndisp |
Number of covariates to display in the plot. If |
... |
Optional arguments to be passed to other methods. |
Invisibly, the variable importance measures that were plotted.
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50), importance = TRUE) ## plot vimp plot.vimp(covregrf.obj)
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50), importance = TRUE) ## plot vimp plot.vimp(covregrf.obj)
Obtain predicted covariance matrices using a covregrf forest for training or new data.
## S3 method for class 'covregrf' predict(object, newdata, ...)
## S3 method for class 'covregrf' predict(object, newdata, ...)
object |
An object of class |
newdata |
Test data of the set of covariates. A data.frame with numeric
values and factors. If missing, the out-of-bag predictions in |
... |
Optional arguments to be passed to other methods. |
An object of class (covregrf, predict)
which is a list with the
following components:
predicted |
Predicted covariance matrices for test data. If
|
bop |
Bag of Observations for Prediction. An |
n |
Sample size of the test data ( |
xvar.names |
A character vector of the covariate names. |
yvar.names |
A character vector of the response variable names. |
covregrf
vimp.covregrf
print.covregrf
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50)) ## predict without new data (OOB predictions will be returned) pred.obj <- predict(covregrf.obj) pred.oob <- pred.obj$predicted ## predict with new test data pred.obj2 <- predict(covregrf.obj, newdata = testdata) pred <- pred.obj2$predicted
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50)) ## predict without new data (OOB predictions will be returned) pred.obj <- predict(covregrf.obj) pred.oob <- pred.obj$predicted ## predict with new test data pred.obj2 <- predict(covregrf.obj, newdata = testdata) pred <- pred.obj2$predicted
Print summary output of a CovRegRF analysis. This is the default print method for the package.
## S3 method for class 'covregrf' print(x, ...)
## S3 method for class 'covregrf' print(x, ...)
x |
An object of class |
... |
Optional arguments to be passed to other methods. |
Returns a character
string for the summary of CovRegRF
analysis.
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50)) ## print the grow object print(covregrf.obj) ## predict with new test data pred.obj <- predict(covregrf.obj, newdata = testdata) ## print the predict object print(pred.obj)
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50)) ## print the grow object print(covregrf.obj) ## predict with new test data pred.obj <- predict(covregrf.obj, newdata = testdata) ## print the predict object print(pred.obj)
This function runs a permutation test to evaluate the effect of a subset of covariates on the covariance matrix estimates. Returns an estimated p-value.
significance.test( formula, data, params.rfsrc = list(ntree = 1000, mtry = ceiling(px/3), nsplit = max(round(n/50), 10)), nodesize.set = round(0.5^(1:100) * round(0.632 * n))[round(0.5^(1:100) * round(0.632 * n)) > py], nperm = 500, test.vars = NULL )
significance.test( formula, data, params.rfsrc = list(ntree = 1000, mtry = ceiling(px/3), nsplit = max(round(n/50), 10)), nodesize.set = round(0.5^(1:100) * round(0.632 * n))[round(0.5^(1:100) * round(0.632 * n)) > py], nperm = 500, test.vars = NULL )
formula |
Object of class |
data |
The multivariate data set which has |
params.rfsrc |
List of parameters that should be passed to
|
nodesize.set |
The set of |
nperm |
Number of permutations. |
test.vars |
Subset of covariates whose effect on the covariance matrix
estimates will be evaluated. A character vector defining the names of the
covariates. The default is |
An object of class (covregrf, significancetest)
which is a list
with the following components:
pvalue |
Estimated *p*-value, see below for details. |
best.nodesize |
Best |
best.nodesize.control |
Best |
test.vars |
Covariates whose effect on the covariance matrix estimates is evaluated. |
control.vars |
Controlling set of covariates. |
predicted.oob |
OOB predicted covariance matrices for training
observations using all covariates including the |
predicted.perm |
Predicted covariance matrices for the permutations
using all covariates including the |
predicted.oob.control |
OOB predicted covariance matrices for training
observations using only the set of controlling covariates. If
|
predicted.perm.control |
Predicted covariance matrices for the
permutations using only the set of controlling covariates. If
|
We perform a hypothesis test to evaluate the effect of a subset of covariates
on the covariance matrix estimates, while controlling for the rest of the
covariates. Define the conditional covariance matrix of given all
variables as
, and the conditional covariance
matrix of
given only the set of controlling
variables as
. If a subset of covariates has an effect on the
covariance matrix estimates obtained with the proposed method, then
should be significantly different from
.
We conduct a permutation test for the null hypothesis
We estimate a
-value with the permutation test. If the
-value is less than the
pre-specified significance level
, we reject the null
hypothesis.
Testing the global effect of the covariates on the conditional covariance
estimates is a particular case of the proposed significance test. Define
the unconditional covariance matrix estimate of as
which is computed as the sample covariance matrix of
, and the conditional covariance matrix of
given
as
which is obtained with
covregrf()
. If there is a
global effect of on the covariance matrix estimates, the
should be significantly different from
.
The null hypothesis for this particular case is
covregrf
predict.covregrf
print.covregrf
Calculates variable importance measures (VIMP) for covariates for training data.
## S3 method for class 'covregrf' vimp(object, ...)
## S3 method for class 'covregrf' vimp(object, ...)
object |
An object of class (covregrf, grow). |
... |
Optional arguments to be passed to other methods. |
An object of class (covregrf, vimp)
which is a list with the
following component:
importance |
Variable importance measures (VIMP) for covariates. |
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50), importance = TRUE) ## get the variable importance measures vimp <- covregrf.obj$importance vimp2 <- vimp(covregrf.obj)$importance
## load generated example data data(data, package = "CovRegRF") xvar.names <- colnames(data$X) yvar.names <- colnames(data$Y) data1 <- data.frame(data$X, data$Y) ## define train/test split set.seed(2345) smp <- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE) traindata <- data1[smp,,drop=FALSE] testdata <- data1[-smp, xvar.names, drop=FALSE] ## formula object formula <- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ ")) ## train covregrf covregrf.obj <- covregrf(formula, traindata, params.rfsrc = list(ntree = 50), importance = TRUE) ## get the variable importance measures vimp <- covregrf.obj$importance vimp2 <- vimp(covregrf.obj)$importance