| Title: | Fast Calculation of Feature Contributions in Boosting Trees |
|---|---|
| Description: | Computes feature-specific R-squared (R2) contributions for boosting tree models using a Shapley-value-based decomposition of the total R-squared in polynomial time. Supports models fitted with 'XGBoost', 'LightGBM', and 'CatBoost', with optimized backend-specific implementations and cached tree summaries suitable for large-scale problems. Multiple visualization tools are included for interpreting and communicating feature contributions. The methodology is described in Jiang, Zhang, and Zhang (2025) <doi:10.48550/arXiv.2407.03515>. Optional 'CatBoost' support uses the R package 'catboost', which is not distributed on CRAN; installation instructions and released binaries are provided by the CatBoost project at <https://catboost.ai/docs/en/concepts/r-installation>. |
| Authors: | Steven He [aut], Zhongli Jiang [aut, cre], Dabao Zhang [aut] |
| Maintainer: | Zhongli Jiang <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.0.1 |
| Built: | 2026-05-13 14:38:03 UTC |
| Source: | https://github.com/catstats/q-shap_r |
Coercion method to data.frame for qshap_result
## S3 method for class 'qshap_result' as.data.frame(x, row.names = NULL, optional = FALSE, ...)## S3 method for class 'qshap_result' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
A qshap_result object |
row.names |
Not used |
optional |
Not used |
... |
Additional arguments (currently unused) |
A data.frame with columns feature (character) and
rsq (numeric), sorted by rsq in decreasing order.
Creates an explainer object for computing feature-specific Shapley values from a trained tree ensemble model. Supports XGBoost, LightGBM, and CatBoost models.
gazer(model, max_depth = NULL, base_score = NULL, ...)gazer(model, max_depth = NULL, base_score = NULL, ...)
model |
A model object of class |
max_depth |
Maximum depth of trees, extracted from |
base_score |
Base score for predictions, extracted from |
... |
Additional arguments, for future use |
A class of qshap_tree_explainer object containing the model information and
preprocessed tree structures for fast Shapley value computation
library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model)library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model)
This is a convenience alias for qshap_loss() that provides a shorter
function name for calculating feature-specific loss contributions.
loss(explainer, x, y, y_mean_ori = NULL)loss(explainer, x, y, y_mean_ori = NULL)
explainer |
A qshap_tree_explainer object created by |
x |
Feature matrix or data frame |
y |
Response vector |
y_mean_ori |
Optional pre-computed mean of y (for efficiency) |
A matrix of loss contributions with dimensions (n_samples, n_features)
library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model) loss_matrix <- loss(explainer, X, y) dim(loss_matrix)library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model) loss_matrix <- loss(explainer, X, y) dim(loss_matrix)
This S3 method enables 'plot(x, ...)' where 'x' is a 'qshap_rsq' object. It dispatches to the visualization functions in 'vis'.
## S3 method for class 'qshap_rsq' plot( x, y = NULL, type = c("rsq", "elbow", "cumu", "gcorr", "hist", "density", "loss"), ... )## S3 method for class 'qshap_rsq' plot( x, y = NULL, type = c("rsq", "elbow", "cumu", "gcorr", "hist", "density", "loss"), ... )
x |
A 'qshap_rsq' object. |
y |
Not used. |
type |
Plot type: one of "rsq", "elbow", "cumu", "gcorr", "hist", "density", or "loss". |
... |
Passed to the underlying visualization function. |
A ggplot2 object (invisibly).
Print method for qshap_result
## S3 method for class 'qshap_result' print(x, n = 10, ...)## S3 method for class 'qshap_result' print(x, n = 10, ...)
x |
A qshap_result object |
n |
Integer number of top features to display (default: 10) |
... |
Additional arguments (currently unused) |
The input x is returned invisibly. Called primarily for its
side effect of printing a summary of the qshap_result object to the
console.
Print method for qshap_tree_explainer
## S3 method for class 'qshap_tree_explainer' print(x, ...)## S3 method for class 'qshap_tree_explainer' print(x, ...)
x |
A qshap_tree_explainer object |
... |
Additional arguments (currently unused) |
The input x is returned invisibly. Called primarily for its
side effect of printing a summary of the qshap_tree_explainer object
to the console.
Print method for simple_tree
## S3 method for class 'simple_tree' print(x, ...)## S3 method for class 'simple_tree' print(x, ...)
x |
A simple_tree object |
... |
Additional arguments (currently unused) |
The input x is returned invisibly. Called primarily for its
side effect of printing a summary of the simple_tree object to the
console.
Print method for tree_summary
## S3 method for class 'tree_summary' print(x, ...)## S3 method for class 'tree_summary' print(x, ...)
x |
A tree_summary object |
... |
Additional arguments (currently unused) |
The input x is returned invisibly. Called primarily for its
side effect of printing a summary of the tree_summary object to the
console.
This is a convenience alias for rsq() that provides a shorter
function name for calculating feature-specific R-squared values.
qshap( explainer, x, y, feature_names = NULL, local = FALSE, nsample = NULL, sd_out = TRUE, ci_out = TRUE, level = 0.95, nfrac = NULL, random_state = 42, ncore = 1L )qshap( explainer, x, y, feature_names = NULL, local = FALSE, nsample = NULL, sd_out = TRUE, ci_out = TRUE, level = 0.95, nfrac = NULL, random_state = 42, ncore = 1L )
explainer |
A qshap_tree_explainer object created by |
x |
Feature matrix or data frame with n samples and p features |
y |
Response vector of length n |
feature_names |
Character vector of feature names. If NULL, uses column names from x. |
local |
Logical; if TRUE, returns both R-squared values and loss matrix |
nsample |
Optional integer; number of samples to use (random subsample if less than nrow(x)) |
sd_out |
Logical; if TRUE, returns standard deviations of R-squared estimates |
ci_out |
Logical; if TRUE, returns Wald-style confidence intervals for each feature's R-squared (normal approximation using sd_rsq) |
level |
Confidence level for the intervals (default 0.95) |
nfrac |
Optional numeric in (0,1); fraction of samples to use (alternative to nsample) |
random_state |
Integer seed for reproducible sampling |
ncore |
Number of cores for parallel processing. Use -1 for all available cores, or a positive integer. Default is 1 (no parallelization) |
A qshap_result object; see rsq for details.
library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model) phi_rsq <- qshap(explainer, X, y) print(phi_rsq)library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model) phi_rsq <- qshap(explainer, X, y) print(phi_rsq)
User-friendly constructor for qshap_result
qshap_result( rsq, feature_names = NULL, total_rsq = NULL, n_samples = NULL, n_features = NULL, loss = NULL )qshap_result( rsq, feature_names = NULL, total_rsq = NULL, n_samples = NULL, n_features = NULL, loss = NULL )
rsq |
Numeric vector of feature-specific R-squared values |
feature_names |
Character vector of feature names (optional) |
total_rsq |
Numeric total R-squared (sum of feature-specific values) |
n_samples |
Integer number of samples used |
n_features |
Integer number of features |
loss |
Optional loss matrix (n_samples x n_features) |
A validated qshap_result object
Computes feature-specific R-squared values using Q-SHAP decomposition,
returning a qshap_result object with better formatting and additional metadata.
The qshap_result object includes feature names, total R², sample counts,
and provides enhanced print(), summary(), and as.data.frame()
methods for easier analysis.
rsq( explainer, x, y, feature_names = NULL, local = FALSE, nsample = NULL, sd_out = TRUE, ci_out = TRUE, level = 0.95, nfrac = NULL, random_state = 42, ncore = 1L )rsq( explainer, x, y, feature_names = NULL, local = FALSE, nsample = NULL, sd_out = TRUE, ci_out = TRUE, level = 0.95, nfrac = NULL, random_state = 42, ncore = 1L )
explainer |
A qshap_tree_explainer object created by |
x |
Feature matrix or data frame with n samples and p features |
y |
Response vector of length n |
feature_names |
Character vector of feature names. If NULL, uses column names from x. |
local |
Logical; if TRUE, returns both R-squared values and loss matrix |
nsample |
Optional integer; number of samples to use (random subsample if less than nrow(x)) |
sd_out |
Logical; if TRUE, returns standard deviations of R-squared estimates |
ci_out |
Logical; if TRUE, returns Wald-style confidence intervals for each feature's R-squared (normal approximation using sd_rsq) |
level |
Confidence level for the intervals (default 0.95) |
nfrac |
Optional numeric in (0,1); fraction of samples to use (alternative to nsample) |
random_state |
Integer seed for reproducible sampling |
ncore |
Number of cores for parallel processing. Use -1 for all available cores, or a positive integer. Default is 1 (no parallelization) |
This function provides a user-friendly interface for Q-SHAP R² computation:
Automatically extracts feature names from the input data
Returns a structured object with metadata
Provides enhanced printing with top features displayed by default
Includes a comprehensive summary() method
Can be easily converted to a data frame with as.data.frame()
A qshap_result object containing:
rsq: Numeric vector of feature-specific R² values
feature_names: Character vector of feature names
total_rsq: Total R² (sum of feature-specific values)
n_samples: Number of samples
n_features: Number of features
loss: Loss matrix (if local=TRUE)
library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model) result <- rsq(explainer, X, y) print(result)library(xgboost) set.seed(42) n <- 100 p <- 100 X <- matrix(rnorm(n * p), nrow = n, ncol = p) y <- X[, 1] - X[, 2] + rnorm(n, sd = 0.2) model <- xgboost(X, y, nrounds = 15, max_depth = 2, verbose = 0) explainer <- gazer(model) result <- rsq(explainer, X, y) print(result)
Summary method for qshap_result
## S3 method for class 'qshap_result' summary(object, ...)## S3 method for class 'qshap_result' summary(object, ...)
object |
A qshap_result object |
... |
Additional arguments (currently unused) |
The input object is returned invisibly. Called primarily for
its side effect of printing a detailed summary of the qshap_result
object to the console.
Provides a summary of the qshap_rsq object, showing the top features by R-squared contribution
## S3 method for class 'qshap_rsq' summary(object, n = 10, ...)## S3 method for class 'qshap_rsq' summary(object, n = 10, ...)
object |
A |
n |
Integer number of top features to display (default: 10) |
... |
Additional arguments (currently unused) |
The input object is returned invisibly. Called primarily for
its side effect of printing a summary of the qshap_rsq object to
the console.
Provides detailed summary information about the explainer
## S3 method for class 'qshap_tree_explainer' summary(object, ...)## S3 method for class 'qshap_tree_explainer' summary(object, ...)
object |
A qshap_tree_explainer object |
... |
Additional arguments (currently unused) |
The input object is returned invisibly. Called primarily for
its side effect of printing a detailed summary of the
qshap_tree_explainer object to the console.