Monday, January 7, 2013

Partial Dependency Plots and GBM


My favorite "go-to-first" modeling algorithm for a classification or regression task is Gradient Boosted Regression Trees (Friedman, 2001 and 2002), especially as coded in the GBM package of R. Gradient boosted regression is one of the focuses of my masters thesis (along with random forests) and has gotten more and more attention as more and more data mining contests have been won, at least in part, by employing this method. Some of the reasons I pick Gradient Boosted Regression Trees as the best off the shelf predictive modeling algorithm available are: 
  • High predictive accuracy across many domains and target data types
    • Ability to specify various loss functions (Gaussian, Bernoulli,  Poisson etc.) as well as run survival analysis via Cox proportional hazards, quantile regression etc.   
  • Handles mixed data types (continuous and nominal)
  • Seamlessly deals with missing values
  • Contains out-of-bag (OOB) estimates of error
  • Contains variable importance measures
  • Contains variable interaction measures / detection
  • Allows estimates of marginal effects of a predictor (s) via Partial Dependency Plots.

This latter point is a nice feature coded into the GBM package that gives the analyst the ability to produce univariate and bivariate partial dependency plots. These plots enable the researcher to understand the effect of a predictor variable (or interaction between two predictors) on the target outcome, given the other predictors (partialling them out - after accounting for their average effects). 

The technique is not unique to Gradient Boosted Regression Trees and in fact is a general method for understanding any black box modeling algorithm. When we use ensembles for instance, this tends to be the only way to understand how changing the value of a predictor, say a binary predictor  from 0 to 1 or a continuous predictor from within it's observed range, effects the outcome, given the model (i.e. accounting for the impact of other predictors in the model).

The idea of these “marginal” plots is to displays the modeled outcome, over the range of a given predictor. Hastie (2009) describes this general technique as considering the full model function $f$, depending on a small subset of predictors we are interested in, $X_{s}$, where this subset is typically one or two predictors in practice, and the other predictors in the model , $X_{c}$. This full model function is then written as $f(X)=f(X_{s},X_{c})$ where $X=X_{s} \cup X_{c}$ . A partial dependence plot displays the marginal expected value of $f(X_{s})$, by averaging over the values of $X_{c}$. Practically, this would be given by $f(X_{s}=x_{s})=Avg(f(X_{s}=x_{s}, X_{ci}))$ where the expected value of the function for a given value (vector) of the subset of predictors is the average value of the model output setting the subset predictors to the value of interest and averaging the result with the values of $X_{c}$ as they exist in the data set. 

A brute force method would be to create a new data set from the training data, but repeating it for every pattern in $X_{s}$ of interest plugged in and then using the model to output the average value for each distinct value of  $X_{s}$.











No comments:

Post a Comment