ML-Kurs-SS2023/slides/04_decision_trees.md

% Introduction to Data Analysis and Machine Learning in Physics: \ 4. Decisions Trees
% Jörg Marks, \underline{Klaus Reygers}
% Studierendentage, 11-14 April 2023

## Exercises

* Exercise 1: Compare different decision tree classifiers
	* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
* Exercise 2: Apply XGBoost classifier to MAGIC data set
	* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
* Exercise 3: Feature importance
* Exercise 4: Interpret a classifier with SHAP values

## Decision trees

\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
\end{figure}

\begin{center}
Leaf nodes classify events as either signal or background
\end{center}

## Decision trees: Rectangular volumes in feature space

\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
\end{figure}

* Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
* How to build a decision tree in an optimal way?

## Finding optimal cuts

Separation btw. signal and background is often measured with the Gini index (or Gini impurity):

$$ G = p (1-p) $$

Here $p$ is the purity:
$$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$

\vfill
\textcolor{gray}{Usefulness of weights will become apparent soon.}

\vfill
Improvement in signal/background separation after splitting a set A into two sets B and C:
$$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$

## Gini impurity and other purity measures
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
\end{figure}


## Decision tree pruning

::: columns
:::: {.column width=50%}

When to stop growing a tree?

* When all nodes are essentially pure?
* Well, that's overfitting!

\vspace{3ex}

Pruning

* Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves

::::
:::: {.column width=50%}
\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
\end{figure}
::::
:::

## Single decision trees: Pros and cons

\textcolor{green}{Pros:}

* Requires little data preparation (unlike neural networks)
* Can use continuous and categorical inputs

\vfill

\textcolor{red}{Cons:}

* Danger of overfitting training data
* Sensitive to fluctuations in the training data
* Hard to find global optimum
* When to stop splitting?

## Ensemble methods: Combine weak learners

::: columns
:::: {.column width=70%}
* Bootstrap Aggregating (Bagging)
	* Sample training data (with replacement) and train a separate model on each of the derived training sets
	* Classify example with majority vote, or compute average output from each tree as model output

::::
:::: {.column width=30%}
$$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$ 
::::
:::
\vfill
::: columns
:::: {.column width=70%}
* Boosting
	* Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
	* Take weighted average to classify examples

::::
:::: {.column width=30%}
$$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$ 
::::
:::

## Random forests

* "One of the most widely used and versatile algorithms in data science and machine learning" 
\tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
\vfill
* Use bagging to select random example subset
\vfill
* Train a tree, but only use random subset of features at each split
	* this reduces the correlation between different trees
	* makes the decision more robust to missing data

## Boosted decision trees: Idea

\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/bdt.png}
\end{figure}

## AdaBoost (short for Adaptive Boosting)

Initial training sample

\begin{center}
\begin{tabular}{l l}
$\vec x_1, ..., \vec x_n$: & multivariate event data \\
$y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
$w_1^{(1)}, ..., w_n^{(1)}$ & event weights
\end{tabular}
\end{center}

with equal weights normalized as

$$ \sum_{i=1}^n w_i^{(1)} = 1 $$

Train first classifier $f_1$:

\begin{center}
\begin{tabular}{l l}
$f_1(\vec x_i) > 0$ & classify as signal \\
$f_1(\vec x_i) < 0$ & classify as background
\end{tabular}
\end{center}

## AdaBoost: Updating events weights

Define training sample $k+1$ from training sample $k$ by updating weights:

$$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$

\footnotesize
\textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
\normalsize

Weight is increased if event was misclassified by the previous classifier

$\to$ "Next classifier should pay more attention to misclassified events"


\vfill
At each step the classifier $f_k$ minimizes error rate:

$$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0), 
\quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise}  $$

## AdaBoost: Assigning the classifier score

Assign score to each classifier according to its error rate:
$$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$

\vfill

Combined classifier (weighted average):
$$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$


## Gradient boosting

Basic idea:

* Train a first decision tree
* Then train a second one on the residual errors made by the first tree
* And so on

\vfill

In slightly more detail:

* \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
* Model prediction at iteration $m$: $F_m(\vec x_i)$
* New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
* Find $h_m(\vec x)$ by fitting it to 
$\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$

\color{black}

## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
\small
[\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
\normalsize

\vfill

Superconductivty data set: 

Predict the critical temperature based on 81 material features.
\footnotesize
[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
\normalsize

\vfill

From the abstract:


We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.

\vfill

\tiny 
[\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
\normalsize


## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)

::: columns
:::: {.column width=60%}
\footnotesize
```python
import xgboost as xgb

XGBreg = xgb.sklearn.XGBRegressor()

XGBreg.fit(X_train, y_train)

y_pred = XGBreg.predict(X_test)

from sklearn.metrics import mean_squared_error
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"root mean square error {rms:.2f}")
```

\textcolor{gray}{This gives:}

`root mean square error 9.68`
::::
:::: {.column width=40%}
\vspace{6ex}
![](figures/critical_temperature.pdf)
::::
:::

## Exercise 1: Compare different decision tree classifiers

\small
[\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)

\vspace{5ex}

Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline

\vspace{2ex}

Is there a classifier that clearly performs best?


## Exercise 2: Apply XGBoost classifier to MAGIC data set

\small
[\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
\normalsize

\footnotesize
```python
# train XGBoost boosted decision tree
import xgboost as xgb
XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
```
\normalsize

\small
a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
b) Which is the most important feature for discriminating signal and background according to XGBoost? \ 
Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
\normalsize


## Exercise 3: Feature importance

\small
[\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
\normalsize

\vspace{3ex}

Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score. 


## Exercise 4: Interpret a classifier with SHAP values

SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.

\vfill

Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.

a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)

b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?

c) Do the same for the superconductivity data set? What are the three most important features?