344 lines
11 KiB
Markdown
344 lines
11 KiB
Markdown
% Introduction to Data Analysis and Machine Learning in Physics: \ 4. Decisions Trees
|
||
% Jörg Marks, \underline{Klaus Reygers}
|
||
% Studierendentage, 11-14 April 2023
|
||
|
||
## Exercises
|
||
|
||
* Exercise 1: Compare different decision tree classifiers
|
||
* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
|
||
* Exercise 2: Apply XGBoost classifier to MAGIC data set
|
||
* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
|
||
* Exercise 3: Feature importance
|
||
* Exercise 4: Interpret a classifier with SHAP values
|
||
|
||
## Decision trees
|
||
|
||
\begin{figure}
|
||
\centering
|
||
\includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
|
||
\end{figure}
|
||
|
||
\begin{center}
|
||
Leaf nodes classify events as either signal or background
|
||
\end{center}
|
||
|
||
## Decision trees: Rectangular volumes in feature space
|
||
|
||
\begin{figure}
|
||
\centering
|
||
\includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
|
||
\end{figure}
|
||
|
||
* Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
|
||
* How to build a decision tree in an optimal way?
|
||
|
||
## Finding optimal cuts
|
||
|
||
Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
|
||
|
||
$$ G = p (1-p) $$
|
||
|
||
Here $p$ is the purity:
|
||
$$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
|
||
|
||
\vfill
|
||
\textcolor{gray}{Usefulness of weights will become apparent soon.}
|
||
|
||
\vfill
|
||
Improvement in signal/background separation after splitting a set A into two sets B and C:
|
||
$$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
|
||
|
||
## Gini impurity and other purity measures
|
||
\begin{figure}
|
||
\centering
|
||
\includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
|
||
\end{figure}
|
||
|
||
|
||
## Decision tree pruning
|
||
|
||
::: columns
|
||
:::: {.column width=50%}
|
||
|
||
When to stop growing a tree?
|
||
|
||
* When all nodes are essentially pure?
|
||
* Well, that's overfitting!
|
||
|
||
\vspace{3ex}
|
||
|
||
Pruning
|
||
|
||
* Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
|
||
|
||
::::
|
||
:::: {.column width=50%}
|
||
\begin{figure}
|
||
\centering
|
||
\includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
|
||
\end{figure}
|
||
::::
|
||
:::
|
||
|
||
## Single decision trees: Pros and cons
|
||
|
||
\textcolor{green}{Pros:}
|
||
|
||
* Requires little data preparation (unlike neural networks)
|
||
* Can use continuous and categorical inputs
|
||
|
||
\vfill
|
||
|
||
\textcolor{red}{Cons:}
|
||
|
||
* Danger of overfitting training data
|
||
* Sensitive to fluctuations in the training data
|
||
* Hard to find global optimum
|
||
* When to stop splitting?
|
||
|
||
## Ensemble methods: Combine weak learners
|
||
|
||
::: columns
|
||
:::: {.column width=70%}
|
||
* Bootstrap Aggregating (Bagging)
|
||
* Sample training data (with replacement) and train a separate model on each of the derived training sets
|
||
* Classify example with majority vote, or compute average output from each tree as model output
|
||
|
||
::::
|
||
:::: {.column width=30%}
|
||
$$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$
|
||
::::
|
||
:::
|
||
\vfill
|
||
::: columns
|
||
:::: {.column width=70%}
|
||
* Boosting
|
||
* Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
|
||
* Take weighted average to classify examples
|
||
|
||
::::
|
||
:::: {.column width=30%}
|
||
$$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$
|
||
::::
|
||
:::
|
||
|
||
## Random forests
|
||
|
||
* "One of the most widely used and versatile algorithms in data science and machine learning"
|
||
\tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
|
||
\vfill
|
||
* Use bagging to select random example subset
|
||
\vfill
|
||
* Train a tree, but only use random subset of features at each split
|
||
* this reduces the correlation between different trees
|
||
* makes the decision more robust to missing data
|
||
|
||
## Boosted decision trees: Idea
|
||
|
||
\begin{figure}
|
||
\centering
|
||
\includegraphics[width=0.75\textwidth]{figures/bdt.png}
|
||
\end{figure}
|
||
|
||
## AdaBoost (short for Adaptive Boosting)
|
||
|
||
Initial training sample
|
||
|
||
\begin{center}
|
||
\begin{tabular}{l l}
|
||
$\vec x_1, ..., \vec x_n$: & multivariate event data \\
|
||
$y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
|
||
$w_1^{(1)}, ..., w_n^{(1)}$ & event weights
|
||
\end{tabular}
|
||
\end{center}
|
||
|
||
with equal weights normalized as
|
||
|
||
$$ \sum_{i=1}^n w_i^{(1)} = 1 $$
|
||
|
||
Train first classifier $f_1$:
|
||
|
||
\begin{center}
|
||
\begin{tabular}{l l}
|
||
$f_1(\vec x_i) > 0$ & classify as signal \\
|
||
$f_1(\vec x_i) < 0$ & classify as background
|
||
\end{tabular}
|
||
\end{center}
|
||
|
||
## AdaBoost: Updating events weights
|
||
|
||
Define training sample $k+1$ from training sample $k$ by updating weights:
|
||
|
||
$$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
|
||
|
||
\footnotesize
|
||
\textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
|
||
\normalsize
|
||
|
||
Weight is increased if event was misclassified by the previous classifier
|
||
|
||
$\to$ "Next classifier should pay more attention to misclassified events"
|
||
|
||
|
||
\vfill
|
||
At each step the classifier $f_k$ minimizes error rate:
|
||
|
||
$$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0),
|
||
\quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise} $$
|
||
|
||
## AdaBoost: Assigning the classifier score
|
||
|
||
Assign score to each classifier according to its error rate:
|
||
$$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
|
||
|
||
\vfill
|
||
|
||
Combined classifier (weighted average):
|
||
$$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
|
||
|
||
|
||
|
||
## Gradient boosting
|
||
|
||
Basic idea:
|
||
|
||
* Train a first decision tree
|
||
* Then train a second one on the residual errors made by the first tree
|
||
* And so on
|
||
|
||
\vfill
|
||
|
||
In slightly more detail:
|
||
|
||
* \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
|
||
* Model prediction at iteration $m$: $F_m(\vec x_i)$
|
||
* New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
|
||
* Find $h_m(\vec x)$ by fitting it to
|
||
$\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
|
||
|
||
\color{black}
|
||
|
||
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
|
||
\small
|
||
[\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
|
||
\normalsize
|
||
|
||
\vfill
|
||
|
||
Superconductivty data set:
|
||
|
||
Predict the critical temperature based on 81 material features.
|
||
\footnotesize
|
||
[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
|
||
\normalsize
|
||
|
||
\vfill
|
||
|
||
From the abstract:
|
||
|
||
|
||
We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.
|
||
|
||
\vfill
|
||
|
||
\tiny
|
||
[\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
|
||
\normalsize
|
||
|
||
|
||
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
|
||
|
||
::: columns
|
||
:::: {.column width=60%}
|
||
\footnotesize
|
||
```python
|
||
import xgboost as xgb
|
||
|
||
XGBreg = xgb.sklearn.XGBRegressor()
|
||
|
||
XGBreg.fit(X_train, y_train)
|
||
|
||
y_pred = XGBreg.predict(X_test)
|
||
|
||
from sklearn.metrics import mean_squared_error
|
||
rms = np.sqrt(mean_squared_error(y_test, y_pred))
|
||
print(f"root mean square error {rms:.2f}")
|
||
```
|
||
|
||
\textcolor{gray}{This gives:}
|
||
|
||
`root mean square error 9.68`
|
||
::::
|
||
:::: {.column width=40%}
|
||
\vspace{6ex}
|
||
![](figures/critical_temperature.pdf)
|
||
::::
|
||
:::
|
||
|
||
## Exercise 1: Compare different decision tree classifiers
|
||
|
||
\small
|
||
[\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
|
||
|
||
\vspace{5ex}
|
||
|
||
Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
|
||
|
||
\vspace{2ex}
|
||
|
||
Is there a classifier that clearly performs best?
|
||
|
||
|
||
## Exercise 2: Apply XGBoost classifier to MAGIC data set
|
||
|
||
\small
|
||
[\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
|
||
\normalsize
|
||
|
||
\footnotesize
|
||
```python
|
||
# train XGBoost boosted decision tree
|
||
import xgboost as xgb
|
||
XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
|
||
```
|
||
\normalsize
|
||
|
||
\small
|
||
a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
|
||
b) Which is the most important feature for discriminating signal and background according to XGBoost? \
|
||
Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
|
||
c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
|
||
d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
|
||
\normalsize
|
||
|
||
|
||
## Exercise 3: Feature importance
|
||
|
||
\small
|
||
[\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
|
||
\normalsize
|
||
|
||
\vspace{3ex}
|
||
|
||
Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score.
|
||
|
||
|
||
## Exercise 4: Interpret a classifier with SHAP values
|
||
|
||
SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
|
||
|
||
\vfill
|
||
|
||
Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
|
||
|
||
a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
|
||
|
||
b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
|
||
|
||
c) Do the same for the superconductivity data set? What are the three most important features?
|
||
|
||
|
||
|
||
|
||
|