Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

343 lines
11 KiB

  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 4. Decisions Trees
  2. % Jörg Marks, \underline{Klaus Reygers}
  3. % Studierendentage, 11-14 April 2023
  4. ## Exercises
  5. * Exercise 1: Compare different decision tree classifiers
  6. * [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
  7. * Exercise 2: Apply XGBoost classifier to MAGIC data set
  8. * [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
  9. * Exercise 3: Feature importance
  10. * Exercise 4: Interpret a classifier with SHAP values
  11. ## Decision trees
  12. \begin{figure}
  13. \centering
  14. \includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
  15. \end{figure}
  16. \begin{center}
  17. Leaf nodes classify events as either signal or background
  18. \end{center}
  19. ## Decision trees: Rectangular volumes in feature space
  20. \begin{figure}
  21. \centering
  22. \includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
  23. \end{figure}
  24. * Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
  25. * How to build a decision tree in an optimal way?
  26. ## Finding optimal cuts
  27. Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
  28. $$ G = p (1-p) $$
  29. Here $p$ is the purity:
  30. $$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
  31. \vfill
  32. \textcolor{gray}{Usefulness of weights will become apparent soon.}
  33. \vfill
  34. Improvement in signal/background separation after splitting a set A into two sets B and C:
  35. $$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
  36. ## Gini impurity and other purity measures
  37. \begin{figure}
  38. \centering
  39. \includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
  40. \end{figure}
  41. ## Decision tree pruning
  42. ::: columns
  43. :::: {.column width=50%}
  44. When to stop growing a tree?
  45. * When all nodes are essentially pure?
  46. * Well, that's overfitting!
  47. \vspace{3ex}
  48. Pruning
  49. * Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
  50. ::::
  51. :::: {.column width=50%}
  52. \begin{figure}
  53. \centering
  54. \includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
  55. \end{figure}
  56. ::::
  57. :::
  58. ## Single decision trees: Pros and cons
  59. \textcolor{green}{Pros:}
  60. * Requires little data preparation (unlike neural networks)
  61. * Can use continuous and categorical inputs
  62. \vfill
  63. \textcolor{red}{Cons:}
  64. * Danger of overfitting training data
  65. * Sensitive to fluctuations in the training data
  66. * Hard to find global optimum
  67. * When to stop splitting?
  68. ## Ensemble methods: Combine weak learners
  69. ::: columns
  70. :::: {.column width=70%}
  71. * Bootstrap Aggregating (Bagging)
  72. * Sample training data (with replacement) and train a separate model on each of the derived training sets
  73. * Classify example with majority vote, or compute average output from each tree as model output
  74. ::::
  75. :::: {.column width=30%}
  76. $$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$
  77. ::::
  78. :::
  79. \vfill
  80. ::: columns
  81. :::: {.column width=70%}
  82. * Boosting
  83. * Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
  84. * Take weighted average to classify examples
  85. ::::
  86. :::: {.column width=30%}
  87. $$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$
  88. ::::
  89. :::
  90. ## Random forests
  91. * "One of the most widely used and versatile algorithms in data science and machine learning"
  92. \tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
  93. \vfill
  94. * Use bagging to select random example subset
  95. \vfill
  96. * Train a tree, but only use random subset of features at each split
  97. * this reduces the correlation between different trees
  98. * makes the decision more robust to missing data
  99. ## Boosted decision trees: Idea
  100. \begin{figure}
  101. \centering
  102. \includegraphics[width=0.75\textwidth]{figures/bdt.png}
  103. \end{figure}
  104. ## AdaBoost (short for Adaptive Boosting)
  105. Initial training sample
  106. \begin{center}
  107. \begin{tabular}{l l}
  108. $\vec x_1, ..., \vec x_n$: & multivariate event data \\
  109. $y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
  110. $w_1^{(1)}, ..., w_n^{(1)}$ & event weights
  111. \end{tabular}
  112. \end{center}
  113. with equal weights normalized as
  114. $$ \sum_{i=1}^n w_i^{(1)} = 1 $$
  115. Train first classifier $f_1$:
  116. \begin{center}
  117. \begin{tabular}{l l}
  118. $f_1(\vec x_i) > 0$ & classify as signal \\
  119. $f_1(\vec x_i) < 0$ & classify as background
  120. \end{tabular}
  121. \end{center}
  122. ## AdaBoost: Updating events weights
  123. Define training sample $k+1$ from training sample $k$ by updating weights:
  124. $$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
  125. \footnotesize
  126. \textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
  127. \normalsize
  128. Weight is increased if event was misclassified by the previous classifier
  129. $\to$ "Next classifier should pay more attention to misclassified events"
  130. \vfill
  131. At each step the classifier $f_k$ minimizes error rate:
  132. $$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0),
  133. \quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise} $$
  134. ## AdaBoost: Assigning the classifier score
  135. Assign score to each classifier according to its error rate:
  136. $$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
  137. \vfill
  138. Combined classifier (weighted average):
  139. $$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
  140. ## Gradient boosting
  141. Basic idea:
  142. * Train a first decision tree
  143. * Then train a second one on the residual errors made by the first tree
  144. * And so on
  145. \vfill
  146. In slightly more detail:
  147. * \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
  148. * Model prediction at iteration $m$: $F_m(\vec x_i)$
  149. * New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
  150. * Find $h_m(\vec x)$ by fitting it to
  151. $\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
  152. \color{black}
  153. ## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
  154. \small
  155. [\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
  156. \normalsize
  157. \vfill
  158. Superconductivty data set:
  159. Predict the critical temperature based on 81 material features.
  160. \footnotesize
  161. [\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
  162. \normalsize
  163. \vfill
  164. From the abstract:
  165. We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.
  166. \vfill
  167. \tiny
  168. [\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
  169. \normalsize
  170. ## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
  171. ::: columns
  172. :::: {.column width=60%}
  173. \footnotesize
  174. ```python
  175. import xgboost as xgb
  176. XGBreg = xgb.sklearn.XGBRegressor()
  177. XGBreg.fit(X_train, y_train)
  178. y_pred = XGBreg.predict(X_test)
  179. from sklearn.metrics import mean_squared_error
  180. rms = np.sqrt(mean_squared_error(y_test, y_pred))
  181. print(f"root mean square error {rms:.2f}")
  182. ```
  183. \textcolor{gray}{This gives:}
  184. `root mean square error 9.68`
  185. ::::
  186. :::: {.column width=40%}
  187. \vspace{6ex}
  188. ![](figures/critical_temperature.pdf)
  189. ::::
  190. :::
  191. ## Exercise 1: Compare different decision tree classifiers
  192. \small
  193. [\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
  194. \vspace{5ex}
  195. Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
  196. \vspace{2ex}
  197. Is there a classifier that clearly performs best?
  198. ## Exercise 2: Apply XGBoost classifier to MAGIC data set
  199. \small
  200. [\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
  201. \normalsize
  202. \footnotesize
  203. ```python
  204. # train XGBoost boosted decision tree
  205. import xgboost as xgb
  206. XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
  207. ```
  208. \normalsize
  209. \small
  210. a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
  211. b) Which is the most important feature for discriminating signal and background according to XGBoost? \
  212. Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
  213. c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
  214. d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
  215. \normalsize
  216. ## Exercise 3: Feature importance
  217. \small
  218. [\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
  219. \normalsize
  220. \vspace{3ex}
  221. Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score.
  222. ## Exercise 4: Interpret a classifier with SHAP values
  223. SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
  224. \vfill
  225. Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
  226. a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
  227. b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
  228. c) Do the same for the superconductivity data set? What are the three most important features?