Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

347 lines
11 KiB

  1. ---
  2. title: |
  3. | Introduction to Data Analysis and Machine Learning in Physics:
  4. | 4. Decisions Trees
  5. author: "Martino Borsato, Jörg Marks, Klaus Reygers"
  6. date: "Studierendentage, 11-14 April 2022"
  7. ---
  8. ## Exercises
  9. * Exercise 1: Compare different decision tree classifiers
  10. * [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
  11. * Exercise 2: Apply XGBoost classifier to MAGIC data set
  12. * [`04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
  13. * Exercise 3: Feature importance
  14. * Exercise 4: Interpret a classifier with SHAP values
  15. ## Decision trees
  16. \begin{figure}
  17. \centering
  18. \includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
  19. \end{figure}
  20. \begin{center}
  21. Leaf nodes classify events as either signal or background
  22. \end{center}
  23. ## Decision trees: Rectangular volumes in feature space
  24. \begin{figure}
  25. \centering
  26. \includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
  27. \end{figure}
  28. * Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
  29. * How to build a decision tree in an optimal way?
  30. ## Finding optimal cuts
  31. Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
  32. $$ G = p (1-p) $$
  33. Here $p$ is the purity:
  34. $$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
  35. \vfill
  36. \textcolor{gray}{Usefulness of weights will become apparent soon.}
  37. \vfill
  38. Improvement in signal/background separation after splitting a set A into two sets B and C:
  39. $$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
  40. ## Gini impurity and other purity measures
  41. \begin{figure}
  42. \centering
  43. \includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
  44. \end{figure}
  45. ## Decision tree pruning
  46. ::: columns
  47. :::: {.column width=50%}
  48. When to stop growing a tree?
  49. * When all nodes are essentially pure?
  50. * Well, that's overfitting!
  51. \vspace{3ex}
  52. Pruning
  53. * Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
  54. ::::
  55. :::: {.column width=50%}
  56. \begin{figure}
  57. \centering
  58. \includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
  59. \end{figure}
  60. ::::
  61. :::
  62. ## Single decision trees: Pros and cons
  63. \textcolor{green}{Pros:}
  64. * Requires little data preparation (unlike neural networks)
  65. * Can use continuous and categorical inputs
  66. \vfill
  67. \textcolor{red}{Cons:}
  68. * Danger of overfitting training data
  69. * Sensitive to fluctuations in the training data
  70. * Hard to find global optimum
  71. * When to stop splitting?
  72. ## Ensemble methods: Combine weak learners
  73. ::: columns
  74. :::: {.column width=70%}
  75. * Bootstrap Aggregating (Bagging)
  76. * Sample training data (with replacement) and train a separate model on each of the derived training sets
  77. * Classify example with majority vote, or compute average output from each tree as model output
  78. ::::
  79. :::: {.column width=30%}
  80. $$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$
  81. ::::
  82. :::
  83. \vfill
  84. ::: columns
  85. :::: {.column width=70%}
  86. * Boosting
  87. * Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
  88. * Take weighted average to classify examples
  89. ::::
  90. :::: {.column width=30%}
  91. $$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$
  92. ::::
  93. :::
  94. ## Random forests
  95. * "One of the most widely used and versatile algorithms in data science and machine learning"
  96. \tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
  97. \vfill
  98. * Use bagging to select random example subset
  99. \vfill
  100. * Train a tree, but only use random subset of features at each split
  101. * this reduces the correlation between different trees
  102. * makes the decision more robust to missing data
  103. ## Boosted decision trees: Idea
  104. \begin{figure}
  105. \centering
  106. \includegraphics[width=0.75\textwidth]{figures/bdt.png}
  107. \end{figure}
  108. ## AdaBoost (short for Adaptive Boosting)
  109. Initial training sample
  110. \begin{center}
  111. \begin{tabular}{l l}
  112. $\vec x_1, ..., \vec x_n$: & multivariate event data \\
  113. $y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
  114. $w_1^{(1)}, ..., w_n^{(1)}$ & event weights
  115. \end{tabular}
  116. \end{center}
  117. with equal weights normalized as
  118. $$ \sum_{i=1}^n w_i^{(1)} = 1 $$
  119. Train first classifier $f_1$:
  120. \begin{center}
  121. \begin{tabular}{l l}
  122. $f_1(\vec x_i) > 0$ & classify as signal \\
  123. $f_1(\vec x_i) < 0$ & classify as background
  124. \end{tabular}
  125. \end{center}
  126. ## AdaBoost: Updating events weights
  127. Define training sample $k+1$ from training sample $k$ by updating weights:
  128. $$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
  129. \footnotesize
  130. \textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
  131. \normalsize
  132. Weight is increased if event was misclassified by the previous classifier
  133. $\to$ "Next classifier should pay more attention to misclassified events"
  134. \vfill
  135. At each step the classifier $f_k$ minimizes error rate:
  136. $$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0),
  137. \quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise} $$
  138. ## AdaBoost: Assigning the classifier score
  139. Assign score to each classifier according to its error rate:
  140. $$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
  141. \vfill
  142. Combined classifier (weighted average):
  143. $$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
  144. ## Gradient boosting
  145. Basic idea:
  146. * Train a first decision tree
  147. * Then train a second one on the residual errors made by the first tree
  148. * And so on
  149. \vfill
  150. In slightly more detail:
  151. * \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
  152. * Model prediction at iteration $m$: $F_m(\vec x_i)$
  153. * New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
  154. * Find $h_m(\vec x)$ by fitting it to
  155. $\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
  156. \color{black}
  157. ## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
  158. \small
  159. [\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
  160. \normalsize
  161. \vfill
  162. Superconductivty data set:
  163. Predict the critical temperature based on 81 material features.
  164. \footnotesize
  165. [\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
  166. \normalsize
  167. \vfill
  168. From the abstract:
  169. We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.
  170. \vfill
  171. \tiny
  172. [\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
  173. \normalsize
  174. ## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
  175. ::: columns
  176. :::: {.column width=60%}
  177. \footnotesize
  178. ```python
  179. import xgboost as xgb
  180. XGBreg = xgb.sklearn.XGBRegressor()
  181. XGBreg.fit(X_train, y_train)
  182. y_pred = XGBreg.predict(X_test)
  183. from sklearn.metrics import mean_squared_error
  184. rms = np.sqrt(mean_squared_error(y_test, y_pred))
  185. print(f"root mean square error {rms:.2f}")
  186. ```
  187. \textcolor{gray}{This gives:}
  188. `root mean square error 9.68`
  189. ::::
  190. :::: {.column width=40%}
  191. \vspace{6ex}
  192. ![](figures/critical_temperature.pdf)
  193. ::::
  194. :::
  195. ## Exercise 1: Compare different decision tree classifiers
  196. \small
  197. [\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
  198. \vspace{5ex}
  199. Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
  200. \vspace{2ex}
  201. Is there a classifier that clearly performs best?
  202. ## Exercise 2: Apply XGBoost classifier to MAGIC data set
  203. \small
  204. [\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
  205. \normalsize
  206. \footnotesize
  207. ```python
  208. # train XGBoost boosted decision tree
  209. import xgboost as xgb
  210. XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
  211. ```
  212. \normalsize
  213. \small
  214. a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
  215. b) Which is the most important feature for discriminating signal and background according to XGBoost? \
  216. Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
  217. c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
  218. d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
  219. \normalsize
  220. ## Exercise 3: Feature importance
  221. \small
  222. [\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
  223. \normalsize
  224. \vspace{3ex}
  225. Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score.
  226. ## Exercise 4: Interpret a classifier with SHAP values
  227. SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
  228. \vfill
  229. Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
  230. a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
  231. b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
  232. c) Do the same for the superconductivity data set? What are the three most important features?