Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

309 lines
9.3 KiB

  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 4. Decisions Trees
  2. % Jörg Marks, \underline{Klaus Reygers}
  3. % Studierendentage, 11-14 April 2023
  4. ## Decision trees
  5. \begin{figure}
  6. \centering
  7. \includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
  8. \end{figure}
  9. \begin{center}
  10. Leaf nodes classify events as either signal or background
  11. \end{center}
  12. ## Decision trees: Rectangular volumes in feature space
  13. \begin{figure}
  14. \centering
  15. \includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
  16. \end{figure}
  17. * Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
  18. * How to build a decision tree in an optimal way?
  19. ## Finding optimal cuts
  20. Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
  21. $$ G = p (1-p) $$
  22. Here $p$ is the purity:
  23. $$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
  24. \vfill
  25. \textcolor{gray}{Usefulness of weights will become apparent soon.}
  26. \vfill
  27. Improvement in signal/background separation after splitting a set A into two sets B and C:
  28. $$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
  29. ## Gini impurity and other purity measures
  30. \begin{figure}
  31. \centering
  32. \includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
  33. \end{figure}
  34. ## Decision tree pruning
  35. ::: columns
  36. :::: {.column width=50%}
  37. When to stop growing a tree?
  38. * When all nodes are essentially pure?
  39. * Well, that's overfitting!
  40. \vspace{3ex}
  41. Pruning
  42. * Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
  43. ::::
  44. :::: {.column width=50%}
  45. \begin{figure}
  46. \centering
  47. \includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
  48. \end{figure}
  49. ::::
  50. :::
  51. ## Single decision trees: Pros and cons
  52. \textcolor{green}{Pros:}
  53. * Requires little data preparation (unlike neural networks)
  54. * Can use continuous and categorical inputs
  55. \vfill
  56. \textcolor{red}{Cons:}
  57. * Danger of overfitting training data
  58. * Sensitive to fluctuations in the training data
  59. * Hard to find global optimum
  60. * When to stop splitting?
  61. ## Ensemble methods: Combine weak learners
  62. ::: columns
  63. :::: {.column width=70%}
  64. * Bootstrap Aggregating (Bagging)
  65. * Sample training data (with replacement) and train a separate model on each of the derived training sets
  66. * Classify example with majority vote, or compute average output from each tree as model output
  67. ::::
  68. :::: {.column width=30%}
  69. $$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$
  70. ::::
  71. :::
  72. \vfill
  73. ::: columns
  74. :::: {.column width=70%}
  75. * Boosting
  76. * Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
  77. * Take weighted average to classify examples
  78. ::::
  79. :::: {.column width=30%}
  80. $$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$
  81. ::::
  82. :::
  83. ## Random forests
  84. * "One of the most widely used and versatile algorithms in data science and machine learning"
  85. \tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
  86. \vfill
  87. * Use bagging to select random example subset
  88. \vfill
  89. * Train a tree, but only use random subset of features at each split
  90. * this reduces the correlation between different trees
  91. * makes the decision more robust to missing data
  92. ## Boosted decision trees: Idea
  93. \begin{figure}
  94. \centering
  95. \includegraphics[width=0.75\textwidth]{figures/bdt.png}
  96. \end{figure}
  97. ## AdaBoost (short for Adaptive Boosting)
  98. Initial training sample
  99. \begin{center}
  100. \begin{tabular}{l l}
  101. $\vec x_1, ..., \vec x_n$: & multivariate event data \\
  102. $y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
  103. $w_1^{(1)}, ..., w_n^{(1)}$ & event weights
  104. \end{tabular}
  105. \end{center}
  106. with equal weights normalized as
  107. $$ \sum_{i=1}^n w_i^{(1)} = 1 $$
  108. Train first classifier $f_1$:
  109. \begin{center}
  110. \begin{tabular}{l l}
  111. $f_1(\vec x_i) > 0$ & classify as signal \\
  112. $f_1(\vec x_i) < 0$ & classify as background
  113. \end{tabular}
  114. \end{center}
  115. ## AdaBoost: Updating events weights
  116. Define training sample $k+1$ from training sample $k$ by updating weights:
  117. $$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
  118. \footnotesize
  119. \textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
  120. \normalsize
  121. Weight is increased if event was misclassified by the previous classifier
  122. $\to$ "Next classifier should pay more attention to misclassified events"
  123. \vfill
  124. At each step the classifier $f_k$ minimizes error rate:
  125. $$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0),
  126. \quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise} $$
  127. ## AdaBoost: Assigning the classifier score
  128. Assign score to each classifier according to its error rate:
  129. $$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
  130. \vfill
  131. Combined classifier (weighted average):
  132. $$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
  133. ## Gradient boosting
  134. Basic idea:
  135. * Train a first decision tree
  136. * Then train a second one on the residual errors made by the first tree
  137. * And so on
  138. \vfill
  139. In slightly more detail:
  140. * \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
  141. * Model prediction at iteration $m$: $F_m(\vec x_i)$
  142. * New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
  143. * Find $h_m(\vec x)$ by fitting it to
  144. $\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
  145. \color{black}
  146. ## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
  147. \small
  148. [\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
  149. \normalsize
  150. \vfill
  151. Superconductivty data set:
  152. Predict the critical temperature based on 81 material features.
  153. \footnotesize
  154. [\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
  155. \normalsize
  156. \vfill
  157. From the abstract:
  158. We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.
  159. \vfill
  160. \tiny
  161. [\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
  162. \normalsize
  163. ## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
  164. ::: columns
  165. :::: {.column width=60%}
  166. \footnotesize
  167. ```python
  168. import xgboost as xgb
  169. XGBreg = xgb.sklearn.XGBRegressor()
  170. XGBreg.fit(X_train, y_train)
  171. y_pred = XGBreg.predict(X_test)
  172. from sklearn.metrics import mean_squared_error
  173. rms = np.sqrt(mean_squared_error(y_test, y_pred))
  174. print(f"root mean square error {rms:.2f}")
  175. ```
  176. \textcolor{gray}{This gives:}
  177. `root mean square error 9.68`
  178. ::::
  179. :::: {.column width=40%}
  180. \vspace{6ex}
  181. ![](figures/critical_temperature.pdf)
  182. ::::
  183. :::
  184. ## Exercise 1: Compare different decision tree classifiers
  185. \small
  186. [\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
  187. \vspace{5ex}
  188. Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
  189. \vspace{2ex}
  190. Is there a classifier that clearly performs best?
  191. ## Exercise 2: Apply XGBoost classifier to MAGIC data set
  192. \small
  193. [\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
  194. \normalsize
  195. \footnotesize
  196. ```python
  197. # train XGBoost boosted decision tree
  198. import xgboost as xgb
  199. XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
  200. ```
  201. \normalsize
  202. \small
  203. a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
  204. b) Which is the most important feature for discriminating signal and background according to XGBoost? \
  205. Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
  206. c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
  207. d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
  208. \normalsize