Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1157 lines
35 KiB

2 years ago
2 years ago
  1. ---
  2. title: |
  3. | Introduction to Data Analysis and Machine Learning in Physics:
  4. | 3. Machine Learning Basics
  5. author: "Martino Borsato, Jörg Marks, Klaus Reygers"
  6. date: "Studierendentage, 11-14 April 2022"
  7. ---
  8. ## Exercises
  9. * Exercise 1: Air shower classification (MAGIC telescope)
  10. * Logistic regression
  11. * [`03_ml_basics_ex01_magic.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_1_magic.ipynb)
  12. * Exercise 2: Hand-written digit recognition with logistic regression
  13. * Logistic regression
  14. * [`03_ml_basics_ex02_mnist_softmax_regression.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_2_mnist_softmax_regression.ipynb)
  15. * Exercise 3: Data preprocessing
  16. ## What is machine learning? (1)
  17. ![](figures/deepl.png)
  18. ## What is machine learning? (2)
  19. "Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed" -- Wikipedia
  20. \vspace{2ex}
  21. Example: spam detection \hfill
  22. \scriptsize [\textcolor{gray}{J. Mayes, Machine learning 101}](https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/preview?imm_mid=0f9b7e&cmp=em-data-na-na-newsltr_20171213&slide=id.g168a3288f7_0_58)
  23. \normalsize
  24. \begin{center}
  25. \includegraphics[width=0.9\textwidth]{figures/ml_example_spam.png}
  26. \vspace{2ex}
  27. Manual feature engineering vs. automatic feature detection
  28. \end{center}
  29. ## AI, ML, and DL
  30. "AI is the study of how to make computers perform things that, at the moment, people do better."
  31. \tiny \textcolor{gray}{Elaine Rich, Artificial intelligence, McGraw-Hill 1983} \normalsize
  32. \vfill
  33. \tiny \hfill \textcolor{gray}{G. Marcus, E. Davis, Rebooting AI} \normalsize
  34. \begin{figure}
  35. \centering
  36. %![](figures/ai_ml_dl.pdf){width=70%}
  37. \includegraphics[width=0.7\textwidth]{figures/ai_ml_dl.pdf}
  38. \end{figure}
  39. \vfill
  40. "deep" in deep learning: artificial neural nets with many neurons and multiple layers of nonlinear processing units for feature extraction
  41. ## Multivariate analysis: An early example from particle physics
  42. ::: columns
  43. :::: {.column width=55%}
  44. ![](figures/mva.png){width=99%}
  45. ::::
  46. :::: {.column width=45%}
  47. * Signal: $e^+e^- \to W^+W^-$
  48. * often 4 well separated hadron jets
  49. * Background: $e^+e^- \to qqgg$
  50. * 4 less well separated hadron jets
  51. * Input variables based on jet structure, event shape, ... none by itself gives much separation.
  52. ![](figures/mva_nn.png){width=85%}
  53. \tiny \textcolor{gray}{(Garrido, Juste and Martinez, ALEPH 96-144)} \normalsize
  54. ::::
  55. :::
  56. ## Applications of machine learning in physics
  57. * Particle physics: Particle identification / classification
  58. * Astronomy: Galaxy morphology classification
  59. * Chemistry and material science: predict properties of new molecules / materials
  60. * Many-body quantum matter: classification of quantum phases
  61. \vspace{3ex}
  62. \scriptsize [\textcolor{gray}{Machine learning and the physical sciences, arXiv:1903.10563}](https://arxiv.org/abs/1903.10563) \normalsize
  63. ## Some successes and unsolved problems in AI
  64. ::: columns
  65. :::: {.column width=50%}
  66. ![](figures/ai_history.png){width=85%}
  67. \tiny \textcolor{gray}{M. Woolridge, The road to conscious machines} \normalsize
  68. ::::
  69. :::: {.column width=50%}
  70. Impressive progress in certain fields:
  71. \small
  72. * Image recognition
  73. * Speech recognition
  74. * Recommendation systems
  75. * Automated translation
  76. * Analysis of medical data
  77. \normalsize
  78. \vfill
  79. How can we profit from these developments in physics?
  80. ::::
  81. :::
  82. ## The deep learning hype -- why now?
  83. Artificial neural networks are around for decades. Why did deep learning take off after 2012?
  84. \vspace{5ex}
  85. * Improved hardware -- graphical processing units [GPUs]
  86. * Large data sets (e.g. images) distributed via the Internet
  87. * Algorithmic advances
  88. ## Different modeling approaches
  89. * Simple mathematical representation like linear regression. Favored by statisticians.
  90. * Complex deterministic models based on scientific understanding of the physical process. Favored by physicists.
  91. * Complex algorithms to make predictions that are derived from a huge number of past examples (“machine learning” as developed in the field of computer science). These are often black boxes.
  92. * Regression models that claim to reach causal conclusions. Used by economists.
  93. \tiny \textcolor{gray}{D. Spiegelhalter, The Art of Statistics – Learning from data} \normalsize
  94. ## Machine learning: The "hello world" problem
  95. ::: columns
  96. :::: {.column width=45%}
  97. Recognition of handwritten digits
  98. * MNIST database (Modified National Institute of Standards and Technology database)
  99. * 60,000 training images and 10,000 testing images labeled with correct answer
  100. * 28 pixel x 28 pixel
  101. * Algorithms have reached "near-human performance"
  102. * Smallest error rate (2018): 0.18\%
  103. ::::
  104. :::: {.column width=55%}
  105. ![](figures/mnist.png)
  106. \tiny
  107. [\color{gray}{\texttt{https://en.wikipedia.org/wiki/MNIST\_database}}](https://en.wikipedia.org/wiki/MNIST_database)
  108. \normalsize
  109. ::::
  110. :::
  111. ## Machine learning: Image recognition
  112. ImageNet database
  113. * 14 million images, 22,000 categories
  114. * Since 2010, the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC): 1.4 million images, 1000 categories
  115. * In 2017, 29 of 38 competing teams got less than 5\% wrong
  116. \begin{figure}
  117. \centering
  118. \includegraphics[width=0.8\textwidth]{figures/imagenet.png}
  119. \end{figure}
  120. ## ImageNet: Large Scale Visual Recognition Challenge
  121. \begin{figure}
  122. \centering
  123. \includegraphics[width=0.8\textwidth]{figures/imagenet_challenge.png}
  124. \end{figure}
  125. \vfill
  126. \scriptsize
  127. \textcolor{gray}{O. Russakovsky et al, arXiv:1409.0575}
  128. \normalsize
  129. ## Adversarial attack
  130. \begin{figure}
  131. \centering
  132. \includegraphics[width=\textwidth]{figures/adversarial_attack.png}
  133. \end{figure}
  134. \vspace{3ex}
  135. \scriptsize [\textcolor{gray}{Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, arXiv:1412.6572v1}](https://arxiv.org/abs/1412.6572v1) \normalsize
  136. ## Types of machine learning
  137. ::: columns
  138. :::: {.column width=60%}
  139. Reinforcement learning
  140. \small
  141. * The machine ("the agent") predicts a scalar reward given once in a while
  142. * Weak feedback
  143. \normalsize
  144. ::::
  145. :::: {.column width=35%}
  146. \tiny [\textcolor{gray}{LeCun 2018, Power And Limits of Deep Learning}](https://www.youtube.com/watch?v=0tEhw5t6rhc) \normalsize
  147. ![](figures/videogame.png)
  148. ::::
  149. :::
  150. \vfill
  151. ::: columns
  152. :::: {.column width=60%}
  153. \vspace{1em}
  154. Supervised learning
  155. \small
  156. * The machine predicts a category based on labeled training data
  157. * Medium feedback
  158. \normalsize
  159. ::::
  160. :::: {.column width=35%}
  161. ![](figures/supervised_learning_car_plane.png)
  162. ::::
  163. :::
  164. \vfill
  165. ::: columns
  166. :::: {.column width=60%}
  167. \vspace{1em}
  168. Unsupervised learning
  169. \small
  170. * Describe/find hidden structure from "unlabeled" data
  171. * Cluster data in different sub-groups with similar properties
  172. \normalsize
  173. ::::
  174. :::: {.column width=35%}
  175. ![](figures/anomaly_detection.png)
  176. ::::
  177. :::
  178. ## Books on machine learning (1)
  179. ::: columns
  180. :::: {.column width=85%}
  181. Ian Goodfellow and Yoshua Bengio and Aaron Courville, \textit{Deep Learning}, free online [http://www.deeplearningbook.org/](http://www.deeplearningbook.org/)
  182. \vspace{8ex}
  183. Kevin Murphy, \textit{Probabilistic Machine Learning: An Introduction}, [draft pdf version](https://probml.github.io/pml-book/)
  184. \vspace{7ex}
  185. Aurelien Geron, \textit{Hands-On Machine Learning with Scikit-Learn and TensorFlow}
  186. ::::
  187. :::: {.column width=15%}
  188. ![](figures/deep_learning_book.png){width=65%}
  189. \vspace{3ex}
  190. ![](figures/book-murphy.png){width=65%}
  191. \vspace{3ex}
  192. ![](figures/hands_on_machine_learning.png){width=65%}
  193. ::::
  194. :::
  195. ## Books on machine learning (2)
  196. ::: columns
  197. :::: {.column width=85%}
  198. Francois Chollet, \textit{Deep Learning with Python}
  199. \vspace{10ex}
  200. Martin Erdmann, Jonas Glombitza, Gregor Kasieczka, Uwe Klemradt, \textit{Deep Learning for Physics Research}
  201. ::::
  202. :::: {.column width=15%}
  203. ![](figures/deep_learning_with_python.png){width=65%}
  204. \vspace{3ex}
  205. ![](figures/book_deep_learning_for_physics_research.png){width=65%}
  206. ::::
  207. :::
  208. ## Papers
  209. A high-bias, low-variance introduction to Machine Learning for physicists
  210. [https://arxiv.org/abs/1803.08823](https://arxiv.org/abs/1803.08823)
  211. \vspace{3ex}
  212. Machine learning and the physical sciences
  213. [https://arxiv.org/abs/1903.10563](https://arxiv.org/abs/1903.10563)
  214. ## Supervised learning in a nutshell
  215. * Supervised Machine Learning requires labeled training data, i.e., a training sample where for each event it is known whether it is a signal or background event.
  216. * Each event is characterized by $n$ observables: $\vec x = (x_1, x_2, ..., x_n) \;$ \textcolor{gray}{"feature vector"}
  217. \begin{figure}
  218. \centering
  219. \raisebox{-0.5\height}{\includegraphics[width=0.69\textwidth]{figures/supervised_nutshell.png}}
  220. \raisebox{-0.5\height}{\includegraphics[width=0.30\textwidth]{figures/loss_fct.png}}
  221. \end{figure}
  222. * Design function $y(\vec x, \vec w)$ with adjustable parameters $\vec w$
  223. * Design a loss function
  224. * Find best parameters which minimize loss
  225. ## Supervised learning: classification and regression
  226. The codomain $Y$ of the function y: $X \to Y$ can be a set of labels or classes or a continuous domain, e.g., $\mathbb{R}$
  227. \vfill
  228. * $Y$ = finite set of labels $\quad \to \quad$ \textcolor{red}{classification}
  229. * binary classification: $Y = \{0,1\}$
  230. * multi-class classification: $Y = \{c_1, c_2, ..., c_n\}$
  231. * $Y$ = real numbers $\quad \to \quad$ \textcolor{red}{regression}
  232. \vfill
  233. \textcolor{gray}{"All the impressive achievements of deep learning amount to just curve fitting" \\[0.5cm]}
  234. \footnotesize
  235. \textcolor{gray}{J. Pearl, Turing Award Winner 2011\\}
  236. \tiny
  237. [\color{gray}{To Build Truly Intelligent Machines, Teach Them Cause and Effect, Quantamagazine}](https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/)
  238. \normalsize
  239. ## Classification: Learning decision boundaries
  240. \begin{figure}
  241. \centering
  242. \includegraphics{figures/decision_boundaries.png}
  243. \end{figure}
  244. ## Supervised learning: Training, validation, and test sample
  245. * Decision boundary fixed with \textcolor{blue}{training sample}
  246. * Performance on training sample becomes better with more iterations
  247. * Danger of overtraining: Statistical fluctuations of the training sample will be learnt
  248. * \textcolor{blue}{Validation sample} = independent labeled data set not used for training $\rightarrow$ check for overtraining
  249. * Sign of overtraining: performance on validation sample becomes worse $\rightarrow$ Stop training when signs of overtraining are observed (early stopping)
  250. * Performance: apply classifier to independent \textcolor{blue}{test sample}
  251. * Often: test sample = validation sample (only small bias)
  252. ## Supervised learning: Cross validation
  253. Rule of thumb if training data not expensive
  254. ::: columns
  255. :::: {.column width=60%}
  256. * Training sample: 50%
  257. * Validation sample: 25%
  258. * Test sample: 25%
  259. \vspace{2ex}
  260. Cross validation (efficient use of scarce training data)
  261. * Split training sample in $k$ independent subset $T_k$ of the full sample $T$
  262. * Train on $T \setminus T_k$ resulting in $k$ different classifiers
  263. * For each training event there is one classifier that didn't use this event for training
  264. * Validation results are then combined
  265. ::::
  266. :::: {.column width=40%}
  267. \textcolor{gray}{Often test sample = validation sample (bias is rather small)}
  268. \vspace{10ex}
  269. ![](figures/cross_val.png)
  270. ::::
  271. :::
  272. ## Often used loss functions
  273. ::: columns
  274. :::: {.column width=45%}
  275. \textcolor{blue}{Square error loss}:
  276. * often used in regression
  277. ::::
  278. :::: {.column width=55%}
  279. $$ E(y(\vec x, \vec w), t) = (y(\vec x, \vec w) - t)^2 $$
  280. ::::
  281. :::
  282. \vfill
  283. ::: columns
  284. :::: {.column width=45%}
  285. \textcolor{blue}{Cross entropy}:
  286. * $t \in \{0,1\}$
  287. * $y(\vec x, \vec w)$: predicted probability for outcome $t=1$
  288. * often used in classification
  289. ::::
  290. :::: {.column width=55%}
  291. \begin{align*}
  292. E(y(\vec x, \vec w), t) = & - t \log y(\vec x, \vec w) \\ & - (1 - t) \log(1 - y(\vec x, \vec w))
  293. \end{align*}
  294. ::::
  295. :::
  296. ## More on entropy
  297. * Self-information of an event $x$: $I(x) = - \log p(x)$
  298. * in units of **nats** (1 nat = information gained by observing an event of probability $1/e$)
  299. \vfill
  300. * Shannon entropy: $H(P) = - \sum p_i \log p_i$
  301. * Expected amount of information in an event drawn from a distribution $P$
  302. * Measure of the minimum of amount of bits needed on average to encode symbols drawn from a distribution
  303. \vfill
  304. * Cross entropy: $H(P,Q) = - E[\log Q] = - \sum p_i \log q_i$
  305. * Can be interpreted as a measure of the amount of bits needed when a wrong distribution Q is assumed while the data actually follows a distribution P
  306. * Measure of dissimilarity between distributions P and Q (i.e, a measure of how well the model Q describes the true distribution P)
  307. ## Hypothesis testing
  308. ::: columns
  309. :::: {.column width=55%}
  310. \includegraphics[width=\textwidth]{figures/signal_background_distr.png}
  311. ::::
  312. :::: {.column width=45%}
  313. \vspace{2ex}
  314. test statistic
  315. * a (usually scalar) variable which is a function of the data alone that can be used to test hypotheses
  316. * example: $\chi^2$ w.r.t. a theory curve
  317. ::::
  318. :::
  319. \textcolor{gray}{$\epsilon_\mathrm{B} \equiv \alpha$}: "background efficiency", i.e., prob. to misclassify bckg. as signal
  320. \textcolor{gray}{$\epsilon_\mathrm{S} \equiv 1 - \beta$}: "signal efficiency"
  321. \begin{center}
  322. \begin{tabular}{ l l l}
  323. & $H_0$ is true & $H_0$ is false (i.e., $H_1$ is true)\\
  324. \hline
  325. $H_0$ is rejected & Type I error ($\alpha$) & Correct decision ($1 - \beta$) \\
  326. $H_0$ is not rejected & Correct decision ($1 - \alpha$) & Type II error ($\beta$) \\
  327. \hline
  328. \end{tabular}
  329. \end{center}
  330. ## Neyman-Pearson Lemma
  331. The likelihood ratio
  332. $$ t(\vec x) = \frac{f(\vec x|H_1)}{f(\vec x|H_0)} $$
  333. is an optimal test statistic, i.e., it provides highest "signal efficiency" $1-\beta$ for a given "background efficiency" $\alpha$. Accept hypothesis if $t(\vec x) > c$.
  334. \vfill
  335. Problem: the underlying pdf's are almost never known explicitly.
  336. \vfill
  337. Two approaches
  338. 1. Estimate signal and background pdf's and construct test statistic based on Neyman-Pearson lemma
  339. 2. Decision boundaries determined directly without approximating the pdf's (linear discriminants, decision trees, neural networks, ...)
  340. ## Estimating PDFs from Histograms?
  341. \begin{center}
  342. \includegraphics[width=0.8\textwidth]{figures/pdf_from_2d_histogram.png}
  343. $\color{gray} \text{approximate PDF by} \; N(x,y|S) \; \text{and} \; N(x,y|B)$
  344. \end{center}
  345. $M$ bins per variable in $d$ dimensions: $M^d$ cells$\to$ hard to generate enough training data (often not practical for $d > 1$)
  346. In general in machine learning, problems related to a large number of dimensions of the feature space are referred to as the \textcolor{red}{"curse of dimensionality"}
  347. ## Na$\text{\"i}$ve Bayesian Classifier (also called "Projected Likelihood Classification")
  348. Application of the Neyman-Pearson lemma (ignoring correlations between the $x_i$):
  349. $$ f(x_1, x_2, ..., x_n) \quad \mbox{approximated as} \quad L = f_1(x_1) \cdot f_2(x_2) \cdot ... \cdot f_n(x_n) $$
  350. \begin{align*}
  351. \mbox{where} \quad
  352. f_1(x_1) & = \int \mathrm dx_2 \mathrm dx_3 ... \mathrm dx_n\; f(x_1, x_2, ..., x_n) \\
  353. f_2(x_2) & = \int \mathrm dx_1 \mathrm dx_3 ... \mathrm dx_n\; f(x_1, x_2, ..., x_n) \\
  354. \vdots
  355. \end{align*}
  356. Classification of feature vector $x$:
  357. $$
  358. y(\vec x) = \frac{L_\mathrm{s}(\vec x)}{L_\mathrm{s}(\vec x) + L_\mathrm{b}(\vec x)} = \frac{1}{1 + L_\mathrm{b}(\vec x) / L_\mathrm{s}(\vec x)}
  359. $$
  360. Performance not optimal if true PDF does not factorize
  361. ## k-Nearest Neighbor Method (1)
  362. $k$-NN classifier:
  363. * Estimates probability density around the input vector
  364. * $p(\vec x|S)$ and $p(\vec x|B)$ are approximated by the number of signal and background events in the training sample that lie in a small volume around the point $\vec x$
  365. \vspace{2ex}
  366. Algorithms finds $k$ nearest neighbors:
  367. $$ k = k_s + k_b $$
  368. Probability for the event to be of signal type:
  369. $$ p_s(\vec x) = \frac{k_s(\vec x)}{k_s(\vec x) + k_b(\vec x)} $$
  370. ## k-Nearest Neighbor Method (2)
  371. ::: columns
  372. :::: {.column width=60%}
  373. Simplest choice for distance measure in feature space is the Euclidean distance:
  374. $$ R = |\vec x - \vec y|$$
  375. Better: take correlations between variables into account:
  376. $$R = \sqrt{(\vec{x}-\vec{y})^T \mat{V}^{-1} (\vec{x}-\vec{y})}$$
  377. $$ \mat{V} = \text{covariance matrix}, R = \text{"Mahalanobis distance"}$$
  378. ::::
  379. :::: {.column width=40%}
  380. ![](figures/knn.png)
  381. ::::
  382. :::
  383. \vfill
  384. The $k$-NN classifier has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.
  385. ## Fisher Linear Discriminant
  386. Linear discriminant is simple. Can still be optimal if amount of training data is limited.
  387. Ansatz for test statistic: $$ y(\vec x) = \sum_{i=1}^n w_i x_i = \vec w^\intercal \vec x $$
  388. Choose parameters $w_i$ so that separation between signal and background distribution is maximum.
  389. \vfill
  390. Need to define "separation".
  391. ::: columns
  392. :::: {.column width=45%}
  393. \begin{center}
  394. Fisher: maximize $$ J(\vec w) = \frac{(\tau_s - \tau_b)^2}{\Sigma_s^2 + \Sigma_b^2} $$
  395. \end{center}
  396. ::::
  397. :::: {.column width=55%}
  398. ![](figures/fisher.png)
  399. ::::
  400. :::
  401. ## Fisher Linear Discriminant: Determining the Coefficients $w_i$
  402. ::: columns
  403. :::: {.column width=60%}
  404. Coefficients are obtained from: $$ \frac{\partial J}{\partial w_i} = 0 $$
  405. \vspace{2ex}
  406. Linear decision boundaries
  407. \vspace{5ex}
  408. Weight vector $\vec w$ can be interpreted as a direction in feature space onto which the events are projected.
  409. ::::
  410. :::: {.column width=40%}
  411. ![](figures/fisher_linear_decision_boundary.png)
  412. ::::
  413. :::
  414. ## Linear regression revisited
  415. \vfill
  416. ::: columns
  417. :::: {.column width=50%}
  418. \small \textcolor{gray}{"Galton family heights data": \\ origin of the term "regression"} \normalsize
  419. ![](figures/03_ml_basics_galton_linear_regression_iminuit.pdf)
  420. ::::
  421. :::: {.column width=50%}
  422. * data: $\{x_i,y_i\}$ \
  423. * objective: predict $y = f(x)$
  424. * model: $f(x; \vec \theta) = m x + b, \quad \vec \theta = (m, b)$
  425. * loss function: $J(\theta|x,y) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i))^2$
  426. * model training: optimal parameters $\hat{\vec{\theta}} = \mathrm{arg\,min} \, J(\vec \theta)$
  427. ::::
  428. :::
  429. ## Linear regression
  430. * Data: vectors with $p$ components ("features"): $\vec x = (x_1, ..., x_p)$
  431. * $n$ observations: $\{\vec x_i, y_i\}, \quad i = 1, ..., n$
  432. * Prediction for given vector $x$:
  433. $$ y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x \quad \text{where } x_0 := 1 $$
  434. * Find weights that minimze loss function:
  435. $$\hat{\vec{w}} = \underset{\vec w}{\min} \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2$$
  436. * In case of linear regression closed-form solution exists:
  437. $$ \hat{\vec{w}} = (\mat{X}^\intercal \mat{X})^{-1} \mat{X}^\intercal \vec y \quad \text{where} \; X \in \mathbb{R}^{n \times p}$$
  438. * $X$ is called the design matrix, row $i$ of $X$ is $\vec x_i$
  439. ## Linear regression with regularization
  440. ::: columns
  441. :::: {.column width=45%}
  442. * Standard loss function
  443. $$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 $$
  444. * Ridge regression
  445. $$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w|^2$$
  446. * LASSO regression
  447. $$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w| $$
  448. ::::
  449. :::: {.column width=55%}
  450. \vfill
  451. ![](figures/L1vsL2.pdf)
  452. \small \textcolor{gray}{LASSO regression tends to give sparse solutions (many components $w_j = 0$). This is why LASSO regression is also called sparse regression.} \normalsize
  453. ::::
  454. :::
  455. ## Logistic regression (1)
  456. * Consider binary classification task, e.g., $y_i \in \{0,1\}$
  457. * Objective: Predict probability for outcome $y=1$ given an observation $\vec x$
  458. * Starting with linear "score"
  459. $$ s = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x$$
  460. * Define function that translates $s$ into a quantity that has the properties of a probability
  461. $$ \sigma(s) = \frac{1}{1+e^{-s}} $$
  462. * We would like to determine the optimal weights for a given training data set. They result from the maximum-likelihood principle.
  463. ## Logistic regression (2)
  464. * Consider feature vector $\vec x$. For a given set of weights $\vec w$ the model predicts
  465. * a probability $p(1|\vec w) = \sigma(\vec w^\intercal \vec x)$ for outcome $y=1$
  466. * a probabiltiy $p(0|\vec w) = 1 - \sigma(\vec w^\intercal \vec x)$ for outcome $y=0$
  467. * The probability $p(y_i | \vec w)$ defines the likelihood $L_i(\vec w) = p(y_i | \vec w)$ (the likelihood is a function of the parameters $\vec w$ and the observations $y_i$ are fixed).
  468. * Likelihood for the full data sample ($n$ observations)
  469. $$ L(\vec w) = \prod_{i=1}^n L_i(\vec w) = \prod_{i=1}^n \sigma(\vec w^\intercal \vec x)^{y_i} \,(1-\sigma(\vec w^\intercal \vec x))^{1-y_i} $$
  470. * Maximizing the log-likelihood $\ln L(\vec w)$ corresponds to minimizing the loss function
  471. $$ C(\vec w) = - \ln L(\vec w) = \sum_{i=1}^n - y_i \ln \sigma(\vec w^\intercal \vec x) -
  472. (1-y_i) \ln(1-\sigma(\vec w^\intercal \vec x))$$
  473. * This is nothing else but the cross-entropy loss function
  474. ## scikit-learn
  475. ::: columns
  476. :::: {.column width=70%}
  477. * Free software machine learning library for Python
  478. * Initial release: 2007
  479. * features various classification, regression and clustering algorithms including k-nearest neighbors, multi-layer perceptrons, support vector machines, random forests, gradient boosting, k-means
  480. * Scikit-learn is one of the most popular machine learning libraries on GitHub
  481. * [https://scikit-learn.org/](https://scikit-learn.org/)
  482. ::::
  483. :::: {.column width=30%}
  484. \vspace{7ex}
  485. \begin{figure}
  486. \centering
  487. \includegraphics[width=0.85\textwidth]{figures/scikit-learn.png}
  488. \end{figure}
  489. ::::
  490. :::
  491. ## Example 1 - Probability of passing an exam (logistic regression) (1)
  492. Objective: predict the probability that someone passes an exam based on the number of hours studying
  493. $$ p_\mathrm{pass} = \sigma(s) = \frac{1}{1+e^{-s}}, \quad s = w_1 t + w_0, \quad t = \text{\# hours}$$
  494. ::: columns
  495. :::: {.column width=40%}
  496. * Data set: \
  497. * preparation $t$ time in hours
  498. * passed / not passes (0/1)
  499. * Parameters need to be determined through numerical minimization
  500. * $w_0 = -4.0777$
  501. * $w_1 = 1.5046$
  502. \vspace{1.5ex}
  503. \footnotesize
  504. [\textcolor{gray}{03\_ml\_basics\_logistic\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/03_ml_basics_logistic_regression.ipynb)
  505. \normalsize
  506. ::::
  507. :::: {.column width=60%}
  508. ![](figures/03_ml_basics_logistic_regression.pdf){width=90%}
  509. ::::
  510. :::
  511. ## Example 1 - Probability of passing an exam (logistic regression) (2)
  512. \footnotesize
  513. \textcolor{gray}{Read data from file:}
  514. ```python
  515. # data: 1. hours studies, 2. passed (0/1)
  516. df = pd.read_csv(filename, engine='python', sep='\s+')
  517. x_tmp = df['hours_studied'].values
  518. x = np.reshape(x_tmp, (-1, 1))
  519. y = df['passed'].values
  520. ```
  521. \vfill
  522. \textcolor{gray}{Fit the data:}
  523. ```python
  524. from sklearn.linear_model import LogisticRegression
  525. clf = LogisticRegression(penalty='none', fit_intercept=True)
  526. clf.fit(x, y);
  527. ```
  528. \vfill
  529. \textcolor{gray}{Calculate predictions:}
  530. ```python
  531. hours_studied_tmp = np.linspace(0., 6., 1000)
  532. hours_studied = np.reshape(hours_studied_tmp, (-1, 1))
  533. y_pred = clf.predict_proba(hours_studied)
  534. ```
  535. \normalsize
  536. ## Precision and recall
  537. ::: columns
  538. :::: {.column width=50%}
  539. \textcolor{blue}{Precision:}\
  540. Fraction of correctly classified instances among all instances that obtain a certain class label.
  541. $$ \text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
  542. \begin{center}
  543. \textcolor{gray}{"purity"}
  544. \end{center}
  545. ::::
  546. :::: {.column width=50%}
  547. \textcolor{blue}{Recall:}\
  548. Fraction of positive instances that are correctly classified.
  549. \vspace{2.9ex}
  550. $$ \text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
  551. \begin{center}
  552. \textcolor{gray}{"efficiency"}
  553. \end{center}
  554. ::::
  555. :::
  556. \vfill
  557. \begin{center}
  558. \textcolor{gray}{TP: true positives, FP: false positives, FN: false negatives}
  559. \end{center}
  560. ## Example 2: Heart disease data set (logistic regression) (1)
  561. \scriptsize
  562. \textcolor{gray}{Read data:}
  563. ```python
  564. filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/data/heart.csv"
  565. df = pd.read_csv(filename)
  566. df
  567. ```
  568. \vfill
  569. ![](figures/heart_table.png){width=70%}
  570. \normalsize
  571. \vspace{1.5ex}
  572. \footnotesize
  573. [\textcolor{gray}{03\_ml\_basics\_log\_regr\_heart\_disease.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)
  574. \normalsize
  575. ## Example 2: Heart disease data set (logistic regression) (2)
  576. \footnotesize
  577. \textcolor{gray}{Define array of labels and feature vectors}
  578. ```python
  579. y = df['target'].values
  580. X = df[[col for col in df.columns if col!="target"]]
  581. ```
  582. \vfill
  583. \textcolor{gray}{Generate training and test data sets}
  584. ```python
  585. from sklearn.model_selection import train_test_split
  586. X_train, X_test, y_train, y_test
  587. = train_test_split(X, y, test_size=0.5, shuffle=True)
  588. ```
  589. \vfill
  590. \textcolor{gray}{Fit the model}
  591. ```python
  592. from sklearn.linear_model import LogisticRegression
  593. lr = LogisticRegression(penalty='none',
  594. fit_intercept=True, max_iter=1000, tol=1E-5)
  595. lr.fit(X_train, y_train)
  596. ```
  597. \normalsize
  598. ## Example 2: Heart disease data set (logistic regression) (3)
  599. \footnotesize
  600. \textcolor{gray}{Test predictions on test data set:}
  601. ```python
  602. from sklearn.metrics import classification_report
  603. y_pred_lr = lr.predict(X_test)
  604. print(classification_report(y_test, y_pred_lr))
  605. ```
  606. \vfill
  607. \textcolor{gray}{Output:}
  608. ```
  609. precision recall f1-score support
  610. 0 0.75 0.86 0.80 63
  611. 1 0.89 0.80 0.84 89
  612. accuracy 0.82 152
  613. macro avg 0.82 0.83 0.82 152
  614. weighted avg 0.83 0.82 0.82 152
  615. ```
  616. ## Example 2: Heart disease data set (logistic regression) (4)
  617. \textcolor{gray}{Compare to another classifier using the \textit{receiver operating characteristic} (ROC) curve}
  618. \vfill
  619. \textcolor{gray}{Let's take the random forest classifier}
  620. \footnotesize
  621. ```python
  622. from sklearn.ensemble import RandomForestClassifier
  623. rf = RandomForestClassifier(max_depth=3)
  624. rf.fit(X_train, y_train)
  625. ```
  626. \normalsize
  627. \vfill
  628. \textcolor{gray}{Use \texttt{roc\_curve} from scikit-learn}
  629. \footnotesize
  630. ```python
  631. from sklearn.metrics import roc_curve
  632. y_pred_prob_lr = lr.predict_proba(X_test) # predicted probabilities
  633. fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr[:,1])
  634. y_pred_prob_rf = rf.predict_proba(X_test) # predicted probabilities
  635. fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf[:,1])
  636. ```
  637. \normalsize
  638. ## Example 2: Heart disease data set (logistic regression) (5)
  639. ::: columns
  640. :::: {.column width=50%}
  641. \scriptsize
  642. ```python
  643. plt.plot(tpr_lr, 1-fpr_lr, label="log. regression")
  644. plt.plot(tpr_rf, 1-fpr_rf, label="random forest")
  645. ```
  646. \vspace{5ex}
  647. \normalsize
  648. \textcolor{gray}{Classifiers can be compared with the \textit{area under curve} (AUC) score.}
  649. \scriptsize
  650. ```python
  651. from sklearn.metrics import roc_auc_score
  652. auc_lr = roc_auc_score(y_test,y_pred_lr)
  653. auc_rf = roc_auc_score(y_test,y_pred_rf)
  654. print(f"AUC scores: {auc_lr:.2f}, {auc_knn:.2f}")
  655. ```
  656. \vspace{5ex}
  657. \normalsize
  658. \textcolor{gray}{This gives}
  659. \scriptsize
  660. ```
  661. AUC scores: 0.82, 0.83
  662. ```
  663. \normalsize
  664. ::::
  665. :::: {.column width=50%}
  666. \begin{figure}
  667. \centering
  668. \includegraphics[width=0.96\textwidth]{figures/03_ml_basics_log_regr_heart_disease.pdf}
  669. \end{figure}
  670. ::::
  671. :::
  672. ## Multinomial logistic regression: Softmax function
  673. In the previous example we considered two classes (0, 1). For multi-class classification, the logistic function can generalized to the softmax function.
  674. \vfill
  675. Now consider $k$ classes and let $s_i$ be the score for class $i$: $\vec s = (s_1, ..., s_k)$
  676. \vfill
  677. A probability for class $i$ can be predicted with the softmax function:
  678. $$ \sigma(\vec s)_i = \frac{e^{s_i}}{\sum_{j=1}^k e^{s_j}} \quad \text{ for } \quad i = 1, ... , k $$
  679. The softmax functions is often used as the last activation function of a neural network in order to predict probabilities in a classification task.
  680. \vfill
  681. Multinomial logistic regression is also known as softmax regression.
  682. ## Example 3: Iris data set (softmax regression) (1)
  683. Iris flower data set
  684. * Introduced 1936 in a paper by Ronald Fisher
  685. * Task: classify flowers
  686. * Three species: iris setosa, iris virginica and iris versicolor
  687. * Four features: petal width and length, sepal width/length, in centimeters
  688. ::: columns
  689. :::: {.column width=40%}
  690. \begin{figure}
  691. \centering
  692. \includegraphics[width=0.95\textwidth]{figures/iris_dataset.png}
  693. \end{figure}
  694. ::::
  695. :::: {.column width=60%}
  696. \vspace{2ex}
  697. \footnotesize
  698. [\textcolor{gray}{03\_ml\_basics\_iris\_softmax\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_iris_softmax_regression.ipynb)
  699. \vspace{19ex}
  700. \scriptsize
  701. [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
  702. [https://en.wikipedia.org/wiki/Iris_flower_data_set](https://en.wikipedia.org/wiki/Iris_flower_data_set)
  703. \normalsize
  704. ::::
  705. :::
  706. ## Example 3: Iris data set (softmax regression) (2)
  707. \textcolor{gray}{Get data set}
  708. \footnotesize
  709. ```python
  710. # import some data to play with
  711. # columns: Sepal Length, Sepal Width, Petal Length and Petal Width
  712. iris = datasets.load_iris()
  713. X = iris.data
  714. y = iris.target
  715. # split data into training and test data sets
  716. x_train, x_test, y_train, y_test =
  717. train_test_split(X, y, test_size=0.5, random_state=42)
  718. ```
  719. \normalsize
  720. \vfill
  721. \textcolor{gray}{Softmax regression}
  722. \footnotesize
  723. ```python
  724. from sklearn.linear_model import LogisticRegression
  725. log_reg = LogisticRegression(multi_class='multinomial', penalty='none')
  726. log_reg.fit(x_train, y_train);
  727. ```
  728. \normalsize
  729. ## Example 3 : Iris data set (softmax regression) (3)
  730. ::: columns
  731. :::: {.column width=70%}
  732. \textcolor{gray}{Accuracy and confusion matrix for different classifiers}
  733. \footnotesize
  734. ```python
  735. for clf in [log_reg, kn_neigh, fisher_ld]:
  736. y_pred = clf.predict(x_test)
  737. acc = accuracy_score(y_test, y_pred)
  738. print(type(clf).__name__)
  739. print(f"accuracy: {acc:0.2f}")
  740. # confusion matrix:
  741. # columns: true class, row: predicted class
  742. print(confusion_matrix(y_test, y_pred),"\n")
  743. ```
  744. \normalsize
  745. ::::
  746. :::: {.column width=30%}
  747. \footnotesize
  748. ```
  749. LogisticRegression
  750. accuracy: 0.96
  751. [[29 0 0]
  752. [ 0 23 0]
  753. [ 0 3 20]]
  754. KNeighborsClassifier
  755. accuracy: 0.95
  756. [[29 0 0]
  757. [ 0 23 0]
  758. [ 0 4 19]]
  759. LinearDiscriminantAnalysis
  760. accuracy: 0.99
  761. [[29 0 0]
  762. [ 0 23 0]
  763. [ 0 1 22]]
  764. ```
  765. \normalsize
  766. ::::
  767. :::
  768. ## General remarks on multi-variate analyses (MVAs)
  769. * MVA Methods
  770. * More effective than classic cut-based analyses
  771. * Take correlations of input variables into account
  772. \vfill
  773. * Important: find good input variables for MVA methods
  774. * Good separation power between S and B
  775. * No strong correlation among variables
  776. * No correlation with the parameters you try to measure in your signal sample!
  777. \vfill
  778. * Pre-processing
  779. * Apply obvious variable transformations and let MVA method do the rest
  780. * Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle $\theta$ use $|\cos \theta|$ and not $\cos \theta$ as input variable
  781. * It is generally useful to bring all input variables to a similar numerical range
  782. ## Example of feature transformation
  783. \begin{figure}
  784. \centering
  785. \includegraphics[width=0.95\textwidth]{figures/feature_transformation.png}
  786. \end{figure}
  787. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  788. ::: columns
  789. :::: {.column width=50%}
  790. \small
  791. * Cosmic gamma rays (30 GeV - 30 TeV).
  792. * Cherenkov light from air showers
  793. * Background: air showers caused by hadrons.
  794. \normalsize
  795. \begin{figure}
  796. \centering
  797. \includegraphics[width=0.85\textwidth]{figures/magic_photo_small.png}
  798. \end{figure}
  799. ::::
  800. :::: {.column width=50%}
  801. ![](figures/magic_sketch.png)
  802. ::::
  803. :::
  804. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  805. \begin{figure}
  806. \centering
  807. \includegraphics[width=0.75\textwidth]{figures/magic_shower_em_had_small.png}
  808. \end{figure}
  809. ::: columns
  810. :::: {.column width=50%}
  811. \begin{center}
  812. Gamma shower
  813. \end{center}
  814. ::::
  815. :::: {.column width=50%}
  816. \begin{center}
  817. Hadronic shower
  818. \end{center}
  819. ::::
  820. :::
  821. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  822. \begin{figure}
  823. \centering
  824. \includegraphics[width=0.95\textwidth]{figures/magic_shower_parameters.png}
  825. \end{figure}
  826. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  827. MAGIC data set \
  828. \tiny
  829. [\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope}](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope)
  830. \normalsize
  831. \scriptsize
  832. ```
  833. 1. fLength: continuous # major axis of ellipse [mm]
  834. 2. fWidth: continuous # minor axis of ellipse [mm]
  835. 3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]
  836. 4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]
  837. 5. fConc1: continuous # ratio of highest pixel over fSize [ratio]
  838. 6. fAsym: continuous # dist. from highest pixel to center, proj. onto major axis [mm]
  839. 7. fM3Long: continuous # 3rd root of third moment along major axis [mm]
  840. 8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]
  841. 9. fAlpha: continuous # angle of major axis with vector to origin [deg]
  842. 10. fDist: continuous # distance from origin to center of ellipse [mm]
  843. 11. class: g,h # gamma (signal), hadron (background)
  844. g = gamma (signal): 12332
  845. h = hadron (background): 6688
  846. For technical reasons, the number of h events is underestimated.
  847. In the real data, the h class represents the majority of the events.
  848. ```
  849. \normalsize
  850. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  851. \small
  852. [\textcolor{gray}{03\_ml\_basics\_ex\_1\_magic.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_1_magic.ipynb)
  853. \normalsize
  854. a) Create for each variable a figure with a plot for gammas and hadrons overlayed.
  855. b) Create training and test data set. The test data should amount to 50% of the total data set.
  856. c) Define the logistic regressor and fit the training data
  857. d) Determine the model accuracy and the AUC score
  858. e) Plot the ROC curve (background rejection vs signal efficiency)
  859. ## Exercise 2: Hand-written digit recognition with logistic regression
  860. \small
  861. [\textcolor{gray}{03\_ml\_basics\_ex\_2\_mnist\_softmax\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_2_mnist_softmax_regression.ipynb)
  862. \normalsize
  863. a) Define logistic regressor from scikit-learn and fit data
  864. b) Use \texttt{classification\_report} from scikit-learn to determine precision and recall
  865. c) Read in a hand-written digit and classify it. Print the probabilities for each digit. Determine the digit with the highest probability.
  866. d) (Optional) Create you own hand-written digit with a program like gimp and check what the classifier does
  867. \begin{figure}
  868. \centering
  869. \includegraphics[width=0.85\textwidth]{figures/handwritten_digits.png}
  870. \end{figure}
  871. Hint: You can install required packages on the jupyter hub server like so:
  872. \scriptsize
  873. ```
  874. !pip3 install --user pypng
  875. ```
  876. \normalsize
  877. ## Exercise 3: Data preprocessing
  878. a) Read the description of the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) package.
  879. b) Start from the example notebook on the logistic regression for the heart disease data set ([03_ml_basics_log_regr_heart_disease.ipynb](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)). Pre-process the heart disease data set according to the given example. Does preprocessing make a difference in this case?