Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

722 lines
22 KiB

  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 3. Machine Learning Basics, Multivariate Analysis
  2. % Jörg Marks, Klaus Reygers
  3. % Studierendentage, 11-14 April 2023
  4. ## Multi-variate analyses (MVA)
  5. * General Question
  6. \vspace{0.1cm}
  7. There are 2 categories of distinguishable data, S and B,
  8. described by discrete variables. What are criteria for a separation
  9. of both samples?
  10. * Single criteria are not sufficient to distinguish S and B
  11. * Reduction of the variable space to probabilities for S or B
  12. \vspace{0.1cm}
  13. * Classification of measurements using a set of observables $(V_1,V_2,....,V_n)$
  14. * find optimal separation conditions considering correlations
  15. \begin{figure}
  16. \centering
  17. \includegraphics[width=0.8\textwidth]{figures/SandBcuts.jpeg}
  18. \end{figure}
  19. ## Multi-variate analyses (MVA)
  20. * Regression - in the multidimensional observable space $(V_1,V_2,....,V_n)$
  21. a functional connection with optimal parameters is determined
  22. \begin{figure}
  23. \centering
  24. \includegraphics[width=0.8\textwidth]{figures/regression.jpeg}
  25. \end{figure}
  26. * supervised regression: model is known
  27. * unsupervised regression: model is unknown
  28. * for the parameter determination Maximum likelihood fits are used
  29. ## MVA Classification in N Dimensions
  30. For each event there are N measured variables
  31. \begin{figure}
  32. \centering
  33. \includegraphics[width=0.9\textwidth]{figures/classificationVar.jpeg}
  34. \end{figure}
  35. :::: columns
  36. :::: {.column width=70%}
  37. * Search for a mathematical transformation F of the N dimensional
  38. input space to a one dimensional output space $F(\vec V) : \mathbb{R}^N \rightarrow \mathbb{R}$
  39. * A simple cut in F implements a complex cut in the N dimensional variable space
  40. * Determine $F(\vec V)$ using a model and fit the parameters
  41. ::::
  42. ::::{.column width=30%}
  43. \includegraphics[]{figures/response.jpeg}
  44. ::::
  45. :::
  46. ## MVA Classification in N Dimensions
  47. :::: columns
  48. :::: {.column width=60%}
  49. * Parameters \newline
  50. Important measures to quantify quality \newline \newline
  51. Efficiency: $\epsilon = \frac{N_S (F>F_0)}{N_s}$ \newline
  52. Purity: $\pi = \frac{N_S (F>F_0)}{(N_s + N_B)(F>F_0)}$
  53. ::::
  54. ::::{.column width=40%}
  55. \includegraphics[]{figures/response.jpeg}
  56. ::::
  57. :::
  58. \vspace{0.3cm}
  59. :::: columns
  60. :::: {.column width=60%}
  61. * Reciever Operations Characteristics (ROC) \newline
  62. Errors in classification \newline
  63. \includegraphics[width=0.7\textwidth]{figures/error.jpeg}
  64. ::::
  65. ::::{.column width=40%}
  66. \includegraphics[]{figures/roc.jpeg}
  67. ::::
  68. :::
  69. ## MVA Classification in N Dimensions
  70. :::: columns
  71. :::: {.column width=60%}
  72. * Interpretation of $F(\vec V)$
  73. * The distributions of \textcolor{blue}{$F(\vec V|S)$} and \textcolor{red}{$F(\vec V|S)$} are interpreted as probability density functions (PDF), \textcolor{blue}{$PDF_S(F)$} and \textcolor{blue}{$PDF_B(F)$}
  74. * For a given $F_0$ the probability for signal and background for a
  75. given $S/B$ can be determined \newline
  76. $P ( data = S | F)= \frac {\color {blue} {f_S \cdot PDF_S(F)}} { \color {red} {f_B \cdot PDF_B(F)} + \color {blue} {f_S \cdot PDF_S(F)} }$
  77. ::::
  78. ::::{.column width=40%}
  79. \includegraphics[]{figures/response.jpeg}
  80. ::::
  81. :::
  82. \vspace{0.3cm}
  83. * A cut in the one dimensional Variable $F(\vec V) =F_0$ and accepting all events on the right determines the signal and background efficiency (background rejection). A systematic change of $F(\vec V)$ gives the ROC curve. \newline
  84. \definecolor{darkgreen}{RGB}{0,125,0}
  85. * \color{darkgreen}{A cut in $F(\vec V)$ corresponds to a complex hyperplane, which can not neccessarily be described by a function.}
  86. ## Simple Cuts in Variables
  87. * The most simple classificator to select signal events are cuts in all variables which show a separation
  88. * The output is binary and not a probability on $S$ or $B$.
  89. * An optimization of the cuts is done by maximizing of the background suppression for given signal efficiencies.
  90. * Significance $sig = \epsilon_S \cdot N_S / \sqrt{ \epsilon_S \cdot N_S + \epsilon_B( \epsilon_S) N_B}$
  91. \begin{figure}
  92. \centering
  93. \includegraphics[width=0.8\textwidth]{figures/cutInVariables.jpeg}
  94. \end{figure}
  95. ## Fisher Discriminat
  96. Idea: Find a plane, that the projection of the data on the plane gives an optimal separation of signal and background
  97. :::: columns
  98. :::: {.column width=60%}
  99. * The Fisher discriminat is the linear combination of all input variables
  100. \newline
  101. $F(\vec{V}) = \sum_i w_i \cdot V_i = \vec{w}^T \vec{V}$ \newline
  102. * $\vec w$ defines the orientation of the plane. The coefficients are defined such that the difference of the expectation values of both classes is large and the variance is small. \newline
  103. $J( \vec{w} ) = \frac {( F_S - F_B )^2}{ \sigma_S^2 + \sigma_B^2 } = \frac { \vec{w}^T K \vec{w} }{ \vec{w}^T L \vec{w} }$ \newline
  104. with $K$ as covariance of the the expectation values $F_S -F_B$ and L is the sum
  105. * For the separation a value $F_c$ is determined.
  106. ::::
  107. ::::{.column width=40%}
  108. \includegraphics[]{figures/fisher.jpeg}
  109. ::::
  110. :::
  111. ## k-Nearest Neighbor Method (1)
  112. $k$-NN classifier:
  113. * Estimates probability density around the input vector
  114. * $p(\vec x|S)$ and $p(\vec x|B)$ are approximated by the number of signal and background events in the training sample that lie in a small volume around the point $\vec x$
  115. \vspace{2ex}
  116. Algorithms finds $k$ nearest neighbors:
  117. $$ k = k_s + k_b $$
  118. Probability for the event to be of signal type:
  119. $$ p_s(\vec x) = \frac{k_s(\vec x)}{k_s(\vec x) + k_b(\vec x)} $$
  120. ## k-Nearest Neighbor Method (2)
  121. ::: columns
  122. :::: {.column width=60%}
  123. Simplest choice for distance measure in feature space is the Euclidean distance:
  124. $$ R = |\vec x - \vec y|$$
  125. Better: take correlations between variables into account:
  126. $$ R = \sqrt{(\vec{x}-\vec{y})^T V^{-1} (\vec{x}-\vec{y})} $$
  127. $$ V = \text{covariance matrix}, R = \text{"Mahalanobis distance"}$$
  128. ::::
  129. :::: {.column width=40%}
  130. ![](figures/knn.png)
  131. ::::
  132. :::
  133. \vfill
  134. The $k$-NN classifier has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.
  135. ## Linear regression revisited
  136. \vfill
  137. ::: columns
  138. :::: {.column width=50%}
  139. \small \textcolor{gray}{"Galton family heights data": \\ origin of the term "regression"} \normalsize
  140. ![](figures/03_ml_basics_galton_linear_regression_iminuit.pdf)
  141. ::::
  142. :::: {.column width=50%}
  143. * data: $\{x_i,y_i\}$ \
  144. * objective: predict $y = f(x)$
  145. * model: $f(x; \vec \theta) = m x + b, \quad \vec \theta = (m, b)$
  146. * loss function: $J(\theta|x,y) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i))^2$
  147. * model training: optimal parameters $\hat{\vec{\theta}} = \mathrm{arg\,min} \, J(\vec \theta)$
  148. ::::
  149. :::
  150. ## Linear regression
  151. * Data: vectors with $p$ components ("features"): $\vec x = (x_1, ..., x_p)$
  152. * $n$ observations: $\{\vec x_i, y_i\}, \quad i = 1, ..., n$
  153. * Prediction for given vector $x$:
  154. $$ y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x \quad \text{where } x_0 := 1 $$
  155. * Find weights that minimze loss function:
  156. $$\hat{\vec{w}} = \underset{\vec w}{\min} \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2$$
  157. * In case of linear regression closed-form solution exists:
  158. $$ \hat{\vec{w}} = (X^\intercal X)^{-1} X^\intercal \vec y \quad \text{where} \; X \in \mathbb{R}^{n \times p}$$
  159. * $X$ is called the design matrix, row $i$ of $X$ is $\vec x_i$
  160. ## Linear regression with regularization
  161. ::: columns
  162. :::: {.column width=45%}
  163. * Standard loss function
  164. $$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 $$
  165. * Ridge regression
  166. $$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w|^2$$
  167. * LASSO regression
  168. $$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w| $$
  169. ::::
  170. :::: {.column width=55%}
  171. \vfill
  172. ![](figures/L1vsL2.pdf)
  173. \small \textcolor{gray}{LASSO regression tends to give sparse solutions (many components $w_j = 0$). This is why LASSO regression is also called sparse regression.} \normalsize
  174. ::::
  175. :::
  176. ## Logistic regression (1)
  177. * Consider binary classification task, e.g., $y_i \in \{0,1\}$
  178. * Objective: Predict probability for outcome $y=1$ given an observation $\vec x$
  179. * Starting with linear "score"
  180. $$ s = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x$$
  181. * Define function that translates $s$ into a quantity that has the properties of a probability
  182. $$ \sigma(s) = \frac{1}{1+e^{-s}} $$
  183. * We would like to determine the optimal weights for a given training data set. They result from the maximum-likelihood principle.
  184. ## Logistic regression (2)
  185. * Consider feature vector $\vec x$. For a given set of weights $\vec w$ the model predicts
  186. * a probability $p(1|\vec w) = \sigma(\vec w^\intercal \vec x)$ for outcome $y=1$
  187. * a probabiltiy $p(0|\vec w) = 1 - \sigma(\vec w^\intercal \vec x)$ for outcome $y=0$
  188. * The probability $p(y_i | \vec w)$ defines the likelihood $L_i(\vec w) = p(y_i | \vec w)$ (the likelihood is a function of the parameters $\vec w$ and the observations $y_i$ are fixed).
  189. * Likelihood for the full data sample ($n$ observations)
  190. $$ L(\vec w) = \prod_{i=1}^n L_i(\vec w) = \prod_{i=1}^n \sigma(\vec w^\intercal \vec x)^{y_i} \,(1-\sigma(\vec w^\intercal \vec x))^{1-y_i} $$
  191. * Maximizing the log-likelihood $\ln L(\vec w)$ corresponds to minimizing the loss function
  192. $$ C(\vec w) = - \ln L(\vec w) = \sum_{i=1}^n - y_i \ln \sigma(\vec w^\intercal \vec x) -
  193. (1-y_i) \ln(1-\sigma(\vec w^\intercal \vec x))$$
  194. * This is nothing else but the cross-entropy loss function
  195. ## Example 1 - Probability of passing an exam (logistic regression) (1)
  196. Objective: predict the probability that someone passes an exam based on the number of hours studying
  197. $$ p_\mathrm{pass} = \sigma(s) = \frac{1}{1+e^{-s}}, \quad s = w_1 t + w_0, \quad t = \text{\# hours}$$
  198. ::: columns
  199. :::: {.column width=40%}
  200. * Data set: \
  201. * preparation $t$ time in hours
  202. * passed / not passes (0/1)
  203. * Parameters need to be determined through numerical minimization
  204. * $w_0 = -4.0777$
  205. * $w_1 = 1.5046$
  206. \vspace{1.5ex}
  207. \footnotesize
  208. [\textcolor{gray}{03\_ml\_basics\_logistic\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_logistic_regression.ipynb)
  209. \normalsize
  210. ::::
  211. :::: {.column width=60%}
  212. ![](figures/03_ml_basics_logistic_regression.pdf){width=90%}
  213. ::::
  214. :::
  215. ## Example 1 - Probability of passing an exam (logistic regression) (2)
  216. \footnotesize
  217. \textcolor{gray}{Read data from file:}
  218. ```python
  219. # data: 1. hours studies, 2. passed (0/1)
  220. df = pd.read_csv(filename, engine='python', sep='\s+')
  221. x_tmp = df['hours_studied'].values
  222. x = np.reshape(x_tmp, (-1, 1))
  223. y = df['passed'].values
  224. ```
  225. \vfill
  226. \textcolor{gray}{Fit the data:}
  227. ```python
  228. from sklearn.linear_model import LogisticRegression
  229. clf = LogisticRegression(penalty='none', fit_intercept=True)
  230. clf.fit(x, y);
  231. ```
  232. \vfill
  233. \textcolor{gray}{Calculate predictions:}
  234. ```python
  235. hours_studied_tmp = np.linspace(0., 6., 1000)
  236. hours_studied = np.reshape(hours_studied_tmp, (-1, 1))
  237. y_pred = clf.predict_proba(hours_studied)
  238. ```
  239. \normalsize
  240. ## Precision and recall
  241. ::: columns
  242. :::: {.column width=50%}
  243. \textcolor{blue}{Precision:}\
  244. Fraction of correctly classified instances among all instances that obtain a certain class label.
  245. $$ \text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
  246. \begin{center}
  247. \textcolor{gray}{"purity"}
  248. \end{center}
  249. ::::
  250. :::: {.column width=50%}
  251. \textcolor{blue}{Recall:}\
  252. Fraction of positive instances that are correctly classified.
  253. \vspace{2.9ex}
  254. $$ \text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
  255. \begin{center}
  256. \textcolor{gray}{"efficiency"}
  257. \end{center}
  258. ::::
  259. :::
  260. \vfill
  261. \begin{center}
  262. \textcolor{gray}{TP: true positives, FP: false positives, FN: false negatives}
  263. \end{center}
  264. ## Example 2: Heart disease data set (logistic regression) (1)
  265. \scriptsize
  266. \textcolor{gray}{Read data:}
  267. ```python
  268. filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/data/heart.csv"
  269. df = pd.read_csv(filename)
  270. df
  271. ```
  272. \vfill
  273. ![](figures/heart_table.png){width=70%}
  274. \normalsize
  275. \vspace{1.5ex}
  276. \footnotesize
  277. [\textcolor{gray}{03\_ml\_basics\_log\_regr\_heart\_disease.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)
  278. \normalsize
  279. ## Example 2: Heart disease data set (logistic regression) (2)
  280. \footnotesize
  281. \textcolor{gray}{Define array of labels and feature vectors}
  282. ```python
  283. y = df['target'].values
  284. X = df[[col for col in df.columns if col!="target"]]
  285. ```
  286. \vfill
  287. \textcolor{gray}{Generate training and test data sets}
  288. ```python
  289. from sklearn.model_selection import train_test_split
  290. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True)
  291. ```
  292. \vfill
  293. \textcolor{gray}{Fit the model}
  294. ```python
  295. from sklearn.linear_model import LogisticRegression
  296. lr = LogisticRegression(penalty='none', fit_intercept=True, max_iter=1000, tol=1E-5)
  297. lr.fit(X_train, y_train)
  298. ```
  299. \normalsize
  300. ## Example 2: Heart disease data set (logistic regression) (3)
  301. \footnotesize
  302. \textcolor{gray}{Test predictions on test data set:}
  303. ```python
  304. from sklearn.metrics import classification_report
  305. y_pred_lr = lr.predict(X_test)
  306. print(classification_report(y_test, y_pred_lr))
  307. ```
  308. \vfill
  309. \textcolor{gray}{Output:}
  310. ```
  311. precision recall f1-score support
  312. 0 0.75 0.86 0.80 63
  313. 1 0.89 0.80 0.84 89
  314. accuracy 0.82 152
  315. macro avg 0.82 0.83 0.82 152
  316. weighted avg 0.83 0.82 0.82 152
  317. ```
  318. ## Example 2: Heart disease data set (logistic regression) (4)
  319. \textcolor{gray}{Compare to another classifier usinf the \textit{receiver operating characteristic} (ROC) curve}
  320. \vfill
  321. \textcolor{gray}{Let's take the random forest classifier}
  322. \footnotesize
  323. ```python
  324. from sklearn.ensemble import RandomForestClassifier
  325. rf = RandomForestClassifier(max_depth=3)
  326. rf.fit(X_train, y_train)
  327. ```
  328. \normalsize
  329. \vfill
  330. \textcolor{gray}{Use \texttt{roc\_curve} from scikit-learn}
  331. \footnotesize
  332. ```python
  333. from sklearn.metrics import roc_curve
  334. y_pred_prob_lr = lr.predict_proba(X_test) # predicted probabilities
  335. fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr[:,1])
  336. y_pred_prob_rf = rf.predict_proba(X_test) # predicted probabilities
  337. fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf[:,1])
  338. ```
  339. \normalsize
  340. ## Example 2: Heart disease data set (logistic regression) (5)
  341. ::: columns
  342. :::: {.column width=50%}
  343. \scriptsize
  344. ```python
  345. plt.plot(tpr_lr, 1-fpr_lr, label="log. regression")
  346. plt.plot(tpr_rf, 1-fpr_rf, label="random forest")
  347. ```
  348. \vspace{5ex}
  349. \normalsize
  350. \textcolor{gray}{Classifiers can be compared with the \textit{area under curve} (AUC) score.}
  351. \scriptsize
  352. ```python
  353. from sklearn.metrics import roc_auc_score
  354. auc_lr = roc_auc_score(y_test,y_pred_lr)
  355. auc_rf = roc_auc_score(y_test,y_pred_rf)
  356. print(f"AUC scores: {auc_lr:.2f}, {auc_knn:.2f}")
  357. ```
  358. \vspace{5ex}
  359. \normalsize
  360. \textcolor{gray}{This gives}
  361. \scriptsize
  362. ```
  363. AUC scores: 0.82, 0.83
  364. ```
  365. \normalsize
  366. ::::
  367. :::: {.column width=50%}
  368. \begin{figure}
  369. \centering
  370. \includegraphics[width=0.96\textwidth]{figures/03_ml_basics_log_regr_heart_disease.pdf}
  371. \end{figure}
  372. ::::
  373. :::
  374. ## Multinomial logistic regression: Softmax function
  375. In the previous example we considered two classes (0, 1). For multi-class classification, the logistic function can generalized to the softmax function.
  376. \vfill
  377. Now consider $k$ classes and let $s_i$ be the score for class $i$: $\vec s = (s_1, ..., s_k)$
  378. \vfill
  379. A probability for class $i$ can be predicted with the softmax function:
  380. $$ \sigma(\vec s)_i = \frac{e^{s_i}}{\sum_{j=1}^k e^{s_j}} \quad \text{ for } \quad i = 1, ... , k $$
  381. The softmax functions is often used as the last activation function of a neural network in order to predict probabilities in a classification task.
  382. \vfill
  383. Multinomial logistic regression is also known as softmax regression.
  384. ## Example 3: Iris data set (softmax regression) (1)
  385. Iris flower data set
  386. * Introduced 1936 in a paper by Ronald Fisher
  387. * Task: classify flowers
  388. * Three species: iris setosa, iris virginica and iris versicolor
  389. * Four features: petal width and length, sepal width/length, in centimeters
  390. ::: columns
  391. :::: {.column width=40%}
  392. \begin{figure}
  393. \centering
  394. \includegraphics[width=0.95\textwidth]{figures/iris_dataset.png}
  395. \end{figure}
  396. ::::
  397. :::: {.column width=60%}
  398. \vspace{2ex}
  399. \footnotesize
  400. [\textcolor{gray}{03\_ml\_basics\_iris\_softmax\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_iris_softmax_regression.ipynb)
  401. \vspace{19ex}
  402. \scriptsize
  403. [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
  404. [https://en.wikipedia.org/wiki/Iris_flower_data_set](https://en.wikipedia.org/wiki/Iris_flower_data_set)
  405. \normalsize
  406. ::::
  407. :::
  408. ## Example 3: Iris data set (softmax regression) (2)
  409. \textcolor{gray}{Get data set}
  410. \footnotesize
  411. ```python
  412. # import some data to play with
  413. # columns: Sepal Length, Sepal Width, Petal Length and Petal Width
  414. iris = datasets.load_iris()
  415. X = iris.data
  416. y = iris.target
  417. # split data into training and test data sets
  418. x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
  419. ```
  420. \normalsize
  421. \vfill
  422. \textcolor{gray}{Softmax regression}
  423. \footnotesize
  424. ```python
  425. from sklearn.linear_model import LogisticRegression
  426. log_reg = LogisticRegression(multi_class='multinomial', penalty='none')
  427. log_reg.fit(x_train, y_train);
  428. ```
  429. \normalsize
  430. ## Example 3 : Iris data set (softmax regression) (3)
  431. ::: columns
  432. :::: {.column width=70%}
  433. \textcolor{gray}{Accuracy and confusion matrix for different classifiers}
  434. \footnotesize
  435. ```python
  436. for clf in [log_reg, kn_neigh, fisher_ld]:
  437. y_pred = clf.predict(x_test)
  438. acc = accuracy_score(y_test, y_pred)
  439. print(type(clf).__name__)
  440. print(f"accuracy: {acc:0.2f}")
  441. # confusion matrix:
  442. # columns: true class, row: predicted class
  443. print(confusion_matrix(y_test, y_pred),"\n")
  444. ```
  445. \normalsize
  446. ::::
  447. :::: {.column width=30%}
  448. \footnotesize
  449. ```
  450. LogisticRegression
  451. accuracy: 0.96
  452. [[29 0 0]
  453. [ 0 23 0]
  454. [ 0 3 20]]
  455. KNeighborsClassifier
  456. accuracy: 0.95
  457. [[29 0 0]
  458. [ 0 23 0]
  459. [ 0 4 19]]
  460. LinearDiscriminantAnalysis
  461. accuracy: 0.99
  462. [[29 0 0]
  463. [ 0 23 0]
  464. [ 0 1 22]]
  465. ```
  466. \normalsize
  467. ::::
  468. :::
  469. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  470. ::: columns
  471. :::: {.column width=50%}
  472. \small
  473. * Cosmic gamma rays (30 GeV - 30 TeV).
  474. * Cherenkov light from air showers
  475. * Background: air showers caused by hadrons.
  476. \normalsize
  477. \begin{figure}
  478. \centering
  479. \includegraphics[width=0.85\textwidth]{figures/magic_photo_small.png}
  480. \end{figure}
  481. ::::
  482. :::: {.column width=50%}
  483. ![](figures/magic_sketch.png)
  484. ::::
  485. :::
  486. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  487. \begin{figure}
  488. \centering
  489. \includegraphics[width=0.75\textwidth]{figures/magic_shower_em_had_small.png}
  490. \end{figure}
  491. ::: columns
  492. :::: {.column width=50%}
  493. \begin{center}
  494. Gamma shower
  495. \end{center}
  496. ::::
  497. :::: {.column width=50%}
  498. \begin{center}
  499. Hadronic shower
  500. \end{center}
  501. ::::
  502. :::
  503. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  504. \begin{figure}
  505. \centering
  506. \includegraphics[width=0.95\textwidth]{figures/magic_shower_parameters.png}
  507. \end{figure}
  508. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  509. MAGIC data set \
  510. \tiny
  511. [\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope}](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope)
  512. \normalsize
  513. \scriptsize
  514. ```
  515. 1. fLength: continuous # major axis of ellipse [mm]
  516. 2. fWidth: continuous # minor axis of ellipse [mm]
  517. 3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]
  518. 4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]
  519. 5. fConc1: continuous # ratio of highest pixel over fSize [ratio]
  520. 6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]
  521. 7. fM3Long: continuous # 3rd root of third moment along major axis [mm]
  522. 8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]
  523. 9. fAlpha: continuous # angle of major axis with vector to origin [deg]
  524. 10. fDist: continuous # distance from origin to center of ellipse [mm]
  525. 11. class: g,h # gamma (signal), hadron (background)
  526. g = gamma (signal): 12332
  527. h = hadron (background): 6688
  528. For technical reasons, the number of h events is underestimated.
  529. In the real data, the h class represents the majority of the events.
  530. ```
  531. \normalsize
  532. ## Exercise 1: Classification of air showers measured with the MAGIC telescope
  533. \small
  534. [\textcolor{gray}{03\_ml\_basics\_ex\_1\_magic.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/03_ml_basics_ex_1_magic.ipynb)
  535. \normalsize
  536. a) Create for each variable a figure with a plot for gammas and hadrons overlayed.
  537. b) Create training and test data set. The test data should amount to 50% of the total data set.
  538. c) Define the logistic regressor and fit the training data
  539. d) Determine the model accuracy and the AUC score
  540. e) Plot the ROC curve (background rejection vs signal efficiency)
  541. ## General remarks on multi-variate analyses (MVAs)
  542. * MVA Methods
  543. * More effective than classic cut-based analyses
  544. * Take correlations of input variables into account
  545. \vfill
  546. * Important: find good input variables for MVA methods
  547. * Good separation power between S and B
  548. * No strong correlation among variables
  549. * No correlation with the parameters you try to measure in your signal sample!
  550. \vfill
  551. * Pre-processing
  552. * Apply obvious variable transformations and let MVA method do the rest
  553. * Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle $\theta$ use $|\cos \theta|$ and not $\cos \theta$ as input variable
  554. * It is generally useful to bring all input variables to a similar numerical range
  555. ## Example of feature transformation
  556. \begin{figure}
  557. \centering
  558. \includegraphics[width=0.95\textwidth]{figures/feature_transformation.png}
  559. \end{figure}
  560. ## Exercise 2: Data preprocessing
  561. a) Read the description of the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) package.
  562. b) Start from the example notebook on the logistic regression for the heart disease data set ([03_ml_basics_log_regr_heart_disease.ipynb](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)). Pre-process the heart disease data set according to the given example. Does preprocessing make a difference in this case?