Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

240 lines
7.9 KiB

  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 3. Machine Learning Basics, Multivariate Analysis
  2. % Martino Borsato, Jörg Marks, Klaus Reygers
  3. % Studierendentage, 11-14 April 2023
  4. ## Multi-variate analyses (MVA)
  5. * General Question
  6. \vspace{0.1cm}
  7. There are 2 categories of distinguishable data, S and B,
  8. described by discrete variables. What are criteria for a separation
  9. of both samples?
  10. * Single criteria are not sufficient to distinguish S and B
  11. * Reduction of the variable space to probabilities for S or B
  12. \vspace{0.1cm}
  13. * Classification of measurements using a set of observables $(V_1,V_2,....,V_n)$
  14. * find optimal separation conditions considering correlations
  15. \begin{figure}
  16. \centering
  17. \includegraphics[width=0.8\textwidth]{figures/SandBcuts.jpeg}
  18. \end{figure}
  19. ## Multi-variate analyses (MVA)
  20. * Regression - in the multidimensional observable space $(V_1,V_2,....,V_n)$
  21. a functional connection with optimal parameters is determined
  22. \begin{figure}
  23. \centering
  24. \includegraphics[width=0.8\textwidth]{figures/regression.jpeg}
  25. \end{figure}
  26. * supervised regression: model is known
  27. * unsupervised regression: model is unknown
  28. * for the parameter determination Maximum likelihood fits are used
  29. ## MVA Classification in N Dimensions
  30. For each event there are N measured variables
  31. \begin{figure}
  32. \centering
  33. \includegraphics[width=0.9\textwidth]{figures/classificationVar.jpeg}
  34. \end{figure}
  35. :::: columns
  36. :::: {.column width=70%}
  37. * Search for a mathematical transformation F of the N dimensional
  38. input space to a one dimensional output space $F(\vec V) : \mathbb{R}^N \rightarrow \mathbb{R}$
  39. * A simple cut in F implements a complex cut in the N dimensional variable space
  40. * Determine $F(\vec V)$ using a model and fit the parameters
  41. ::::
  42. ::::{.column width=30%}
  43. \includegraphics[]{figures/response.jpeg}
  44. ::::
  45. :::
  46. ## MVA Classification in N Dimensions
  47. :::: columns
  48. :::: {.column width=60%}
  49. * Parameters \newline
  50. Important measures to quantify quality \newline \newline
  51. Efficiency: $\epsilon = \frac{N_S (F>F_0)}{N_s}$ \newline
  52. Purity: $\pi = \frac{N_S (F>F_0)}{(N_s + N_B)(F>F_0)}$
  53. ::::
  54. ::::{.column width=40%}
  55. \includegraphics[]{figures/response.jpeg}
  56. ::::
  57. :::
  58. \vspace{0.3cm}
  59. :::: columns
  60. :::: {.column width=60%}
  61. * Reciever Operations Characteristics (ROC) \newline
  62. Errors in classification \newline
  63. \includegraphics[width=0.7\textwidth]{figures/error.jpeg}
  64. ::::
  65. ::::{.column width=40%}
  66. \includegraphics[]{figures/roc.jpeg}
  67. ::::
  68. :::
  69. ## MVA Classification in N Dimensions
  70. :::: columns
  71. :::: {.column width=60%}
  72. * Interpretation of $F(\vec V)$
  73. * The distributions of \textcolor{blue}{$F(\vec V|S)$} and \textcolor{red}{$F(\vec V|S)$} are interpreted as probability density functions (PDF), \textcolor{blue}{$PDF_S(F)$} and \textcolor{blue}{$PDF_B(F)$}
  74. * For a given $F_0$ the probability for signal and background for a
  75. given $S/B$ can be determined \newline
  76. $P ( data = S | F)= \frac {\color {blue} {f_S \cdot PDF_S(F)}} { \color {red} {f_B \cdot PDF_B(F)} + \color {blue} {f_S \cdot PDF_S(F)} }$
  77. ::::
  78. ::::{.column width=40%}
  79. \includegraphics[]{figures/response.jpeg}
  80. ::::
  81. :::
  82. \vspace{0.3cm}
  83. * A cut in the one dimensional Variable $F(\vec V) =F_0$ and accepting all events on the right determines the signal and background efficiency (background rejection). A systematic change of $F(\vec V)$ gives the ROC curve. \newline
  84. \definecolor{darkgreen}{RGB}{0,125,0}
  85. * \color{darkgreen}{A cut in $F(\vec V)$ corresponds to a complex hyperplane, which can not neccessarily be described by a function.}
  86. ## Simple Cuts in Variables
  87. * The most simple classificator to select signal events are cuts in all variables which show a separation
  88. * The output is binary and not a probability on $S$ or $B$.
  89. * An optimization of the cuts is done by maximizing of the background suppression for given signal efficiencies.
  90. * Significance $sig = \epsilon_S \cdot N_S / \sqrt{ \epsilon_S \cdot N_S + \epsilon_B( \epsilon_S) N_B}$
  91. \begin{figure}
  92. \centering
  93. \includegraphics[width=0.8\textwidth]{figures/cutInVariables.jpeg}
  94. \end{figure}
  95. ## Fisher Discriminat
  96. Idea: Find a plane, that the projection of the data on the plane gives an optimal separation of signal and background
  97. :::: columns
  98. :::: {.column width=60%}
  99. * The Fisher discriminat is the linear combination of all input variables
  100. \newline
  101. $F(\vec{V}) = \sum_i w_i \cdot V_i = \vec{w}^T \vec{V}$ \newline
  102. * $\vec w$ defines the orientation of the plane. The coefficients are defined such that the difference of the expectation values of both classes is large and the variance is small. \newline
  103. $J( \vec{w} ) = \frac {( F_S - F_B )^2}{ \sigma_S^2 + \sigma_B^2 } = \frac { \vec{w}^T K \vec{w} }{ \vec{w}^T L \vec{w} }$ \newline
  104. with $K$ as covariance of the the expectation values $F_S -F_B$ and L is the sum
  105. * For the separation a value $F_c$ is determined.
  106. ::::
  107. ::::{.column width=40%}
  108. \includegraphics[]{figures/fisher.jpeg}
  109. ::::
  110. :::
  111. ## k-Nearest Neighbor Method (1)
  112. $k$-NN classifier:
  113. * Estimates probability density around the input vector
  114. * $p(\vec x|S)$ and $p(\vec x|B)$ are approximated by the number of signal and background events in the training sample that lie in a small volume around the point $\vec x$
  115. \vspace{2ex}
  116. Algorithms finds $k$ nearest neighbors:
  117. $$ k = k_s + k_b $$
  118. Probability for the event to be of signal type:
  119. $$ p_s(\vec x) = \frac{k_s(\vec x)}{k_s(\vec x) + k_b(\vec x)} $$
  120. ## k-Nearest Neighbor Method (2)
  121. ::: columns
  122. :::: {.column width=60%}
  123. Simplest choice for distance measure in feature space is the Euclidean distance:
  124. $$ R = |\vec x - \vec y|$$
  125. Better: take correlations between variables into account:
  126. $$ R = \sqrt{(\vec{x}-\vec{y})^T V^{-1} (\vec{x}-\vec{y})} $$
  127. $$ V = \text{covariance matrix}, R = \text{"Mahalanobis distance"}$$
  128. ::::
  129. :::: {.column width=40%}
  130. ![](figures/knn.png)
  131. ::::
  132. :::
  133. \vfill
  134. The $k$-NN classifier has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.
  135. ##
  136. * Determination of the underlying data structure (regression problem)
  137. * MVA Methods
  138. * More effective than classic cut-based analyses
  139. * Take correlations of input variables into account
  140. \vfill
  141. * Important: find good input variables for MVA methods
  142. * Good separation power between S and B
  143. * No strong correlation among variables
  144. * No correlation with the parameters you try to measure in your signal sample!
  145. \vfill
  146. * Pre-processing
  147. * Apply obvious variable transformations and let MVA method do the rest
  148. * Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle $\theta$ use $|\cos \theta|$ and not $\cos \theta$ as input variable
  149. * It is generally useful to bring all input variables to a similar numerical range
  150. ## Fischer Discriminant
  151. ## Regression
  152. ## Logistic regression
  153. ## Decision Trees
  154. XGBoost example with the iris dataset
  155. MVA stands for "Multivariate Analysis," which is a statistical technique used to analyze data that involves multiple variables. In MVA, the relationships between the variables are studied to identify patterns, trends, and dependencies.
  156. MVA techniques are used in many fields, including finance, engineering, physics, biology, and social sciences. Some common MVA techniques include principal component analysis (PCA), factor analysis, cluster analysis, discriminant analysis, and regression analysis.
  157. PCA, for example, is a commonly used MVA technique that reduces the dimensionality of a data set by identifying the most important variables or components. This can help to simplify data visualization and analysis. Factor analysis, on the other hand, is used to identify underlying factors that contribute to the variation in a set of variables.
  158. Overall, MVA is a powerful tool for analyzing complex data sets and can provide insights into relationships and dependencies that might not be apparent when analyzing individual variables in isolation.
  159. Regenerate response