PhD thesis of Renata Kopečná Angular analysis of B+->K*+(K+pi0)mu+mu- decay with the LHCb experiment
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

205 lines
17 KiB

  1. \subsection{Multi-variate analysis selection}\label{sec:sel-TMVA}
  2. After the cut-based preselection, a rather large amount of combinatorial background is still present (see \refFig{ANA-nonPrese}). To reduce the amount of background while maintaining high signal selection efficiency, a multi-variate analysis (MVA) is performed~\cite{ANA-MVA}. Generally, an MVA is a set of statistical methods that examine patterns in multidimensional data.\vspace{\baselineskip}
  3. \begin{figure}[hbt!]\centering
  4. \includegraphics[width=0.58\textwidth]{./AnalysisSelection/NeuralNetwork.png}
  5. \captionof{figure}[Multilayer feedforward backpropagation neural network principles.]{Sketch of multilayer feedforward backpropagation neural network principles. The input layer distributes the input data by weighting them and sending them to the hidden neurons (nods). The hidden neurons sum the signal from the input neurons and project this sum on an \emph{activation function} $f_h$. The activation function is typically a binary step (treshold) or rectified linear unit funcion $f(x) = max(0,x)$. The projected numbers are weighted and sent to the output layer, where they are summed again. There can be an arbitrary number of neurons and hidden layers. } \label{fig:ANA-NeuralNetwork}
  6. \end{figure}
  7. There is a vast list of methods that can be considered a multi-variate analysis, the most commonly used ones are \emph{decision trees} or \emph{multiple regression} methods. In this analysis, the multilayer perceptron analysis is used.
  8. A multilayer perceptron (MLP) is an artificial neural network. Neural networks were proposed as early as 1943~\cite{ANA-MLP}. A simple sketch of its principle is presented in \refFig{ANA-NeuralNetwork}. It consists of three layers: input layer, hidden layer and output layer. Each layer consists of several (or many) nodes that are interconnected. A node receives a data item (a number) from each of its connections, multiplies it by an associated weight and returns the sum of these products.
  9. This sum is then transformed by an \emph{activation function}. During the training process, the associated weights are random: by examining examples with known input and/or output layer, the weights are associated in a way that the training data with same labels consistently yield similar output.
  10. An MLP is a special kind of neural network: it is a supervised-learning network that uses backpropagation for training. It is used to distinguish data categories that are not linearly separable: in this case signal and background. Supervised learning means the neural network is trained with a set of input-output pairs (while unsupervised is trained only using the input data). Backpropagation means the gradient of the loss function with respect to the weights of the network is computed. The loss function represents the discrepancy between the desired output and the output calculated by the neural network. This \emph{error} is then sent through the network backwards, updating the weights according to the \emph{error}, leading to a quick reduction of the difference between the expected and calculated outputs.
  11. The MLP tool provided by the Toolkit for Multivariate Data Analysis (TMVA)~\cite{ANA-TMVA} is used. The samples used for training have to be clearly labeled as signal or background and be as close to the real signal and background as possible. Hence, the MLP is trained using \BuToKstmm decay candidates in the simulation sample for signal with the requirement of the reconstructed \Bu meson mass to be close to the \Bu rest mass ($|m_{\Bu}^{\rm reco} - \mBu| <100\mev$). The background training sample is taken from the recorded data: the \Bu meson upper-mass sideband, with the requirement of the reconstructed \Bu meson mass to be larger than 5700\mev. The requirement of $m_{\Bu}^{\rm reco} > 5700\mev$ enforces no (partially) reconstructed events in the background sample. The numbers of available signal and background events are listed in \refTab{TMVAevents}. The MLP is trained separately for \runI and \runII, as the Run conditions differed.
  12. \begin{table}[hbt!]
  13. \centering
  14. \begin{tabular}{l|cc}
  15. & \runI & \runII \\ \hline
  16. Signal events & 4531 & 19152\\
  17. Background events & 511 & 1748
  18. \end{tabular}
  19. \captionof{table}[Number of events used for the MLP training.]{Number of events used for the MLP training. \label{tab:TMVAevents}}
  20. \end{table}
  21. The list of variables that serve as an input to the MLP are presented in \refTab{TMVA}. These variables were identified as the variables with the largest discrimination power. The agreement between the simulated and recorded data in the listed variables becomes extremely important, as the MLP could pick up on differences between the data and simulation instead of separating background from the signal. As mentioned in \refSec{sel-SimulationCorrection}, the \sWeight ed data and weighted simulation distributions of variables listed in \refTab{TMVA} are carefully checked to be in agreement. The distributions agree very well. Small discrepancies are acceptable as they are only minor and present in regions where the MLP does not differentiate between signal and background.
  22. \begin{table}[hbt!]
  23. \centering
  24. \begin{tabular}{c}
  25. $\ln{\pt^{\Bu}} $\\
  26. \Bu Cone-$\pt$ asymmetry \\
  27. \Bu \chisqip\\
  28. $\ln(1-\Bu\text{DIRA})$\\
  29. $\ln{\pt^{\Kp}} $\\
  30. $|\eta(\piz)-\eta(\Kp)|$ \\
  31. $CL_\piz$ \\
  32. max$\left[\ln(\pt^{\g_1}),\ln(\pt^{\g_2})\right]$ \\
  33. min$\left[\ln{ \mun \chisqip },\ln{ \mup \chisqip }\right]$\\
  34. \end{tabular}\\ \vspace{5pt}
  35. \captionof{table}[List of variables used for the MLP training.]{List of variables used for the MLP training. The confidence level of the neutral pion is a product of photon confidence levels, $CL_\piz = CL_{\gamma_1} CL_{\gamma_2}$. The list is identical in \runI and \runII. \label{tab:TMVA}}
  36. \end{table}
  37. In order for the MLP to select signal over background as efficiently as possible, the input variables should not be correlated among each other both in the signal and the background samples, as they lower the separation power of the MLP. The correlations between the input variables for the training signal and background samples are depicted in \refFig{ANA-MLP_corr}.
  38. The TMVA toolkit returns MLP response value between 0 and 1, where the number represents the probability of an event being a signal event. The optimal cut value is discussed later in \refSec{sel-SignalEstimation}.
  39. \begin{figure}[hbt!]
  40. \centering
  41. \includegraphics[width=0.48\textwidth]{./Data/MVA/Run1/CorrelationS_new.eps}
  42. \includegraphics[width=0.48\textwidth]{./Data/MVA/Run1/CorrelationB_new.eps}
  43. \captionof{figure}[The correlations between the input variables for the MVA training.]{The correlations between the input variables for the MVA training signal (left) and background samples (right). It is clear there is no significant correlation between the input variables both in the signal nad the background samples. } \label{fig:ANA-MLP_corr}
  44. \end{figure}
  45. %%%%% Run I
  46. %--- DataSetFactory : Signal -- training events : 4531 (sum of weights: 4531) - requested were 0 events
  47. %--- DataSetFactory : Signal -- testing events : 4531 (sum of weights: 4586.78) - requested were 0 events
  48. %--- DataSetFactory : Signal -- training and testing events: 9062 (sum of weights: 9117.78)
  49. %--- DataSetFactory : Signal -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.330549
  50. %--- DataSetFactory : Background -- training events : 511 (sum of weights: 511) - requested were 0 events
  51. %--- DataSetFactory : Background -- testing events : 511 (sum of weights: 511) - requested were 0 events
  52. %--- DataSetFactory : Background -- training and testing events: 1022 (sum of weights: 1022)
  53. %--- DataSetFactory : Background -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.0103976
  54. %
  55. %--- MLP : Ranking result (top variable is best ranked)
  56. %--- MLP : ----------------------------------------------
  57. %--- MLP : Rank : Variable : Importance
  58. %--- MLP : ----------------------------------------------
  59. %--- MLP : 1 : gamma_max_log_PT_DTF : 1.282e+01
  60. %--- MLP : 2 : K_plus_PI0_ETA_DTF : 1.066e+01
  61. %--- MLP : 3 : B_plus_log_DIRA : 8.585e+00
  62. %--- MLP : 4 : pi_zero_resolved_CL : 5.243e+00
  63. %--- MLP : 5 : B_plus_NEW_ConePTasym : 4.506e+00
  64. %--- MLP : 6 : B_plus_log_PT_DTF : 4.417e+00
  65. %--- MLP : 7 : min_mumu_IPCHI2_OWNPV : 3.755e+00
  66. %--- MLP : 8 : B_plus_IPCHI2_OWNPV : 3.632e+00
  67. %--- MLP : 9 : K_plus_log_PT_DTF : 2.342e+00
  68. %--- MLP : ----------------------------------------------
  69. %
  70. %--- Factory : Inter-MVA correlation matrix (signal):
  71. %--- Factory : --------------------------------
  72. %--- Factory : BDT BDTG MLP
  73. %--- Factory : BDT: +1.000 +0.773 +0.608
  74. %--- Factory : BDTG: +0.773 +1.000 +0.833
  75. %--- Factory : MLP: +0.608 +0.833 +1.000
  76. %--- Factory : --------------------------------
  77. %--- Factory :
  78. %--- Factory : Inter-MVA correlation matrix (background):
  79. %--- Factory : --------------------------------
  80. %--- Factory : BDT BDTG MLP
  81. %--- Factory : BDT: +1.000 +0.853 +0.852
  82. %--- Factory : BDTG: +0.853 +1.000 +0.814
  83. %--- Factory : MLP: +0.852 +0.814 +1.000
  84. %--- Factory : --------------------------------
  85. %--- Factory :
  86. %--- Factory : Correlations between input variables and MVA response (signal):
  87. %--- Factory : --------------------------------
  88. %--- Factory : BDT BDTG MLP
  89. %--- Factory : gamma_max_log_PT_DTF: +0.483 +0.385 +0.294
  90. %--- Factory : K_plus_PI0_ETA_DTF: -0.446 -0.432 -0.382
  91. %--- Factory : B_plus_NEW_ConePTasym: +0.408 +0.319 +0.265
  92. %--- Factory : B_plus_log_PT_DTF: +0.363 +0.319 +0.272
  93. %--- Factory : B_plus_IPCHI2_OWNPV: -0.281 -0.267 -0.243
  94. %--- Factory : K_plus_log_PT_DTF: +0.346 +0.279 +0.217
  95. %--- Factory : B_plus_log_DIRA: -0.428 -0.346 -0.274
  96. %--- Factory : pi_zero_resolved_CL: +0.246 +0.214 +0.167
  97. %--- Factory : min_mumu_IPCHI2_OWNPV: +0.552 +0.383 +0.295
  98. %--- Factory : --------------------------------
  99. %--- Factory :
  100. %--- Factory : Correlations between input variables and MVA response (background):
  101. %--- Factory : --------------------------------
  102. %--- Factory : BDT BDTG MLP
  103. %--- Factory : gamma_max_log_PT_DTF: +0.388 +0.374 +0.335
  104. %--- Factory : K_plus_PI0_ETA_DTF: -0.548 -0.501 -0.705
  105. %--- Factory : B_plus_NEW_ConePTasym: +0.319 +0.306 +0.280
  106. %--- Factory : B_plus_log_PT_DTF: +0.155 +0.171 +0.205
  107. %--- Factory : B_plus_IPCHI2_OWNPV: -0.367 -0.183 -0.309
  108. %--- Factory : K_plus_log_PT_DTF: +0.296 +0.281 +0.323
  109. %--- Factory : B_plus_log_DIRA: -0.204 -0.150 -0.168
  110. %--- Factory : pi_zero_resolved_CL: +0.286 +0.220 +0.199
  111. %--- Factory : min_mumu_IPCHI2_OWNPV: +0.342 +0.346 +0.248
  112. %%%%% Run II
  113. %--- DataSetFactory : Signal -- training events : 19152 (sum of weights: 19152) - requested were 0 events
  114. %--- DataSetFactory : Signal -- testing events : 19152 (sum of weights: 18831.2) - requested were 0 events
  115. %--- DataSetFactory : Signal -- training and testing events: 38304 (sum of weights: 37983.2)
  116. %--- DataSetFactory : Signal -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.336993
  117. %--- DataSetFactory : Background -- training events : 1748 (sum of weights: 1748) - requested were 0 events
  118. %--- DataSetFactory : Background -- testing events : 1748 (sum of weights: 1748) - requested were 0 events
  119. %--- DataSetFactory : Background -- training and testing events: 3496 (sum of weights: 3496)
  120. %--- DataSetFactory : Background -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.0123012
  121. %
  122. %--- MLP : Ranking result (top variable is best ranked)
  123. %--- MLP : ----------------------------------------------
  124. %--- MLP : Rank : Variable : Importance
  125. %--- MLP : ----------------------------------------------
  126. %--- MLP : 1 : K_plus_PI0_ETA_DTF : 2.466e+01
  127. %--- MLP : 2 : B_plus_log_DIRA : 2.394e+01
  128. %--- MLP : 3 : gamma_max_log_PT_DTF : 1.465e+01
  129. %--- MLP : 4 : pi_zero_resolved_CL : 6.758e+00
  130. %--- MLP : 5 : B_plus_log_PT_DTF : 5.412e+00
  131. %--- MLP : 6 : K_plus_log_PT_DTF : 4.794e+00
  132. %--- MLP : 7 : B_plus_IPCHI2_OWNPV : 4.181e+00
  133. %--- MLP : 8 : min_mumu_IPCHI2_OWNPV : 3.781e+00
  134. %--- MLP : 9 : B_plus_NEW_ConePTasym : 2.970e+00
  135. %--- MLP : ----------------------------------------------
  136. %
  137. %--- Factory : Inter-MVA correlation matrix (signal):
  138. %--- Factory : --------------------------------
  139. %--- Factory : BDT BDTG MLP
  140. %--- Factory : BDT: +1.000 +0.834 +0.606
  141. %--- Factory : BDTG: +0.834 +1.000 +0.786
  142. %--- Factory : MLP: +0.606 +0.786 +1.000
  143. %--- Factory : --------------------------------
  144. %--- Factory :
  145. %--- Factory : Inter-MVA correlation matrix (background):
  146. %--- Factory : --------------------------------
  147. %--- Factory : BDT BDTG MLP
  148. %--- Factory : BDT: +1.000 +0.862 +0.848
  149. %--- Factory : BDTG: +0.862 +1.000 +0.787
  150. %--- Factory : MLP: +0.848 +0.787 +1.000
  151. %--- Factory : --------------------------------
  152. %--- Factory :
  153. %--- Factory : Correlations between input variables and MVA response (signal):
  154. %--- Factory : --------------------------------
  155. %--- Factory : BDT BDTG MLP
  156. %--- Factory : gamma_max_log_PT_DTF: +0.373 +0.366 +0.257
  157. %--- Factory : K_plus_PI0_ETA_DTF: -0.409 -0.441 -0.330
  158. %--- Factory : B_plus_NEW_ConePTasym: +0.367 +0.313 +0.222
  159. %--- Factory : B_plus_log_PT_DTF: +0.263 +0.282 +0.191
  160. %--- Factory : B_plus_IPCHI2_OWNPV: -0.344 -0.304 -0.254
  161. %--- Factory : K_plus_log_PT_DTF: +0.280 +0.312 +0.227
  162. %--- Factory : B_plus_log_DIRA: -0.515 -0.406 -0.269
  163. %--- Factory : pi_zero_resolved_CL: +0.348 +0.248 +0.184
  164. %--- Factory : min_mumu_IPCHI2_OWNPV: +0.557 +0.428 +0.294
  165. %--- Factory : --------------------------------
  166. %--- Factory :
  167. %--- Factory : Correlations between input variables and MVA response (background):
  168. %--- Factory : --------------------------------
  169. %--- Factory : BDT BDTG MLP
  170. %--- Factory : gamma_max_log_PT_DTF: +0.315 +0.296 +0.199
  171. %--- Factory : K_plus_PI0_ETA_DTF: -0.562 -0.474 -0.646
  172. %--- Factory : B_plus_NEW_ConePTasym: +0.261 +0.228 +0.197
  173. %--- Factory : B_plus_log_PT_DTF: +0.217 +0.202 +0.189
  174. %--- Factory : B_plus_IPCHI2_OWNPV: -0.472 -0.354 -0.410
  175. %--- Factory : K_plus_log_PT_DTF: +0.363 +0.335 +0.387
  176. %--- Factory : B_plus_log_DIRA: -0.287 -0.278 -0.198
  177. %--- Factory : pi_zero_resolved_CL: +0.254 +0.238 +0.229
  178. %--- Factory : min_mumu_IPCHI2_OWNPV: +0.291 +0.292 +0.222
  179. %--- Factory : --------------------------------