PhD-Kopecna-Renata/Chapters/EventSelection/MVA.tex

206 lines
17 KiB
TeX
Raw Normal View History

\subsection{Multi-variate analysis selection}\label{sec:sel-TMVA}
After the cut-based preselection, a rather large amount of combinatorial background is still present (see \refFig{ANA-nonPrese}). To reduce the amount of background while maintaining high signal selection efficiency, a multi-variate analysis (MVA) is performed~\cite{ANA-MVA}. Generally, an MVA is a set of statistical methods that examine patterns in multidimensional data.\vspace{\baselineskip}
\begin{figure}[hbt!]\centering
\includegraphics[width=0.58\textwidth]{./AnalysisSelection/NeuralNetwork.png}
\captionof{figure}[Multilayer feedforward backpropagation neural network principles.]{Sketch of multilayer feedforward backpropagation neural network principles. The input layer distributes the input data by weighting them and sending them to the hidden neurons (nods). The hidden neurons sum the signal from the input neurons and project this sum on an \emph{activation function} $f_h$. The activation function is typically a binary step (treshold) or rectified linear unit funcion $f(x) = max(0,x)$. The projected numbers are weighted and sent to the output layer, where they are summed again. There can be an arbitrary number of neurons and hidden layers. } \label{fig:ANA-NeuralNetwork}
\end{figure}
There is a vast list of methods that can be considered a multi-variate analysis, the most commonly used ones are \emph{decision trees} or \emph{multiple regression} methods. In this analysis, the multilayer perceptron analysis is used.
A multilayer perceptron (MLP) is an artificial neural network. Neural networks were proposed as early as 1943~\cite{ANA-MLP}. A simple sketch of its principle is presented in \refFig{ANA-NeuralNetwork}. It consists of three layers: input layer, hidden layer and output layer. Each layer consists of several (or many) nodes that are interconnected. A node receives a data item (a number) from each of its connections, multiplies it by an associated weight and returns the sum of these products.
This sum is then transformed by an \emph{activation function}. During the training process, the associated weights are random: by examining examples with known input and/or output layer, the weights are associated in a way that the training data with same labels consistently yield similar output.
An MLP is a special kind of neural network: it is a supervised-learning network that uses backpropagation for training. It is used to distinguish data categories that are not linearly separable: in this case signal and background. Supervised learning means the neural network is trained with a set of input-output pairs (while unsupervised is trained only using the input data). Backpropagation means the gradient of the loss function with respect to the weights of the network is computed. The loss function represents the discrepancy between the desired output and the output calculated by the neural network. This \emph{error} is then sent through the network backwards, updating the weights according to the \emph{error}, leading to a quick reduction of the difference between the expected and calculated outputs.
The MLP tool provided by the Toolkit for Multivariate Data Analysis (TMVA)~\cite{ANA-TMVA} is used. The samples used for training have to be clearly labeled as signal or background and be as close to the real signal and background as possible. Hence, the MLP is trained using \BuToKstmm decay candidates in the simulation sample for signal with the requirement of the reconstructed \Bu meson mass to be close to the \Bu rest mass ($|m_{\Bu}^{\rm reco} - \mBu| <100\mev$). The background training sample is taken from the recorded data: the \Bu meson upper-mass sideband, with the requirement of the reconstructed \Bu meson mass to be larger than 5700\mev. The requirement of $m_{\Bu}^{\rm reco} > 5700\mev$ enforces no (partially) reconstructed events in the background sample. The numbers of available signal and background events are listed in \refTab{TMVAevents}. The MLP is trained separately for \runI and \runII, as the Run conditions differed.
\begin{table}[hbt!]
\centering
\begin{tabular}{l|cc}
& \runI & \runII \\ \hline
Signal events & 4531 & 19152\\
Background events & 511 & 1748
\end{tabular}
\captionof{table}[Number of events used for the MLP training.]{Number of events used for the MLP training. \label{tab:TMVAevents}}
\end{table}
The list of variables that serve as an input to the MLP are presented in \refTab{TMVA}. These variables were identified as the variables with the largest discrimination power. The agreement between the simulated and recorded data in the listed variables becomes extremely important, as the MLP could pick up on differences between the data and simulation instead of separating background from the signal. As mentioned in \refSec{sel-SimulationCorrection}, the \sWeight ed data and weighted simulation distributions of variables listed in \refTab{TMVA} are carefully checked to be in agreement. The distributions agree very well. Small discrepancies are acceptable as they are only minor and present in regions where the MLP does not differentiate between signal and background.
\begin{table}[hbt!]
\centering
\begin{tabular}{c}
$\ln{\pt^{\Bu}} $\\
\Bu Cone-$\pt$ asymmetry \\
\Bu \chisqip\\
$\ln(1-\Bu\text{DIRA})$\\
$\ln{\pt^{\Kp}} $\\
$|\eta(\piz)-\eta(\Kp)|$ \\
$CL_\piz$ \\
max$\left[\ln(\pt^{\g_1}),\ln(\pt^{\g_2})\right]$ \\
min$\left[\ln{ \mun \chisqip },\ln{ \mup \chisqip }\right]$\\
\end{tabular}\\ \vspace{5pt}
\captionof{table}[List of variables used for the MLP training.]{List of variables used for the MLP training. The confidence level of the neutral pion is a product of photon confidence levels, $CL_\piz = CL_{\gamma_1} CL_{\gamma_2}$. The list is identical in \runI and \runII. \label{tab:TMVA}}
\end{table}
In order for the MLP to select signal over background as efficiently as possible, the input variables should not be correlated among each other both in the signal and the background samples, as they lower the separation power of the MLP. The correlations between the input variables for the training signal and background samples are depicted in \refFig{ANA-MLP_corr}.
The TMVA toolkit returns MLP response value between 0 and 1, where the number represents the probability of an event being a signal event. The optimal cut value is discussed later in \refSec{sel-SignalEstimation}.
\begin{figure}[hbt!]
\centering
\includegraphics[width=0.48\textwidth]{./Data/MVA/Run1/CorrelationS_new.eps}
\includegraphics[width=0.48\textwidth]{./Data/MVA/Run1/CorrelationB_new.eps}
\captionof{figure}[The correlations between the input variables for the MVA training.]{The correlations between the input variables for the MVA training signal (left) and background samples (right). It is clear there is no significant correlation between the input variables both in the signal nad the background samples. } \label{fig:ANA-MLP_corr}
\end{figure}
%%%%% Run I
%--- DataSetFactory : Signal -- training events : 4531 (sum of weights: 4531) - requested were 0 events
%--- DataSetFactory : Signal -- testing events : 4531 (sum of weights: 4586.78) - requested were 0 events
%--- DataSetFactory : Signal -- training and testing events: 9062 (sum of weights: 9117.78)
%--- DataSetFactory : Signal -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.330549
%--- DataSetFactory : Background -- training events : 511 (sum of weights: 511) - requested were 0 events
%--- DataSetFactory : Background -- testing events : 511 (sum of weights: 511) - requested were 0 events
%--- DataSetFactory : Background -- training and testing events: 1022 (sum of weights: 1022)
%--- DataSetFactory : Background -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.0103976
%
%--- MLP : Ranking result (top variable is best ranked)
%--- MLP : ----------------------------------------------
%--- MLP : Rank : Variable : Importance
%--- MLP : ----------------------------------------------
%--- MLP : 1 : gamma_max_log_PT_DTF : 1.282e+01
%--- MLP : 2 : K_plus_PI0_ETA_DTF : 1.066e+01
%--- MLP : 3 : B_plus_log_DIRA : 8.585e+00
%--- MLP : 4 : pi_zero_resolved_CL : 5.243e+00
%--- MLP : 5 : B_plus_NEW_ConePTasym : 4.506e+00
%--- MLP : 6 : B_plus_log_PT_DTF : 4.417e+00
%--- MLP : 7 : min_mumu_IPCHI2_OWNPV : 3.755e+00
%--- MLP : 8 : B_plus_IPCHI2_OWNPV : 3.632e+00
%--- MLP : 9 : K_plus_log_PT_DTF : 2.342e+00
%--- MLP : ----------------------------------------------
%
%--- Factory : Inter-MVA correlation matrix (signal):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : BDT: +1.000 +0.773 +0.608
%--- Factory : BDTG: +0.773 +1.000 +0.833
%--- Factory : MLP: +0.608 +0.833 +1.000
%--- Factory : --------------------------------
%--- Factory :
%--- Factory : Inter-MVA correlation matrix (background):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : BDT: +1.000 +0.853 +0.852
%--- Factory : BDTG: +0.853 +1.000 +0.814
%--- Factory : MLP: +0.852 +0.814 +1.000
%--- Factory : --------------------------------
%--- Factory :
%--- Factory : Correlations between input variables and MVA response (signal):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : gamma_max_log_PT_DTF: +0.483 +0.385 +0.294
%--- Factory : K_plus_PI0_ETA_DTF: -0.446 -0.432 -0.382
%--- Factory : B_plus_NEW_ConePTasym: +0.408 +0.319 +0.265
%--- Factory : B_plus_log_PT_DTF: +0.363 +0.319 +0.272
%--- Factory : B_plus_IPCHI2_OWNPV: -0.281 -0.267 -0.243
%--- Factory : K_plus_log_PT_DTF: +0.346 +0.279 +0.217
%--- Factory : B_plus_log_DIRA: -0.428 -0.346 -0.274
%--- Factory : pi_zero_resolved_CL: +0.246 +0.214 +0.167
%--- Factory : min_mumu_IPCHI2_OWNPV: +0.552 +0.383 +0.295
%--- Factory : --------------------------------
%--- Factory :
%--- Factory : Correlations between input variables and MVA response (background):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : gamma_max_log_PT_DTF: +0.388 +0.374 +0.335
%--- Factory : K_plus_PI0_ETA_DTF: -0.548 -0.501 -0.705
%--- Factory : B_plus_NEW_ConePTasym: +0.319 +0.306 +0.280
%--- Factory : B_plus_log_PT_DTF: +0.155 +0.171 +0.205
%--- Factory : B_plus_IPCHI2_OWNPV: -0.367 -0.183 -0.309
%--- Factory : K_plus_log_PT_DTF: +0.296 +0.281 +0.323
%--- Factory : B_plus_log_DIRA: -0.204 -0.150 -0.168
%--- Factory : pi_zero_resolved_CL: +0.286 +0.220 +0.199
%--- Factory : min_mumu_IPCHI2_OWNPV: +0.342 +0.346 +0.248
%%%%% Run II
%--- DataSetFactory : Signal -- training events : 19152 (sum of weights: 19152) - requested were 0 events
%--- DataSetFactory : Signal -- testing events : 19152 (sum of weights: 18831.2) - requested were 0 events
%--- DataSetFactory : Signal -- training and testing events: 38304 (sum of weights: 37983.2)
%--- DataSetFactory : Signal -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.336993
%--- DataSetFactory : Background -- training events : 1748 (sum of weights: 1748) - requested were 0 events
%--- DataSetFactory : Background -- testing events : 1748 (sum of weights: 1748) - requested were 0 events
%--- DataSetFactory : Background -- training and testing events: 3496 (sum of weights: 3496)
%--- DataSetFactory : Background -- due to the preselection a scaling factor has been applied to the numbers of requested events: 0.0123012
%
%--- MLP : Ranking result (top variable is best ranked)
%--- MLP : ----------------------------------------------
%--- MLP : Rank : Variable : Importance
%--- MLP : ----------------------------------------------
%--- MLP : 1 : K_plus_PI0_ETA_DTF : 2.466e+01
%--- MLP : 2 : B_plus_log_DIRA : 2.394e+01
%--- MLP : 3 : gamma_max_log_PT_DTF : 1.465e+01
%--- MLP : 4 : pi_zero_resolved_CL : 6.758e+00
%--- MLP : 5 : B_plus_log_PT_DTF : 5.412e+00
%--- MLP : 6 : K_plus_log_PT_DTF : 4.794e+00
%--- MLP : 7 : B_plus_IPCHI2_OWNPV : 4.181e+00
%--- MLP : 8 : min_mumu_IPCHI2_OWNPV : 3.781e+00
%--- MLP : 9 : B_plus_NEW_ConePTasym : 2.970e+00
%--- MLP : ----------------------------------------------
%
%--- Factory : Inter-MVA correlation matrix (signal):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : BDT: +1.000 +0.834 +0.606
%--- Factory : BDTG: +0.834 +1.000 +0.786
%--- Factory : MLP: +0.606 +0.786 +1.000
%--- Factory : --------------------------------
%--- Factory :
%--- Factory : Inter-MVA correlation matrix (background):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : BDT: +1.000 +0.862 +0.848
%--- Factory : BDTG: +0.862 +1.000 +0.787
%--- Factory : MLP: +0.848 +0.787 +1.000
%--- Factory : --------------------------------
%--- Factory :
%--- Factory : Correlations between input variables and MVA response (signal):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : gamma_max_log_PT_DTF: +0.373 +0.366 +0.257
%--- Factory : K_plus_PI0_ETA_DTF: -0.409 -0.441 -0.330
%--- Factory : B_plus_NEW_ConePTasym: +0.367 +0.313 +0.222
%--- Factory : B_plus_log_PT_DTF: +0.263 +0.282 +0.191
%--- Factory : B_plus_IPCHI2_OWNPV: -0.344 -0.304 -0.254
%--- Factory : K_plus_log_PT_DTF: +0.280 +0.312 +0.227
%--- Factory : B_plus_log_DIRA: -0.515 -0.406 -0.269
%--- Factory : pi_zero_resolved_CL: +0.348 +0.248 +0.184
%--- Factory : min_mumu_IPCHI2_OWNPV: +0.557 +0.428 +0.294
%--- Factory : --------------------------------
%--- Factory :
%--- Factory : Correlations between input variables and MVA response (background):
%--- Factory : --------------------------------
%--- Factory : BDT BDTG MLP
%--- Factory : gamma_max_log_PT_DTF: +0.315 +0.296 +0.199
%--- Factory : K_plus_PI0_ETA_DTF: -0.562 -0.474 -0.646
%--- Factory : B_plus_NEW_ConePTasym: +0.261 +0.228 +0.197
%--- Factory : B_plus_log_PT_DTF: +0.217 +0.202 +0.189
%--- Factory : B_plus_IPCHI2_OWNPV: -0.472 -0.354 -0.410
%--- Factory : K_plus_log_PT_DTF: +0.363 +0.335 +0.387
%--- Factory : B_plus_log_DIRA: -0.287 -0.278 -0.198
%--- Factory : pi_zero_resolved_CL: +0.254 +0.238 +0.229
%--- Factory : min_mumu_IPCHI2_OWNPV: +0.291 +0.292 +0.222
%--- Factory : --------------------------------