\subsection{Correction to the simulation}\label{sec:sel-SimulationCorrection}

The Monte Carlo simulation sample is used to estimate the background contribution in the data and to account for detector acceptance effects. Therefore, the distributions of variables (and the correlations between them) in data and simulation have to agree. Even though there have been many recent improvements in the Monte Carlo simulation methods, the agreement is not perfect. The main difference between the simulation and the data is multiplicity: in simulation, the underlying event is under-represented.

The correction of the simulation is done by applying weights on the simulation to match the data. In order to obtain the weights, the simulated events have to pass the same selection as the data sample. On top of this selection, only \emph{true} signal candidates have to be selected: the reconstruction algorithms can reconstruct a track that does not correspond to any simulated particle. Such signal candidates have to be removed by the so-called \emph{truth-matching}. 

\subsubsection[Matching of reconstructed candidates to simulated candidates]{Matching of reconstructed signal candidates to simulated candidates}\label{sec:sel-TruthMatching}

%All simulated events undergo the full detector reconstruction. That means the \emph{true} event properties are distorted by the detector effects. To obtain the true variables, dedicated tools are used. For any standard \lhcb analysis, the \texttt{BackgroundCategory} tool is typically used. This tool looks for true properties of the particles in the decay chain and categorizes the event into groups, such as combinatorial background or events with a ghost tracks. However, in this work, this tool cannot be used as the typical background categories do not cover the possible realizations of the \pizTogg decay. 

As there is a neutral pion in the decay chain, it is important to make sure our \emph{true} candidates match the signal candidates we select in data. 
Events where for instance a photon is converted into an electron or one of the photons in the \pizTogg decay is randomly assigned is still considered signal, as there is no way to asses the origin of the photon in the data sample. It is clearly visible in \refFig{RndGamma_angles} there are no structures in the angular distribution of the events with a random photon included. Hence, these events can be considered as signal candidates, as they do not distort the angular distributions.

In order to select the \emph{true} signal candidates, an ID-based selection is applied. Each particle type has it is own unique ID following the Monte Carlo Particle Numbering Scheme~\cite{PDG}. Each generated particle has its \emph{true} ID and the reconstructed ID, based on the PID response of the \lhcb detector. The ID-based selection is achieved by comparing the \emph{true} ID of the particles, their mother and their grandmother ID to the reconstructed ID. This check is applied on the whole decay chain \BuToKstmmFull except for the photons.
 

\begin{figure}[hbt!]
    \centering
        \includegraphics[width=0.47\textwidth]{./Data/TM/RndGammas/Run_2_KplusPi0Resolved_TMed_costhetak_AllGammaContributions_normalized_fancy.eps}
        \includegraphics[width=0.47\textwidth]{./Data/TM/RndGammas/Run_2_KplusPi0Resolved_TMed_costhetal_AllGammaContributions_normalized_fancy.eps}
        \includegraphics[width=0.47\textwidth]{./Data/TM/RndGammas/Run_2_KplusPi0Resolved_TMed_phi_AllGammaContributions_normalized_fancy.eps}\\        
        
    \captionof{figure}[Distributions of \ctk, \ctl and $\phi$ for events with random photons.]{Normalized  \ctk (left), \ctl (middle) and $\phi$ (right) distributions for simulated events where both photons are  either coming from \BuToKstmm, \KstToKpPi, \pizTogg or one photon is a random hit in \ecal reconstructed as photon. Black squares note all events passing the \emph{true} ID requirements, excluding photons parents' ID. Red stars are events, where both photons originate from \BuToKstmm, blue circles are events, where one photon is \emph{true} and one is random. At the bottom of the figures, a ratio of the number of normalized events with only \emph{true} photons over the number of normalized events with one \emph{true} and one random photon is shown. The ratio is consistent with one.} \label{fig:RndGamma_angles}
\end{figure}


\subsubsection{Reweighting and the \sPlot technique}\label{sec:sel-sWeight} %sWeights

To account for the simulation imperfections listed above, a correction has to be applied. Very good agreement between the data and the simulation is achieved when the Monte Carlo simulation is weighted in $\pt^{\Bu}$ and \texttt{nLongTracks}, which represents the number of tracks traversing \velo, \ttracker and the T-stations. This number is strongly correlated with the overall event multiplicity. The weighting is performed as two independent weightings, as there is no correlation between $\pt^{\Bu}$ and \texttt{nLongTracks}, as can be seen in \refFig{ANA-pt_long_corr}.

\begin{figure}[hbt!]
    \centering
    \includegraphics[width=0.62\textwidth]{./Data/weightPlots/2018/2018_KplusPi0Resolved_nLongTracks_B_plus_PT_DTF_Correlation_.eps}
    \captionof{figure}[Correlation between $\pt^{\Bu}$ and the number of long tracks.]{Correlation between $\pt^{\Bu}$ and the number of long tracks in the 2018 data sample. The correlation coeficcient is $\simeq0$, proving the variables are not correlated. \vspace{0.5\baselineskip}} \label{fig:ANA-pt_long_corr}
\end{figure}

The weights cannot be calculated directly using the data sample: the simulation sample consists only of signal candidates while in the data sample the background is also present. Therefore, the data sample has to be weighted to mimic the signal as much as possible. This is done using the \sPlot technique~\cite{sPLOT1,sPLOT2}. This technique is used to unfold the signal decay from the background by exploiting likelihood fits. The \sPlot technique is a more general case of \emph{sideband subtraction}: it provides a weight for every data point in a way the weighted distribution is re-sampling the background-subtracted distribution.
% A \emph{discriminating} variable, typically invariant mass, is chosen. This variable needs to be uncorrelated with the \emph{control} variable: variable which behavior \sPlot infers  from the \emph{discriminating} variable. In the \emph{discriminating} variable a sideband region is selected, where there is no or negligible amount of signal events. \emph{Control} variable is then determined from the background and extrapolated into the signal region of the \emph{discriminating} variable. Scaled distribution of the \emph{control} variable in the sideband is then subtracted from the distribution in the signal region, resulting in pure-signal distribution of the \emph{control} variable. 

Mathematically, it can be expressed using the number of signal $N_s$ and  background $N_b$ events with probability density functions $s(d,c)$ and $b(d,c)$ respectively:\vspace{-0.25\baselineskip}
%
\begin{equation}\label{eq:sPlot_1}
    N_s s(d,c) +  N_b b(d,c) = (N_s+N_b) f(d,c)\,,
\end{equation}
%
where $d$ is the \emph{discriminating} variable and $c$ is the \emph{control} variable. $f(d,c)$ is the Probability Density Function (PDF) of combined distribution of signal and background. As the \emph{control} and \emph{discriminating} variables are uncorrelated, one can rewrite their PDFs as\vspace{-0.25\baselineskip}
%
\begin{align}\begin{split}\label{eq:sPlot_2}
 s(d,c) &= s(d)\,s(c)\,,\\
 b(d,c) &= b(d)\,b(c)\,.
\end{split}\end{align}
%

The goal is to obtain an arbitrary weight function $w(d)$ fulfilling\vspace{-0.25\baselineskip}
\begin{equation}\label{eq:sPlot_3}
    N_s\,s(c) =  (N_s+N_b) \int{f(d,c)w(d)\,\deriv{d}} = N_s s(c)  \int{s(d)w(d)\,\deriv{d}} + N_n b(c) \int{b(d)w(d)\,\deriv{d}} \,.
\end{equation}
%
Therefore, the $w(d)$ function is chosen in a way that:\vspace{-0.25\baselineskip}
%
\begin{align}\begin{split}\label{eq:sPlot_4}
\int{s(d)w(d)\,\deriv{d}} &= 1\,,\\
\int{b(d)w(d)\,\deriv{d}} &= 0
\end{split}\end{align}
%
To have the smallest statistical uncertainty on the weights, the variation given by \refEq{sPlot_5} of the weights have to be minimized \vspace{-0.25\baselineskip}
\begin{equation}\label{eq:sPlot_5}
    \int{ f(d,c) w(d)^2\,\deriv{c}\,\deriv{d}} \,.
\end{equation}
The three conditions assure an unique determination of the function $w(d)$. That allows for calculating the weights for any event with property $d$, resulting in signal-only distribution of the \emph{control} variable. These weights are then called \sWeight s.

\sWeight ed \emph{data} events in the resonant \qsq region are then used to obtain weights to correct the simulated sample. The data sample is dominated by the resonances (\refSec{sel-Charmonium}). Hence, the \BuToKstJpsi simulation sample is used to obtain the weights needed for correcting the simulation distributions. The agreement between the \sWeight ed data and weighted simulation is crucial for the next step, the multi-variate analysis. The distributions used for the multi-variate analysis are carefully validated, see \refApp{CompareVariables}, where the comparison of the \sWeight ed \emph{data} and weighted simulation for each data taking year are given. The distributions of the \sWeight ed \emph{data} and weighted simulation  agree very well.