723 lines
22 KiB
Markdown
723 lines
22 KiB
Markdown
% Introduction to Data Analysis and Machine Learning in Physics: \ 3. Machine Learning Basics, Multivariate Analysis
|
|
% Jörg Marks, Klaus Reygers
|
|
% Studierendentage, 11-14 April 2023
|
|
|
|
|
|
## Multi-variate analyses (MVA)
|
|
|
|
* General Question
|
|
\vspace{0.1cm}
|
|
|
|
There are 2 categories of distinguishable data, S and B,
|
|
described by discrete variables. What are criteria for a separation
|
|
of both samples?
|
|
|
|
* Single criteria are not sufficient to distinguish S and B
|
|
* Reduction of the variable space to probabilities for S or B
|
|
\vspace{0.1cm}
|
|
|
|
* Classification of measurements using a set of observables $(V_1,V_2,....,V_n)$
|
|
* find optimal separation conditions considering correlations
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{figures/SandBcuts.jpeg}
|
|
\end{figure}
|
|
|
|
## Multi-variate analyses (MVA)
|
|
|
|
* Regression - in the multidimensional observable space $(V_1,V_2,....,V_n)$
|
|
a functional connection with optimal parameters is determined
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{figures/regression.jpeg}
|
|
\end{figure}
|
|
|
|
* supervised regression: model is known
|
|
|
|
* unsupervised regression: model is unknown
|
|
|
|
* for the parameter determination Maximum likelihood fits are used
|
|
|
|
|
|
## MVA Classification in N Dimensions
|
|
|
|
For each event there are N measured variables
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.9\textwidth]{figures/classificationVar.jpeg}
|
|
\end{figure}
|
|
|
|
:::: columns
|
|
:::: {.column width=70%}
|
|
* Search for a mathematical transformation F of the N dimensional
|
|
input space to a one dimensional output space $F(\vec V) : \mathbb{R}^N \rightarrow \mathbb{R}$
|
|
|
|
* A simple cut in F implements a complex cut in the N dimensional variable space
|
|
|
|
* Determine $F(\vec V)$ using a model and fit the parameters
|
|
::::
|
|
::::{.column width=30%}
|
|
\includegraphics[]{figures/response.jpeg}
|
|
::::
|
|
:::
|
|
|
|
## MVA Classification in N Dimensions
|
|
|
|
:::: columns
|
|
:::: {.column width=60%}
|
|
* Parameters \newline
|
|
Important measures to quantify quality \newline \newline
|
|
Efficiency: $\epsilon = \frac{N_S (F>F_0)}{N_s}$ \newline
|
|
Purity: $\pi = \frac{N_S (F>F_0)}{(N_s + N_B)(F>F_0)}$
|
|
::::
|
|
::::{.column width=40%}
|
|
\includegraphics[]{figures/response.jpeg}
|
|
::::
|
|
:::
|
|
\vspace{0.3cm}
|
|
:::: columns
|
|
:::: {.column width=60%}
|
|
* Reciever Operations Characteristics (ROC) \newline
|
|
Errors in classification \newline
|
|
\includegraphics[width=0.7\textwidth]{figures/error.jpeg}
|
|
::::
|
|
::::{.column width=40%}
|
|
\includegraphics[]{figures/roc.jpeg}
|
|
::::
|
|
:::
|
|
|
|
## MVA Classification in N Dimensions
|
|
|
|
|
|
:::: columns
|
|
:::: {.column width=60%}
|
|
* Interpretation of $F(\vec V)$
|
|
|
|
* The distributions of \textcolor{blue}{$F(\vec V|S)$} and \textcolor{red}{$F(\vec V|S)$} are interpreted as probability density functions (PDF), \textcolor{blue}{$PDF_S(F)$} and \textcolor{blue}{$PDF_B(F)$}
|
|
|
|
* For a given $F_0$ the probability for signal and background for a
|
|
given $S/B$ can be determined \newline
|
|
$P ( data = S | F)= \frac {\color {blue} {f_S \cdot PDF_S(F)}} { \color {red} {f_B \cdot PDF_B(F)} + \color {blue} {f_S \cdot PDF_S(F)} }$
|
|
::::
|
|
::::{.column width=40%}
|
|
\includegraphics[]{figures/response.jpeg}
|
|
::::
|
|
:::
|
|
\vspace{0.3cm}
|
|
* A cut in the one dimensional Variable $F(\vec V) =F_0$ and accepting all events on the right determines the signal and background efficiency (background rejection). A systematic change of $F(\vec V)$ gives the ROC curve. \newline
|
|
|
|
\definecolor{darkgreen}{RGB}{0,125,0}
|
|
* \color{darkgreen}{A cut in $F(\vec V)$ corresponds to a complex hyperplane, which can not neccessarily be described by a function.}
|
|
|
|
## Simple Cuts in Variables
|
|
|
|
* The most simple classificator to select signal events are cuts in all variables which show a separation
|
|
|
|
* The output is binary and not a probability on $S$ or $B$.
|
|
|
|
* An optimization of the cuts is done by maximizing of the background suppression for given signal efficiencies.
|
|
|
|
* Significance $sig = \epsilon_S \cdot N_S / \sqrt{ \epsilon_S \cdot N_S + \epsilon_B( \epsilon_S) N_B}$
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{figures/cutInVariables.jpeg}
|
|
\end{figure}
|
|
|
|
|
|
## Fisher Discriminat
|
|
|
|
Idea: Find a plane, that the projection of the data on the plane gives an optimal separation of signal and background
|
|
|
|
:::: columns
|
|
:::: {.column width=60%}
|
|
* The Fisher discriminat is the linear combination of all input variables
|
|
\newline
|
|
$F(\vec{V}) = \sum_i w_i \cdot V_i = \vec{w}^T \vec{V}$ \newline
|
|
|
|
* $\vec w$ defines the orientation of the plane. The coefficients are defined such that the difference of the expectation values of both classes is large and the variance is small. \newline
|
|
$J( \vec{w} ) = \frac {( F_S - F_B )^2}{ \sigma_S^2 + \sigma_B^2 } = \frac { \vec{w}^T K \vec{w} }{ \vec{w}^T L \vec{w} }$ \newline
|
|
with $K$ as covariance of the the expectation values $F_S -F_B$ and L is the sum
|
|
|
|
* For the separation a value $F_c$ is determined.
|
|
::::
|
|
::::{.column width=40%}
|
|
\includegraphics[]{figures/fisher.jpeg}
|
|
::::
|
|
:::
|
|
|
|
|
|
## k-Nearest Neighbor Method (1)
|
|
|
|
$k$-NN classifier:
|
|
|
|
* Estimates probability density around the input vector
|
|
* $p(\vec x|S)$ and $p(\vec x|B)$ are approximated by the number of signal and background events in the training sample that lie in a small volume around the point $\vec x$
|
|
|
|
\vspace{2ex}
|
|
|
|
Algorithms finds $k$ nearest neighbors:
|
|
$$ k = k_s + k_b $$
|
|
|
|
Probability for the event to be of signal type:
|
|
|
|
$$ p_s(\vec x) = \frac{k_s(\vec x)}{k_s(\vec x) + k_b(\vec x)} $$
|
|
|
|
## k-Nearest Neighbor Method (2)
|
|
|
|
::: columns
|
|
:::: {.column width=60%}
|
|
Simplest choice for distance measure in feature space is the Euclidean distance:
|
|
$$ R = |\vec x - \vec y|$$
|
|
|
|
Better: take correlations between variables into account:
|
|
|
|
$$ R = \sqrt{(\vec{x}-\vec{y})^T V^{-1} (\vec{x}-\vec{y})} $$
|
|
$$ V = \text{covariance matrix}, R = \text{"Mahalanobis distance"}$$
|
|
|
|
|
|
::::
|
|
:::: {.column width=40%}
|
|
![](figures/knn.png)
|
|
::::
|
|
:::
|
|
|
|
\vfill
|
|
|
|
The $k$-NN classifier has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.
|
|
|
|
|
|
|
|
## Linear regression revisited
|
|
|
|
\vfill
|
|
|
|
::: columns
|
|
:::: {.column width=50%}
|
|
\small \textcolor{gray}{"Galton family heights data": \\ origin of the term "regression"} \normalsize
|
|
![](figures/03_ml_basics_galton_linear_regression_iminuit.pdf)
|
|
|
|
::::
|
|
:::: {.column width=50%}
|
|
|
|
* data: $\{x_i,y_i\}$ \
|
|
* objective: predict $y = f(x)$
|
|
* model: $f(x; \vec \theta) = m x + b, \quad \vec \theta = (m, b)$
|
|
* loss function: $J(\theta|x,y) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i))^2$
|
|
* model training: optimal parameters $\hat{\vec{\theta}} = \mathrm{arg\,min} \, J(\vec \theta)$
|
|
|
|
::::
|
|
:::
|
|
|
|
## Linear regression
|
|
|
|
* Data: vectors with $p$ components ("features"): $\vec x = (x_1, ..., x_p)$
|
|
* $n$ observations: $\{\vec x_i, y_i\}, \quad i = 1, ..., n$
|
|
* Prediction for given vector $x$:
|
|
$$ y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x \quad \text{where } x_0 := 1 $$
|
|
|
|
* Find weights that minimze loss function:
|
|
$$\hat{\vec{w}} = \underset{\vec w}{\min} \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2$$
|
|
|
|
* In case of linear regression closed-form solution exists:
|
|
$$ \hat{\vec{w}} = (X^\intercal X)^{-1} X^\intercal \vec y \quad \text{where} \; X \in \mathbb{R}^{n \times p}$$
|
|
|
|
* $X$ is called the design matrix, row $i$ of $X$ is $\vec x_i$
|
|
|
|
## Linear regression with regularization
|
|
|
|
::: columns
|
|
:::: {.column width=45%}
|
|
* Standard loss function
|
|
$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 $$
|
|
|
|
* Ridge regression
|
|
$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w|^2$$
|
|
|
|
* LASSO regression
|
|
$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w| $$
|
|
|
|
::::
|
|
:::: {.column width=55%}
|
|
|
|
\vfill
|
|
|
|
![](figures/L1vsL2.pdf)
|
|
\small \textcolor{gray}{LASSO regression tends to give sparse solutions (many components $w_j = 0$). This is why LASSO regression is also called sparse regression.} \normalsize
|
|
::::
|
|
:::
|
|
|
|
## Logistic regression (1)
|
|
|
|
* Consider binary classification task, e.g., $y_i \in \{0,1\}$
|
|
* Objective: Predict probability for outcome $y=1$ given an observation $\vec x$
|
|
* Starting with linear "score"
|
|
$$ s = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x$$
|
|
* Define function that translates $s$ into a quantity that has the properties of a probability
|
|
$$ \sigma(s) = \frac{1}{1+e^{-s}} $$
|
|
* We would like to determine the optimal weights for a given training data set. They result from the maximum-likelihood principle.
|
|
|
|
## Logistic regression (2)
|
|
|
|
* Consider feature vector $\vec x$. For a given set of weights $\vec w$ the model predicts
|
|
* a probability $p(1|\vec w) = \sigma(\vec w^\intercal \vec x)$ for outcome $y=1$
|
|
* a probabiltiy $p(0|\vec w) = 1 - \sigma(\vec w^\intercal \vec x)$ for outcome $y=0$
|
|
* The probability $p(y_i | \vec w)$ defines the likelihood $L_i(\vec w) = p(y_i | \vec w)$ (the likelihood is a function of the parameters $\vec w$ and the observations $y_i$ are fixed).
|
|
* Likelihood for the full data sample ($n$ observations)
|
|
$$ L(\vec w) = \prod_{i=1}^n L_i(\vec w) = \prod_{i=1}^n \sigma(\vec w^\intercal \vec x)^{y_i} \,(1-\sigma(\vec w^\intercal \vec x))^{1-y_i} $$
|
|
* Maximizing the log-likelihood $\ln L(\vec w)$ corresponds to minimizing the loss function
|
|
$$ C(\vec w) = - \ln L(\vec w) = \sum_{i=1}^n - y_i \ln \sigma(\vec w^\intercal \vec x) -
|
|
(1-y_i) \ln(1-\sigma(\vec w^\intercal \vec x))$$
|
|
* This is nothing else but the cross-entropy loss function
|
|
|
|
|
|
|
|
## Example 1 - Probability of passing an exam (logistic regression) (1)
|
|
|
|
Objective: predict the probability that someone passes an exam based on the number of hours studying
|
|
|
|
$$ p_\mathrm{pass} = \sigma(s) = \frac{1}{1+e^{-s}}, \quad s = w_1 t + w_0, \quad t = \text{\# hours}$$
|
|
|
|
::: columns
|
|
:::: {.column width=40%}
|
|
* Data set: \
|
|
* preparation $t$ time in hours
|
|
* passed / not passes (0/1)
|
|
* Parameters need to be determined through numerical minimization
|
|
* $w_0 = -4.0777$
|
|
* $w_1 = 1.5046$
|
|
|
|
|
|
\vspace{1.5ex}
|
|
\footnotesize
|
|
[\textcolor{gray}{03\_ml\_basics\_logistic\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_logistic_regression.ipynb)
|
|
\normalsize
|
|
::::
|
|
:::: {.column width=60%}
|
|
![](figures/03_ml_basics_logistic_regression.pdf){width=90%}
|
|
::::
|
|
:::
|
|
|
|
## Example 1 - Probability of passing an exam (logistic regression) (2)
|
|
|
|
\footnotesize
|
|
\textcolor{gray}{Read data from file:}
|
|
```python
|
|
# data: 1. hours studies, 2. passed (0/1)
|
|
df = pd.read_csv(filename, engine='python', sep='\s+')
|
|
x_tmp = df['hours_studied'].values
|
|
x = np.reshape(x_tmp, (-1, 1))
|
|
y = df['passed'].values
|
|
```
|
|
\vfill
|
|
\textcolor{gray}{Fit the data:}
|
|
```python
|
|
from sklearn.linear_model import LogisticRegression
|
|
clf = LogisticRegression(penalty='none', fit_intercept=True)
|
|
clf.fit(x, y);
|
|
```
|
|
\vfill
|
|
\textcolor{gray}{Calculate predictions:}
|
|
```python
|
|
hours_studied_tmp = np.linspace(0., 6., 1000)
|
|
hours_studied = np.reshape(hours_studied_tmp, (-1, 1))
|
|
y_pred = clf.predict_proba(hours_studied)
|
|
```
|
|
\normalsize
|
|
|
|
|
|
## Precision and recall
|
|
|
|
::: columns
|
|
:::: {.column width=50%}
|
|
\textcolor{blue}{Precision:}\
|
|
Fraction of correctly classified instances among all instances that obtain a certain class label.
|
|
|
|
$$ \text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
|
|
|
|
\begin{center}
|
|
\textcolor{gray}{"purity"}
|
|
\end{center}
|
|
|
|
::::
|
|
:::: {.column width=50%}
|
|
\textcolor{blue}{Recall:}\
|
|
Fraction of positive instances that are correctly classified.
|
|
\vspace{2.9ex}
|
|
|
|
$$ \text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
|
|
|
|
\begin{center}
|
|
\textcolor{gray}{"efficiency"}
|
|
\end{center}
|
|
|
|
::::
|
|
:::
|
|
\vfill
|
|
\begin{center}
|
|
\textcolor{gray}{TP: true positives, FP: false positives, FN: false negatives}
|
|
\end{center}
|
|
|
|
|
|
|
|
## Example 2: Heart disease data set (logistic regression) (1)
|
|
|
|
\scriptsize
|
|
\textcolor{gray}{Read data:}
|
|
```python
|
|
filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/data/heart.csv"
|
|
df = pd.read_csv(filename)
|
|
df
|
|
```
|
|
\vfill
|
|
![](figures/heart_table.png){width=70%}
|
|
\normalsize
|
|
\vspace{1.5ex}
|
|
\footnotesize
|
|
[\textcolor{gray}{03\_ml\_basics\_log\_regr\_heart\_disease.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)
|
|
\normalsize
|
|
|
|
## Example 2: Heart disease data set (logistic regression) (2)
|
|
\footnotesize
|
|
|
|
\textcolor{gray}{Define array of labels and feature vectors}
|
|
```python
|
|
y = df['target'].values
|
|
X = df[[col for col in df.columns if col!="target"]]
|
|
```
|
|
\vfill
|
|
\textcolor{gray}{Generate training and test data sets}
|
|
```python
|
|
from sklearn.model_selection import train_test_split
|
|
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True)
|
|
```
|
|
\vfill
|
|
\textcolor{gray}{Fit the model}
|
|
```python
|
|
from sklearn.linear_model import LogisticRegression
|
|
lr = LogisticRegression(penalty='none', fit_intercept=True, max_iter=1000, tol=1E-5)
|
|
lr.fit(X_train, y_train)
|
|
```
|
|
\normalsize
|
|
|
|
## Example 2: Heart disease data set (logistic regression) (3)
|
|
\footnotesize
|
|
\textcolor{gray}{Test predictions on test data set:}
|
|
```python
|
|
from sklearn.metrics import classification_report
|
|
y_pred_lr = lr.predict(X_test)
|
|
print(classification_report(y_test, y_pred_lr))
|
|
```
|
|
\vfill
|
|
\textcolor{gray}{Output:}
|
|
```
|
|
precision recall f1-score support
|
|
|
|
0 0.75 0.86 0.80 63
|
|
1 0.89 0.80 0.84 89
|
|
|
|
accuracy 0.82 152
|
|
macro avg 0.82 0.83 0.82 152
|
|
weighted avg 0.83 0.82 0.82 152
|
|
```
|
|
|
|
## Example 2: Heart disease data set (logistic regression) (4)
|
|
|
|
\textcolor{gray}{Compare to another classifier usinf the \textit{receiver operating characteristic} (ROC) curve}
|
|
\vfill
|
|
\textcolor{gray}{Let's take the random forest classifier}
|
|
\footnotesize
|
|
```python
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
rf = RandomForestClassifier(max_depth=3)
|
|
rf.fit(X_train, y_train)
|
|
```
|
|
\normalsize
|
|
\vfill
|
|
\textcolor{gray}{Use \texttt{roc\_curve} from scikit-learn}
|
|
\footnotesize
|
|
```python
|
|
from sklearn.metrics import roc_curve
|
|
|
|
y_pred_prob_lr = lr.predict_proba(X_test) # predicted probabilities
|
|
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr[:,1])
|
|
|
|
y_pred_prob_rf = rf.predict_proba(X_test) # predicted probabilities
|
|
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf[:,1])
|
|
|
|
```
|
|
\normalsize
|
|
|
|
## Example 2: Heart disease data set (logistic regression) (5)
|
|
::: columns
|
|
:::: {.column width=50%}
|
|
\scriptsize
|
|
```python
|
|
plt.plot(tpr_lr, 1-fpr_lr, label="log. regression")
|
|
plt.plot(tpr_rf, 1-fpr_rf, label="random forest")
|
|
```
|
|
\vspace{5ex}
|
|
|
|
\normalsize
|
|
\textcolor{gray}{Classifiers can be compared with the \textit{area under curve} (AUC) score.}
|
|
\scriptsize
|
|
```python
|
|
from sklearn.metrics import roc_auc_score
|
|
auc_lr = roc_auc_score(y_test,y_pred_lr)
|
|
auc_rf = roc_auc_score(y_test,y_pred_rf)
|
|
print(f"AUC scores: {auc_lr:.2f}, {auc_knn:.2f}")
|
|
```
|
|
\vspace{5ex}
|
|
\normalsize
|
|
\textcolor{gray}{This gives}
|
|
\scriptsize
|
|
```
|
|
AUC scores: 0.82, 0.83
|
|
```
|
|
\normalsize
|
|
|
|
::::
|
|
:::: {.column width=50%}
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.96\textwidth]{figures/03_ml_basics_log_regr_heart_disease.pdf}
|
|
\end{figure}
|
|
::::
|
|
:::
|
|
|
|
## Multinomial logistic regression: Softmax function
|
|
|
|
In the previous example we considered two classes (0, 1). For multi-class classification, the logistic function can generalized to the softmax function.
|
|
\vfill
|
|
Now consider $k$ classes and let $s_i$ be the score for class $i$: $\vec s = (s_1, ..., s_k)$
|
|
\vfill
|
|
A probability for class $i$ can be predicted with the softmax function:
|
|
$$ \sigma(\vec s)_i = \frac{e^{s_i}}{\sum_{j=1}^k e^{s_j}} \quad \text{ for } \quad i = 1, ... , k $$
|
|
The softmax functions is often used as the last activation function of a neural network in order to predict probabilities in a classification task.
|
|
\vfill
|
|
Multinomial logistic regression is also known as softmax regression.
|
|
|
|
## Example 3: Iris data set (softmax regression) (1)
|
|
|
|
Iris flower data set
|
|
|
|
* Introduced 1936 in a paper by Ronald Fisher
|
|
* Task: classify flowers
|
|
* Three species: iris setosa, iris virginica and iris versicolor
|
|
* Four features: petal width and length, sepal width/length, in centimeters
|
|
|
|
::: columns
|
|
:::: {.column width=40%}
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.95\textwidth]{figures/iris_dataset.png}
|
|
\end{figure}
|
|
::::
|
|
:::: {.column width=60%}
|
|
|
|
\vspace{2ex}
|
|
|
|
\footnotesize
|
|
[\textcolor{gray}{03\_ml\_basics\_iris\_softmax\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_iris_softmax_regression.ipynb)
|
|
|
|
\vspace{19ex}
|
|
|
|
\scriptsize
|
|
[https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
|
|
|
|
[https://en.wikipedia.org/wiki/Iris_flower_data_set](https://en.wikipedia.org/wiki/Iris_flower_data_set)
|
|
\normalsize
|
|
::::
|
|
:::
|
|
|
|
## Example 3: Iris data set (softmax regression) (2)
|
|
|
|
\textcolor{gray}{Get data set}
|
|
\footnotesize
|
|
```python
|
|
# import some data to play with
|
|
# columns: Sepal Length, Sepal Width, Petal Length and Petal Width
|
|
iris = datasets.load_iris()
|
|
X = iris.data
|
|
y = iris.target
|
|
|
|
# split data into training and test data sets
|
|
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
|
|
```
|
|
\normalsize
|
|
\vfill
|
|
|
|
\textcolor{gray}{Softmax regression}
|
|
\footnotesize
|
|
```python
|
|
from sklearn.linear_model import LogisticRegression
|
|
log_reg = LogisticRegression(multi_class='multinomial', penalty='none')
|
|
log_reg.fit(x_train, y_train);
|
|
```
|
|
\normalsize
|
|
|
|
## Example 3 : Iris data set (softmax regression) (3)
|
|
|
|
::: columns
|
|
:::: {.column width=70%}
|
|
\textcolor{gray}{Accuracy and confusion matrix for different classifiers}
|
|
\footnotesize
|
|
```python
|
|
for clf in [log_reg, kn_neigh, fisher_ld]:
|
|
y_pred = clf.predict(x_test)
|
|
acc = accuracy_score(y_test, y_pred)
|
|
print(type(clf).__name__)
|
|
print(f"accuracy: {acc:0.2f}")
|
|
|
|
# confusion matrix:
|
|
# columns: true class, row: predicted class
|
|
print(confusion_matrix(y_test, y_pred),"\n")
|
|
```
|
|
\normalsize
|
|
::::
|
|
:::: {.column width=30%}
|
|
|
|
\footnotesize
|
|
```
|
|
LogisticRegression
|
|
accuracy: 0.96
|
|
[[29 0 0]
|
|
[ 0 23 0]
|
|
[ 0 3 20]]
|
|
|
|
KNeighborsClassifier
|
|
accuracy: 0.95
|
|
[[29 0 0]
|
|
[ 0 23 0]
|
|
[ 0 4 19]]
|
|
|
|
LinearDiscriminantAnalysis
|
|
accuracy: 0.99
|
|
[[29 0 0]
|
|
[ 0 23 0]
|
|
[ 0 1 22]]
|
|
```
|
|
\normalsize
|
|
::::
|
|
:::
|
|
|
|
## Exercise 1: Classification of air showers measured with the MAGIC telescope
|
|
|
|
::: columns
|
|
:::: {.column width=50%}
|
|
|
|
\small
|
|
* Cosmic gamma rays (30 GeV - 30 TeV).
|
|
* Cherenkov light from air showers
|
|
* Background: air showers caused by hadrons.
|
|
\normalsize
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.85\textwidth]{figures/magic_photo_small.png}
|
|
\end{figure}
|
|
::::
|
|
:::: {.column width=50%}
|
|
![](figures/magic_sketch.png)
|
|
::::
|
|
:::
|
|
|
|
## Exercise 1: Classification of air showers measured with the MAGIC telescope
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.75\textwidth]{figures/magic_shower_em_had_small.png}
|
|
\end{figure}
|
|
::: columns
|
|
:::: {.column width=50%}
|
|
\begin{center}
|
|
Gamma shower
|
|
\end{center}
|
|
::::
|
|
:::: {.column width=50%}
|
|
\begin{center}
|
|
Hadronic shower
|
|
\end{center}
|
|
::::
|
|
:::
|
|
|
|
## Exercise 1: Classification of air showers measured with the MAGIC telescope
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.95\textwidth]{figures/magic_shower_parameters.png}
|
|
\end{figure}
|
|
|
|
## Exercise 1: Classification of air showers measured with the MAGIC telescope
|
|
MAGIC data set \
|
|
\tiny
|
|
[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope}](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope)
|
|
\normalsize
|
|
|
|
\scriptsize
|
|
```
|
|
1. fLength: continuous # major axis of ellipse [mm]
|
|
2. fWidth: continuous # minor axis of ellipse [mm]
|
|
3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]
|
|
4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]
|
|
5. fConc1: continuous # ratio of highest pixel over fSize [ratio]
|
|
6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]
|
|
7. fM3Long: continuous # 3rd root of third moment along major axis [mm]
|
|
8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]
|
|
9. fAlpha: continuous # angle of major axis with vector to origin [deg]
|
|
10. fDist: continuous # distance from origin to center of ellipse [mm]
|
|
11. class: g,h # gamma (signal), hadron (background)
|
|
|
|
g = gamma (signal): 12332
|
|
h = hadron (background): 6688
|
|
|
|
For technical reasons, the number of h events is underestimated.
|
|
In the real data, the h class represents the majority of the events.
|
|
```
|
|
\normalsize
|
|
|
|
## Exercise 1: Classification of air showers measured with the MAGIC telescope
|
|
|
|
\small
|
|
[\textcolor{gray}{03\_ml\_basics\_ex\_1\_magic.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/03_ml_basics_ex_1_magic.ipynb)
|
|
\normalsize
|
|
|
|
a) Create for each variable a figure with a plot for gammas and hadrons overlayed.
|
|
b) Create training and test data set. The test data should amount to 50% of the total data set.
|
|
c) Define the logistic regressor and fit the training data
|
|
d) Determine the model accuracy and the AUC score
|
|
e) Plot the ROC curve (background rejection vs signal efficiency)
|
|
|
|
|
|
## General remarks on multi-variate analyses (MVAs)
|
|
|
|
* MVA Methods
|
|
* More effective than classic cut-based analyses
|
|
* Take correlations of input variables into account
|
|
\vfill
|
|
* Important: find good input variables for MVA methods
|
|
* Good separation power between S and B
|
|
* No strong correlation among variables
|
|
* No correlation with the parameters you try to measure in your signal sample!
|
|
\vfill
|
|
* Pre-processing
|
|
* Apply obvious variable transformations and let MVA method do the rest
|
|
* Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle $\theta$ use $|\cos \theta|$ and not $\cos \theta$ as input variable
|
|
* It is generally useful to bring all input variables to a similar numerical range
|
|
|
|
## Example of feature transformation
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\includegraphics[width=0.95\textwidth]{figures/feature_transformation.png}
|
|
\end{figure}
|
|
|
|
|
|
|
|
## Exercise 2: Data preprocessing
|
|
|
|
a) Read the description of the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) package.
|
|
|
|
b) Start from the example notebook on the logistic regression for the heart disease data set ([03_ml_basics_log_regr_heart_disease.ipynb](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)). Pre-process the heart disease data set according to the given example. Does preprocessing make a difference in this case?
|