ML-Kurs-SS2023/slides/ml_basics.md

---
title: |
  | Introduction to Data Analysis and Machine Learning in Physics:  
  | 3. Machine Learning Basics  

author: "Martino Borsato, Jörg Marks, Klaus Reygers"
date: "Studierendentage, 11-14 April 2022"
---

## Exercises

* Exercise 1: Air shower classification (MAGIC telescope)
	* Logistic regression
	* [`03_ml_basics_ex01_magic.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_1_magic.ipynb)
* Exercise 2: Hand-written digit recognition with logistic regression
	* Logistic regression 		
	* [`03_ml_basics_ex02_mnist_softmax_regression.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_2_mnist_softmax_regression.ipynb)
* Exercise 3: Data preprocessing

## What is machine learning? (1)
![](figures/deepl.png)

## What is machine learning? (2)
"Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed" -- Wikipedia

\vspace{2ex}
Example: spam detection \hfill 
\scriptsize [\textcolor{gray}{J. Mayes, Machine learning 101}](https://docs.google.com/presentation/d/1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k/preview?imm_mid=0f9b7e&cmp=em-data-na-na-newsltr_20171213&slide=id.g168a3288f7_0_58)
\normalsize

\begin{center}
\includegraphics[width=0.9\textwidth]{figures/ml_example_spam.png} 
\vspace{2ex}

Manual feature engineering vs. automatic feature detection
\end{center}

## AI, ML, and DL
"AI is the study of how to make computers perform things that, at the moment, people do better."
\tiny \textcolor{gray}{Elaine Rich, Artificial intelligence, McGraw-Hill 1983} \normalsize 
\vfill
\tiny \hfill \textcolor{gray}{G. Marcus, E. Davis, Rebooting AI} \normalsize
\begin{figure}
\centering
%![](figures/ai_ml_dl.pdf){width=70%}
\includegraphics[width=0.7\textwidth]{figures/ai_ml_dl.pdf}
\end{figure}

\vfill
"deep" in deep learning: artificial neural nets with many neurons and multiple layers of nonlinear processing units for feature extraction

## Multivariate analysis: An early example from particle physics
::: columns
:::: {.column width=55%}
![](figures/mva.png){width=99%}
::::
:::: {.column width=45%}
* Signal: $e^+e^- \to W^+W^-$ 
	* often 4 well separated hadron jets
* Background: $e^+e^- \to qqgg$
	* 4 less well separated hadron jets
* Input variables based on jet structure, event shape, ... none by itself gives much separation.
![](figures/mva_nn.png){width=85%}
\tiny \textcolor{gray}{(Garrido, Juste and Martinez, ALEPH 96-144)} \normalsize
::::
:::

## Applications of machine learning in physics

* Particle physics: Particle identification / classification
* Astronomy: Galaxy morphology classification
* Chemistry and material science: predict properties of new molecules / materials
* Many-body quantum matter: classification of quantum phases

\vspace{3ex}
\scriptsize [\textcolor{gray}{Machine learning and the physical sciences, arXiv:1903.10563}](https://arxiv.org/abs/1903.10563) \normalsize

## Some successes and unsolved problems in AI
::: columns
:::: {.column width=50%}
![](figures/ai_history.png){width=85%}

\tiny \textcolor{gray}{M. Woolridge, The road to conscious machines} \normalsize

::::
:::: {.column width=50%}

Impressive progress in certain fields:

\small
* Image recognition
* Speech recognition
* Recommendation systems
* Automated translation
* Analysis of medical data
\normalsize
\vfill

How can we profit from these developments in physics?
::::
:::

## The deep learning hype -- why now?
Artificial neural networks are around for decades. Why did deep learning take off after 2012?

\vspace{5ex}

* Improved hardware -- graphical processing units [GPUs]
* Large data sets (e.g. images) distributed via the Internet
* Algorithmic advances


## Different modeling approaches

* Simple mathematical representation like linear regression. Favored by statisticians.
* Complex deterministic models based on scientific understanding of the physical process. Favored by physicists.
* Complex algorithms to make predictions that are derived from a huge number of past examples (“machine learning” as developed in the field of computer science). These are often black boxes.
* Regression models that claim to reach causal conclusions. Used by economists.

\tiny \textcolor{gray}{D. Spiegelhalter, The Art of Statistics – Learning from data} \normalsize


## Machine learning: The "hello world" problem
::: columns
:::: {.column width=45%}

Recognition of handwritten digits

* MNIST database (Modified National Institute of Standards and Technology database)
* 60,000 training images and 10,000 testing images labeled with correct answer
* 28 pixel x 28 pixel
* Algorithms have reached "near-human performance"
* Smallest error rate (2018): 0.18\%

::::
:::: {.column width=55%}
![](figures/mnist.png)

\tiny 
[\color{gray}{\texttt{https://en.wikipedia.org/wiki/MNIST\_database}}](https://en.wikipedia.org/wiki/MNIST_database)
\normalsize

::::
:::

## Machine learning: Image recognition
ImageNet database

* 14 million images, 22,000 categories
* Since 2010, the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC): 1.4 million images, 1000 categories
* In 2017, 29 of 38 competing teams got less than 5\% wrong

\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/imagenet.png}
\end{figure}

## ImageNet: Large Scale Visual Recognition Challenge

\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/imagenet_challenge.png}
\end{figure}

\vfill

\scriptsize 
\textcolor{gray}{O. Russakovsky et al, arXiv:1409.0575} 
\normalsize

## Adversarial attack

\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/adversarial_attack.png}
\end{figure}

\vspace{3ex}
\scriptsize [\textcolor{gray}{Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, arXiv:1412.6572v1}](https://arxiv.org/abs/1412.6572v1) \normalsize

## Types of machine learning
::: columns
:::: {.column width=60%}
Reinforcement learning

\small 
* The machine ("the agent") predicts a scalar reward given once in a while
* Weak feedback
\normalsize

::::
:::: {.column width=35%}
\tiny [\textcolor{gray}{LeCun 2018, Power And Limits of Deep Learning}](https://www.youtube.com/watch?v=0tEhw5t6rhc) \normalsize
![](figures/videogame.png)
::::
:::
\vfill
::: columns
:::: {.column width=60%}

\vspace{1em}
Supervised learning

\small
* The machine predicts a category based on labeled training data 
* Medium feedback
\normalsize
::::
:::: {.column width=35%}
![](figures/supervised_learning_car_plane.png)
::::
:::
\vfill
::: columns
:::: {.column width=60%}

\vspace{1em}
Unsupervised learning

\small 
* Describe/find hidden structure from "unlabeled" data
* Cluster data in different sub-groups with similar properties
\normalsize
::::
:::: {.column width=35%}
![](figures/anomaly_detection.png)
::::
:::

## Books on machine learning (1)

::: columns
:::: {.column width=85%}
Ian Goodfellow and Yoshua Bengio and Aaron Courville, \textit{Deep Learning}, free online [http://www.deeplearningbook.org/](http://www.deeplearningbook.org/)

\vspace{8ex}

Kevin Murphy, \textit{Probabilistic Machine Learning: An Introduction}, [draft pdf version](https://probml.github.io/pml-book/)

\vspace{7ex}

Aurelien Geron, \textit{Hands-On Machine Learning with Scikit-Learn and TensorFlow}

::::
:::: {.column width=15%}
![](figures/deep_learning_book.png){width=65%}

\vspace{3ex}

![](figures/book-murphy.png){width=65%}

\vspace{3ex}

![](figures/hands_on_machine_learning.png){width=65%}

::::
:::

## Books on machine learning (2)

::: columns
:::: {.column width=85%}
Francois Chollet, \textit{Deep Learning with Python}

\vspace{10ex}

Martin Erdmann, Jonas Glombitza, Gregor Kasieczka, Uwe Klemradt, \textit{Deep Learning for Physics Research}

::::
:::: {.column width=15%}
![](figures/deep_learning_with_python.png){width=65%}

\vspace{3ex}

![](figures/book_deep_learning_for_physics_research.png){width=65%}

::::
:::

## Papers

A high-bias, low-variance introduction to Machine Learning for physicists 

[https://arxiv.org/abs/1803.08823](https://arxiv.org/abs/1803.08823)

\vspace{3ex}

Machine learning and the physical sciences

[https://arxiv.org/abs/1903.10563](https://arxiv.org/abs/1903.10563)

## Supervised learning in a nutshell
* Supervised Machine Learning requires labeled training data, i.e., a training sample where for each event it is known whether it is a signal or background event. 
* Each event is characterized by $n$ observables: $\vec x = (x_1, x_2, ..., x_n) \;$ \textcolor{gray}{"feature vector"} 

\begin{figure}
\centering
\raisebox{-0.5\height}{\includegraphics[width=0.69\textwidth]{figures/supervised_nutshell.png}}
\raisebox{-0.5\height}{\includegraphics[width=0.30\textwidth]{figures/loss_fct.png}}
\end{figure}

* Design function $y(\vec x, \vec w)$ with adjustable parameters $\vec w$ 
* Design a loss function
* Find best parameters which minimize loss


## Supervised learning: classification and regression

The codomain $Y$ of the function y: $X \to Y$ can be a set of labels or classes or a continuous domain, e.g., $\mathbb{R}$

\vfill

* $Y$ = finite set of labels $\quad \to \quad$ \textcolor{red}{classification}
	* binary classification: $Y = \{0,1\}$
	* multi-class classification: $Y = \{c_1, c_2, ..., c_n\}$
* $Y$ = real numbers $\quad \to \quad$ \textcolor{red}{regression}

\vfill

\textcolor{gray}{"All the impressive achievements of deep learning amount to just curve fitting" \\[0.5cm]} 
\footnotesize
\textcolor{gray}{J. Pearl, Turing Award Winner 2011\\}
\tiny
[\color{gray}{To Build Truly Intelligent Machines, Teach Them Cause and Effect, Quantamagazine}](https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/)
\normalsize

## Classification: Learning decision boundaries

\begin{figure}
\centering
\includegraphics{figures/decision_boundaries.png}
\end{figure}

## Supervised learning: Training, validation, and test sample
* Decision boundary fixed with \textcolor{blue}{training sample}
* Performance on training sample becomes better with more iterations
* Danger of overtraining: Statistical fluctuations of the training sample will be learnt
* \textcolor{blue}{Validation sample} = independent labeled data set not used for training $\rightarrow$ check for overtraining
* Sign of overtraining: performance on validation sample becomes worse $\rightarrow$ Stop training when signs of overtraining are observed (early stopping)
* Performance: apply classifier to independent \textcolor{blue}{test sample}
* Often: test sample = validation sample (only small bias)

## Supervised learning: Cross validation

Rule of thumb if training data not expensive

::: columns
:::: {.column width=60%}
* Training sample: 50%
* Validation sample: 25% 
* Test sample: 25%

\vspace{2ex}

Cross validation (efficient use of scarce training data)

* Split training sample in $k$ independent subset $T_k$ of the full sample $T$
* Train on $T \setminus T_k$ resulting in $k$ different classifiers
* For each training event there is one classifier that didn't use this event for training
* Validation results are then combined
::::
:::: {.column width=40%}
\textcolor{gray}{Often test sample = validation sample (bias is rather small)}

\vspace{10ex}
![](figures/cross_val.png)
::::
:::

## Often used loss functions
::: columns
:::: {.column width=45%}
\textcolor{blue}{Square error loss}:

* often used in regression

::::
:::: {.column width=55%}
$$ E(y(\vec x, \vec w), t) = (y(\vec x, \vec w) - t)^2 $$
::::
:::

\vfill 

::: columns
:::: {.column width=45%}
\textcolor{blue}{Cross entropy}: 

* $t \in \{0,1\}$
* $y(\vec x, \vec w)$: predicted probability for outcome $t=1$
* often used in classification

::::
:::: {.column width=55%}
\begin{align*}
E(y(\vec x, \vec w), t)  = & - t \log y(\vec x, \vec w) \\ &  - (1 - t) \log(1 - y(\vec x, \vec w))
\end{align*}

::::
:::

## More on entropy
* Self-information of an event $x$: $I(x) = - \log p(x)$
	* in units of **nats** (1 nat = information gained by observing an event of probability $1/e$)

\vfill

* Shannon entropy: $H(P) = - \sum p_i \log p_i$
	* Expected amount of information in an event drawn from a distribution $P$
	* Measure of the minimum of amount of bits needed on average to encode symbols drawn from a distribution

\vfill

* Cross entropy: $H(P,Q) = - E[\log Q] = - \sum p_i \log q_i$
	* Can be interpreted as a measure of the amount of bits needed when a wrong distribution Q is assumed while the data actually follows a distribution P
	* Measure of dissimilarity between distributions P and Q (i.e, a measure of how well the model Q describes the true distribution P)

## Hypothesis testing
::: columns
:::: {.column width=55%}
\includegraphics[width=\textwidth]{figures/signal_background_distr.png}
::::
:::: {.column width=45%}
\vspace{2ex}
test statistic 

* a (usually scalar) variable which is a function of the data alone that can be used to test hypotheses
* example: $\chi^2$ w.r.t. a theory curve

::::
:::

\textcolor{gray}{$\epsilon_\mathrm{B} \equiv \alpha$}: "background efficiency", i.e., prob. to misclassify bckg. as signal

\textcolor{gray}{$\epsilon_\mathrm{S} \equiv 1 - \beta$}: "signal efficiency"

\begin{center}
\begin{tabular}{ l l l}
  & $H_0$ is true & $H_0$ is false (i.e., $H_1$ is true)\\
  \hline
 $H_0$ is rejected & Type I error ($\alpha$) & Correct decision ($1 - \beta$) \\
 $H_0$ is not rejected & Correct decision ($1 - \alpha$) & Type II error ($\beta$) \\
  \hline  
\end{tabular}
\end{center}


## Neyman-Pearson Lemma

The likelihood ratio

$$ t(\vec x) = \frac{f(\vec x|H_1)}{f(\vec x|H_0)}  $$

is an optimal test statistic, i.e., it provides highest "signal efficiency" $1-\beta$ for a given "background efficiency" $\alpha$. Accept hypothesis if $t(\vec x) > c$.

\vfill

Problem: the underlying pdf's are almost never known explicitly.

\vfill

Two approaches 

1. Estimate signal and background pdf's and construct test statistic based on Neyman-Pearson lemma

2. Decision boundaries determined directly without approximating the pdf's (linear discriminants, decision trees, neural networks, ...)


## Estimating PDFs from Histograms?

\begin{center}
\includegraphics[width=0.8\textwidth]{figures/pdf_from_2d_histogram.png}
$\color{gray} \text{approximate PDF by} \; N(x,y|S) \; \text{and} \; N(x,y|B)$
\end{center}

$M$ bins per variable in $d$ dimensions: $M^d$ cells$\to$ hard to generate enough training data (often not practical for $d > 1$)


In general in machine learning, problems related to a large number of dimensions of the feature space are referred to as the \textcolor{red}{"curse of dimensionality"}

## Na$\text{\"i}$ve Bayesian Classifier (also called "Projected Likelihood Classification")

Application of the Neyman-Pearson lemma (ignoring correlations between the $x_i$):

$$ f(x_1, x_2, ..., x_n) \quad \mbox{approximated as} \quad L = f_1(x_1) \cdot f_2(x_2) \cdot ... \cdot f_n(x_n) $$
\begin{align*}
\mbox{where} \quad 
f_1(x_1) & = \int \mathrm dx_2 \mathrm dx_3 ... \mathrm dx_n\; f(x_1, x_2, ..., x_n) \\
f_2(x_2) & = \int \mathrm dx_1 \mathrm dx_3 ... \mathrm dx_n\; f(x_1, x_2, ..., x_n)  \\
\vdots
\end{align*}
Classification of feature vector $x$:
$$
y(\vec x) = \frac{L_\mathrm{s}(\vec x)}{L_\mathrm{s}(\vec x) + L_\mathrm{b}(\vec x)} = \frac{1}{1 + L_\mathrm{b}(\vec x) / L_\mathrm{s}(\vec x)} 
$$ 


Performance not optimal if true PDF does not factorize

## k-Nearest Neighbor Method (1)

$k$-NN classifier:

* Estimates probability density around the input vector
* $p(\vec x|S)$ and $p(\vec x|B)$ are approximated by the number of signal and background events in the training sample that lie in a small volume around the point $\vec x$

\vspace{2ex}

Algorithms finds $k$ nearest neighbors:
$$ k = k_s + k_b $$

Probability for the event to be of signal type:

$$ p_s(\vec x) = \frac{k_s(\vec x)}{k_s(\vec x) + k_b(\vec x)} $$

## k-Nearest Neighbor Method (2)

::: columns
:::: {.column width=60%}
Simplest choice for distance measure in feature space is the Euclidean distance:
$$ R = |\vec x - \vec y|$$

Better: take correlations between variables into account:

$$R = \sqrt{(\vec{x}-\vec{y})^T \mat{V}^{-1} (\vec{x}-\vec{y})}$$ 
$$ \mat{V} = \text{covariance matrix}, R = \text{"Mahalanobis distance"}$$


::::
:::: {.column width=40%}
![](figures/knn.png)
::::
:::

\vfill

The $k$-NN classifier has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.


## Fisher Linear Discriminant

Linear discriminant is simple. Can still be optimal if amount of training data is limited. 


Ansatz for test statistic: $$ y(\vec x) = \sum_{i=1}^n w_i x_i = \vec w^\intercal \vec x $$


Choose parameters $w_i$ so that separation between signal and background distribution is maximum.

\vfill

Need to define "separation".


::: columns
:::: {.column width=45%}
\begin{center}
Fisher: maximize $$ J(\vec w) = \frac{(\tau_s - \tau_b)^2}{\Sigma_s^2 + \Sigma_b^2} $$
\end{center}
::::
:::: {.column width=55%}
![](figures/fisher.png)
::::
:::

## Fisher Linear Discriminant: Determining the Coefficients $w_i$

::: columns
:::: {.column width=60%}
Coefficients are obtained from: $$ \frac{\partial J}{\partial w_i} = 0 $$

\vspace{2ex}

Linear decision boundaries

\vspace{5ex}

Weight vector $\vec w$ can be interpreted as a direction in feature space onto which the events are projected.
::::
:::: {.column width=40%}
![](figures/fisher_linear_decision_boundary.png)
::::
:::


## Linear regression revisited

\vfill 

::: columns
:::: {.column width=50%}
\small \textcolor{gray}{"Galton family heights data": \\ origin of the term "regression"} \normalsize
![](figures/03_ml_basics_galton_linear_regression_iminuit.pdf)

::::
:::: {.column width=50%}

* data: $\{x_i,y_i\}$ \
* objective: predict $y = f(x)$ 
* model: $f(x; \vec \theta) = m x + b, \quad \vec \theta = (m, b)$ 
* loss function: $J(\theta|x,y) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i))^2$ 
* model training: optimal parameters $\hat{\vec{\theta}} = \mathrm{arg\,min} \, J(\vec \theta)$

::::
:::

## Linear regression

* Data: vectors with $p$ components ("features"): $\vec x = (x_1, ..., x_p)$
* $n$ observations: $\{\vec x_i, y_i\}, \quad i = 1, ..., n$
* Prediction for given vector $x$: 
	$$ y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x \quad \text{where } x_0 := 1 $$

* Find weights that minimze loss function:
	$$\hat{\vec{w}} = \underset{\vec w}{\min} \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2$$

*  In case of linear regression closed-form solution exists:
	$$ \hat{\vec{w}} = (\mat{X}^\intercal \mat{X})^{-1} \mat{X}^\intercal \vec y \quad \text{where} \; X \in \mathbb{R}^{n \times p}$$

* $X$ is called the design matrix, row $i$ of $X$ is $\vec x_i$ 

## Linear regression with regularization

::: columns
:::: {.column width=45%}
* Standard loss function
	$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 $$

* Ridge regression
	$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w|^2$$

* LASSO regression
	$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w| $$

::::
:::: {.column width=55%}

\vfill 

![](figures/L1vsL2.pdf)
\small \textcolor{gray}{LASSO regression tends to give sparse solutions (many components $w_j = 0$). This is why LASSO regression is also called sparse regression.} \normalsize
::::
:::

## Logistic regression (1)

* Consider binary classification task, e.g., $y_i \in \{0,1\}$
* Objective: Predict probability for outcome $y=1$ given an observation $\vec x$
* Starting with linear "score" 
	$$ s = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x$$
* Define function that translates $s$ into a quantity that has the properties of a probability
	$$ \sigma(s) = \frac{1}{1+e^{-s}} $$
* We would like to determine the optimal weights for a given training data set. They result from the maximum-likelihood principle.

## Logistic regression (2)

* Consider feature vector $\vec x$. For a given set of weights $\vec w$ the model predicts 
	* a probability $p(1|\vec w) = \sigma(\vec w^\intercal \vec x)$ for outcome $y=1$
	* a probabiltiy $p(0|\vec w) = 1 - \sigma(\vec w^\intercal \vec x)$ for outcome $y=0$
* The probability $p(y_i | \vec w)$ defines the likelihood $L_i(\vec w) = p(y_i | \vec w)$ (the likelihood is a function of the parameters $\vec w$ and the observations $y_i$ are fixed).
* Likelihood for the full data sample ($n$ observations)
	$$ L(\vec w) = \prod_{i=1}^n L_i(\vec w) = \prod_{i=1}^n \sigma(\vec w^\intercal \vec x)^{y_i} \,(1-\sigma(\vec w^\intercal \vec x))^{1-y_i} $$
* Maximizing the log-likelihood $\ln L(\vec w)$ corresponds to minimizing the loss function
	$$ C(\vec w) = - \ln L(\vec w) = \sum_{i=1}^n - y_i \ln \sigma(\vec w^\intercal \vec x) -
(1-y_i) \ln(1-\sigma(\vec w^\intercal \vec x))$$
* This is nothing else but the cross-entropy loss function

## scikit-learn

::: columns
:::: {.column width=70%}
* Free software machine learning library for Python 
* Initial release: 2007
* features various classification, regression and clustering algorithms including k-nearest neighbors, multi-layer perceptrons, support vector machines, random forests, gradient boosting, k-means
* Scikit-learn is one of the most popular machine learning libraries on GitHub
* [https://scikit-learn.org/](https://scikit-learn.org/)
::::
:::: {.column width=30%}
\vspace{7ex}
\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/scikit-learn.png}
\end{figure}
::::
:::


## Example 1 - Probability of passing an exam (logistic regression) (1)

Objective: predict the probability that someone passes an exam based on the number of hours studying 

$$ p_\mathrm{pass} = \sigma(s) = \frac{1}{1+e^{-s}}, \quad s = w_1 t + w_0, \quad t = \text{\# hours}$$

::: columns
:::: {.column width=40%}
* Data set: \
	* preparation $t$ time in hours 
	* passed / not passes (0/1)
* Parameters need to be determined through numerical minimization
	* $w_0 = -4.0777$
	* $w_1 = 1.5046$


\vspace{1.5ex}
\footnotesize 
[\textcolor{gray}{03\_ml\_basics\_logistic\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/03_ml_basics_logistic_regression.ipynb)
\normalsize
::::
:::: {.column width=60%}
![](figures/03_ml_basics_logistic_regression.pdf){width=90%}
::::
:::

## Example 1 - Probability of passing an exam (logistic regression) (2)

\footnotesize
\textcolor{gray}{Read data from file:}
```python
# data: 1. hours studies, 2. passed (0/1)  
df = pd.read_csv(filename, engine='python', sep='\s+')
x_tmp = df['hours_studied'].values
x = np.reshape(x_tmp, (-1, 1))
y = df['passed'].values
```
\vfill
\textcolor{gray}{Fit the data:}
```python
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='none', fit_intercept=True)
clf.fit(x, y);
```
\vfill
\textcolor{gray}{Calculate predictions:} 
```python
hours_studied_tmp = np.linspace(0., 6., 1000)
hours_studied = np.reshape(hours_studied_tmp, (-1, 1))
y_pred = clf.predict_proba(hours_studied)
```
\normalsize

## Precision and recall

::: columns
:::: {.column width=50%}
\textcolor{blue}{Precision:}\
Fraction of correctly classified instances among all instances that obtain a certain class label.

$$ \text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

\begin{center} 
\textcolor{gray}{"purity"}
\end{center}

::::
:::: {.column width=50%}
\textcolor{blue}{Recall:}\
Fraction of positive instances that are correctly classified.
\vspace{2.9ex}

$$ \text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

\begin{center} 
\textcolor{gray}{"efficiency"}
\end{center}

::::
:::
\vfill
\begin{center}
\textcolor{gray}{TP: true positives, FP: false positives, FN: false negatives}
\end{center}

## Example 2: Heart disease data set (logistic regression) (1)

\scriptsize
\textcolor{gray}{Read data:}
```python
filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/data/heart.csv"
df = pd.read_csv(filename)
df
```
\vfill
![](figures/heart_table.png){width=70%}
\normalsize
\vspace{1.5ex}
\footnotesize 
[\textcolor{gray}{03\_ml\_basics\_log\_regr\_heart\_disease.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)
\normalsize

## Example 2: Heart disease data set (logistic regression) (2)
\footnotesize

\textcolor{gray}{Define array of labels and feature vectors}
```python
y = df['target'].values
X = df[[col for col in df.columns if col!="target"]]
```
\vfill
\textcolor{gray}{Generate training and test data sets}
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test 
		= train_test_split(X, y, test_size=0.5, shuffle=True)
```
\vfill
\textcolor{gray}{Fit the model}
```python
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='none', 
		fit_intercept=True, max_iter=1000, tol=1E-5)
lr.fit(X_train, y_train)
```
\normalsize

## Example 2: Heart disease data set (logistic regression) (3)
\footnotesize
\textcolor{gray}{Test predictions on test data set:}
```python
from sklearn.metrics import classification_report
y_pred_lr = lr.predict(X_test)
print(classification_report(y_test, y_pred_lr))
```
\vfill
\textcolor{gray}{Output:}
```
              precision    recall  f1-score   support

           0       0.75      0.86      0.80        63
           1       0.89      0.80      0.84        89

    accuracy                           0.82       152
   macro avg       0.82      0.83      0.82       152
weighted avg       0.83      0.82      0.82       152
```

## Example 2: Heart disease data set (logistic regression) (4)

\textcolor{gray}{Compare to another classifier using the \textit{receiver operating characteristic} (ROC) curve}
\vfill
\textcolor{gray}{Let's take the random forest classifier}
\footnotesize
```python
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=3)
rf.fit(X_train, y_train)
```
\normalsize
\vfill
\textcolor{gray}{Use \texttt{roc\_curve} from scikit-learn}
\footnotesize
```python
from sklearn.metrics import roc_curve

y_pred_prob_lr = lr.predict_proba(X_test) # predicted probabilities
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr[:,1])

y_pred_prob_rf = rf.predict_proba(X_test) # predicted probabilities
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf[:,1])

```
\normalsize

## Example 2: Heart disease data set (logistic regression) (5)
::: columns
:::: {.column width=50%}
\scriptsize
```python
plt.plot(tpr_lr, 1-fpr_lr, label="log. regression")
plt.plot(tpr_rf, 1-fpr_rf, label="random forest")
```
\vspace{5ex}

\normalsize
\textcolor{gray}{Classifiers can be compared with the \textit{area under curve} (AUC) score.}
\scriptsize
```python
from sklearn.metrics import roc_auc_score
auc_lr = roc_auc_score(y_test,y_pred_lr)
auc_rf = roc_auc_score(y_test,y_pred_rf)
print(f"AUC scores: {auc_lr:.2f}, {auc_knn:.2f}")
```
\vspace{5ex}
\normalsize
\textcolor{gray}{This gives}
\scriptsize
```
AUC scores: 0.82, 0.83
```
\normalsize

::::
:::: {.column width=50%}
\begin{figure}
\centering
\includegraphics[width=0.96\textwidth]{figures/03_ml_basics_log_regr_heart_disease.pdf}
\end{figure}
::::
:::

## Multinomial logistic regression: Softmax function

In the previous example we considered two classes (0, 1). For multi-class classification, the logistic function can generalized to the softmax function.
\vfill
Now consider $k$ classes and let $s_i$ be the score for class $i$: $\vec s = (s_1, ..., s_k)$
\vfill
A probability for class $i$ can be predicted with the softmax function:
	$$ \sigma(\vec s)_i = \frac{e^{s_i}}{\sum_{j=1}^k e^{s_j}} \quad \text{ for } \quad i = 1, ... , k $$
The softmax functions is often used as the last activation function of a neural network in order to predict probabilities in a classification task.
\vfill
Multinomial logistic regression is also known as softmax regression.

## Example 3: Iris data set (softmax regression) (1)

Iris flower data set

* Introduced 1936 in a paper by Ronald Fisher
* Task: classify flowers
* Three species: iris setosa, iris virginica and iris versicolor
* Four features: petal width and length, sepal width/length, in centimeters

::: columns
:::: {.column width=40%}
\begin{figure}
\centering
\includegraphics[width=0.95\textwidth]{figures/iris_dataset.png}
\end{figure}
::::
:::: {.column width=60%}

\vspace{2ex}

\footnotesize 
[\textcolor{gray}{03\_ml\_basics\_iris\_softmax\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_iris_softmax_regression.ipynb)

\vspace{19ex}

\scriptsize
[https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)

[https://en.wikipedia.org/wiki/Iris_flower_data_set](https://en.wikipedia.org/wiki/Iris_flower_data_set)
\normalsize
::::
:::

## Example 3: Iris data set (softmax regression) (2)

\textcolor{gray}{Get data set}
\footnotesize
```python
# import some data to play with
# columns: Sepal Length, Sepal Width, Petal Length and Petal Width
iris = datasets.load_iris()
X = iris.data
y = iris.target

# split data into training and test data sets
x_train, x_test, y_train, y_test = 
			train_test_split(X, y, test_size=0.5, random_state=42)
```
\normalsize
\vfill

\textcolor{gray}{Softmax regression}
\footnotesize
```python
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(multi_class='multinomial', penalty='none')
log_reg.fit(x_train, y_train);
```
\normalsize

## Example 3 : Iris data set (softmax regression) (3)

::: columns
:::: {.column width=70%}
\textcolor{gray}{Accuracy and confusion matrix for different classifiers}
\footnotesize
```python
for clf in [log_reg, kn_neigh, fisher_ld]:
    y_pred = clf.predict(x_test)
    acc = accuracy_score(y_test, y_pred)
    print(type(clf).__name__)
    print(f"accuracy: {acc:0.2f}")

    # confusion matrix: 
    # columns: true class, row: predicted class
    print(confusion_matrix(y_test, y_pred),"\n")
```
\normalsize
::::
:::: {.column width=30%}

\footnotesize
```
LogisticRegression
accuracy: 0.96
[[29  0  0]
 [ 0 23  0]
 [ 0  3 20]] 

KNeighborsClassifier
accuracy: 0.95
[[29  0  0]
 [ 0 23  0]
 [ 0  4 19]] 

LinearDiscriminantAnalysis
accuracy: 0.99
[[29  0  0]
 [ 0 23  0]
 [ 0  1 22]] 
```
\normalsize
::::
:::

## General remarks on multi-variate analyses (MVAs)

* MVA Methods
	* More effective than classic cut-based analyses
	* Take correlations of input variables into account
\vfill
* Important: find good input variables for MVA methods
	* Good separation power between S and B
	* No strong correlation among variables
	* No correlation with the parameters you try to measure in your signal sample!
\vfill
* Pre-processing
	* Apply obvious variable transformations and let MVA method do the rest
	* Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle $\theta$ use $|\cos \theta|$ and not $\cos \theta$ as input variable
	* It is generally useful to bring all input variables to a similar numerical range 

## Example of feature transformation

\begin{figure}
\centering
\includegraphics[width=0.95\textwidth]{figures/feature_transformation.png}
\end{figure}

## Exercise 1: Classification of air showers measured with the MAGIC telescope

::: columns
:::: {.column width=50%}

\small
* Cosmic gamma rays (30 GeV - 30 TeV).
* Cherenkov light from air showers
* Background: air showers caused by hadrons.
\normalsize

\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/magic_photo_small.png}
\end{figure}
::::
:::: {.column width=50%}
![](figures/magic_sketch.png)
::::
:::

## Exercise 1: Classification of air showers measured with the MAGIC telescope
\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/magic_shower_em_had_small.png}
\end{figure}
::: columns
:::: {.column width=50%}
\begin{center}
Gamma shower
\end{center}
::::
:::: {.column width=50%}
\begin{center}
Hadronic shower
\end{center}
::::
:::

## Exercise 1: Classification of air showers measured with the MAGIC telescope
\begin{figure}
\centering
\includegraphics[width=0.95\textwidth]{figures/magic_shower_parameters.png}
\end{figure}

## Exercise 1: Classification of air showers measured with the MAGIC telescope
MAGIC data set \
\tiny
[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope}](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope)
\normalsize

\scriptsize
```
1.  fLength:  continuous  # major axis of ellipse [mm]
2.  fWidth:   continuous  # minor axis of ellipse [mm] 
3.  fSize:    continuous  # 10-log of sum of content of all pixels [in #phot]
4.  fConc:    continuous  # ratio of sum of two highest pixels over fSize  [ratio]
5.  fConc1:   continuous  # ratio of highest pixel over fSize  [ratio]
6.  fAsym:    continuous  # dist. from highest pixel to center, proj. onto major axis [mm]
7.  fM3Long:  continuous  # 3rd root of third moment along major axis  [mm] 
8.  fM3Trans: continuous  # 3rd root of third moment along minor axis  [mm]
9.  fAlpha:   continuous  # angle of major axis with vector to origin [deg]
10. fDist:    continuous  # distance from origin to center of ellipse [mm]
11. class:    g,h         # gamma (signal), hadron (background)

g = gamma (signal):     12332
h = hadron (background): 6688

For technical reasons, the number of h events is underestimated.
In the real data, the h class represents the majority of the events.
```
\normalsize

## Exercise 1: Classification of air showers measured with the MAGIC telescope

\small
[\textcolor{gray}{03\_ml\_basics\_ex\_1\_magic.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_1_magic.ipynb)
\normalsize

a) Create for each variable a figure with a plot for gammas and hadrons overlayed.
b) Create training and test data set. The test data should amount to 50% of the total data set.
c) Define the logistic regressor and fit the training data
d) Determine the model accuracy and the AUC score
e) Plot the ROC curve (background rejection vs signal efficiency)

## Exercise 2: Hand-written digit recognition with logistic regression

\small
[\textcolor{gray}{03\_ml\_basics\_ex\_2\_mnist\_softmax\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/03_ml_basics_ex_2_mnist_softmax_regression.ipynb)
\normalsize

a) Define logistic regressor from scikit-learn and fit data
b) Use \texttt{classification\_report} from scikit-learn to determine precision and recall
c) Read in a hand-written digit and classify it. Print the probabilities for each digit. Determine the digit with the highest probability.
d) (Optional) Create you own hand-written digit with a program like gimp and check what the classifier does

\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/handwritten_digits.png}
\end{figure}

Hint: You can install required packages on the jupyter hub server like so:
\scriptsize
```
!pip3 install --user pypng 
```
\normalsize


## Exercise 3: Data preprocessing

a) Read the description of the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) package. 

b) Start from the example notebook on the logistic regression for the heart disease data set ([03_ml_basics_log_regr_heart_disease.ipynb](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)). Pre-process the heart disease data set according to the given example. Does preprocessing make a difference in this case?