* Chemistry and material science: predict properties of new molecules / materials
* Many-body quantum matter: classification of quantum phases
\vspace{3ex}
\scriptsize [\textcolor{gray}{Machine learning and the physical sciences, arXiv:1903.10563}](https://arxiv.org/abs/1903.10563) \normalsize
## Some successes and unsolved problems in AI
::: columns
:::: {.column width=50%}
![](figures/ai_history.png){width=85%}
\tiny \textcolor{gray}{M. Woolridge, The road to conscious machines} \normalsize
::::
:::: {.column width=50%}
Impressive progress in certain fields:
\small
* Image recognition
* Speech recognition
* Recommendation systems
* Automated translation
* Analysis of medical data
\normalsize
\vfill
How can we profit from these developments in physics?
::::
:::
## The deep learning hype -- why now?
Artificial neural networks are around for decades. Why did deep learning take off after 2012?
\vspace{5ex}
* Improved hardware -- graphical processing units [GPUs]
* Large data sets (e.g. images) distributed via the Internet
* Algorithmic advances
## Different modeling approaches
* Simple mathematical representation like linear regression. Favored by statisticians.
* Complex deterministic models based on scientific understanding of the physical process. Favored by physicists.
* Complex algorithms to make predictions that are derived from a huge number of past examples (“machine learning” as developed in the field of computer science). These are often black boxes.
* Regression models that claim to reach causal conclusions. Used by economists.
\tiny \textcolor{gray}{D. Spiegelhalter, The Art of Statistics – Learning from data} \normalsize
## Machine learning: The "hello world" problem
::: columns
:::: {.column width=45%}
Recognition of handwritten digits
* MNIST database (Modified National Institute of Standards and Technology database)
* 60,000 training images and 10,000 testing images labeled with correct answer
* 28 pixel x 28 pixel
* Algorithms have reached "near-human performance"
\scriptsize [\textcolor{gray}{Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy, arXiv:1412.6572v1}](https://arxiv.org/abs/1412.6572v1) \normalsize
## Types of machine learning
::: columns
:::: {.column width=60%}
Reinforcement learning
\small
* The machine ("the agent") predicts a scalar reward given once in a while
* Weak feedback
\normalsize
::::
:::: {.column width=35%}
\tiny [\textcolor{gray}{LeCun 2018, Power And Limits of Deep Learning}](https://www.youtube.com/watch?v=0tEhw5t6rhc) \normalsize
![](figures/videogame.png)
::::
:::
\vfill
::: columns
:::: {.column width=60%}
\vspace{1em}
Supervised learning
\small
* The machine predicts a category based on labeled training data
* Medium feedback
\normalsize
::::
:::: {.column width=35%}
![](figures/supervised_learning_car_plane.png)
::::
:::
\vfill
::: columns
:::: {.column width=60%}
\vspace{1em}
Unsupervised learning
\small
* Describe/find hidden structure from "unlabeled" data
* Cluster data in different sub-groups with similar properties
\normalsize
::::
:::: {.column width=35%}
![](figures/anomaly_detection.png)
::::
:::
## Books on machine learning (1)
::: columns
:::: {.column width=85%}
Ian Goodfellow and Yoshua Bengio and Aaron Courville, \textit{Deep Learning}, free online [http://www.deeplearningbook.org/](http://www.deeplearningbook.org/)
\vspace{8ex}
Kevin Murphy, \textit{Probabilistic Machine Learning: An Introduction}, [draft pdf version](https://probml.github.io/pml-book/)
\vspace{7ex}
Aurelien Geron, \textit{Hands-On Machine Learning with Scikit-Learn and TensorFlow}
* Supervised Machine Learning requires labeled training data, i.e., a training sample where for each event it is known whether it is a signal or background event.
* Each event is characterized by $n$ observables: $\vec x = (x_1, x_2, ..., x_n) \;$ \textcolor{gray}{"feature vector"}
* $Y$ = real numbers $\quad \to \quad$ \textcolor{red}{regression}
\vfill
\textcolor{gray}{"All the impressive achievements of deep learning amount to just curve fitting" \\[0.5cm]}
\footnotesize
\textcolor{gray}{J. Pearl, Turing Award Winner 2011\\}
\tiny
[\color{gray}{To Build Truly Intelligent Machines, Teach Them Cause and Effect, Quantamagazine}](https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/)
\normalsize
## Classification: Learning decision boundaries
\begin{figure}
\centering
\includegraphics{figures/decision_boundaries.png}
\end{figure}
## Supervised learning: Training, validation, and test sample
* Decision boundary fixed with \textcolor{blue}{training sample}
* Performance on training sample becomes better with more iterations
* Danger of overtraining: Statistical fluctuations of the training sample will be learnt
* \textcolor{blue}{Validation sample} = independent labeled data set not used for training $\rightarrow$ check for overtraining
* Sign of overtraining: performance on validation sample becomes worse $\rightarrow$ Stop training when signs of overtraining are observed (early stopping)
* Performance: apply classifier to independent \textcolor{blue}{test sample}
* Often: test sample = validation sample (only small bias)
## Supervised learning: Cross validation
Rule of thumb if training data not expensive
::: columns
:::: {.column width=60%}
* Training sample: 50%
* Validation sample: 25%
* Test sample: 25%
\vspace{2ex}
Cross validation (efficient use of scarce training data)
* Split training sample in $k$ independent subset $T_k$ of the full sample $T$
* Train on $T \setminus T_k$ resulting in $k$ different classifiers
* For each training event there is one classifier that didn't use this event for training
* Validation results are then combined
::::
:::: {.column width=40%}
\textcolor{gray}{Often test sample = validation sample (bias is rather small)}
is an optimal test statistic, i.e., it provides highest "signal efficiency" $1-\beta$ for a given "background efficiency" $\alpha$. Accept hypothesis if $t(\vec x) > c$.
\vfill
Problem: the underlying pdf's are almost never known explicitly.
\vfill
Two approaches
1. Estimate signal and background pdf's and construct test statistic based on Neyman-Pearson lemma
2. Decision boundaries determined directly without approximating the pdf's (linear discriminants, decision trees, neural networks, ...)
$\color{gray} \text{approximate PDF by} \; N(x,y|S) \; \text{and} \; N(x,y|B)$
\end{center}
$M$ bins per variable in $d$ dimensions: $M^d$ cells$\to$ hard to generate enough training data (often not practical for $d > 1$)
In general in machine learning, problems related to a large number of dimensions of the feature space are referred to as the \textcolor{red}{"curse of dimensionality"}
## Na$\text{\"i}$ve Bayesian Classifier (also called "Projected Likelihood Classification")
Application of the Neyman-Pearson lemma (ignoring correlations between the $x_i$):
Performance not optimal if true PDF does not factorize
## k-Nearest Neighbor Method (1)
$k$-NN classifier:
* Estimates probability density around the input vector
* $p(\vec x|S)$ and $p(\vec x|B)$ are approximated by the number of signal and background events in the training sample that lie in a small volume around the point $\vec x$
$$ \mat{V} = \text{covariance matrix}, R = \text{"Mahalanobis distance"}$$
::::
:::: {.column width=40%}
![](figures/knn.png)
::::
:::
\vfill
The $k$-NN classifier has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.
## Fisher Linear Discriminant
Linear discriminant is simple. Can still be optimal if amount of training data is limited.
Ansatz for test statistic: $$ y(\vec x) = \sum_{i=1}^n w_i x_i = \vec w^\intercal \vec x $$
Choose parameters $w_i$ so that separation between signal and background distribution is maximum.
\small \textcolor{gray}{LASSO regression tends to give sparse solutions (many components $w_j = 0$). This is why LASSO regression is also called sparse regression.} \normalsize
* Define function that translates $s$ into a quantity that has the properties of a probability
$$ \sigma(s) = \frac{1}{1+e^{-s}} $$
* We would like to determine the optimal weights for a given training data set. They result from the maximum-likelihood principle.
## Logistic regression (2)
* Consider feature vector $\vec x$. For a given set of weights $\vec w$ the model predicts
* a probability $p(1|\vec w) = \sigma(\vec w^\intercal \vec x)$ for outcome $y=1$
* a probabiltiy $p(0|\vec w) = 1 - \sigma(\vec w^\intercal \vec x)$ for outcome $y=0$
* The probability $p(y_i | \vec w)$ defines the likelihood $L_i(\vec w) = p(y_i | \vec w)$ (the likelihood is a function of the parameters $\vec w$ and the observations $y_i$ are fixed).
* Likelihood for the full data sample ($n$ observations)
* This is nothing else but the cross-entropy loss function
## scikit-learn
::: columns
:::: {.column width=70%}
* Free software machine learning library for Python
* Initial release: 2007
* features various classification, regression and clustering algorithms including k-nearest neighbors, multi-layer perceptrons, support vector machines, random forests, gradient boosting, k-means
* Scikit-learn is one of the most popular machine learning libraries on GitHub
## Multinomial logistic regression: Softmax function
In the previous example we considered two classes (0, 1). For multi-class classification, the logistic function can generalized to the softmax function.
\vfill
Now consider $k$ classes and let $s_i$ be the score for class $i$: $\vec s = (s_1, ..., s_k)$
\vfill
A probability for class $i$ can be predicted with the softmax function:
$$ \sigma(\vec s)_i = \frac{e^{s_i}}{\sum_{j=1}^k e^{s_j}} \quad \text{ for } \quad i = 1, ... , k $$
The softmax functions is often used as the last activation function of a neural network in order to predict probabilities in a classification task.
\vfill
Multinomial logistic regression is also known as softmax regression.
## Example 3: Iris data set (softmax regression) (1)
Iris flower data set
* Introduced 1936 in a paper by Ronald Fisher
* Task: classify flowers
* Three species: iris setosa, iris virginica and iris versicolor
* Four features: petal width and length, sepal width/length, in centimeters
## Example 3 : Iris data set (softmax regression) (3)
::: columns
:::: {.column width=70%}
\textcolor{gray}{Accuracy and confusion matrix for different classifiers}
\footnotesize
```python
for clf in [log_reg, kn_neigh, fisher_ld]:
y_pred = clf.predict(x_test)
acc = accuracy_score(y_test, y_pred)
print(type(clf).__name__)
print(f"accuracy: {acc:0.2f}")
# confusion matrix:
# columns: true class, row: predicted class
print(confusion_matrix(y_test, y_pred),"\n")
```
\normalsize
::::
:::: {.column width=30%}
\footnotesize
```
LogisticRegression
accuracy: 0.96
[[29 0 0]
[ 0 23 0]
[ 0 3 20]]
KNeighborsClassifier
accuracy: 0.95
[[29 0 0]
[ 0 23 0]
[ 0 4 19]]
LinearDiscriminantAnalysis
accuracy: 0.99
[[29 0 0]
[ 0 23 0]
[ 0 1 22]]
```
\normalsize
::::
:::
## General remarks on multi-variate analyses (MVAs)
* MVA Methods
* More effective than classic cut-based analyses
* Take correlations of input variables into account
\vfill
* Important: find good input variables for MVA methods
* Good separation power between S and B
* No strong correlation among variables
* No correlation with the parameters you try to measure in your signal sample!
\vfill
* Pre-processing
* Apply obvious variable transformations and let MVA method do the rest
* Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle $\theta$ use $|\cos \theta|$ and not $\cos \theta$ as input variable
* It is generally useful to bring all input variables to a similar numerical range
Hint: You can install required packages on the jupyter hub server like so:
\scriptsize
```
!pip3 install --user pypng
```
\normalsize
## Exercise 3: Data preprocessing
a) Read the description of the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) package.
b) Start from the example notebook on the logistic regression for the heart disease data set ([03_ml_basics_log_regr_heart_disease.ipynb](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)). Pre-process the heart disease data set according to the given example. Does preprocessing make a difference in this case?