initial set of slides corresponding to 2022 version

This commit is contained in:
Klaus Reygers 2023-03-10 14:19:06 +01:00
parent 702e034438
commit 46fae93569
104 changed files with 3988 additions and 0 deletions

1
slides/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
.DS_Store

10
slides/Makefile Normal file
View File

@ -0,0 +1,10 @@
# make creates pdf files of all newly edited .md files
SRCS := $(wildcard *.md)
PDF := $(SRCS:%.md=%.pdf)
OPT := --pdf-engine=xelatex --variable mainfont="Helvetica" --variable sansfont="Helvetica" -t beamer -s -fmarkdown-implicit_figures --template=template.beamer --highlight-style=kate
all: ${PDF}
%.pdf: %.md
pandoc $(OPT) --output=$@ $<

2
slides/README.md Normal file
View File

@ -0,0 +1,2 @@
Pandoc slides example following style of [Stefan Wunsch's CERN IML workhsop presenation](https://github.com/stwunsch/iml_keras_workshop) on [keras](https://keras.io/) (see slides folder)

6
slides/copy_slides.sh Executable file
View File

@ -0,0 +1,6 @@
# slides (do chgrp machlearn <file> later)
# scp CIPpoolAccess.PDF reygers@rho0:public_html/lectures/2021/ml/transparencies/
# scp 03_ml_basics.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
# scp 04_decision_trees.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
scp 05_neural_networks.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/

347
slides/decision_trees.md Normal file
View File

@ -0,0 +1,347 @@
---
title: |
| Introduction to Data Analysis and Machine Learning in Physics:
| 4. Decisions Trees
author: "Martino Borsato, Jörg Marks, Klaus Reygers"
date: "Studierendentage, 11-14 April 2022"
---
## Exercises
* Exercise 1: Compare different decision tree classifiers
* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
* Exercise 2: Apply XGBoost classifier to MAGIC data set
* [`04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
* Exercise 3: Feature importance
* Exercise 4: Interpret a classifier with SHAP values
## Decision trees
\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
\end{figure}
\begin{center}
Leaf nodes classify events as either signal or background
\end{center}
## Decision trees: Rectangular volumes in feature space
\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
\end{figure}
* Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
* How to build a decision tree in an optimal way?
## Finding optimal cuts
Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
$$ G = p (1-p) $$
Here $p$ is the purity:
$$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
\vfill
\textcolor{gray}{Usefulness of weights will become apparent soon.}
\vfill
Improvement in signal/background separation after splitting a set A into two sets B and C:
$$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
## Gini impurity and other purity measures
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
\end{figure}
## Decision tree pruning
::: columns
:::: {.column width=50%}
When to stop growing a tree?
* When all nodes are essentially pure?
* Well, that's overfitting!
\vspace{3ex}
Pruning
* Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
::::
:::: {.column width=50%}
\begin{figure}
\centering
\includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
\end{figure}
::::
:::
## Single decision trees: Pros and cons
\textcolor{green}{Pros:}
* Requires little data preparation (unlike neural networks)
* Can use continuous and categorical inputs
\vfill
\textcolor{red}{Cons:}
* Danger of overfitting training data
* Sensitive to fluctuations in the training data
* Hard to find global optimum
* When to stop splitting?
## Ensemble methods: Combine weak learners
::: columns
:::: {.column width=70%}
* Bootstrap Aggregating (Bagging)
* Sample training data (with replacement) and train a separate model on each of the derived training sets
* Classify example with majority vote, or compute average output from each tree as model output
::::
:::: {.column width=30%}
$$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$
::::
:::
\vfill
::: columns
:::: {.column width=70%}
* Boosting
* Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
* Take weighted average to classify examples
::::
:::: {.column width=30%}
$$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$
::::
:::
## Random forests
* "One of the most widely used and versatile algorithms in data science and machine learning"
\tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
\vfill
* Use bagging to select random example subset
\vfill
* Train a tree, but only use random subset of features at each split
* this reduces the correlation between different trees
* makes the decision more robust to missing data
## Boosted decision trees: Idea
\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/bdt.png}
\end{figure}
## AdaBoost (short for Adaptive Boosting)
Initial training sample
\begin{center}
\begin{tabular}{l l}
$\vec x_1, ..., \vec x_n$: & multivariate event data \\
$y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
$w_1^{(1)}, ..., w_n^{(1)}$ & event weights
\end{tabular}
\end{center}
with equal weights normalized as
$$ \sum_{i=1}^n w_i^{(1)} = 1 $$
Train first classifier $f_1$:
\begin{center}
\begin{tabular}{l l}
$f_1(\vec x_i) > 0$ & classify as signal \\
$f_1(\vec x_i) < 0$ & classify as background
\end{tabular}
\end{center}
## AdaBoost: Updating events weights
Define training sample $k+1$ from training sample $k$ by updating weights:
$$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
\footnotesize
\textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
\normalsize
Weight is increased if event was misclassified by the previous classifier
$\to$ "Next classifier should pay more attention to misclassified events"
\vfill
At each step the classifier $f_k$ minimizes error rate:
$$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0),
\quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise} $$
## AdaBoost: Assigning the classifier score
Assign score to each classifier according to its error rate:
$$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
\vfill
Combined classifier (weighted average):
$$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
## Gradient boosting
Basic idea:
* Train a first decision tree
* Then train a second one on the residual errors made by the first tree
* And so on
\vfill
In slightly more detail:
* \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
* Model prediction at iteration $m$: $F_m(\vec x_i)$
* New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
* Find $h_m(\vec x)$ by fitting it to
$\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
\color{black}
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
\small
[\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
\normalsize
\vfill
Superconductivty data set:
Predict the critical temperature based on 81 material features.
\footnotesize
[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
\normalsize
\vfill
From the abstract:
We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductors chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the models predictive accuracy.
\vfill
\tiny
[\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
\normalsize
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
::: columns
:::: {.column width=60%}
\footnotesize
```python
import xgboost as xgb
XGBreg = xgb.sklearn.XGBRegressor()
XGBreg.fit(X_train, y_train)
y_pred = XGBreg.predict(X_test)
from sklearn.metrics import mean_squared_error
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"root mean square error {rms:.2f}")
```
\textcolor{gray}{This gives:}
`root mean square error 9.68`
::::
:::: {.column width=40%}
\vspace{6ex}
![](figures/critical_temperature.pdf)
::::
:::
## Exercise 1: Compare different decision tree classifiers
\small
[\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
\vspace{5ex}
Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
\vspace{2ex}
Is there a classifier that clearly performs best?
## Exercise 2: Apply XGBoost classifier to MAGIC data set
\small
[\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
\normalsize
\footnotesize
```python
# train XGBoost boosted decision tree
import xgboost as xgb
XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
```
\normalsize
\small
a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
b) Which is the most important feature for discriminating signal and background according to XGBoost? \
Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
\normalsize
## Exercise 3: Feature importance
\small
[\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
\normalsize
\vspace{3ex}
Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score.
## Exercise 4: Interpret a classifier with SHAP values
SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
\vfill
Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
c) Do the same for the superconductivity data set? What are the three most important features?

Binary file not shown.

Binary file not shown.

BIN
slides/figures/L1vsL2.pdf Normal file

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 249 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 386 KiB

BIN
slides/figures/ai_ml_dl.pdf Normal file

Binary file not shown.

BIN
slides/figures/ann.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 186 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

BIN
slides/figures/bdt.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 214 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 328 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

BIN
slides/figures/cnn.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 163 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 132 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 158 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 116 KiB

BIN
slides/figures/deepl.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 511 KiB

BIN
slides/figures/dnn.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 519 KiB

BIN
slides/figures/dropout.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 424 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 229 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 394 KiB

BIN
slides/figures/fisher.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

BIN
slides/figures/gan.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 116 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 226 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 216 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

BIN
slides/figures/imagenet.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 422 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 659 KiB

BIN
slides/figures/keras.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.5 KiB

BIN
slides/figures/knn.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 116 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 63 KiB

BIN
slides/figures/loss_fct.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.0 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.9 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 667 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 543 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 248 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

BIN
slides/figures/mlp.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 204 KiB

BIN
slides/figures/mnist.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 216 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

BIN
slides/figures/mva.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 788 KiB

BIN
slides/figures/mva_nn.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

BIN
slides/figures/neuron.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 306 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 907 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 170 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 695 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 344 KiB

BIN
slides/figures/relu.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 91 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 278 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

BIN
slides/figures/sigmoid.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 104 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 144 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 131 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 651 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 645 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 102 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 95 KiB

BIN
slides/figures/xor.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

563
slides/fit_intro.md Normal file
View File

@ -0,0 +1,563 @@
---
title: |
| Introduction to Data Analysis and Machine Learning in Physics:
| 2. Data modeling and fitting
author: "Martino Borsato, Jörg Marks, Klaus Reygers"
date: "Studierendentage, 11-14 April 2022"
---
## Data modeling and fitting - introduction
Data analysis is a process of understanding and modeling measured
data. The goal is to find patterns and to obtain inferences allowing to
observe underlying patterns.
* There are 2 approaches to statistical data modeling
* Hypothesis testing: is our data compatible with a certain model?
* Determination of model parameter: use the data to determine the parameters
of a (theoretical) model
* For the determination of model parameter
* Analysis of data distributions $\rightarrow$ mean, variance,
median, FWHM, .... \newline
allows for an approximate determination of model parameter
* Data fitting with the least square method $\rightarrow$ an iterative
process which minimizes the deviation of a model decribed by parameters
from data. This determines the optimal values and uncertainties
of the parameters.
* Maximum likelihood fitting $\rightarrow$ find a set of model parameters
which most likely describe the data by maximizing the probability
distributions.
The parameter determination by minimization is an integral part of machine
learning approaches, here a system learns patterns and predicts
related ones. This is the focus in the upcoming days.
## Data modeling and fitting - introduction
Data analysis is a process of understanding and modeling measured
data. The goal is to find patterns and to obtain inferences allowing to
observe underlying patterns.
* There are 2 approaches to statistical data modeling
* Hypothesis testing: is our data compatible with a certain model?
* Determination of model parameter: use the data to determine the parameters
of a (theoretical) model
* For the determination of model parameter
* Analysis of data distributions $\rightarrow$ mean, variance,
median, FWHM, .... \newline
allows for an approximate determination of model parameter
\setbeamertemplate{itemize subitem}{\color{red}\tiny$\blacksquare$}
* \textcolor{blue}{Data fitting with the least square method
$\rightarrow$ an iterative
process which minimizes the deviation of a model decribed by parameters
from data. This determines the optimal values and uncertainties
of the parameters.}
\setbeamertemplate{itemize subitem}{\color{blue}\tiny$\blacktriangleright$}
* Maximum likelihood fitting $\rightarrow$ find a set of model parameters
which most likely describe the data by maximizing the probability
distributions.
The parameter determination by minimization is an integral part of machine
learning approaches, here a system learns patterns and predicts
related ones. This is the focus in the upcoming days.
## Least Square (LS) Method (1)
The method determines the \textcolor{blue}{optimal parameters of functions
to gaussian distributed measurements}.
Lets consider a sample of $n$ measurements $y_{i}$ and a parametrized
description of the measurement $\eta_{i} = f(x_{i} | \theta)$
with a parameter set $\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$,
dependent values $x_{i}$ and measurement errors $\sigma_{i}$.
The parameter set should be determined such that
\begin{equation*}
\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} = \sum \limits_{i=1}^{n} \frac{(y_i- f(x_i|\theta))^2}{\sigma_i^2} \longrightarrow \, minimal }
\end{equation*}
In case of correlated measurements the covariance matrix of the $y_{i}$ has to
be taken into account. This is accomplished by defining a weight matrix from
the covariance matrix of the input data. A decorrelation of the input data
should be considered.
\vspace{0.2cm}
$S$ follows a $\chi^{2}$-distribution with $(n-k)$ degrees of freedom.
## Least Square (LS) Method (2)
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Example LS-method
\vspace{0.2cm}
Often the fit function $f(x, \theta)$ is linear in
$\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$
\vspace{0.2cm}
$f(x | \theta) = \theta_{1} f_{1}(x) + .... + \theta_{k} f_{k}(x)$
\vspace{0.2cm}
If the model is a straight line and our parameters are $\theta_{1}$ and
$\theta_{2}$ $(f_{1}(x) = 1,$ $f_{2}(x) = x)$ we have
$f(x | \theta) = \theta_{1} + \theta_{2} x$
\vspace{0.2cm}
The LS equation is
\vspace{0.2cm}
$\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum
\limits_{i=1}^{n} \frac{(y_{i} - \theta_{1} - x_{i}
\theta_{2})^2}{\sigma_i^2 }}$ \hspace{0.4cm} and with
\vspace{0.2cm}
$\frac{\partial S}{\partial \theta_1} = \sum\limits_{i=1}^{n} \frac{-2
(y_i - \theta_1 - x_i \theta_2)}{\sigma_i^2} = 0$ \hspace{0.4cm} and \hspace{0.4cm}
$\frac{\partial S}{\partial \theta_2} = \sum\limits_{i=1}^{n} \frac{-2 x_i (y_i - \theta_1 - x_i \theta_2)}{\sigma_i^2} = 0$
\vspace{0.2cm}
the parameters $\theta_{1}$ and $\theta_{2}$ can be determined.
\vspace{0.2cm}
\textcolor{olive}{In case of linear fit functions solutions can be found by matrix inversion}
\vfill
## Least Square (LS) Method (3)
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Use of a nonlinear fit function $f(x, \theta)$ like \hspace{0.4cm}
$f(x | \theta) = \theta_{1} \cdot e^{-\theta_{2} x}$
\vspace{0.2cm}
results in the LS equation
\vspace{0.2cm}
$\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum \limits_{i=1}^{n} \frac{(y_{i} - \theta_{1} \cdot e^{-\theta_{2} x_{i}})^2}{\sigma_i^2 }}$ \hspace{0.4cm}
\vspace{0.2cm}
which we have to minimize
\vspace{0.2cm}
$\frac{\partial S}{\partial \theta_1} = \sum\limits_{i=1}^{n} \frac{ 2 e^{-2 \theta_2 x_i} ( \theta_1 - y_i e^{\theta_2 x_i} )} {\sigma_i^2 } = 0$ \hspace{0.4cm} and \hspace{0.4cm}
$\frac{\partial S}{\partial \theta_2} = \sum\limits_{i=1}^{n} \frac{ 2 \theta_1 x_I e^{-2 \theta_2 x_i} (y_i e^{\theta_2 x_i} - \theta_1)} {\sigma_i^2 } = 0$
\vspace{0.4cm}
In a nonlinear system, the LS Ansatz leads to derivatives which are
functions of the independent variable and the parameters $\color{red}\rightarrow$ \textcolor{olive}{no closed solutions}
\vspace{0.4cm}
In general, we have gradient equations which don't have closed solutions.
There are a couple of methods including approximations which allow together
with numerical methods to find a global minimum, GaussNewton algorithm,
LevenbergMarquardt algorithm, gradient descend methods and also direct
search methods.
## Minuit - a programm package for minimization (1)
In general data fitting and also solving machine learning algorithms lead
to a minimization problem of functions. In the
1975-1980 F. James (CERN) developed
a FORTRAN-based package, [\textcolor{violet}{MINUIT}](http://seal.web.cern.ch/seal/documents/minuit/mntutorial.pdf), which is a framework to handle
multiparameter minimization and compute the best-fit parameter values and
uncertainties, including correlations between the parameters.
\vspace{0.2cm}
The user provides a minimization function
$F(X,P)$ with the parameter space $P=(p_1,....p_k)$ and
variable space $X$ (also multi-dimensional). There is an interface via
functions which influences the
the minimization process. MINUIT provides
[\textcolor{violet}{error calculations}](http://seal.web.cern.ch/seal/documents/minuit/mnerror.pdf) including correlations for the parameter space by evaluating the shape of the function in some neighbourhood of the minimum.
\vspace{0.2cm}
The package
has now a new object-oriented implementation as [\textcolor{violet}{Minuit2 library}](https://root.cern.ch/doc/master/Minuit2Page.html) , written
in C++.
\vspace{0.2cm}
During the minimization $F(X,P)$ is evaluated for various $X$. For the
choice of $P=(p_1,....p_k)$ different methods are used
## Minuit - a programm package for minimization (2)
\vspace{0.4cm}
\textcolor{olive}{SEEK}: Search for the minimum with Monte Carlo methods, mostly used at the start
of the minimization with unknown starting values. It is not a converging
algorithm.
\vspace{0.2cm}
\textcolor{olive}{SIMPLX}:
Uses the simplex method of Nelder and Mead. Function values are compared
in the parameter space. Via step size control the minimum is approached.
Parameter errors are only approximate, no covariance matrix is calculated.
\vspace{0.2cm}
<!---
A simplex is the smallest n dimensional figure with n+1 corners. By reflecting
one point in the hyperplane of the other point and adopts itself to the
function plane.
-->
\textcolor{olive}{MIGRAD}:
Uses an algorithm of R. Fletcher, which takes the function and the gradient
to approach the minimum with a variable metric method. An error matrix and
correlation coefficients are available
\vspace{0.2cm}
\textcolor{olive}{HESSE}:
Calculates the hessian matrix of second derivatives and determines the
covariance matrix.
\vspace{0.2cm}
\textcolor{olive}{MINOS}:
Calculates (asymmetric) errors using likelihood profiles.
The algorithm for finding the positive and negative MINOS errors for parameter
$n$ consists of varying $n$ each time minimizing $F(X,P)$ with respect to
all the others.
\vspace{0.2cm}
## Minuit - a programm package for minimization (3)
\vspace{0.4cm}
Fit process with the minuit package
\vspace{0.2cm}
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* The individual steps decribed above can be called several times and in different order during the minimization process.
* Each of the parameters $p_i$ of $P=(p_1,....p_k)$ can be set constant and
released during the minimization steps.
* Problems are expected in models with strong correlation between
parameters $\rightarrow$ change model to uncorrelated definitions
* Local minima, edges/steps or undefined ranges in $F(X,P)$ are problematic
$\rightarrow$ simplify your model
\vspace{3cm}
## Minuit2 - The iminuit package
\vspace{0.4cm}
[\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) is
a Jupyter-friendly Python interface for the Minuit2 C++ library.
\vspace{0.2cm}
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* The class `iminuit.Minuit` instanciates the minuit object. The minimizer
function is given as argument. Basic steering of the fit
like setting start parameters, error definition and print level is also
done here.
\footnotesize
```python
from iminuit import Minuit
def fcn(x, y, z): # definition of the minimizer function
return (x - 2) ** 2 + (y - x) ** 2 + (z - 4) ** 2
m = Minuit(fcn, x=0, y=0, z=0, errordef=1 , print_level=1)
```
\normalsize
* Several methods determine the interaction with the fitting process, calls
to `migrad` , `hesse` or printing of parameters and errors
\footnotesize
```python
......
m.migrad() # run optimiser
print(m.values , m.errors) # print results
m.hesse() # run covariance estimator
```
\normalsize
## Minuit2 - iminuit example
\vspace{0.2cm}
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* The function `fcn` describes the model with parameters to be determined by
data.`fcn` is minimal when the model parameters agree best with data.
`fcn` has positional arguments, one for each fit parameter. `iminuit`
example fit:
[\textcolor{violet}{02\_fit\_exp\_fit\_iMinuit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_exp_fit_iMinuit.py)
\footnotesize
```python
......
x = np.array([....],dtype='d') # measurements x
y = np.array([....],dtype='d') # measurements y
dy = np.array([....],dtype='d') # error in y
def xp(a, b , c):
return a * np.exp(b*x) + c
# least-squares function = sum of data residuals squared
def fcn(a,b,c):
return np.sum((y - xp(a,b,c)) ** 2 / dy ** 2)
# limit the range of b and fix parameter c
m = Minuit(fcn,a=1,b=-0.7,c=1,limit_b=(-1,0.1),fix_c=True)
m.migrad() # run minimizer
m.fixed["c"] = False # release parameter c
m.migrad() # rerun minimizer
```
\normalsize
* Might be useful to fix parameters or limit the range for some applications
## Minuit2 - iminuit (3)
\vspace{0.2cm}
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Results and control information of the fit can be printed and accessed
in the the prorgamm.
\footnotesize
```python
......
m = Minuit(fcn,....,print_level=1) # set flag in the initializer
m.migrad() # run minimizer
a_fit = m.values['a'] # get parameter value a
a_fit_error = m.errors['a'] # get parameter error of a
print (m.values,m.errors) # print results
```
\normalsize
* After processing Hesse, covariance and correlation information of the
fit is available
\footnotesize
```python
......
m.hesse() # run covariance estimator
m.matrix() # get covariance matrix
m.matrix(correlation=True) # get full correlation matrix
cov = m.np_matrix() # save matrix to numpy
cor = m.np_matrix(correlation=True)
print(cor[0, 1]) # print correlation between parameter 1 and 2
```
\normalsize
## Minuit2 - iminuit (4)
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Minos provides asymmetric uncertainty intervals and parameter contours by
scanning one parameter and minimizing the function with respect to all other
parameters for each scan point. Results are displayed with `matplotlib`.
\footnotesize
```python
......
m.minos()
print (m.get_merrors()['a'])
m.draw_mnprofile('b')
m.draw_mncontour('a', 'b', nsigma=4)
```
::: columns
:::: {.column width=40%}
![](figures/iminuit_minos_scan-1.png)
::::
:::: {.column width=40%}
![](figures/iminuit_minos_scan-2.png)
::::
:::
## Exercise 3
Plot the following data with mathplotlib as in the iminuit example:
\footnotesize
```
x: 0.2,0.4,0.6,0.8,1.,1.2,1.4,1.6,1.8,2.,2.2,2.4,2.6,2.8,3.,3.2,
3.4,3.6, 3.8,4.
y: 0.04,0.021,0.035,0.03,0.029,0.019,0.024,0.018,0.019,0.022,0.02,
0.025,0.018,0.024,0.019,0.021,0.03,0.019,0.03,0.024
dy: 1.792,1.695,1.541,1.514,1.427,1.399,1.388,1.270,1.262,1.228,1.189,
1.182,1.121,1.129,1.124,1.089,1.092,1.084,1.058,1.057
```
\normalsize
\setbeamertemplate{itemize item}{\color{red}$\square$}
* Exchange in the example iminuit fit `02_fit_exp_fit_iMinuit.ipynb` the
exponential function by a 3rd order polynomial and perform the fit
* Compare the correlation of the parameters of the exponential and
the polynomial fit
* What defines the fit quality, give an estimate
\small
Solution: [\textcolor{violet}{02\_fit\_ex\_3\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_3_sol.py) \normalsize
## Exercise 4
Plot the following data with mathplotlib:
\footnotesize
```
x: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
dx: 0.1,0.1,0.5,0.1,0.5,0.1,0.5,0.1,0.5,0.1
y: 1.1,2.3,2.7,3.2,3.1,2.4,1.7,1.5,1.5,1.7
dy: 0.15,0.22,0.29,0.39,0.31,0.21,0.13,0.15,0.19,0.13
```
\normalsize
\setbeamertemplate{itemize item}{\color{red}$\square$}
* Perform a fit with iminuit. Which model do you use?
* Plot the resulting fit function in the graph with the data
* Print the covariance matrix. Can we improve the errors.
* Can you draw a contour plot of 2 of the fit parameters.
\small
Solution: [\textcolor{violet}{02\_fit\_ex\_4\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_4_sol.py) \normalsize
## PyROOT
[\textcolor{violet}{PyROOT}](https://root.cern/manual/python/) is the python binding for the C++ data analysis toolkit [\textcolor{violet}{ROOT}](https://root.cern/) developed with and for the LHC community. You can access the full
ROOT functionality from Python while
benefiting from the performance of the ROOT C++ libraries. The PyROOT bindings
are automatic and dynamic and are able to interoperate with widely-used Python
data-science libraries as `NumPy`, `pandas`, SciPy `scikit-learn` and `tensorflow`.
* ROOT/PyROOT can be installed easily within anaconda3 (ROOT version 6.22.02
or later ) or is available in the
[\textcolor{violet}{CIP jupyter2 Hub}](https://jupyter2.kip.uni-heidelberg.de/)
* Tools for statistical analysis, a math library with optimized algorithms,
multivariate analysis, visualization and simulation of data.
* Storing data including objects and classes with compression in files is a
very powerfull aspect for any data analysis project
* Within PyROOT Minuit2 can be accessed easily either with predefined functions
or your own function definition
* For advanced statistical analyses and data modeling likelihood fitting with
the packages **rooFit** and **rooStats** is available.
##
* Example reading the invariant mass measurements of a $D^0$ from a text file
and determine $\mu$ and $\sigma$ \hspace{1.0cm} \small
[\textcolor{violet}{02\_fit\_histFit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_histFit.py)
\normalsize
\footnotesize
```python
import numpy as np
import math
from ROOT import TCanvas, TFile, TH1D, TF1, TMinuit, TFitResult
data = np.genfromtxt('D0Mass.txt', dtype='d') # read data from text file
c = TCanvas('c','D0 Mass',200,10,700,500) # instanciate output canvas
d0 = TH1D('d0','D0 Mass',200,1700.,2000.) # instanciate histogramm
for x in data : # fill data into histogramm d0
d0.Fill(x)
def pyf_tf1_params(x, p): # define fit function
return p[0] * math.exp (-0.5 * ((x[0] - p[1])**2 / p[2]**2))
func = TF1("func",pyf_tf1_params,1840.,1880.,3)
# func = TF1("func",'gaus',1840.,1880.) # use predefined function
func.SetParameters(500.,1860.,5.5) # set start parameters
myfit = d0.Fit(func,"S") # fit function to the histogramm data
print ("Fit results: mean=",myfit.Parameter(0)," +/- ",myfit.ParError(0))
c.Draw() # draw canvas
myfile = TFile('myOutFile.root','RECREATE') # Open a ROOT file for output
c.Write() # Write canvas
d0.Write() # Write histogram
myfile.Close() # close file
```
\normalsize
##
* Fit Options
\vspace{0.1cm}
::: columns
:::: {.column width=2%}
::::
:::: {.column width=98%}
![](figures/rootOptions.png)
::::
:::
## Exercise 5
Read text file [\textcolor{violet}{FitTestData.txt}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/FitTestData.txt) and draw a histogramm using PyROOT.
\setbeamertemplate{itemize item}{\color{red}$\square$}
* Determine the mean and sigma of the signal distribution. Which function do
you use for fitting?
* The option S fills the result object.
* Try to improve the errors of the fit values with minos using the option E
and also try the option M to scan for a new minimum, option V provides more
output.
* Fit the background outside the signal region use the option R+ to add the
function to your fit
\small
Solution: [\textcolor{violet}{02\_fit\_ex\_5\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_5_sol.py) \normalsize
## iPython Examples for Fitting
The different python packages are used in
\textcolor{blue}{example iPython notebooks}
to demonstrate the fitting of a third order polynomial to the same data
available as numpy arrays.
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* LSQ fit of a polynomial to data using Minuit2 with
\textcolor{blue}{iminuit} and \textcolor{blue}{matplotlib} plot:
\small
[\textcolor{violet}{02\_fit\_iminuitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_iminuitFit.ipynb)
\normalsize
* Graph fitting with \textcolor{blue}{pyROOT} with options using a python
function including confidence level plot:
\small
[\textcolor{violet}{02\_fit\_fitGraph.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_fitGraph.ipynb)
\normalsize
* Graph fitting with \textcolor{blue}{numpy} and confidence level
plotting with \textcolor{blue}{matplotlib}:
\small
[\textcolor{violet}{02\_fit\_numpyFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_numpyFit.ipynb)
\normalsize
* Graph fitting with a polynomial fit of \textcolor{blue}{scikit-learn} and
plotting with \textcolor{blue}{matplotlib}:
\normalsize
\small
[\textcolor{violet}{02\_fit\_scikitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_scikitFit.ipynb)
\normalsize

830
slides/intro_python.md Normal file
View File

@ -0,0 +1,830 @@
---
title: |
| Introduction to Data Analysis and Machine Learning in Physics:
| 1. Introduction to python
author: "Martino Borsato, Jörg Marks, Klaus Reygers"
date: "Studierendentage, 11-14 April 2022"
---
## Outline of the $1^{st}$ day
* Technical instructions for your interactions with the CIP pool, like
* using the jupyter hub
* using python locally in your own linux environment (anaconda)
* access the CIP pool from your own windows or linux system
* transfer data from and to the CIP pool
Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/Folien/CIPpoolAccess.pdf)\normalsize
* Summary of NumPy
* Plotting with matplotlib
* Input / output of data
* Summary of pandas
* Fitting with iminuit and pyROOT
## A glimpse into python classes
The following python classes are important to data analysis and machine
learning will be used during the course
* [\textcolor{violet}{NumPy}](https://numpy.org/doc/stable/user/basics.html) - python library adding support for large,
multi-dimensional arrays and matrices, along with high-level
mathematical functions to operate on these arrays
* [\textcolor{violet}{matplotlib}](https://matplotlib.org/stable/tutorials/index.html) - a python plotting library
* [\textcolor{violet}{SciPy}](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) - extension of NumPy by a collection of
mathematical algorithms for minimization, regression,
fourier transformation, linear algebra and image processing
* [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) -
python wrapper to the data fitting toolkit
[\textcolor{violet}{Minuit2}](https://root.cern.ch/doc/master/Minuit2Page.html)
developed at CERN by F. James in the 1970ies
* [\textcolor{violet}{pyROOT}](https://root.cern/manual/python/) - python wrapper to the C++ data analysis toolkit
ROOT used at the LHC
* [\textcolor{violet}{scikit-learn}](https://scikit-learn.org/stable/) - machine learning library written in
python, which makes use extensively of NumPy for high-performance
linear algebra algorithms
## NumPy
\textcolor{blue}{NumPy} (Numerical Python) is an open source Python library,
which contains multidimensional array and matrix data structures and methods
to efficiently operate on these. The core object is
a homogeneous n-dimensional array object, \textcolor{blue}{ndarray}, which
allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations
with arrays and matrices} due to the extensive usage of compiled code.
* It is heavily used in numerous scientific python packages
* `ndarray` 's have a fixed size at creation $\rightarrow$ changing size
leads to recreation
* Array elements are all required to be of the same data type
* Facilitates advanced mathematical operations on large datasets
* See for a summary, e.g. &nbsp;&nbsp;
\small
[\textcolor{violet}{https://cs231n.github.io/python-numpy-tutorial/\#numpy}](https://cs231n.github.io/python-numpy-tutorial/#numpy) \normalsize
\vfill
::: columns
:::: {.column width=30%}
::::
:::
::: columns
:::: {.column width=35%}
`c = []`
`for i in range(len(a)):`
&nbsp;&nbsp;&nbsp; `c.append(a[i]*b[i])`
::::
:::: {.column width=35%}
with NumPy
`c = a * b`
::::
:::
<!---
It seem we need to indent by hand.
I don't manage to align under the bullet text
If we do it with column the vertical space is with code sections not good
If we do it without code section the vertical space is ok, but there is no
code high lightning.
See the different versions of the same page in the following
-->
## NumPy - array basics
* numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed.
The *rank* is the dimension of the array.
There are methods to create and preset arrays.
\footnotesize
```python
myA = np.array([2, 5 , 11]) # create rank 1 array (vector like)
type(myA) # <class numpy.ndarray>
myA.shape # (3,)
print(myA[2]) # 11 access 3. element
myA[0] = 12 # set 1. element to 12
myB = np.array([[1,5],[7,9]]) # create rank 2 array
myB.shape # (2,2)
print(myB[0,0],myB[0,1],myB[1,1]) # 1 5 9
myC = np.arange(6) # create rank 1 set to 0 - 5
myC.reshape(2,3) # change rank to (2,3)
zero = np.zeros((2,5)) # 2 rows, 5 columns, set to 0
one = np.ones((2,2)) # 2 rows, 2 columns, set to 1
five = np.full((2,2), 5) # 2 rows, 2 columns, set to 5
e = np.eye(2) # create 2x2 identity matrix
```
\normalsize
## NumPy - array indexing (1)
* select slices of a numpy array
\footnotesize
```python
a = np.array([[1,2,3,4],
[5,6,7,8], # 3 rows 4 columns array
[9,10,11,12]])
b = a[:2, 1:3] # subarray of 2 rows and
array([[2, 3], # column 1 and 2
[6, 7]])
```
\normalsize
* a slice of an array points into the same data, *modifying* changes the original array!
\footnotesize
```python
b[0, 0] = 77 # b[0,0] and a[0,1] are 77
r1_row = a[1, :] # get 2nd row -> rank 1
r1_row.shape # (4,)
r2_row = a[1:2, :] # get 2nd row -> rank 2
r2_row.shape # (1,4)
a=np.array([[1,2],[3,4],[5,6]]) # set a , 3 rows 2 cols
d=a[[0, 1, 2], [0, 1, 1]] # d contains [1 4 6]
e=a[[1, 2], [1, 1]] # e contains [4 6]
np.array([a[0,0],a[1,1],a[2,0]]) # address elements explicitly
```
\normalsize
## NumPy - array indexing (2)
* integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements
\footnotesize
```python
a = np.array([[1,2,3,4],
[5,6,7,8], # 3 rows 4 columns array
[9,10,11,12]])
p_a = np.array([0,2,0]) # Create an array of indices
s = a[np.arange(3), p_a] # number the rows, p_a points to cols
print (s) # s contains [1 7 9]
a[np.arange(3),p_a] += 10 # add 10 to corresponding elements
x=np.array([[8,2],[7,4]]) # create 2x2 array
bool = (x > 5) # bool : array of boolians
# [[True False]
# [True False]]
print(x[x>5]) # select elements, prints [8 7]
```
\normalsize
* data type in numpy - create according to input numbers or set explicitly
\footnotesize
```python
x = np.array([1.1, 2.1]) # create float array
print(x.dtype) # print float64
y=np.array([1.1,2.9],dtype=np.int64) # create float array [1 2]
```
\normalsize
## NumPy - functions
* math functions operate elementwise either as operator overload or as methods
\footnotesize
```python
x=np.array([[1,2],[3,4]],dtype=np.float64) # define 2x2 float array
y=np.array([[3,1],[5,1]],dtype=np.float64) # define 2x2 float array
s = x + y # elementwise sum
s = np.add(x,y)
s = np.subtract(x,y)
s = np.multiply(x,y) # no matrix multiplication!
s = np.divide(x,y)
s = np.sqrt(x), np.exp(x), ...
x @ y , or np.dot(x, y) # matrix product
np.sum(x, axis=0) # sum of each column
np.sum(x, axis=1) # sum of each row
xT = x.T # transpose of x
x = np.linspace(0,2*pi,100) # get equal spaced points in x
r = np.random.default_rng(seed=42) # constructor random number class
b = r.random((2,3)) # random 2x3 matrix
```
\normalsize
##
* broadcasting in numpy
\vspace{0.4cm}
The term broadcasting describes how numpy treats arrays with different
shapes during arithmetic operations
* add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to
$[b,b,b]$
\vspace{0.2cm}
* add a scalar $b$ to a 2D [2,3] array $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$
$\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise
\vspace{0.2cm}
* add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ $\rightarrow$ 1D array is broadcast
across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added element wise
\vspace{0.2cm}
Arithmetic operations can only be performed when the shape of each
dimension in the arrays are equal or one has the dimension size of 1. Look
[\textcolor{violet}{here}](https://numpy.org/doc/stable/user/basics.broadcasting.html) for more details
\footnotesize
```python
# Add a vector to each row of a matrix
x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3)
v = np.array([1,2,3]) # v has shape (3,)
x + v # [[2 4 6]
# [5 7 9]]
```
\normalsize
## Plot data
A popular library to present data is the `pyplot` module of `matplotlib`.
* Drawing a function in one plot
\footnotesize
::: columns
:::: {.column width=35%}
```python
import numpy as np
import matplotlib.pyplot as plt
# generate 100 points from 0 to 2 pi
x = np.linspace( 0, 10*np.pi, 100 )
f = np.sin(x)**2
# plot function
plt.plot(x,f,'blueviolet',label='sine')
plt.xlabel('x [radian]')
plt.ylabel('f(x)')
plt.title('Plot sin^2')
plt.legend(loc='upper right')
plt.axis([0,30,-0.1,1.2]) # limit the plot range
# show the plot
plt.show()
```
::::
:::: {.column width=40%}
![](figures/matplotlib_Figure_1.png)
::::
:::
\normalsize
##
* Drawing subplots in one canvas
\footnotesize
::: columns
:::: {.column width=35%}
```python
...
g = np.exp(-0.2*x)
# create figure
plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey')
plt.suptitle('1 x 2 Plot')
# create subplot and plot first one
plt.subplot(1,2,1)
# plot first one
plt.title('exp(x)')
plt.xlabel('x')
plt.ylabel('g(x)')
plt.plot(x,g,'blueviolet')
# create subplot and plot second one
plt.subplot(1,2,2)
plt.plot(x,f,'orange')
plt.plot(x,f*g,'red')
plt.legend(['sine^2','exp*sine'])
# show the plot
plt.show()
```
::::
:::: {.column width=40%}
\vspace{3cm}
![](figures/matplotlib_Figure_2.png)
::::
:::
\normalsize
## Image data
The `image` class of the `matplotlib` library can be used to load the image
to numpy arrays and to render the image.
* There are 3 common formats for the numpy array
* (M, N) scalar data used for greyscale images
* (M, N, 3) for RGB images (each pixel has an array with RGB color attached)
* (M, N, 4) for RGBA images (each pixel has an array with RGB color
and transparency attached)
The method `imread` loads the image into an `ndarray`, which can be
manipulated.
The method `imshow` renders the image data
\vspace {2cm}
##
* Drawing pixel data and images
\footnotesize
::: columns
:::: {.column width=50%}
```python
....
# create data array with pixel postion and RGB color code
width, height = 400, 400
data = np.zeros((height, width, 3), dtype=np.uint8)
# red patch in the center
data[175:225, 175:225] = [255, 0, 0]
x = np.random.randint(0,width-1,100)
y = np.random.randint(0,height-1,100)
data[x,y]= [0,255,0] # random green pixel
plt.imshow(data)
plt.show()
....
import matplotlib.image as mpimg
#read image into numpy array
pic = mpimg.imread('picture.jpg')
mod_pic = pic[:,:,0] # grab slice 0 of the colors
plt.imshow(mod_pic) # use default color code also
plt.colorbar() # try cmap='hot'
plt.show()
```
::::
:::: {.column width=25%}
![](figures/matplotlib_Figure_3.png)
\vspace{1cm}
![](figures/matplotlib_Figure_4.png)
::::
:::
\normalsize
## Input / output
For the analysis of measured data efficient input \/ output plays an
important role. In numpy, `ndarrays` can be saved and read in from files.
`load()` and `save()` functions handle numpy binary files (.npy extension)
which contain data, shape, dtype and other information required to
reconstruct the `ndarray` of the disk file.
\footnotesize
```python
r = np.random.default_rng() # instanciate random number generator
a = r.random((4,3)) # random 4x3 array
np.save('myBinary.npy', a) # write array a to binary file myBinary.npy
b = np.arange(12)
np.savez('myComp.npz', a=a, b=b) # write a and b in compressed binary file
......
b = np.load('myBinary.npy') # read content of myBinary.npy into b
```
\normalsize
The storage and retrieval of array data in text file format is done
with `savetxt()` and `loadtxt()` methods. Parameter controling delimiter,
line separators, file header and footer can be specified.
\footnotesize
```python
x = np.array([1,2,3,4,5,6,7]) # create ndarray
np.savetxt('myText.txt',x,fmt='%d') # write array x to text file myText.txt
.....
y = np.loadtxt('myText.txt',dtype=int) # read content of myText.txt in y
```
\normalsize
## Exercise 1
i) Display a numpy array as figure of a blue cross. The size should be 200
by 200 pixel. Use as array format (M, N, 3), where the first 2 specify
the pixel positions and the last 3 the rbg color from 0:255.
- Draw in addition a red square of arbitrary position into the figure.
- Draw a circle in the center of the figure. Try to create a mask which
selects the inner part of the circle using the indexing.
\small
[Solution: 01_intro_ex_1a_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1a_sol.py) \normalsize
ii) Read data which contains pixels from the binary file horse.py into a
numpy array. Display the data and the following transformations in 4
subplots: scaling and translation, compression in x and y, rotation
and mirroring.
\small
[Solution: 01_intro_ex_1b_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1b_sol.py) \normalsize
## Pandas
[\textcolor{violet}{pandas}](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) is a software library written in Python for
\textcolor{blue}{data manipulation and analysis}.
\vspace{0.4cm}
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Offers data structures and operations for manipulating numerical tables with
integrated indexing
* Imports data from various file formats, e.g. comma-separated values, JSON,
SQL or Excel
* Tools for reading and writing data structures, allows analyzing, filtering,
spliting, merging and joining
* Built on top of `NumPy`
* Visualize the data with `matplotlib`
* Most machine learning tools support `pandas` $\rightarrow$
it is widely used to preprocess data sets for machine learning
## Pandas micro introduction
Goal: Exploring, cleaning, transforming, and visualization of data.
The basic indexable objects are
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* `Series` -> vector (list) of data elements of arbitrary type
* `DataFrame` -> tabular arangement of data elements of column wise
arbitrary type
Both allow cleaning data by removing of `empty` or `nan` data entries
\footnotesize
```python
import numpy as np
import pandas as pd # use together with numpy
s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of float64
r = pd.Series(np.random.randn(4)) # Series of random numbers float64
dates = pd.date_range("20130101", periods=3) # index according to dates
df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD"))
print (df) # print the DataFrame
A B C D
2013-01-01 1.618395 1.210263 -1.276586 -0.775545
2013-01-02 0.676783 -0.754161 -1.148029 -0.244821
2013-01-03 -0.359081 0.296019 1.541571 0.235337
new_s = s.dropna() # return a new Data Frame with no empty cells
```
\normalsize
##
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* pandas data can be saved in different file formats (CSV, JASON, html, XML,
Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept
in the output file.
* csv file
\footnotesize
```python
df.to_csv("myFile.csv") # Write the DataFrame df to a csv file
```
\normalsize
* HDF5 output
\footnotesize
```python
df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5
s.to_hdf("myFile.h5", key='s',mode='a')
```
\normalsize
* Writing to an excel file
\footnotesize
```python
df.to_excel("myFile.xlsx", sheet_name="Sheet1")
```
\normalsize
* Deleting file with data in python
\footnotesize
```python
import os
os.remove('myFile.h5')
```
\normalsize
##
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* read in data from various formats
* csv file
\footnotesize
```python
.......
df = pd.read_csv('heart.csv') # read csv data table
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
print(df.head(5)) # prints the first 5 rows of the data table
print(df.describe()) # shows a quick statistic summary of your data
```
\normalsize
* Reading an excel file
\footnotesize
```python
df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"])
```
\normalsize
\textcolor{olive}{There are many options specifying details for IO.}
##
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Various functions exist to select and view data from pandas objects
* Display column and index
\footnotesize
```python
df.index # show datetime index of df
DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'],
dtype='datetime64[ns]',freq='D')
df.column # show columns info
Index(['A', 'B', 'C', 'D'], dtype='object')
```
\normalsize
* `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data
\footnotesize
```python
df.to_numpy() # one dtype for the entire array, not per column!
[[-0.62660101 -0.67330526 0.23269168 -0.67403546]
[-0.53033339 0.32872063 -0.09893568 0.44814084]
[-0.60289996 -0.22352548 -0.43393248 0.47531456]]
```
\normalsize
Does not include the index or column labels in the output
* more on viewing
\footnotesize
```python
df.T # transpose the DataFrame df
df.sort_values(by="B") # Sorting by values of a column of df
df.sort_index(axis=0,ascending=False) # Sorting by index descending values
df.sort_index(axis=0,ascending=False) # Display columns in inverse order
```
\normalsize
##
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions
* get a named column as a Series
\footnotesize
```python
df["A"] # selects a column A from df, simular to df.A
df.iloc[:, 1:2] # slices column A explicitly from df, df.loc[:, ["A"]]
```
\normalsize
* select rows of a DataFrame
\footnotesize
```python
df[0:2] # selects row 0 and 1 from df,
df["20130102":"20130103"] # use indices endpoint are included!
df.iloc[3] # Select with the position of the passed integers
df.iloc[1:3, :] # selects row 1 and 2 from df
```
\normalsize
* select by label
\footnotesize
```python
df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D
df.loc[dates[0], "A"] # selects a single value (scalar)
```
\normalsize
* select by lists of integer position (as in `NumPy`)
\footnotesize
```python
df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D
df.iloc[1, 1] # get a value explicitly
```
\normalsize
* select according to expressions
\footnotesize
```python
df.query('B<C') # select rows where B < C
df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows
```
\normalsize
##
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
* Selecting data of pandas objects continued
* Boolean indexing
\footnotesize
```python
df[df["A"] > 0] # select df where all values of column A are >0
df[df > 0] # select values from the entire DataFrame
```
\normalsize
more complex example
\footnotesize
```python
df2 = df.copy() # copy df
df2["E"] = ["eight","one","four"] # add column E
df2[df2["E"].isin(["two", "four"])] # test if elements "two" and "four" are
# contained in Series column E
```
\normalsize
* Operations (in general exclude missing data)
\footnotesize
```python
df2[df2 > 0] = -df2 # All elements > 0 change sign
df.mean(0) # get column wise mean (numbers=axis)
df.mean(1) # get row wise mean
df.std(0) # standard deviation according to axis
df.cumsum() # cumulative sum of each column
df.apply(np.sin) # apply function to each element of df
df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise
df + 10 # add scalar 10
df - [1, 2, 10 , 100] # subtract values of each column
df.corr() # Compute pairwise correlation of columns
```
\normalsize
## Pandas - plotting data
[\textcolor{violet}{Visualization}](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) is integrated in pandas using mathplotlib. Here are only 2 examples
* Plot random data in histogramm and scatter plot
\footnotesize
```python
# create DataFrame with random normal distributed data
df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"])
df = df + [1, 3, 8 , 10] # shift mean to 1, 3, 8 , 10
plt.figure()
df.plot.hist(bins=20) # histogram all 4 columns
g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1")
df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1)
```
\normalsize
::: columns
:::: {.column width=35%}
![](figures/pandas_histogramm.png)
::::
:::: {.column width=35%}
![](figures/pandas_scatterplot.png)
::::
:::
## Pandas - plotting data
The function crosstab() takes one or more array-like objects as indexes or
columns and constructs a new DataFrame of variable counts on the inputs
\footnotesize
```python
df = pd.DataFrame( # create DataFrame of 2 categories
{"sex": np.array([0,0,0,0,1,1,1,1,0,0,0]),
"heart": np.array([1,1,1,0,1,1,1,0,0,0,1])
} ) # closing bracket goes on next line
pd.crosstab(df2.sex,df2.heart) # create cross table of possibilities
pd.crosstab(df2.sex,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts
```
\normalsize
::: columns
:::: {.column width=42%}
![](figures/pandas_crosstabplot.png)
::::
:::
## Exercise 2
Read the file [\textcolor{violet}{heart.csv}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/heart.csv) into a DataFrame.
[\textcolor{violet}{Information on the dataset}](https://archive.ics.uci.edu/ml/datasets/heart+Disease)
\setbeamertemplate{itemize item}{\color{red}$\square$}
* Which columns do we have
* Print the first 3 rows
* Print the statistics summary and the correlations
* Print mean values for each column with and without disease
* Select the data according to `sex` and `target` (heart disease 0=no 1=yes).
* Plot the `age` distribution of male and female in one histogram
* Plot the heart disease distribution according to chest pain type `cp`
* Plot `thalach` according to `target` in one histogramm
* Plot `sex` and `target` in a histogramm figure
* Correlate `age` and `max heart rate` according to `target`
* Correlate `age` and `colesterol` according to `target`
\small
[Solution: 01_intro_ex_2_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_2_sol.py) \normalsize

Some files were not shown because too many files have changed in this diff Show More