initial set of slides corresponding to 2022 version
1
slides/.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
||||
.DS_Store
|
10
slides/Makefile
Normal file
@ -0,0 +1,10 @@
|
||||
# make creates pdf files of all newly edited .md files
|
||||
|
||||
SRCS := $(wildcard *.md)
|
||||
PDF := $(SRCS:%.md=%.pdf)
|
||||
|
||||
OPT := --pdf-engine=xelatex --variable mainfont="Helvetica" --variable sansfont="Helvetica" -t beamer -s -fmarkdown-implicit_figures --template=template.beamer --highlight-style=kate
|
||||
all: ${PDF}
|
||||
|
||||
%.pdf: %.md
|
||||
pandoc $(OPT) --output=$@ $<
|
2
slides/README.md
Normal file
@ -0,0 +1,2 @@
|
||||
Pandoc slides example following style of [Stefan Wunsch's CERN IML workhsop presenation](https://github.com/stwunsch/iml_keras_workshop) on [keras](https://keras.io/) (see slides folder)
|
||||
|
6
slides/copy_slides.sh
Executable file
@ -0,0 +1,6 @@
|
||||
# slides (do chgrp machlearn <file> later)
|
||||
# scp CIPpoolAccess.PDF reygers@rho0:public_html/lectures/2021/ml/transparencies/
|
||||
# scp 03_ml_basics.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
|
||||
# scp 04_decision_trees.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
|
||||
scp 05_neural_networks.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
|
||||
|
347
slides/decision_trees.md
Normal file
@ -0,0 +1,347 @@
|
||||
---
|
||||
title: |
|
||||
| Introduction to Data Analysis and Machine Learning in Physics:
|
||||
| 4. Decisions Trees
|
||||
|
||||
author: "Martino Borsato, Jörg Marks, Klaus Reygers"
|
||||
date: "Studierendentage, 11-14 April 2022"
|
||||
---
|
||||
## Exercises
|
||||
|
||||
* Exercise 1: Compare different decision tree classifiers
|
||||
* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
|
||||
* Exercise 2: Apply XGBoost classifier to MAGIC data set
|
||||
* [`04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
|
||||
* Exercise 3: Feature importance
|
||||
* Exercise 4: Interpret a classifier with SHAP values
|
||||
|
||||
## Decision trees
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
|
||||
\end{figure}
|
||||
|
||||
\begin{center}
|
||||
Leaf nodes classify events as either signal or background
|
||||
\end{center}
|
||||
|
||||
## Decision trees: Rectangular volumes in feature space
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
|
||||
\end{figure}
|
||||
|
||||
* Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
|
||||
* How to build a decision tree in an optimal way?
|
||||
|
||||
## Finding optimal cuts
|
||||
|
||||
Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
|
||||
|
||||
$$ G = p (1-p) $$
|
||||
|
||||
Here $p$ is the purity:
|
||||
$$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
|
||||
|
||||
\vfill
|
||||
\textcolor{gray}{Usefulness of weights will become apparent soon.}
|
||||
|
||||
\vfill
|
||||
Improvement in signal/background separation after splitting a set A into two sets B and C:
|
||||
$$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
|
||||
|
||||
## Gini impurity and other purity measures
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
|
||||
\end{figure}
|
||||
|
||||
|
||||
## Decision tree pruning
|
||||
|
||||
::: columns
|
||||
:::: {.column width=50%}
|
||||
|
||||
When to stop growing a tree?
|
||||
|
||||
* When all nodes are essentially pure?
|
||||
* Well, that's overfitting!
|
||||
|
||||
\vspace{3ex}
|
||||
|
||||
Pruning
|
||||
|
||||
* Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
|
||||
|
||||
::::
|
||||
:::: {.column width=50%}
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
|
||||
\end{figure}
|
||||
::::
|
||||
:::
|
||||
|
||||
## Single decision trees: Pros and cons
|
||||
|
||||
\textcolor{green}{Pros:}
|
||||
|
||||
* Requires little data preparation (unlike neural networks)
|
||||
* Can use continuous and categorical inputs
|
||||
|
||||
\vfill
|
||||
|
||||
\textcolor{red}{Cons:}
|
||||
|
||||
* Danger of overfitting training data
|
||||
* Sensitive to fluctuations in the training data
|
||||
* Hard to find global optimum
|
||||
* When to stop splitting?
|
||||
|
||||
## Ensemble methods: Combine weak learners
|
||||
|
||||
::: columns
|
||||
:::: {.column width=70%}
|
||||
* Bootstrap Aggregating (Bagging)
|
||||
* Sample training data (with replacement) and train a separate model on each of the derived training sets
|
||||
* Classify example with majority vote, or compute average output from each tree as model output
|
||||
|
||||
::::
|
||||
:::: {.column width=30%}
|
||||
$$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$
|
||||
::::
|
||||
:::
|
||||
\vfill
|
||||
::: columns
|
||||
:::: {.column width=70%}
|
||||
* Boosting
|
||||
* Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
|
||||
* Take weighted average to classify examples
|
||||
|
||||
::::
|
||||
:::: {.column width=30%}
|
||||
$$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$
|
||||
::::
|
||||
:::
|
||||
|
||||
## Random forests
|
||||
|
||||
* "One of the most widely used and versatile algorithms in data science and machine learning"
|
||||
\tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
|
||||
\vfill
|
||||
* Use bagging to select random example subset
|
||||
\vfill
|
||||
* Train a tree, but only use random subset of features at each split
|
||||
* this reduces the correlation between different trees
|
||||
* makes the decision more robust to missing data
|
||||
|
||||
## Boosted decision trees: Idea
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.75\textwidth]{figures/bdt.png}
|
||||
\end{figure}
|
||||
|
||||
## AdaBoost (short for Adaptive Boosting)
|
||||
|
||||
Initial training sample
|
||||
|
||||
\begin{center}
|
||||
\begin{tabular}{l l}
|
||||
$\vec x_1, ..., \vec x_n$: & multivariate event data \\
|
||||
$y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
|
||||
$w_1^{(1)}, ..., w_n^{(1)}$ & event weights
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
|
||||
with equal weights normalized as
|
||||
|
||||
$$ \sum_{i=1}^n w_i^{(1)} = 1 $$
|
||||
|
||||
Train first classifier $f_1$:
|
||||
|
||||
\begin{center}
|
||||
\begin{tabular}{l l}
|
||||
$f_1(\vec x_i) > 0$ & classify as signal \\
|
||||
$f_1(\vec x_i) < 0$ & classify as background
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
|
||||
## AdaBoost: Updating events weights
|
||||
|
||||
Define training sample $k+1$ from training sample $k$ by updating weights:
|
||||
|
||||
$$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
|
||||
|
||||
\footnotesize
|
||||
\textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
|
||||
\normalsize
|
||||
|
||||
Weight is increased if event was misclassified by the previous classifier
|
||||
|
||||
$\to$ "Next classifier should pay more attention to misclassified events"
|
||||
|
||||
|
||||
\vfill
|
||||
At each step the classifier $f_k$ minimizes error rate:
|
||||
|
||||
$$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0),
|
||||
\quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise} $$
|
||||
|
||||
## AdaBoost: Assigning the classifier score
|
||||
|
||||
Assign score to each classifier according to its error rate:
|
||||
$$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
|
||||
|
||||
\vfill
|
||||
|
||||
Combined classifier (weighted average):
|
||||
$$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
|
||||
|
||||
|
||||
|
||||
## Gradient boosting
|
||||
|
||||
Basic idea:
|
||||
|
||||
* Train a first decision tree
|
||||
* Then train a second one on the residual errors made by the first tree
|
||||
* And so on
|
||||
|
||||
\vfill
|
||||
|
||||
In slightly more detail:
|
||||
|
||||
* \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
|
||||
* Model prediction at iteration $m$: $F_m(\vec x_i)$
|
||||
* New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
|
||||
* Find $h_m(\vec x)$ by fitting it to
|
||||
$\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
|
||||
|
||||
\color{black}
|
||||
|
||||
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
|
||||
\small
|
||||
[\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
|
||||
\normalsize
|
||||
|
||||
\vfill
|
||||
|
||||
Superconductivty data set:
|
||||
|
||||
Predict the critical temperature based on 81 material features.
|
||||
\footnotesize
|
||||
[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
|
||||
\normalsize
|
||||
|
||||
\vfill
|
||||
|
||||
From the abstract:
|
||||
|
||||
|
||||
We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.
|
||||
|
||||
\vfill
|
||||
|
||||
\tiny
|
||||
[\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
|
||||
\normalsize
|
||||
|
||||
|
||||
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
|
||||
|
||||
::: columns
|
||||
:::: {.column width=60%}
|
||||
\footnotesize
|
||||
```python
|
||||
import xgboost as xgb
|
||||
|
||||
XGBreg = xgb.sklearn.XGBRegressor()
|
||||
|
||||
XGBreg.fit(X_train, y_train)
|
||||
|
||||
y_pred = XGBreg.predict(X_test)
|
||||
|
||||
from sklearn.metrics import mean_squared_error
|
||||
rms = np.sqrt(mean_squared_error(y_test, y_pred))
|
||||
print(f"root mean square error {rms:.2f}")
|
||||
```
|
||||
|
||||
\textcolor{gray}{This gives:}
|
||||
|
||||
`root mean square error 9.68`
|
||||
::::
|
||||
:::: {.column width=40%}
|
||||
\vspace{6ex}
|
||||
![](figures/critical_temperature.pdf)
|
||||
::::
|
||||
:::
|
||||
|
||||
## Exercise 1: Compare different decision tree classifiers
|
||||
|
||||
\small
|
||||
[\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
|
||||
|
||||
\vspace{5ex}
|
||||
|
||||
Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
|
||||
|
||||
\vspace{2ex}
|
||||
|
||||
Is there a classifier that clearly performs best?
|
||||
|
||||
|
||||
## Exercise 2: Apply XGBoost classifier to MAGIC data set
|
||||
|
||||
\small
|
||||
[\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
|
||||
\normalsize
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
# train XGBoost boosted decision tree
|
||||
import xgboost as xgb
|
||||
XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
|
||||
```
|
||||
\normalsize
|
||||
|
||||
\small
|
||||
a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
|
||||
b) Which is the most important feature for discriminating signal and background according to XGBoost? \
|
||||
Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
|
||||
c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
|
||||
d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
|
||||
\normalsize
|
||||
|
||||
|
||||
## Exercise 3: Feature importance
|
||||
|
||||
\small
|
||||
[\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
|
||||
\normalsize
|
||||
|
||||
\vspace{3ex}
|
||||
|
||||
Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score.
|
||||
|
||||
|
||||
## Exercise 4: Interpret a classifier with SHAP values
|
||||
|
||||
SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
|
||||
|
||||
\vfill
|
||||
|
||||
Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
|
||||
|
||||
a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
|
||||
|
||||
b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
|
||||
|
||||
c) Do the same for the superconductivity data set? What are the three most important features?
|
||||
|
||||
|
||||
|
||||
|
||||
|
BIN
slides/figures/03_ml_basics_galton_linear_regression_iminuit.pdf
Normal file
BIN
slides/figures/03_ml_basics_log_regr_heart_disease.pdf
Normal file
BIN
slides/figures/03_ml_basics_logistic_regression.pdf
Normal file
BIN
slides/figures/L1vsL2.pdf
Normal file
BIN
slides/figures/activation_functions.png
Normal file
After Width: | Height: | Size: 249 KiB |
BIN
slides/figures/adversarial_attack.png
Normal file
After Width: | Height: | Size: 2.3 MiB |
BIN
slides/figures/ai_history.png
Normal file
After Width: | Height: | Size: 386 KiB |
BIN
slides/figures/ai_ml_dl.pdf
Normal file
BIN
slides/figures/ann.png
Normal file
After Width: | Height: | Size: 186 KiB |
BIN
slides/figures/anomaly_detection.png
Normal file
After Width: | Height: | Size: 123 KiB |
BIN
slides/figures/autoencoder_example.pdf
Normal file
BIN
slides/figures/bdt.png
Normal file
After Width: | Height: | Size: 214 KiB |
BIN
slides/figures/book-murphy.png
Normal file
After Width: | Height: | Size: 328 KiB |
BIN
slides/figures/book_deep_learning_for_physics_research.png
Normal file
After Width: | Height: | Size: 117 KiB |
BIN
slides/figures/boston_house_prices.pdf
Normal file
BIN
slides/figures/cnn.png
Normal file
After Width: | Height: | Size: 114 KiB |
BIN
slides/figures/cnn_conv_layer.png
Normal file
After Width: | Height: | Size: 34 KiB |
BIN
slides/figures/cnn_fully_connected.png
Normal file
After Width: | Height: | Size: 163 KiB |
BIN
slides/figures/cnn_pooling.png
Normal file
After Width: | Height: | Size: 132 KiB |
BIN
slides/figures/cnn_sliding_filter.png
Normal file
After Width: | Height: | Size: 112 KiB |
BIN
slides/figures/critical_temperature.pdf
Normal file
BIN
slides/figures/cross_val.png
Normal file
After Width: | Height: | Size: 53 KiB |
BIN
slides/figures/decision_boundaries.png
Normal file
After Width: | Height: | Size: 161 KiB |
BIN
slides/figures/decision_trees_feature_space.png
Normal file
After Width: | Height: | Size: 135 KiB |
BIN
slides/figures/deep_learning_book.png
Normal file
After Width: | Height: | Size: 158 KiB |
BIN
slides/figures/deep_learning_with_python.png
Normal file
After Width: | Height: | Size: 116 KiB |
BIN
slides/figures/deepl.png
Normal file
After Width: | Height: | Size: 511 KiB |
BIN
slides/figures/dnn.png
Normal file
After Width: | Height: | Size: 519 KiB |
BIN
slides/figures/dropout.png
Normal file
After Width: | Height: | Size: 424 KiB |
BIN
slides/figures/example_overtraining.png
Normal file
After Width: | Height: | Size: 229 KiB |
BIN
slides/figures/feature_transformation.png
Normal file
After Width: | Height: | Size: 394 KiB |
BIN
slides/figures/fisher.png
Normal file
After Width: | Height: | Size: 44 KiB |
BIN
slides/figures/fisher_linear_decision_boundary.png
Normal file
After Width: | Height: | Size: 139 KiB |
BIN
slides/figures/gan.png
Normal file
After Width: | Height: | Size: 116 KiB |
BIN
slides/figures/gradient_descent.png
Normal file
After Width: | Height: | Size: 226 KiB |
BIN
slides/figures/gradient_descent_cmp.png
Normal file
After Width: | Height: | Size: 216 KiB |
BIN
slides/figures/hands_on_machine_learning.png
Normal file
After Width: | Height: | Size: 71 KiB |
BIN
slides/figures/handwritten_digits.png
Normal file
After Width: | Height: | Size: 34 KiB |
BIN
slides/figures/heart_table.png
Normal file
After Width: | Height: | Size: 65 KiB |
BIN
slides/figures/imagenet.png
Normal file
After Width: | Height: | Size: 1.5 MiB |
BIN
slides/figures/imagenet_challenge.png
Normal file
After Width: | Height: | Size: 422 KiB |
BIN
slides/figures/iminuit_minos_scan-1.png
Normal file
After Width: | Height: | Size: 25 KiB |
BIN
slides/figures/iminuit_minos_scan-2.png
Normal file
After Width: | Height: | Size: 49 KiB |
BIN
slides/figures/iris_dataset.png
Normal file
After Width: | Height: | Size: 659 KiB |
BIN
slides/figures/keras.png
Normal file
After Width: | Height: | Size: 2.5 KiB |
BIN
slides/figures/knn.png
Normal file
After Width: | Height: | Size: 116 KiB |
BIN
slides/figures/logistic_fct.png
Normal file
After Width: | Height: | Size: 63 KiB |
BIN
slides/figures/loss_fct.png
Normal file
After Width: | Height: | Size: 40 KiB |
BIN
slides/figures/magic_photo.png
Normal file
After Width: | Height: | Size: 6.0 MiB |
BIN
slides/figures/magic_photo_small.png
Normal file
After Width: | Height: | Size: 1.3 MiB |
BIN
slides/figures/magic_shower_em_had.png
Normal file
After Width: | Height: | Size: 3.9 MiB |
BIN
slides/figures/magic_shower_em_had_small.png
Normal file
After Width: | Height: | Size: 667 KiB |
BIN
slides/figures/magic_shower_parameters.png
Normal file
After Width: | Height: | Size: 1.1 MiB |
BIN
slides/figures/magic_sketch.png
Normal file
After Width: | Height: | Size: 1.1 MiB |
BIN
slides/figures/matplotlib_Figure_1.png
Normal file
After Width: | Height: | Size: 41 KiB |
BIN
slides/figures/matplotlib_Figure_2.png
Normal file
After Width: | Height: | Size: 98 KiB |
BIN
slides/figures/matplotlib_Figure_3.png
Normal file
After Width: | Height: | Size: 11 KiB |
BIN
slides/figures/matplotlib_Figure_4.png
Normal file
After Width: | Height: | Size: 543 KiB |
BIN
slides/figures/mini_boone_decisions_tree.png
Normal file
After Width: | Height: | Size: 248 KiB |
BIN
slides/figures/ml_example_spam.png
Normal file
After Width: | Height: | Size: 1.1 MiB |
BIN
slides/figures/mlp.png
Normal file
After Width: | Height: | Size: 204 KiB |
BIN
slides/figures/mnist.png
Normal file
After Width: | Height: | Size: 216 KiB |
BIN
slides/figures/monitoring_overtraining.png
Normal file
After Width: | Height: | Size: 137 KiB |
BIN
slides/figures/mva.png
Normal file
After Width: | Height: | Size: 788 KiB |
BIN
slides/figures/mva_nn.png
Normal file
After Width: | Height: | Size: 135 KiB |
BIN
slides/figures/neuron.png
Normal file
After Width: | Height: | Size: 306 KiB |
BIN
slides/figures/nn_decision_boundary.png
Normal file
After Width: | Height: | Size: 907 KiB |
BIN
slides/figures/pandas_crosstabplot.png
Normal file
After Width: | Height: | Size: 15 KiB |
BIN
slides/figures/pandas_histogramm.png
Normal file
After Width: | Height: | Size: 12 KiB |
BIN
slides/figures/pandas_scatterplot.png
Normal file
After Width: | Height: | Size: 47 KiB |
BIN
slides/figures/pdf_from_2d_histogram.png
Normal file
After Width: | Height: | Size: 170 KiB |
BIN
slides/figures/perceptron_photo.png
Normal file
After Width: | Height: | Size: 695 KiB |
BIN
slides/figures/perceptron_retina.png
Normal file
After Width: | Height: | Size: 79 KiB |
BIN
slides/figures/perceptron_weighted_sum.png
Normal file
After Width: | Height: | Size: 69 KiB |
BIN
slides/figures/perceptron_with_threshold.png
Normal file
After Width: | Height: | Size: 34 KiB |
BIN
slides/figures/regularization.png
Normal file
After Width: | Height: | Size: 344 KiB |
BIN
slides/figures/relu.png
Normal file
After Width: | Height: | Size: 91 KiB |
BIN
slides/figures/rootOptions.png
Normal file
After Width: | Height: | Size: 278 KiB |
BIN
slides/figures/scikit-learn.png
Normal file
After Width: | Height: | Size: 18 KiB |
BIN
slides/figures/sigmoid.png
Normal file
After Width: | Height: | Size: 104 KiB |
BIN
slides/figures/signal_background_distr.png
Normal file
After Width: | Height: | Size: 144 KiB |
BIN
slides/figures/signal_purity.png
Normal file
After Width: | Height: | Size: 107 KiB |
BIN
slides/figures/stochastic_gradient_descent.png
Normal file
After Width: | Height: | Size: 131 KiB |
BIN
slides/figures/supervised_learning_car_plane.png
Normal file
After Width: | Height: | Size: 651 KiB |
BIN
slides/figures/supervised_nutshell.png
Normal file
After Width: | Height: | Size: 76 KiB |
BIN
slides/figures/tensorflow.png
Normal file
After Width: | Height: | Size: 14 KiB |
BIN
slides/figures/tf_playground.png
Normal file
After Width: | Height: | Size: 645 KiB |
BIN
slides/figures/tree_pruning_slides.png
Normal file
After Width: | Height: | Size: 102 KiB |
BIN
slides/figures/underfitting_overfitting.pdf
Normal file
BIN
slides/figures/underfitting_overfitting_001.png
Normal file
After Width: | Height: | Size: 50 KiB |
BIN
slides/figures/videogame.png
Normal file
After Width: | Height: | Size: 95 KiB |
BIN
slides/figures/xor.png
Normal file
After Width: | Height: | Size: 44 KiB |
BIN
slides/figures/xor_like_data.pdf
Normal file
563
slides/fit_intro.md
Normal file
@ -0,0 +1,563 @@
|
||||
---
|
||||
title: |
|
||||
| Introduction to Data Analysis and Machine Learning in Physics:
|
||||
| 2. Data modeling and fitting
|
||||
|
||||
author: "Martino Borsato, Jörg Marks, Klaus Reygers"
|
||||
date: "Studierendentage, 11-14 April 2022"
|
||||
---
|
||||
|
||||
## Data modeling and fitting - introduction
|
||||
|
||||
Data analysis is a process of understanding and modeling measured
|
||||
data. The goal is to find patterns and to obtain inferences allowing to
|
||||
observe underlying patterns.
|
||||
|
||||
* There are 2 approaches to statistical data modeling
|
||||
* Hypothesis testing: is our data compatible with a certain model?
|
||||
* Determination of model parameter: use the data to determine the parameters
|
||||
of a (theoretical) model
|
||||
|
||||
* For the determination of model parameter
|
||||
* Analysis of data distributions $\rightarrow$ mean, variance,
|
||||
median, FWHM, .... \newline
|
||||
allows for an approximate determination of model parameter
|
||||
|
||||
* Data fitting with the least square method $\rightarrow$ an iterative
|
||||
process which minimizes the deviation of a model decribed by parameters
|
||||
from data. This determines the optimal values and uncertainties
|
||||
of the parameters.
|
||||
|
||||
* Maximum likelihood fitting $\rightarrow$ find a set of model parameters
|
||||
which most likely describe the data by maximizing the probability
|
||||
distributions.
|
||||
|
||||
The parameter determination by minimization is an integral part of machine
|
||||
learning approaches, here a system learns patterns and predicts
|
||||
related ones. This is the focus in the upcoming days.
|
||||
|
||||
## Data modeling and fitting - introduction
|
||||
|
||||
Data analysis is a process of understanding and modeling measured
|
||||
data. The goal is to find patterns and to obtain inferences allowing to
|
||||
observe underlying patterns.
|
||||
|
||||
* There are 2 approaches to statistical data modeling
|
||||
* Hypothesis testing: is our data compatible with a certain model?
|
||||
* Determination of model parameter: use the data to determine the parameters
|
||||
of a (theoretical) model
|
||||
|
||||
* For the determination of model parameter
|
||||
* Analysis of data distributions $\rightarrow$ mean, variance,
|
||||
median, FWHM, .... \newline
|
||||
allows for an approximate determination of model parameter
|
||||
|
||||
\setbeamertemplate{itemize subitem}{\color{red}\tiny$\blacksquare$}
|
||||
* \textcolor{blue}{Data fitting with the least square method
|
||||
$\rightarrow$ an iterative
|
||||
process which minimizes the deviation of a model decribed by parameters
|
||||
from data. This determines the optimal values and uncertainties
|
||||
of the parameters.}
|
||||
|
||||
\setbeamertemplate{itemize subitem}{\color{blue}\tiny$\blacktriangleright$}
|
||||
* Maximum likelihood fitting $\rightarrow$ find a set of model parameters
|
||||
which most likely describe the data by maximizing the probability
|
||||
distributions.
|
||||
|
||||
The parameter determination by minimization is an integral part of machine
|
||||
learning approaches, here a system learns patterns and predicts
|
||||
related ones. This is the focus in the upcoming days.
|
||||
|
||||
|
||||
|
||||
## Least Square (LS) Method (1)
|
||||
|
||||
The method determines the \textcolor{blue}{optimal parameters of functions
|
||||
to gaussian distributed measurements}.
|
||||
|
||||
Lets consider a sample of $n$ measurements $y_{i}$ and a parametrized
|
||||
description of the measurement $\eta_{i} = f(x_{i} | \theta)$
|
||||
with a parameter set $\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$,
|
||||
dependent values $x_{i}$ and measurement errors $\sigma_{i}$.
|
||||
|
||||
The parameter set should be determined such that
|
||||
\begin{equation*}
|
||||
\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} = \sum \limits_{i=1}^{n} \frac{(y_i- f(x_i|\theta))^2}{\sigma_i^2} \longrightarrow \, minimal }
|
||||
\end{equation*}
|
||||
In case of correlated measurements the covariance matrix of the $y_{i}$ has to
|
||||
be taken into account. This is accomplished by defining a weight matrix from
|
||||
the covariance matrix of the input data. A decorrelation of the input data
|
||||
should be considered.
|
||||
\vspace{0.2cm}
|
||||
|
||||
$S$ follows a $\chi^{2}$-distribution with $(n-k)$ degrees of freedom.
|
||||
|
||||
## Least Square (LS) Method (2)
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
* Example LS-method
|
||||
\vspace{0.2cm}
|
||||
|
||||
Often the fit function $f(x, \theta)$ is linear in
|
||||
$\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$
|
||||
\vspace{0.2cm}
|
||||
|
||||
$f(x | \theta) = \theta_{1} f_{1}(x) + .... + \theta_{k} f_{k}(x)$
|
||||
\vspace{0.2cm}
|
||||
|
||||
If the model is a straight line and our parameters are $\theta_{1}$ and
|
||||
$\theta_{2}$ $(f_{1}(x) = 1,$ $f_{2}(x) = x)$ we have
|
||||
$f(x | \theta) = \theta_{1} + \theta_{2} x$
|
||||
\vspace{0.2cm}
|
||||
|
||||
The LS equation is
|
||||
\vspace{0.2cm}
|
||||
|
||||
$\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum
|
||||
\limits_{i=1}^{n} \frac{(y_{i} - \theta_{1} - x_{i}
|
||||
\theta_{2})^2}{\sigma_i^2 }}$ \hspace{0.4cm} and with
|
||||
\vspace{0.2cm}
|
||||
|
||||
$\frac{\partial S}{\partial \theta_1} = \sum\limits_{i=1}^{n} \frac{-2
|
||||
(y_i - \theta_1 - x_i \theta_2)}{\sigma_i^2} = 0$ \hspace{0.4cm} and \hspace{0.4cm}
|
||||
$\frac{\partial S}{\partial \theta_2} = \sum\limits_{i=1}^{n} \frac{-2 x_i (y_i - \theta_1 - x_i \theta_2)}{\sigma_i^2} = 0$
|
||||
\vspace{0.2cm}
|
||||
|
||||
the parameters $\theta_{1}$ and $\theta_{2}$ can be determined.
|
||||
|
||||
\vspace{0.2cm}
|
||||
\textcolor{olive}{In case of linear fit functions solutions can be found by matrix inversion}
|
||||
|
||||
\vfill
|
||||
|
||||
## Least Square (LS) Method (3)
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* Use of a nonlinear fit function $f(x, \theta)$ like \hspace{0.4cm}
|
||||
$f(x | \theta) = \theta_{1} \cdot e^{-\theta_{2} x}$
|
||||
\vspace{0.2cm}
|
||||
|
||||
results in the LS equation
|
||||
\vspace{0.2cm}
|
||||
|
||||
$\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum \limits_{i=1}^{n} \frac{(y_{i} - \theta_{1} \cdot e^{-\theta_{2} x_{i}})^2}{\sigma_i^2 }}$ \hspace{0.4cm}
|
||||
\vspace{0.2cm}
|
||||
|
||||
which we have to minimize
|
||||
\vspace{0.2cm}
|
||||
|
||||
$\frac{\partial S}{\partial \theta_1} = \sum\limits_{i=1}^{n} \frac{ 2 e^{-2 \theta_2 x_i} ( \theta_1 - y_i e^{\theta_2 x_i} )} {\sigma_i^2 } = 0$ \hspace{0.4cm} and \hspace{0.4cm}
|
||||
$\frac{\partial S}{\partial \theta_2} = \sum\limits_{i=1}^{n} \frac{ 2 \theta_1 x_I e^{-2 \theta_2 x_i} (y_i e^{\theta_2 x_i} - \theta_1)} {\sigma_i^2 } = 0$
|
||||
|
||||
\vspace{0.4cm}
|
||||
|
||||
In a nonlinear system, the LS Ansatz leads to derivatives which are
|
||||
functions of the independent variable and the parameters $\color{red}\rightarrow$ \textcolor{olive}{no closed solutions}
|
||||
\vspace{0.4cm}
|
||||
|
||||
In general, we have gradient equations which don't have closed solutions.
|
||||
There are a couple of methods including approximations which allow together
|
||||
with numerical methods to find a global minimum, Gauss–Newton algorithm,
|
||||
Levenberg–Marquardt algorithm, gradient descend methods and also direct
|
||||
search methods.
|
||||
|
||||
## Minuit - a programm package for minimization (1)
|
||||
|
||||
In general data fitting and also solving machine learning algorithms lead
|
||||
to a minimization problem of functions. In the
|
||||
1975-1980 F. James (CERN) developed
|
||||
a FORTRAN-based package, [\textcolor{violet}{MINUIT}](http://seal.web.cern.ch/seal/documents/minuit/mntutorial.pdf), which is a framework to handle
|
||||
multiparameter minimization and compute the best-fit parameter values and
|
||||
uncertainties, including correlations between the parameters.
|
||||
\vspace{0.2cm}
|
||||
|
||||
The user provides a minimization function
|
||||
$F(X,P)$ with the parameter space $P=(p_1,....p_k)$ and
|
||||
variable space $X$ (also multi-dimensional). There is an interface via
|
||||
functions which influences the
|
||||
the minimization process. MINUIT provides
|
||||
[\textcolor{violet}{error calculations}](http://seal.web.cern.ch/seal/documents/minuit/mnerror.pdf) including correlations for the parameter space by evaluating the shape of the function in some neighbourhood of the minimum.
|
||||
\vspace{0.2cm}
|
||||
|
||||
The package
|
||||
has now a new object-oriented implementation as [\textcolor{violet}{Minuit2 library}](https://root.cern.ch/doc/master/Minuit2Page.html) , written
|
||||
in C++.
|
||||
\vspace{0.2cm}
|
||||
|
||||
During the minimization $F(X,P)$ is evaluated for various $X$. For the
|
||||
choice of $P=(p_1,....p_k)$ different methods are used
|
||||
|
||||
## Minuit - a programm package for minimization (2)
|
||||
|
||||
\vspace{0.4cm}
|
||||
\textcolor{olive}{SEEK}: Search for the minimum with Monte Carlo methods, mostly used at the start
|
||||
of the minimization with unknown starting values. It is not a converging
|
||||
algorithm.
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textcolor{olive}{SIMPLX}:
|
||||
Uses the simplex method of Nelder and Mead. Function values are compared
|
||||
in the parameter space. Via step size control the minimum is approached.
|
||||
Parameter errors are only approximate, no covariance matrix is calculated.
|
||||
\vspace{0.2cm}
|
||||
|
||||
<!---
|
||||
A simplex is the smallest n dimensional figure with n+1 corners. By reflecting
|
||||
one point in the hyperplane of the other point and adopts itself to the
|
||||
function plane.
|
||||
-->
|
||||
|
||||
\textcolor{olive}{MIGRAD}:
|
||||
Uses an algorithm of R. Fletcher, which takes the function and the gradient
|
||||
to approach the minimum with a variable metric method. An error matrix and
|
||||
correlation coefficients are available
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textcolor{olive}{HESSE}:
|
||||
Calculates the hessian matrix of second derivatives and determines the
|
||||
covariance matrix.
|
||||
\vspace{0.2cm}
|
||||
|
||||
\textcolor{olive}{MINOS}:
|
||||
Calculates (asymmetric) errors using likelihood profiles.
|
||||
The algorithm for finding the positive and negative MINOS errors for parameter
|
||||
$n$ consists of varying $n$ each time minimizing $F(X,P)$ with respect to
|
||||
all the others.
|
||||
\vspace{0.2cm}
|
||||
|
||||
## Minuit - a programm package for minimization (3)
|
||||
|
||||
\vspace{0.4cm}
|
||||
|
||||
Fit process with the minuit package
|
||||
\vspace{0.2cm}
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* The individual steps decribed above can be called several times and in different order during the minimization process.
|
||||
|
||||
* Each of the parameters $p_i$ of $P=(p_1,....p_k)$ can be set constant and
|
||||
released during the minimization steps.
|
||||
|
||||
* Problems are expected in models with strong correlation between
|
||||
parameters $\rightarrow$ change model to uncorrelated definitions
|
||||
|
||||
* Local minima, edges/steps or undefined ranges in $F(X,P)$ are problematic
|
||||
$\rightarrow$ simplify your model
|
||||
|
||||
\vspace{3cm}
|
||||
|
||||
|
||||
## Minuit2 - The iminuit package
|
||||
|
||||
\vspace{0.4cm}
|
||||
|
||||
[\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) is
|
||||
a Jupyter-friendly Python interface for the Minuit2 C++ library.
|
||||
\vspace{0.2cm}
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* The class `iminuit.Minuit` instanciates the minuit object. The minimizer
|
||||
function is given as argument. Basic steering of the fit
|
||||
like setting start parameters, error definition and print level is also
|
||||
done here.
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
from iminuit import Minuit
|
||||
def fcn(x, y, z): # definition of the minimizer function
|
||||
return (x - 2) ** 2 + (y - x) ** 2 + (z - 4) ** 2
|
||||
m = Minuit(fcn, x=0, y=0, z=0, errordef=1 , print_level=1)
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* Several methods determine the interaction with the fitting process, calls
|
||||
to `migrad` , `hesse` or printing of parameters and errors
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
......
|
||||
m.migrad() # run optimiser
|
||||
print(m.values , m.errors) # print results
|
||||
m.hesse() # run covariance estimator
|
||||
```
|
||||
\normalsize
|
||||
|
||||
## Minuit2 - iminuit example
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* The function `fcn` describes the model with parameters to be determined by
|
||||
data.`fcn` is minimal when the model parameters agree best with data.
|
||||
`fcn` has positional arguments, one for each fit parameter. `iminuit`
|
||||
example fit:
|
||||
|
||||
[\textcolor{violet}{02\_fit\_exp\_fit\_iMinuit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_exp_fit_iMinuit.py)
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
......
|
||||
x = np.array([....],dtype='d') # measurements x
|
||||
y = np.array([....],dtype='d') # measurements y
|
||||
dy = np.array([....],dtype='d') # error in y
|
||||
def xp(a, b , c):
|
||||
return a * np.exp(b*x) + c
|
||||
# least-squares function = sum of data residuals squared
|
||||
def fcn(a,b,c):
|
||||
return np.sum((y - xp(a,b,c)) ** 2 / dy ** 2)
|
||||
# limit the range of b and fix parameter c
|
||||
m = Minuit(fcn,a=1,b=-0.7,c=1,limit_b=(-1,0.1),fix_c=True)
|
||||
m.migrad() # run minimizer
|
||||
m.fixed["c"] = False # release parameter c
|
||||
m.migrad() # rerun minimizer
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* Might be useful to fix parameters or limit the range for some applications
|
||||
|
||||
## Minuit2 - iminuit (3)
|
||||
|
||||
\vspace{0.2cm}
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* Results and control information of the fit can be printed and accessed
|
||||
in the the prorgamm.
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
......
|
||||
m = Minuit(fcn,....,print_level=1) # set flag in the initializer
|
||||
m.migrad() # run minimizer
|
||||
a_fit = m.values['a'] # get parameter value a
|
||||
a_fit_error = m.errors['a'] # get parameter error of a
|
||||
print (m.values,m.errors) # print results
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* After processing Hesse, covariance and correlation information of the
|
||||
fit is available
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
......
|
||||
m.hesse() # run covariance estimator
|
||||
m.matrix() # get covariance matrix
|
||||
m.matrix(correlation=True) # get full correlation matrix
|
||||
cov = m.np_matrix() # save matrix to numpy
|
||||
cor = m.np_matrix(correlation=True)
|
||||
print(cor[0, 1]) # print correlation between parameter 1 and 2
|
||||
```
|
||||
\normalsize
|
||||
|
||||
## Minuit2 - iminuit (4)
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* Minos provides asymmetric uncertainty intervals and parameter contours by
|
||||
scanning one parameter and minimizing the function with respect to all other
|
||||
parameters for each scan point. Results are displayed with `matplotlib`.
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
......
|
||||
m.minos()
|
||||
print (m.get_merrors()['a'])
|
||||
m.draw_mnprofile('b')
|
||||
m.draw_mncontour('a', 'b', nsigma=4)
|
||||
```
|
||||
::: columns
|
||||
:::: {.column width=40%}
|
||||
![](figures/iminuit_minos_scan-1.png)
|
||||
::::
|
||||
:::: {.column width=40%}
|
||||
![](figures/iminuit_minos_scan-2.png)
|
||||
::::
|
||||
:::
|
||||
|
||||
## Exercise 3
|
||||
|
||||
Plot the following data with mathplotlib as in the iminuit example:
|
||||
|
||||
\footnotesize
|
||||
```
|
||||
x: 0.2,0.4,0.6,0.8,1.,1.2,1.4,1.6,1.8,2.,2.2,2.4,2.6,2.8,3.,3.2,
|
||||
3.4,3.6, 3.8,4.
|
||||
y: 0.04,0.021,0.035,0.03,0.029,0.019,0.024,0.018,0.019,0.022,0.02,
|
||||
0.025,0.018,0.024,0.019,0.021,0.03,0.019,0.03,0.024
|
||||
dy: 1.792,1.695,1.541,1.514,1.427,1.399,1.388,1.270,1.262,1.228,1.189,
|
||||
1.182,1.121,1.129,1.124,1.089,1.092,1.084,1.058,1.057
|
||||
```
|
||||
\normalsize
|
||||
\setbeamertemplate{itemize item}{\color{red}$\square$}
|
||||
|
||||
* Exchange in the example iminuit fit `02_fit_exp_fit_iMinuit.ipynb` the
|
||||
exponential function by a 3rd order polynomial and perform the fit
|
||||
|
||||
* Compare the correlation of the parameters of the exponential and
|
||||
the polynomial fit
|
||||
|
||||
* What defines the fit quality, give an estimate
|
||||
|
||||
\small
|
||||
Solution: [\textcolor{violet}{02\_fit\_ex\_3\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_3_sol.py) \normalsize
|
||||
|
||||
## Exercise 4
|
||||
|
||||
Plot the following data with mathplotlib:
|
||||
|
||||
\footnotesize
|
||||
```
|
||||
x: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
|
||||
dx: 0.1,0.1,0.5,0.1,0.5,0.1,0.5,0.1,0.5,0.1
|
||||
y: 1.1,2.3,2.7,3.2,3.1,2.4,1.7,1.5,1.5,1.7
|
||||
dy: 0.15,0.22,0.29,0.39,0.31,0.21,0.13,0.15,0.19,0.13
|
||||
```
|
||||
\normalsize
|
||||
\setbeamertemplate{itemize item}{\color{red}$\square$}
|
||||
|
||||
* Perform a fit with iminuit. Which model do you use?
|
||||
|
||||
* Plot the resulting fit function in the graph with the data
|
||||
|
||||
* Print the covariance matrix. Can we improve the errors.
|
||||
|
||||
* Can you draw a contour plot of 2 of the fit parameters.
|
||||
|
||||
\small
|
||||
Solution: [\textcolor{violet}{02\_fit\_ex\_4\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_4_sol.py) \normalsize
|
||||
|
||||
|
||||
## PyROOT
|
||||
|
||||
[\textcolor{violet}{PyROOT}](https://root.cern/manual/python/) is the python binding for the C++ data analysis toolkit [\textcolor{violet}{ROOT}](https://root.cern/) developed with and for the LHC community. You can access the full
|
||||
ROOT functionality from Python while
|
||||
benefiting from the performance of the ROOT C++ libraries. The PyROOT bindings
|
||||
are automatic and dynamic and are able to interoperate with widely-used Python
|
||||
data-science libraries as `NumPy`, `pandas`, SciPy `scikit-learn` and `tensorflow`.
|
||||
|
||||
* ROOT/PyROOT can be installed easily within anaconda3 (ROOT version 6.22.02
|
||||
or later ) or is available in the
|
||||
[\textcolor{violet}{CIP jupyter2 Hub}](https://jupyter2.kip.uni-heidelberg.de/)
|
||||
|
||||
* Tools for statistical analysis, a math library with optimized algorithms,
|
||||
multivariate analysis, visualization and simulation of data.
|
||||
|
||||
* Storing data including objects and classes with compression in files is a
|
||||
very powerfull aspect for any data analysis project
|
||||
|
||||
* Within PyROOT Minuit2 can be accessed easily either with predefined functions
|
||||
or your own function definition
|
||||
|
||||
* For advanced statistical analyses and data modeling likelihood fitting with
|
||||
the packages **rooFit** and **rooStats** is available.
|
||||
|
||||
|
||||
##
|
||||
|
||||
* Example reading the invariant mass measurements of a $D^0$ from a text file
|
||||
and determine $\mu$ and $\sigma$ \hspace{1.0cm} \small
|
||||
[\textcolor{violet}{02\_fit\_histFit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_histFit.py)
|
||||
\normalsize
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
import numpy as np
|
||||
import math
|
||||
from ROOT import TCanvas, TFile, TH1D, TF1, TMinuit, TFitResult
|
||||
data = np.genfromtxt('D0Mass.txt', dtype='d') # read data from text file
|
||||
c = TCanvas('c','D0 Mass',200,10,700,500) # instanciate output canvas
|
||||
d0 = TH1D('d0','D0 Mass',200,1700.,2000.) # instanciate histogramm
|
||||
for x in data : # fill data into histogramm d0
|
||||
d0.Fill(x)
|
||||
def pyf_tf1_params(x, p): # define fit function
|
||||
return p[0] * math.exp (-0.5 * ((x[0] - p[1])**2 / p[2]**2))
|
||||
func = TF1("func",pyf_tf1_params,1840.,1880.,3)
|
||||
# func = TF1("func",'gaus',1840.,1880.) # use predefined function
|
||||
func.SetParameters(500.,1860.,5.5) # set start parameters
|
||||
myfit = d0.Fit(func,"S") # fit function to the histogramm data
|
||||
print ("Fit results: mean=",myfit.Parameter(0)," +/- ",myfit.ParError(0))
|
||||
c.Draw() # draw canvas
|
||||
myfile = TFile('myOutFile.root','RECREATE') # Open a ROOT file for output
|
||||
c.Write() # Write canvas
|
||||
d0.Write() # Write histogram
|
||||
myfile.Close() # close file
|
||||
```
|
||||
\normalsize
|
||||
|
||||
|
||||
##
|
||||
|
||||
* Fit Options
|
||||
\vspace{0.1cm}
|
||||
|
||||
::: columns
|
||||
:::: {.column width=2%}
|
||||
::::
|
||||
:::: {.column width=98%}
|
||||
![](figures/rootOptions.png)
|
||||
::::
|
||||
:::
|
||||
|
||||
## Exercise 5
|
||||
|
||||
Read text file [\textcolor{violet}{FitTestData.txt}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/FitTestData.txt) and draw a histogramm using PyROOT.
|
||||
\setbeamertemplate{itemize item}{\color{red}$\square$}
|
||||
|
||||
* Determine the mean and sigma of the signal distribution. Which function do
|
||||
you use for fitting?
|
||||
|
||||
* The option S fills the result object.
|
||||
|
||||
* Try to improve the errors of the fit values with minos using the option E
|
||||
and also try the option M to scan for a new minimum, option V provides more
|
||||
output.
|
||||
|
||||
* Fit the background outside the signal region use the option R+ to add the
|
||||
function to your fit
|
||||
|
||||
\small
|
||||
Solution: [\textcolor{violet}{02\_fit\_ex\_5\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_5_sol.py) \normalsize
|
||||
|
||||
|
||||
## iPython Examples for Fitting
|
||||
|
||||
The different python packages are used in
|
||||
\textcolor{blue}{example iPython notebooks}
|
||||
to demonstrate the fitting of a third order polynomial to the same data
|
||||
available as numpy arrays.
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* LSQ fit of a polynomial to data using Minuit2 with
|
||||
\textcolor{blue}{iminuit} and \textcolor{blue}{matplotlib} plot:
|
||||
|
||||
\small
|
||||
[\textcolor{violet}{02\_fit\_iminuitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_iminuitFit.ipynb)
|
||||
\normalsize
|
||||
|
||||
* Graph fitting with \textcolor{blue}{pyROOT} with options using a python
|
||||
function including confidence level plot:
|
||||
|
||||
\small
|
||||
[\textcolor{violet}{02\_fit\_fitGraph.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_fitGraph.ipynb)
|
||||
\normalsize
|
||||
|
||||
* Graph fitting with \textcolor{blue}{numpy} and confidence level
|
||||
plotting with \textcolor{blue}{matplotlib}:
|
||||
|
||||
\small
|
||||
[\textcolor{violet}{02\_fit\_numpyFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_numpyFit.ipynb)
|
||||
\normalsize
|
||||
|
||||
* Graph fitting with a polynomial fit of \textcolor{blue}{scikit-learn} and
|
||||
plotting with \textcolor{blue}{matplotlib}:
|
||||
|
||||
\normalsize
|
||||
\small
|
||||
[\textcolor{violet}{02\_fit\_scikitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_scikitFit.ipynb)
|
||||
\normalsize
|
830
slides/intro_python.md
Normal file
@ -0,0 +1,830 @@
|
||||
---
|
||||
title: |
|
||||
| Introduction to Data Analysis and Machine Learning in Physics:
|
||||
| 1. Introduction to python
|
||||
|
||||
author: "Martino Borsato, Jörg Marks, Klaus Reygers"
|
||||
date: "Studierendentage, 11-14 April 2022"
|
||||
---
|
||||
|
||||
## Outline of the $1^{st}$ day
|
||||
|
||||
* Technical instructions for your interactions with the CIP pool, like
|
||||
* using the jupyter hub
|
||||
* using python locally in your own linux environment (anaconda)
|
||||
* access the CIP pool from your own windows or linux system
|
||||
* transfer data from and to the CIP pool
|
||||
|
||||
Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/Folien/CIPpoolAccess.pdf)\normalsize
|
||||
|
||||
* Summary of NumPy
|
||||
|
||||
* Plotting with matplotlib
|
||||
|
||||
* Input / output of data
|
||||
|
||||
* Summary of pandas
|
||||
|
||||
* Fitting with iminuit and pyROOT
|
||||
|
||||
|
||||
## A glimpse into python classes
|
||||
|
||||
The following python classes are important to data analysis and machine
|
||||
learning will be used during the course
|
||||
|
||||
* [\textcolor{violet}{NumPy}](https://numpy.org/doc/stable/user/basics.html) - python library adding support for large,
|
||||
multi-dimensional arrays and matrices, along with high-level
|
||||
mathematical functions to operate on these arrays
|
||||
|
||||
* [\textcolor{violet}{matplotlib}](https://matplotlib.org/stable/tutorials/index.html) - a python plotting library
|
||||
|
||||
* [\textcolor{violet}{SciPy}](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) - extension of NumPy by a collection of
|
||||
mathematical algorithms for minimization, regression,
|
||||
fourier transformation, linear algebra and image processing
|
||||
|
||||
* [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) -
|
||||
python wrapper to the data fitting toolkit
|
||||
[\textcolor{violet}{Minuit2}](https://root.cern.ch/doc/master/Minuit2Page.html)
|
||||
developed at CERN by F. James in the 1970ies
|
||||
|
||||
* [\textcolor{violet}{pyROOT}](https://root.cern/manual/python/) - python wrapper to the C++ data analysis toolkit
|
||||
ROOT used at the LHC
|
||||
|
||||
* [\textcolor{violet}{scikit-learn}](https://scikit-learn.org/stable/) - machine learning library written in
|
||||
python, which makes use extensively of NumPy for high-performance
|
||||
linear algebra algorithms
|
||||
|
||||
## NumPy
|
||||
|
||||
\textcolor{blue}{NumPy} (Numerical Python) is an open source Python library,
|
||||
which contains multidimensional array and matrix data structures and methods
|
||||
to efficiently operate on these. The core object is
|
||||
a homogeneous n-dimensional array object, \textcolor{blue}{ndarray}, which
|
||||
allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations
|
||||
with arrays and matrices} due to the extensive usage of compiled code.
|
||||
|
||||
* It is heavily used in numerous scientific python packages
|
||||
|
||||
* `ndarray` 's have a fixed size at creation $\rightarrow$ changing size
|
||||
leads to recreation
|
||||
|
||||
* Array elements are all required to be of the same data type
|
||||
|
||||
* Facilitates advanced mathematical operations on large datasets
|
||||
|
||||
* See for a summary, e.g.
|
||||
\small
|
||||
[\textcolor{violet}{https://cs231n.github.io/python-numpy-tutorial/\#numpy}](https://cs231n.github.io/python-numpy-tutorial/#numpy) \normalsize
|
||||
|
||||
\vfill
|
||||
|
||||
::: columns
|
||||
:::: {.column width=30%}
|
||||
|
||||
::::
|
||||
:::
|
||||
|
||||
::: columns
|
||||
:::: {.column width=35%}
|
||||
|
||||
`c = []`
|
||||
|
||||
`for i in range(len(a)):`
|
||||
|
||||
`c.append(a[i]*b[i])`
|
||||
|
||||
::::
|
||||
|
||||
:::: {.column width=35%}
|
||||
|
||||
with NumPy
|
||||
|
||||
`c = a * b`
|
||||
|
||||
::::
|
||||
:::
|
||||
|
||||
<!---
|
||||
It seem we need to indent by hand.
|
||||
I don't manage to align under the bullet text
|
||||
If we do it with column the vertical space is with code sections not good
|
||||
If we do it without code section the vertical space is ok, but there is no
|
||||
code high lightning.
|
||||
See the different versions of the same page in the following
|
||||
-->
|
||||
|
||||
## NumPy - array basics
|
||||
|
||||
* numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed.
|
||||
The *rank* is the dimension of the array.
|
||||
There are methods to create and preset arrays.
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
myA = np.array([2, 5 , 11]) # create rank 1 array (vector like)
|
||||
type(myA) # <class ‘numpy.ndarray’>
|
||||
myA.shape # (3,)
|
||||
print(myA[2]) # 11 access 3. element
|
||||
myA[0] = 12 # set 1. element to 12
|
||||
myB = np.array([[1,5],[7,9]]) # create rank 2 array
|
||||
myB.shape # (2,2)
|
||||
print(myB[0,0],myB[0,1],myB[1,1]) # 1 5 9
|
||||
myC = np.arange(6) # create rank 1 set to 0 - 5
|
||||
myC.reshape(2,3) # change rank to (2,3)
|
||||
|
||||
zero = np.zeros((2,5)) # 2 rows, 5 columns, set to 0
|
||||
one = np.ones((2,2)) # 2 rows, 2 columns, set to 1
|
||||
five = np.full((2,2), 5) # 2 rows, 2 columns, set to 5
|
||||
e = np.eye(2) # create 2x2 identity matrix
|
||||
```
|
||||
\normalsize
|
||||
|
||||
|
||||
## NumPy - array indexing (1)
|
||||
|
||||
* select slices of a numpy array
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
a = np.array([[1,2,3,4],
|
||||
[5,6,7,8], # 3 rows 4 columns array
|
||||
[9,10,11,12]])
|
||||
b = a[:2, 1:3] # subarray of 2 rows and
|
||||
array([[2, 3], # column 1 and 2
|
||||
[6, 7]])
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* a slice of an array points into the same data, *modifying* changes the original array!
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
b[0, 0] = 77 # b[0,0] and a[0,1] are 77
|
||||
|
||||
r1_row = a[1, :] # get 2nd row -> rank 1
|
||||
r1_row.shape # (4,)
|
||||
r2_row = a[1:2, :] # get 2nd row -> rank 2
|
||||
r2_row.shape # (1,4)
|
||||
a=np.array([[1,2],[3,4],[5,6]]) # set a , 3 rows 2 cols
|
||||
d=a[[0, 1, 2], [0, 1, 1]] # d contains [1 4 6]
|
||||
e=a[[1, 2], [1, 1]] # e contains [4 6]
|
||||
np.array([a[0,0],a[1,1],a[2,0]]) # address elements explicitly
|
||||
```
|
||||
\normalsize
|
||||
|
||||
|
||||
## NumPy - array indexing (2)
|
||||
|
||||
|
||||
* integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
a = np.array([[1,2,3,4],
|
||||
[5,6,7,8], # 3 rows 4 columns array
|
||||
[9,10,11,12]])
|
||||
p_a = np.array([0,2,0]) # Create an array of indices
|
||||
s = a[np.arange(3), p_a] # number the rows, p_a points to cols
|
||||
print (s) # s contains [1 7 9]
|
||||
a[np.arange(3),p_a] += 10 # add 10 to corresponding elements
|
||||
x=np.array([[8,2],[7,4]]) # create 2x2 array
|
||||
bool = (x > 5) # bool : array of boolians
|
||||
# [[True False]
|
||||
# [True False]]
|
||||
print(x[x>5]) # select elements, prints [8 7]
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* data type in numpy - create according to input numbers or set explicitly
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
x = np.array([1.1, 2.1]) # create float array
|
||||
print(x.dtype) # print float64
|
||||
y=np.array([1.1,2.9],dtype=np.int64) # create float array [1 2]
|
||||
```
|
||||
\normalsize
|
||||
|
||||
|
||||
## NumPy - functions
|
||||
|
||||
* math functions operate elementwise either as operator overload or as methods
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
x=np.array([[1,2],[3,4]],dtype=np.float64) # define 2x2 float array
|
||||
y=np.array([[3,1],[5,1]],dtype=np.float64) # define 2x2 float array
|
||||
s = x + y # elementwise sum
|
||||
s = np.add(x,y)
|
||||
s = np.subtract(x,y)
|
||||
s = np.multiply(x,y) # no matrix multiplication!
|
||||
s = np.divide(x,y)
|
||||
s = np.sqrt(x), np.exp(x), ...
|
||||
x @ y , or np.dot(x, y) # matrix product
|
||||
np.sum(x, axis=0) # sum of each column
|
||||
np.sum(x, axis=1) # sum of each row
|
||||
xT = x.T # transpose of x
|
||||
x = np.linspace(0,2*pi,100) # get equal spaced points in x
|
||||
|
||||
r = np.random.default_rng(seed=42) # constructor random number class
|
||||
b = r.random((2,3)) # random 2x3 matrix
|
||||
```
|
||||
\normalsize
|
||||
|
||||
|
||||
|
||||
##
|
||||
|
||||
* broadcasting in numpy
|
||||
\vspace{0.4cm}
|
||||
|
||||
The term broadcasting describes how numpy treats arrays with different
|
||||
shapes during arithmetic operations
|
||||
|
||||
* add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to
|
||||
$[b,b,b]$
|
||||
\vspace{0.2cm}
|
||||
|
||||
* add a scalar $b$ to a 2D [2,3] array $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$
|
||||
$\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise
|
||||
\vspace{0.2cm}
|
||||
|
||||
* add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ $\rightarrow$ 1D array is broadcast
|
||||
across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added element wise
|
||||
\vspace{0.2cm}
|
||||
|
||||
Arithmetic operations can only be performed when the shape of each
|
||||
dimension in the arrays are equal or one has the dimension size of 1. Look
|
||||
[\textcolor{violet}{here}](https://numpy.org/doc/stable/user/basics.broadcasting.html) for more details
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
# Add a vector to each row of a matrix
|
||||
x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3)
|
||||
v = np.array([1,2,3]) # v has shape (3,)
|
||||
x + v # [[2 4 6]
|
||||
# [5 7 9]]
|
||||
```
|
||||
\normalsize
|
||||
|
||||
## Plot data
|
||||
|
||||
A popular library to present data is the `pyplot` module of `matplotlib`.
|
||||
|
||||
* Drawing a function in one plot
|
||||
|
||||
\footnotesize
|
||||
::: columns
|
||||
:::: {.column width=35%}
|
||||
```python
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
# generate 100 points from 0 to 2 pi
|
||||
x = np.linspace( 0, 10*np.pi, 100 )
|
||||
f = np.sin(x)**2
|
||||
# plot function
|
||||
plt.plot(x,f,'blueviolet',label='sine')
|
||||
plt.xlabel('x [radian]')
|
||||
plt.ylabel('f(x)')
|
||||
plt.title('Plot sin^2')
|
||||
plt.legend(loc='upper right')
|
||||
plt.axis([0,30,-0.1,1.2]) # limit the plot range
|
||||
|
||||
# show the plot
|
||||
plt.show()
|
||||
```
|
||||
::::
|
||||
:::: {.column width=40%}
|
||||
![](figures/matplotlib_Figure_1.png)
|
||||
::::
|
||||
:::
|
||||
|
||||
\normalsize
|
||||
|
||||
##
|
||||
* Drawing subplots in one canvas
|
||||
|
||||
\footnotesize
|
||||
::: columns
|
||||
:::: {.column width=35%}
|
||||
```python
|
||||
...
|
||||
g = np.exp(-0.2*x)
|
||||
# create figure
|
||||
plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey')
|
||||
plt.suptitle('1 x 2 Plot')
|
||||
# create subplot and plot first one
|
||||
plt.subplot(1,2,1)
|
||||
# plot first one
|
||||
plt.title('exp(x)')
|
||||
plt.xlabel('x')
|
||||
plt.ylabel('g(x)')
|
||||
plt.plot(x,g,'blueviolet')
|
||||
# create subplot and plot second one
|
||||
plt.subplot(1,2,2)
|
||||
plt.plot(x,f,'orange')
|
||||
plt.plot(x,f*g,'red')
|
||||
plt.legend(['sine^2','exp*sine'])
|
||||
# show the plot
|
||||
plt.show()
|
||||
```
|
||||
::::
|
||||
:::: {.column width=40%}
|
||||
\vspace{3cm}
|
||||
![](figures/matplotlib_Figure_2.png)
|
||||
::::
|
||||
:::
|
||||
\normalsize
|
||||
|
||||
## Image data
|
||||
|
||||
The `image` class of the `matplotlib` library can be used to load the image
|
||||
to numpy arrays and to render the image.
|
||||
|
||||
* There are 3 common formats for the numpy array
|
||||
|
||||
* (M, N) scalar data used for greyscale images
|
||||
|
||||
* (M, N, 3) for RGB images (each pixel has an array with RGB color attached)
|
||||
|
||||
* (M, N, 4) for RGBA images (each pixel has an array with RGB color
|
||||
and transparency attached)
|
||||
|
||||
|
||||
The method `imread` loads the image into an `ndarray`, which can be
|
||||
manipulated.
|
||||
|
||||
The method `imshow` renders the image data
|
||||
|
||||
\vspace {2cm}
|
||||
|
||||
##
|
||||
* Drawing pixel data and images
|
||||
|
||||
\footnotesize
|
||||
::: columns
|
||||
:::: {.column width=50%}
|
||||
|
||||
```python
|
||||
....
|
||||
# create data array with pixel postion and RGB color code
|
||||
width, height = 400, 400
|
||||
data = np.zeros((height, width, 3), dtype=np.uint8)
|
||||
# red patch in the center
|
||||
data[175:225, 175:225] = [255, 0, 0]
|
||||
x = np.random.randint(0,width-1,100)
|
||||
y = np.random.randint(0,height-1,100)
|
||||
data[x,y]= [0,255,0] # random green pixel
|
||||
plt.imshow(data)
|
||||
plt.show()
|
||||
....
|
||||
import matplotlib.image as mpimg
|
||||
#read image into numpy array
|
||||
pic = mpimg.imread('picture.jpg')
|
||||
mod_pic = pic[:,:,0] # grab slice 0 of the colors
|
||||
plt.imshow(mod_pic) # use default color code also
|
||||
plt.colorbar() # try cmap='hot'
|
||||
plt.show()
|
||||
```
|
||||
::::
|
||||
:::: {.column width=25%}
|
||||
![](figures/matplotlib_Figure_3.png)
|
||||
\vspace{1cm}
|
||||
![](figures/matplotlib_Figure_4.png)
|
||||
::::
|
||||
:::
|
||||
\normalsize
|
||||
|
||||
|
||||
## Input / output
|
||||
|
||||
For the analysis of measured data efficient input \/ output plays an
|
||||
important role. In numpy, `ndarrays` can be saved and read in from files.
|
||||
`load()` and `save()` functions handle numpy binary files (.npy extension)
|
||||
which contain data, shape, dtype and other information required to
|
||||
reconstruct the `ndarray` of the disk file.
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
r = np.random.default_rng() # instanciate random number generator
|
||||
a = r.random((4,3)) # random 4x3 array
|
||||
np.save('myBinary.npy', a) # write array a to binary file myBinary.npy
|
||||
b = np.arange(12)
|
||||
np.savez('myComp.npz', a=a, b=b) # write a and b in compressed binary file
|
||||
......
|
||||
b = np.load('myBinary.npy') # read content of myBinary.npy into b
|
||||
```
|
||||
\normalsize
|
||||
|
||||
The storage and retrieval of array data in text file format is done
|
||||
with `savetxt()` and `loadtxt()` methods. Parameter controling delimiter,
|
||||
line separators, file header and footer can be specified.
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
x = np.array([1,2,3,4,5,6,7]) # create ndarray
|
||||
np.savetxt('myText.txt',x,fmt='%d') # write array x to text file myText.txt
|
||||
.....
|
||||
y = np.loadtxt('myText.txt',dtype=int) # read content of myText.txt in y
|
||||
```
|
||||
\normalsize
|
||||
|
||||
|
||||
## Exercise 1
|
||||
|
||||
i) Display a numpy array as figure of a blue cross. The size should be 200
|
||||
by 200 pixel. Use as array format (M, N, 3), where the first 2 specify
|
||||
the pixel positions and the last 3 the rbg color from 0:255.
|
||||
- Draw in addition a red square of arbitrary position into the figure.
|
||||
- Draw a circle in the center of the figure. Try to create a mask which
|
||||
selects the inner part of the circle using the indexing.
|
||||
|
||||
\small
|
||||
[Solution: 01_intro_ex_1a_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1a_sol.py) \normalsize
|
||||
|
||||
ii) Read data which contains pixels from the binary file horse.py into a
|
||||
numpy array. Display the data and the following transformations in 4
|
||||
subplots: scaling and translation, compression in x and y, rotation
|
||||
and mirroring.
|
||||
|
||||
\small
|
||||
[Solution: 01_intro_ex_1b_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1b_sol.py) \normalsize
|
||||
|
||||
|
||||
## Pandas
|
||||
|
||||
[\textcolor{violet}{pandas}](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) is a software library written in Python for
|
||||
\textcolor{blue}{data manipulation and analysis}.
|
||||
|
||||
\vspace{0.4cm}
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* Offers data structures and operations for manipulating numerical tables with
|
||||
integrated indexing
|
||||
|
||||
* Imports data from various file formats, e.g. comma-separated values, JSON,
|
||||
SQL or Excel
|
||||
|
||||
* Tools for reading and writing data structures, allows analyzing, filtering,
|
||||
spliting, merging and joining
|
||||
|
||||
* Built on top of `NumPy`
|
||||
|
||||
* Visualize the data with `matplotlib`
|
||||
|
||||
* Most machine learning tools support `pandas` $\rightarrow$
|
||||
it is widely used to preprocess data sets for machine learning
|
||||
|
||||
## Pandas micro introduction
|
||||
|
||||
Goal: Exploring, cleaning, transforming, and visualization of data.
|
||||
The basic indexable objects are
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* `Series` -> vector (list) of data elements of arbitrary type
|
||||
|
||||
* `DataFrame` -> tabular arangement of data elements of column wise
|
||||
arbitrary type
|
||||
|
||||
Both allow cleaning data by removing of `empty` or `nan` data entries
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd # use together with numpy
|
||||
s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of float64
|
||||
r = pd.Series(np.random.randn(4)) # Series of random numbers float64
|
||||
dates = pd.date_range("20130101", periods=3) # index according to dates
|
||||
df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD"))
|
||||
print (df) # print the DataFrame
|
||||
A B C D
|
||||
2013-01-01 1.618395 1.210263 -1.276586 -0.775545
|
||||
2013-01-02 0.676783 -0.754161 -1.148029 -0.244821
|
||||
2013-01-03 -0.359081 0.296019 1.541571 0.235337
|
||||
|
||||
new_s = s.dropna() # return a new Data Frame with no empty cells
|
||||
```
|
||||
\normalsize
|
||||
|
||||
##
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* pandas data can be saved in different file formats (CSV, JASON, html, XML,
|
||||
Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept
|
||||
in the output file.
|
||||
|
||||
* csv file
|
||||
\footnotesize
|
||||
```python
|
||||
df.to_csv("myFile.csv") # Write the DataFrame df to a csv file
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* HDF5 output
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5
|
||||
s.to_hdf("myFile.h5", key='s',mode='a')
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* Writing to an excel file
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
df.to_excel("myFile.xlsx", sheet_name="Sheet1")
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* Deleting file with data in python
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
import os
|
||||
os.remove('myFile.h5')
|
||||
```
|
||||
\normalsize
|
||||
|
||||
##
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* read in data from various formats
|
||||
|
||||
* csv file
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
.......
|
||||
df = pd.read_csv('heart.csv') # read csv data table
|
||||
print(df.info())
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 303 entries, 0 to 302
|
||||
Data columns (total 14 columns):
|
||||
# Column Non-Null Count Dtype
|
||||
--- ------ -------------- -----
|
||||
0 age 303 non-null int64
|
||||
1 sex 303 non-null int64
|
||||
2 cp 303 non-null int64
|
||||
print(df.head(5)) # prints the first 5 rows of the data table
|
||||
print(df.describe()) # shows a quick statistic summary of your data
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* Reading an excel file
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"])
|
||||
```
|
||||
\normalsize
|
||||
|
||||
\textcolor{olive}{There are many options specifying details for IO.}
|
||||
|
||||
##
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* Various functions exist to select and view data from pandas objects
|
||||
|
||||
* Display column and index
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df.index # show datetime index of df
|
||||
DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'],
|
||||
dtype='datetime64[ns]',freq='D')
|
||||
df.column # show columns info
|
||||
Index(['A', 'B', 'C', 'D'], dtype='object')
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df.to_numpy() # one dtype for the entire array, not per column!
|
||||
[[-0.62660101 -0.67330526 0.23269168 -0.67403546]
|
||||
[-0.53033339 0.32872063 -0.09893568 0.44814084]
|
||||
[-0.60289996 -0.22352548 -0.43393248 0.47531456]]
|
||||
```
|
||||
\normalsize
|
||||
|
||||
Does not include the index or column labels in the output
|
||||
|
||||
* more on viewing
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df.T # transpose the DataFrame df
|
||||
df.sort_values(by="B") # Sorting by values of a column of df
|
||||
df.sort_index(axis=0,ascending=False) # Sorting by index descending values
|
||||
df.sort_index(axis=0,ascending=False) # Display columns in inverse order
|
||||
|
||||
```
|
||||
\normalsize
|
||||
|
||||
##
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions
|
||||
|
||||
* get a named column as a Series
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df["A"] # selects a column A from df, simular to df.A
|
||||
df.iloc[:, 1:2] # slices column A explicitly from df, df.loc[:, ["A"]]
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* select rows of a DataFrame
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df[0:2] # selects row 0 and 1 from df,
|
||||
df["20130102":"20130103"] # use indices endpoint are included!
|
||||
df.iloc[3] # Select with the position of the passed integers
|
||||
df.iloc[1:3, :] # selects row 1 and 2 from df
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* select by label
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D
|
||||
df.loc[dates[0], "A"] # selects a single value (scalar)
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* select by lists of integer position (as in `NumPy`)
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D
|
||||
df.iloc[1, 1] # get a value explicitly
|
||||
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* select according to expressions
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df.query('B<C') # select rows where B < C
|
||||
df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows
|
||||
```
|
||||
\normalsize
|
||||
|
||||
##
|
||||
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
|
||||
|
||||
* Selecting data of pandas objects continued
|
||||
|
||||
* Boolean indexing
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df[df["A"] > 0] # select df where all values of column A are >0
|
||||
df[df > 0] # select values from the entire DataFrame
|
||||
```
|
||||
\normalsize
|
||||
|
||||
more complex example
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df2 = df.copy() # copy df
|
||||
df2["E"] = ["eight","one","four"] # add column E
|
||||
df2[df2["E"].isin(["two", "four"])] # test if elements "two" and "four" are
|
||||
# contained in Series column E
|
||||
```
|
||||
\normalsize
|
||||
|
||||
* Operations (in general exclude missing data)
|
||||
|
||||
\footnotesize
|
||||
|
||||
```python
|
||||
df2[df2 > 0] = -df2 # All elements > 0 change sign
|
||||
df.mean(0) # get column wise mean (numbers=axis)
|
||||
df.mean(1) # get row wise mean
|
||||
df.std(0) # standard deviation according to axis
|
||||
df.cumsum() # cumulative sum of each column
|
||||
df.apply(np.sin) # apply function to each element of df
|
||||
df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise
|
||||
df + 10 # add scalar 10
|
||||
df - [1, 2, 10 , 100] # subtract values of each column
|
||||
df.corr() # Compute pairwise correlation of columns
|
||||
```
|
||||
\normalsize
|
||||
|
||||
|
||||
## Pandas - plotting data
|
||||
|
||||
[\textcolor{violet}{Visualization}](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) is integrated in pandas using mathplotlib. Here are only 2 examples
|
||||
|
||||
* Plot random data in histogramm and scatter plot
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
# create DataFrame with random normal distributed data
|
||||
df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"])
|
||||
df = df + [1, 3, 8 , 10] # shift mean to 1, 3, 8 , 10
|
||||
plt.figure()
|
||||
df.plot.hist(bins=20) # histogram all 4 columns
|
||||
g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1")
|
||||
df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1)
|
||||
```
|
||||
\normalsize
|
||||
|
||||
::: columns
|
||||
:::: {.column width=35%}
|
||||
![](figures/pandas_histogramm.png)
|
||||
::::
|
||||
:::: {.column width=35%}
|
||||
![](figures/pandas_scatterplot.png)
|
||||
::::
|
||||
:::
|
||||
|
||||
## Pandas - plotting data
|
||||
|
||||
The function crosstab() takes one or more array-like objects as indexes or
|
||||
columns and constructs a new DataFrame of variable counts on the inputs
|
||||
|
||||
\footnotesize
|
||||
```python
|
||||
df = pd.DataFrame( # create DataFrame of 2 categories
|
||||
{"sex": np.array([0,0,0,0,1,1,1,1,0,0,0]),
|
||||
"heart": np.array([1,1,1,0,1,1,1,0,0,0,1])
|
||||
} ) # closing bracket goes on next line
|
||||
pd.crosstab(df2.sex,df2.heart) # create cross table of possibilities
|
||||
pd.crosstab(df2.sex,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts
|
||||
```
|
||||
\normalsize
|
||||
::: columns
|
||||
:::: {.column width=42%}
|
||||
![](figures/pandas_crosstabplot.png)
|
||||
::::
|
||||
:::
|
||||
|
||||
## Exercise 2
|
||||
|
||||
Read the file [\textcolor{violet}{heart.csv}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/heart.csv) into a DataFrame.
|
||||
[\textcolor{violet}{Information on the dataset}](https://archive.ics.uci.edu/ml/datasets/heart+Disease)
|
||||
|
||||
\setbeamertemplate{itemize item}{\color{red}$\square$}
|
||||
|
||||
* Which columns do we have
|
||||
|
||||
* Print the first 3 rows
|
||||
|
||||
* Print the statistics summary and the correlations
|
||||
|
||||
* Print mean values for each column with and without disease
|
||||
|
||||
* Select the data according to `sex` and `target` (heart disease 0=no 1=yes).
|
||||
|
||||
* Plot the `age` distribution of male and female in one histogram
|
||||
|
||||
* Plot the heart disease distribution according to chest pain type `cp`
|
||||
|
||||
* Plot `thalach` according to `target` in one histogramm
|
||||
|
||||
* Plot `sex` and `target` in a histogramm figure
|
||||
|
||||
* Correlate `age` and `max heart rate` according to `target`
|
||||
|
||||
* Correlate `age` and `colesterol` according to `target`
|
||||
|
||||
\small
|
||||
[Solution: 01_intro_ex_2_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_2_sol.py) \normalsize
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|