-
1slides/.gitignore
-
10slides/Makefile
-
2slides/README.md
-
6slides/copy_slides.sh
-
347slides/decision_trees.md
-
BINslides/figures/03_ml_basics_galton_linear_regression_iminuit.pdf
-
BINslides/figures/03_ml_basics_log_regr_heart_disease.pdf
-
BINslides/figures/03_ml_basics_logistic_regression.pdf
-
BINslides/figures/L1vsL2.pdf
-
BINslides/figures/activation_functions.png
-
BINslides/figures/adversarial_attack.png
-
BINslides/figures/ai_history.png
-
BINslides/figures/ai_ml_dl.pdf
-
BINslides/figures/ann.png
-
BINslides/figures/anomaly_detection.png
-
BINslides/figures/autoencoder_example.pdf
-
BINslides/figures/bdt.png
-
BINslides/figures/book-murphy.png
-
BINslides/figures/book_deep_learning_for_physics_research.png
-
BINslides/figures/boston_house_prices.pdf
-
BINslides/figures/cnn.png
-
BINslides/figures/cnn_conv_layer.png
-
BINslides/figures/cnn_fully_connected.png
-
BINslides/figures/cnn_pooling.png
-
BINslides/figures/cnn_sliding_filter.png
-
BINslides/figures/critical_temperature.pdf
-
BINslides/figures/cross_val.png
-
BINslides/figures/decision_boundaries.png
-
BINslides/figures/decision_trees_feature_space.png
-
BINslides/figures/deep_learning_book.png
-
BINslides/figures/deep_learning_with_python.png
-
BINslides/figures/deepl.png
-
BINslides/figures/dnn.png
-
BINslides/figures/dropout.png
-
BINslides/figures/example_overtraining.png
-
BINslides/figures/feature_transformation.png
-
BINslides/figures/fisher.png
-
BINslides/figures/fisher_linear_decision_boundary.png
-
BINslides/figures/gan.png
-
BINslides/figures/gradient_descent.png
-
BINslides/figures/gradient_descent_cmp.png
-
BINslides/figures/hands_on_machine_learning.png
-
BINslides/figures/handwritten_digits.png
-
BINslides/figures/heart_table.png
-
BINslides/figures/imagenet.png
-
BINslides/figures/imagenet_challenge.png
-
BINslides/figures/iminuit_minos_scan-1.png
-
BINslides/figures/iminuit_minos_scan-2.png
-
BINslides/figures/iris_dataset.png
-
BINslides/figures/keras.png
-
BINslides/figures/knn.png
-
BINslides/figures/logistic_fct.png
-
BINslides/figures/loss_fct.png
-
BINslides/figures/magic_photo.png
-
BINslides/figures/magic_photo_small.png
-
BINslides/figures/magic_shower_em_had.png
-
BINslides/figures/magic_shower_em_had_small.png
-
BINslides/figures/magic_shower_parameters.png
-
BINslides/figures/magic_sketch.png
-
BINslides/figures/matplotlib_Figure_1.png
-
BINslides/figures/matplotlib_Figure_2.png
-
BINslides/figures/matplotlib_Figure_3.png
-
BINslides/figures/matplotlib_Figure_4.png
-
BINslides/figures/mini_boone_decisions_tree.png
-
BINslides/figures/ml_example_spam.png
-
BINslides/figures/mlp.png
-
BINslides/figures/mnist.png
-
BINslides/figures/monitoring_overtraining.png
-
BINslides/figures/mva.png
-
BINslides/figures/mva_nn.png
-
BINslides/figures/neuron.png
-
BINslides/figures/nn_decision_boundary.png
-
BINslides/figures/pandas_crosstabplot.png
-
BINslides/figures/pandas_histogramm.png
-
BINslides/figures/pandas_scatterplot.png
-
BINslides/figures/pdf_from_2d_histogram.png
-
BINslides/figures/perceptron_photo.png
-
BINslides/figures/perceptron_retina.png
-
BINslides/figures/perceptron_weighted_sum.png
-
BINslides/figures/perceptron_with_threshold.png
-
BINslides/figures/regularization.png
-
BINslides/figures/relu.png
-
BINslides/figures/rootOptions.png
-
BINslides/figures/scikit-learn.png
-
BINslides/figures/sigmoid.png
-
BINslides/figures/signal_background_distr.png
-
BINslides/figures/signal_purity.png
-
BINslides/figures/stochastic_gradient_descent.png
-
BINslides/figures/supervised_learning_car_plane.png
-
BINslides/figures/supervised_nutshell.png
-
BINslides/figures/tensorflow.png
-
BINslides/figures/tf_playground.png
-
BINslides/figures/tree_pruning_slides.png
-
BINslides/figures/underfitting_overfitting.pdf
-
BINslides/figures/underfitting_overfitting_001.png
-
BINslides/figures/videogame.png
-
BINslides/figures/xor.png
-
BINslides/figures/xor_like_data.pdf
-
563slides/fit_intro.md
-
830slides/intro_python.md
@ -0,0 +1 @@ |
|||
.DS_Store |
@ -0,0 +1,10 @@ |
|||
# make creates pdf files of all newly edited .md files
|
|||
|
|||
SRCS := $(wildcard *.md) |
|||
PDF := $(SRCS:%.md=%.pdf) |
|||
|
|||
OPT := --pdf-engine=xelatex --variable mainfont="Helvetica" --variable sansfont="Helvetica" -t beamer -s -fmarkdown-implicit_figures --template=template.beamer --highlight-style=kate |
|||
all: ${PDF} |
|||
|
|||
%.pdf: %.md |
|||
pandoc $(OPT) --output=$@ $< |
@ -0,0 +1,2 @@ |
|||
Pandoc slides example following style of [Stefan Wunsch's CERN IML workhsop presenation](https://github.com/stwunsch/iml_keras_workshop) on [keras](https://keras.io/) (see slides folder) |
|||
|
@ -0,0 +1,6 @@ |
|||
# slides (do chgrp machlearn <file> later) |
|||
# scp CIPpoolAccess.PDF reygers@rho0:public_html/lectures/2021/ml/transparencies/ |
|||
# scp 03_ml_basics.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/ |
|||
# scp 04_decision_trees.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/ |
|||
scp 05_neural_networks.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/ |
|||
|
@ -0,0 +1,347 @@ |
|||
--- |
|||
title: | |
|||
| Introduction to Data Analysis and Machine Learning in Physics: |
|||
| 4. Decisions Trees |
|||
|
|||
author: "Martino Borsato, Jörg Marks, Klaus Reygers" |
|||
date: "Studierendentage, 11-14 April 2022" |
|||
--- |
|||
## Exercises |
|||
|
|||
* Exercise 1: Compare different decision tree classifiers |
|||
* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb) |
|||
* Exercise 2: Apply XGBoost classifier to MAGIC data set |
|||
* [`04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb) |
|||
* Exercise 3: Feature importance |
|||
* Exercise 4: Interpret a classifier with SHAP values |
|||
|
|||
## Decision trees |
|||
|
|||
\begin{figure} |
|||
\centering |
|||
\includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png} |
|||
\end{figure} |
|||
|
|||
\begin{center} |
|||
Leaf nodes classify events as either signal or background |
|||
\end{center} |
|||
|
|||
## Decision trees: Rectangular volumes in feature space |
|||
|
|||
\begin{figure} |
|||
\centering |
|||
\includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png} |
|||
\end{figure} |
|||
|
|||
* Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background) |
|||
* How to build a decision tree in an optimal way? |
|||
|
|||
## Finding optimal cuts |
|||
|
|||
Separation btw. signal and background is often measured with the Gini index (or Gini impurity): |
|||
|
|||
$$ G = p (1-p) $$ |
|||
|
|||
Here $p$ is the purity: |
|||
$$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$ |
|||
|
|||
\vfill |
|||
\textcolor{gray}{Usefulness of weights will become apparent soon.} |
|||
|
|||
\vfill |
|||
Improvement in signal/background separation after splitting a set A into two sets B and C: |
|||
$$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$ |
|||
|
|||
## Gini impurity and other purity measures |
|||
\begin{figure} |
|||
\centering |
|||
\includegraphics[width=0.7\textwidth]{figures/signal_purity.png} |
|||
\end{figure} |
|||
|
|||
|
|||
## Decision tree pruning |
|||
|
|||
::: columns |
|||
:::: {.column width=50%} |
|||
|
|||
When to stop growing a tree? |
|||
|
|||
* When all nodes are essentially pure? |
|||
* Well, that's overfitting! |
|||
|
|||
\vspace{3ex} |
|||
|
|||
Pruning |
|||
|
|||
* Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves |
|||
|
|||
:::: |
|||
:::: {.column width=50%} |
|||
\begin{figure} |
|||
\centering |
|||
\includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png} |
|||
\end{figure} |
|||
:::: |
|||
::: |
|||
|
|||
## Single decision trees: Pros and cons |
|||
|
|||
\textcolor{green}{Pros:} |
|||
|
|||
* Requires little data preparation (unlike neural networks) |
|||
* Can use continuous and categorical inputs |
|||
|
|||
\vfill |
|||
|
|||
\textcolor{red}{Cons:} |
|||
|
|||
* Danger of overfitting training data |
|||
* Sensitive to fluctuations in the training data |
|||
* Hard to find global optimum |
|||
* When to stop splitting? |
|||
|
|||
## Ensemble methods: Combine weak learners |
|||
|
|||
::: columns |
|||
:::: {.column width=70%} |
|||
* Bootstrap Aggregating (Bagging) |
|||
* Sample training data (with replacement) and train a separate model on each of the derived training sets |
|||
* Classify example with majority vote, or compute average output from each tree as model output |
|||
|
|||
:::: |
|||
:::: {.column width=30%} |
|||
$$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$ |
|||
:::: |
|||
::: |
|||
\vfill |
|||
::: columns |
|||
:::: {.column width=70%} |
|||
* Boosting |
|||
* Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model |
|||
* Take weighted average to classify examples |
|||
|
|||
:::: |
|||
:::: {.column width=30%} |
|||
$$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$ |
|||
:::: |
|||
::: |
|||
|
|||
## Random forests |
|||
|
|||
* "One of the most widely used and versatile algorithms in data science and machine learning" |
|||
\tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize |
|||
\vfill |
|||
* Use bagging to select random example subset |
|||
\vfill |
|||
* Train a tree, but only use random subset of features at each split |
|||
* this reduces the correlation between different trees |
|||
* makes the decision more robust to missing data |
|||
|
|||
## Boosted decision trees: Idea |
|||
|
|||
\begin{figure} |
|||
\centering |
|||
\includegraphics[width=0.75\textwidth]{figures/bdt.png} |
|||
\end{figure} |
|||
|
|||
## AdaBoost (short for Adaptive Boosting) |
|||
|
|||
Initial training sample |
|||
|
|||
\begin{center} |
|||
\begin{tabular}{l l} |
|||
$\vec x_1, ..., \vec x_n$: & multivariate event data \\ |
|||
$y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\ |
|||
$w_1^{(1)}, ..., w_n^{(1)}$ & event weights |
|||
\end{tabular} |
|||
\end{center} |
|||
|
|||
with equal weights normalized as |
|||
|
|||
$$ \sum_{i=1}^n w_i^{(1)} = 1 $$ |
|||
|
|||
Train first classifier $f_1$: |
|||
|
|||
\begin{center} |
|||
\begin{tabular}{l l} |
|||
$f_1(\vec x_i) > 0$ & classify as signal \\ |
|||
$f_1(\vec x_i) < 0$ & classify as background |
|||
\end{tabular} |
|||
\end{center} |
|||
|
|||
## AdaBoost: Updating events weights |
|||
|
|||
Define training sample $k+1$ from training sample $k$ by updating weights: |
|||
|
|||
$$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$ |
|||
|
|||
\footnotesize |
|||
\textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$} |
|||
\normalsize |
|||
|
|||
Weight is increased if event was misclassified by the previous classifier |
|||
|
|||
$\to$ "Next classifier should pay more attention to misclassified events" |
|||
|
|||
|
|||
\vfill |
|||
At each step the classifier $f_k$ minimizes error rate: |
|||
|
|||
$$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0), |
|||
\quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise} $$ |
|||
|
|||
## AdaBoost: Assigning the classifier score |
|||
|
|||
Assign score to each classifier according to its error rate: |
|||
$$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$ |
|||
|
|||
\vfill |
|||
|
|||
Combined classifier (weighted average): |
|||
$$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$ |
|||
|
|||
|
|||
|
|||
## Gradient boosting |
|||
|
|||
Basic idea: |
|||
|
|||
* Train a first decision tree |
|||
* Then train a second one on the residual errors made by the first tree |
|||
* And so on |
|||
|
|||
\vfill |
|||
|
|||
In slightly more detail: |
|||
|
|||
* \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$ |
|||
* Model prediction at iteration $m$: $F_m(\vec x_i)$ |
|||
* New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$ |
|||
* Find $h_m(\vec x)$ by fitting it to |
|||
$\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$ |
|||
|
|||
\color{black} |
|||
|
|||
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1) |
|||
\small |
|||
[\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/04_decision_trees_critical_temp_regression.ipynb) |
|||
\normalsize |
|||
|
|||
\vfill |
|||
|
|||
Superconductivty data set: |
|||
|
|||
Predict the critical temperature based on 81 material features. |
|||
\footnotesize |
|||
[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data) |
|||
\normalsize |
|||
|
|||
\vfill |
|||
|
|||
From the abstract: |
|||
|
|||
|
|||
We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy. |
|||
|
|||
\vfill |
|||
|
|||
\tiny |
|||
[\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052) |
|||
\normalsize |
|||
|
|||
|
|||
## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2) |
|||
|
|||
::: columns |
|||
:::: {.column width=60%} |
|||
\footnotesize |
|||
```python |
|||
import xgboost as xgb |
|||
|
|||
XGBreg = xgb.sklearn.XGBRegressor() |
|||
|
|||
XGBreg.fit(X_train, y_train) |
|||
|
|||
y_pred = XGBreg.predict(X_test) |
|||
|
|||
from sklearn.metrics import mean_squared_error |
|||
rms = np.sqrt(mean_squared_error(y_test, y_pred)) |
|||
print(f"root mean square error {rms:.2f}") |
|||
``` |
|||
|
|||
\textcolor{gray}{This gives:} |
|||
|
|||
`root mean square error 9.68` |
|||
:::: |
|||
:::: {.column width=40%} |
|||
\vspace{6ex} |
|||
![](figures/critical_temperature.pdf) |
|||
:::: |
|||
::: |
|||
|
|||
## Exercise 1: Compare different decision tree classifiers |
|||
|
|||
\small |
|||
[\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb) |
|||
|
|||
\vspace{5ex} |
|||
|
|||
Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline |
|||
|
|||
\vspace{2ex} |
|||
|
|||
Is there a classifier that clearly performs best? |
|||
|
|||
|
|||
## Exercise 2: Apply XGBoost classifier to MAGIC data set |
|||
|
|||
\small |
|||
[\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb) |
|||
\normalsize |
|||
|
|||
\footnotesize |
|||
```python |
|||
# train XGBoost boosted decision tree |
|||
import xgboost as xgb |
|||
XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000) |
|||
``` |
|||
\normalsize |
|||
|
|||
\small |
|||
a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist}) |
|||
b) Which is the most important feature for discriminating signal and background according to XGBoost? \ |
|||
Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)? |
|||
c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`) |
|||
d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better? |
|||
\normalsize |
|||
|
|||
|
|||
## Exercise 3: Feature importance |
|||
|
|||
\small |
|||
[\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb) |
|||
\normalsize |
|||
|
|||
\vspace{3ex} |
|||
|
|||
Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score. |
|||
|
|||
|
|||
## Exercise 4: Interpret a classifier with SHAP values |
|||
|
|||
SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012. |
|||
|
|||
\vfill |
|||
|
|||
Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance. |
|||
|
|||
a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html) |
|||
|
|||
b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features? |
|||
|
|||
c) Do the same for the superconductivity data set? What are the three most important features? |
|||
|
|||
|
|||
|
|||
|
|||
|
After Width: 1930 | Height: 868 | Size: 249 KiB |
After Width: 2914 | Height: 1102 | Size: 2.3 MiB |
After Width: 1046 | Height: 1210 | Size: 386 KiB |
After Width: 876 | Height: 681 | Size: 186 KiB |
After Width: 1440 | Height: 696 | Size: 123 KiB |
After Width: 1763 | Height: 1235 | Size: 214 KiB |
After Width: 852 | Height: 1000 | Size: 328 KiB |
After Width: 251 | Height: 401 | Size: 117 KiB |
After Width: 1748 | Height: 506 | Size: 114 KiB |
After Width: 776 | Height: 350 | Size: 34 KiB |
After Width: 1065 | Height: 400 | Size: 163 KiB |
After Width: 1610 | Height: 736 | Size: 132 KiB |
After Width: 490 | Height: 540 | Size: 112 KiB |
After Width: 909 | Height: 686 | Size: 53 KiB |
After Width: 1002 | Height: 312 | Size: 161 KiB |
After Width: 1357 | Height: 679 | Size: 135 KiB |
After Width: 242 | Height: 320 | Size: 158 KiB |
After Width: 328 | Height: 436 | Size: 116 KiB |
After Width: 3572 | Height: 1268 | Size: 511 KiB |
After Width: 1224 | Height: 610 | Size: 519 KiB |
After Width: 1628 | Height: 870 | Size: 424 KiB |
After Width: 1908 | Height: 824 | Size: 229 KiB |
After Width: 1126 | Height: 507 | Size: 394 KiB |
After Width: 977 | Height: 541 | Size: 44 KiB |
After Width: 710 | Height: 742 | Size: 139 KiB |
After Width: 1644 | Height: 790 | Size: 116 KiB |
After Width: 1111 | Height: 962 | Size: 226 KiB |
After Width: 1440 | Height: 794 | Size: 216 KiB |
After Width: 232 | Height: 354 | Size: 71 KiB |
After Width: 1114 | Height: 200 | Size: 34 KiB |
After Width: 864 | Height: 477 | Size: 65 KiB |
After Width: 1904 | Height: 786 | Size: 1.5 MiB |
After Width: 1832 | Height: 883 | Size: 422 KiB |
After Width: 640 | Height: 480 | Size: 25 KiB |
After Width: 640 | Height: 480 | Size: 49 KiB |
After Width: 641 | Height: 482 | Size: 659 KiB |
After Width: 300 | Height: 300 | Size: 2.5 KiB |
After Width: 990 | Height: 884 | Size: 116 KiB |
After Width: 966 | Height: 740 | Size: 63 KiB |
After Width: 761 | Height: 441 | Size: 40 KiB |
After Width: 2240 | Height: 1680 | Size: 6.0 MiB |
After Width: 1024 | Height: 768 | Size: 1.3 MiB |
After Width: 2920 | Height: 1584 | Size: 3.9 MiB |
After Width: 1024 | Height: 555 | Size: 667 KiB |
After Width: 3436 | Height: 1536 | Size: 1.1 MiB |
After Width: 2182 | Height: 2020 | Size: 1.1 MiB |
After Width: 640 | Height: 480 | Size: 41 KiB |
After Width: 1500 | Height: 983 | Size: 98 KiB |
After Width: 400 | Height: 400 | Size: 11 KiB |
After Width: 1280 | Height: 960 | Size: 543 KiB |
After Width: 1276 | Height: 735 | Size: 248 KiB |
After Width: 1862 | Height: 754 | Size: 1.1 MiB |
After Width: 840 | Height: 774 | Size: 204 KiB |
After Width: 1140 | Height: 694 | Size: 216 KiB |
After Width: 1087 | Height: 696 | Size: 137 KiB |
After Width: 1118 | Height: 1098 | Size: 788 KiB |
After Width: 876 | Height: 454 | Size: 135 KiB |
After Width: 1088 | Height: 591 | Size: 306 KiB |
After Width: 1257 | Height: 1272 | Size: 907 KiB |
After Width: 640 | Height: 480 | Size: 15 KiB |
After Width: 640 | Height: 480 | Size: 12 KiB |
After Width: 640 | Height: 480 | Size: 47 KiB |
After Width: 1983 | Height: 808 | Size: 170 KiB |
After Width: 800 | Height: 637 | Size: 695 KiB |
After Width: 1096 | Height: 448 | Size: 79 KiB |
After Width: 995 | Height: 635 | Size: 69 KiB |
After Width: 784 | Height: 508 | Size: 34 KiB |
After Width: 1817 | Height: 1028 | Size: 344 KiB |
After Width: 890 | Height: 880 | Size: 91 KiB |
After Width: 1379 | Height: 789 | Size: 278 KiB |
After Width: 330 | Height: 178 | Size: 18 KiB |
After Width: 855 | Height: 880 | Size: 104 KiB |
After Width: 1489 | Height: 839 | Size: 144 KiB |
After Width: 861 | Height: 590 | Size: 107 KiB |
After Width: 1349 | Height: 1011 | Size: 131 KiB |
After Width: 1330 | Height: 484 | Size: 651 KiB |
After Width: 870 | Height: 328 | Size: 76 KiB |
After Width: 180 | Height: 148 | Size: 14 KiB |
After Width: 1980 | Height: 1080 | Size: 645 KiB |
After Width: 978 | Height: 854 | Size: 102 KiB |
After Width: 1400 | Height: 500 | Size: 50 KiB |
After Width: 1426 | Height: 476 | Size: 95 KiB |
After Width: 888 | Height: 446 | Size: 44 KiB |
@ -0,0 +1,563 @@ |
|||
--- |
|||
title: | |
|||
| Introduction to Data Analysis and Machine Learning in Physics: |
|||
| 2. Data modeling and fitting |
|||
|
|||
author: "Martino Borsato, Jörg Marks, Klaus Reygers" |
|||
date: "Studierendentage, 11-14 April 2022" |
|||
--- |
|||
|
|||
## Data modeling and fitting - introduction |
|||
|
|||
Data analysis is a process of understanding and modeling measured |
|||
data. The goal is to find patterns and to obtain inferences allowing to |
|||
observe underlying patterns. |
|||
|
|||
* There are 2 approaches to statistical data modeling |
|||
* Hypothesis testing: is our data compatible with a certain model? |
|||
* Determination of model parameter: use the data to determine the parameters |
|||
of a (theoretical) model |
|||
|
|||
* For the determination of model parameter |
|||
* Analysis of data distributions $\rightarrow$ mean, variance, |
|||
median, FWHM, .... \newline |
|||
allows for an approximate determination of model parameter |
|||
|
|||
* Data fitting with the least square method $\rightarrow$ an iterative |
|||
process which minimizes the deviation of a model decribed by parameters |
|||
from data. This determines the optimal values and uncertainties |
|||
of the parameters. |
|||
|
|||
* Maximum likelihood fitting $\rightarrow$ find a set of model parameters |
|||
which most likely describe the data by maximizing the probability |
|||
distributions. |
|||
|
|||
The parameter determination by minimization is an integral part of machine |
|||
learning approaches, here a system learns patterns and predicts |
|||
related ones. This is the focus in the upcoming days. |
|||
|
|||
## Data modeling and fitting - introduction |
|||
|
|||
Data analysis is a process of understanding and modeling measured |
|||
data. The goal is to find patterns and to obtain inferences allowing to |
|||
observe underlying patterns. |
|||
|
|||
* There are 2 approaches to statistical data modeling |
|||
* Hypothesis testing: is our data compatible with a certain model? |
|||
* Determination of model parameter: use the data to determine the parameters |
|||
of a (theoretical) model |
|||
|
|||
* For the determination of model parameter |
|||
* Analysis of data distributions $\rightarrow$ mean, variance, |
|||
median, FWHM, .... \newline |
|||
allows for an approximate determination of model parameter |
|||
|
|||
\setbeamertemplate{itemize subitem}{\color{red}\tiny$\blacksquare$} |
|||
* \textcolor{blue}{Data fitting with the least square method |
|||
$\rightarrow$ an iterative |
|||
process which minimizes the deviation of a model decribed by parameters |
|||
from data. This determines the optimal values and uncertainties |
|||
of the parameters.} |
|||
|
|||
\setbeamertemplate{itemize subitem}{\color{blue}\tiny$\blacktriangleright$} |
|||
* Maximum likelihood fitting $\rightarrow$ find a set of model parameters |
|||
which most likely describe the data by maximizing the probability |
|||
distributions. |
|||
|
|||
The parameter determination by minimization is an integral part of machine |
|||
learning approaches, here a system learns patterns and predicts |
|||
related ones. This is the focus in the upcoming days. |
|||
|
|||
|
|||
|
|||
## Least Square (LS) Method (1) |
|||
|
|||
The method determines the \textcolor{blue}{optimal parameters of functions |
|||
to gaussian distributed measurements}. |
|||
|
|||
Lets consider a sample of $n$ measurements $y_{i}$ and a parametrized |
|||
description of the measurement $\eta_{i} = f(x_{i} | \theta)$ |
|||
with a parameter set $\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$, |
|||
dependent values $x_{i}$ and measurement errors $\sigma_{i}$. |
|||
|
|||
The parameter set should be determined such that |
|||
\begin{equation*} |
|||
\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} = \sum \limits_{i=1}^{n} \frac{(y_i- f(x_i|\theta))^2}{\sigma_i^2} \longrightarrow \, minimal } |
|||
\end{equation*} |
|||
In case of correlated measurements the covariance matrix of the $y_{i}$ has to |
|||
be taken into account. This is accomplished by defining a weight matrix from |
|||
the covariance matrix of the input data. A decorrelation of the input data |
|||
should be considered. |
|||
\vspace{0.2cm} |
|||
|
|||
$S$ follows a $\chi^{2}$-distribution with $(n-k)$ degrees of freedom. |
|||
|
|||
## Least Square (LS) Method (2) |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
* Example LS-method |
|||
\vspace{0.2cm} |
|||
|
|||
Often the fit function $f(x, \theta)$ is linear in |
|||
$\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$ |
|||
\vspace{0.2cm} |
|||
|
|||
$f(x | \theta) = \theta_{1} f_{1}(x) + .... + \theta_{k} f_{k}(x)$ |
|||
\vspace{0.2cm} |
|||
|
|||
If the model is a straight line and our parameters are $\theta_{1}$ and |
|||
$\theta_{2}$ $(f_{1}(x) = 1,$ $f_{2}(x) = x)$ we have |
|||
$f(x | \theta) = \theta_{1} + \theta_{2} x$ |
|||
\vspace{0.2cm} |
|||
|
|||
The LS equation is |
|||
\vspace{0.2cm} |
|||
|
|||
$\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum |
|||
\limits_{i=1}^{n} \frac{(y_{i} - \theta_{1} - x_{i} |
|||
\theta_{2})^2}{\sigma_i^2 }}$ \hspace{0.4cm} and with |
|||
\vspace{0.2cm} |
|||
|
|||
$\frac{\partial S}{\partial \theta_1} = \sum\limits_{i=1}^{n} \frac{-2 |
|||
(y_i - \theta_1 - x_i \theta_2)}{\sigma_i^2} = 0$ \hspace{0.4cm} and \hspace{0.4cm} |
|||
$\frac{\partial S}{\partial \theta_2} = \sum\limits_{i=1}^{n} \frac{-2 x_i (y_i - \theta_1 - x_i \theta_2)}{\sigma_i^2} = 0$ |
|||
\vspace{0.2cm} |
|||
|
|||
the parameters $\theta_{1}$ and $\theta_{2}$ can be determined. |
|||
|
|||
\vspace{0.2cm} |
|||
\textcolor{olive}{In case of linear fit functions solutions can be found by matrix inversion} |
|||
|
|||
\vfill |
|||
|
|||
## Least Square (LS) Method (3) |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* Use of a nonlinear fit function $f(x, \theta)$ like \hspace{0.4cm} |
|||
$f(x | \theta) = \theta_{1} \cdot e^{-\theta_{2} x}$ |
|||
\vspace{0.2cm} |
|||
|
|||
results in the LS equation |
|||
\vspace{0.2cm} |
|||
|
|||
$\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum \limits_{i=1}^{n} \frac{(y_{i} - \theta_{1} \cdot e^{-\theta_{2} x_{i}})^2}{\sigma_i^2 }}$ \hspace{0.4cm} |
|||
\vspace{0.2cm} |
|||
|
|||
which we have to minimize |
|||
\vspace{0.2cm} |
|||
|
|||
$\frac{\partial S}{\partial \theta_1} = \sum\limits_{i=1}^{n} \frac{ 2 e^{-2 \theta_2 x_i} ( \theta_1 - y_i e^{\theta_2 x_i} )} {\sigma_i^2 } = 0$ \hspace{0.4cm} and \hspace{0.4cm} |
|||
$\frac{\partial S}{\partial \theta_2} = \sum\limits_{i=1}^{n} \frac{ 2 \theta_1 x_I e^{-2 \theta_2 x_i} (y_i e^{\theta_2 x_i} - \theta_1)} {\sigma_i^2 } = 0$ |
|||
|
|||
\vspace{0.4cm} |
|||
|
|||
In a nonlinear system, the LS Ansatz leads to derivatives which are |
|||
functions of the independent variable and the parameters $\color{red}\rightarrow$ \textcolor{olive}{no closed solutions} |
|||
\vspace{0.4cm} |
|||
|
|||
In general, we have gradient equations which don't have closed solutions. |
|||
There are a couple of methods including approximations which allow together |
|||
with numerical methods to find a global minimum, Gauss–Newton algorithm, |
|||
Levenberg–Marquardt algorithm, gradient descend methods and also direct |
|||
search methods. |
|||
|
|||
## Minuit - a programm package for minimization (1) |
|||
|
|||
In general data fitting and also solving machine learning algorithms lead |
|||
to a minimization problem of functions. In the |
|||
1975-1980 F. James (CERN) developed |
|||
a FORTRAN-based package, [\textcolor{violet}{MINUIT}](http://seal.web.cern.ch/seal/documents/minuit/mntutorial.pdf), which is a framework to handle |
|||
multiparameter minimization and compute the best-fit parameter values and |
|||
uncertainties, including correlations between the parameters. |
|||
\vspace{0.2cm} |
|||
|
|||
The user provides a minimization function |
|||
$F(X,P)$ with the parameter space $P=(p_1,....p_k)$ and |
|||
variable space $X$ (also multi-dimensional). There is an interface via |
|||
functions which influences the |
|||
the minimization process. MINUIT provides |
|||
[\textcolor{violet}{error calculations}](http://seal.web.cern.ch/seal/documents/minuit/mnerror.pdf) including correlations for the parameter space by evaluating the shape of the function in some neighbourhood of the minimum. |
|||
\vspace{0.2cm} |
|||
|
|||
The package |
|||
has now a new object-oriented implementation as [\textcolor{violet}{Minuit2 library}](https://root.cern.ch/doc/master/Minuit2Page.html) , written |
|||
in C++. |
|||
\vspace{0.2cm} |
|||
|
|||
During the minimization $F(X,P)$ is evaluated for various $X$. For the |
|||
choice of $P=(p_1,....p_k)$ different methods are used |
|||
|
|||
## Minuit - a programm package for minimization (2) |
|||
|
|||
\vspace{0.4cm} |
|||
\textcolor{olive}{SEEK}: Search for the minimum with Monte Carlo methods, mostly used at the start |
|||
of the minimization with unknown starting values. It is not a converging |
|||
algorithm. |
|||
\vspace{0.2cm} |
|||
|
|||
\textcolor{olive}{SIMPLX}: |
|||
Uses the simplex method of Nelder and Mead. Function values are compared |
|||
in the parameter space. Via step size control the minimum is approached. |
|||
Parameter errors are only approximate, no covariance matrix is calculated. |
|||
\vspace{0.2cm} |
|||
|
|||
<!--- |
|||
A simplex is the smallest n dimensional figure with n+1 corners. By reflecting |
|||
one point in the hyperplane of the other point and adopts itself to the |
|||
function plane. |
|||
--> |
|||
|
|||
\textcolor{olive}{MIGRAD}: |
|||
Uses an algorithm of R. Fletcher, which takes the function and the gradient |
|||
to approach the minimum with a variable metric method. An error matrix and |
|||
correlation coefficients are available |
|||
\vspace{0.2cm} |
|||
|
|||
\textcolor{olive}{HESSE}: |
|||
Calculates the hessian matrix of second derivatives and determines the |
|||
covariance matrix. |
|||
\vspace{0.2cm} |
|||
|
|||
\textcolor{olive}{MINOS}: |
|||
Calculates (asymmetric) errors using likelihood profiles. |
|||
The algorithm for finding the positive and negative MINOS errors for parameter |
|||
$n$ consists of varying $n$ each time minimizing $F(X,P)$ with respect to |
|||
all the others. |
|||
\vspace{0.2cm} |
|||
|
|||
## Minuit - a programm package for minimization (3) |
|||
|
|||
\vspace{0.4cm} |
|||
|
|||
Fit process with the minuit package |
|||
\vspace{0.2cm} |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* The individual steps decribed above can be called several times and in different order during the minimization process. |
|||
|
|||
* Each of the parameters $p_i$ of $P=(p_1,....p_k)$ can be set constant and |
|||
released during the minimization steps. |
|||
|
|||
* Problems are expected in models with strong correlation between |
|||
parameters $\rightarrow$ change model to uncorrelated definitions |
|||
|
|||
* Local minima, edges/steps or undefined ranges in $F(X,P)$ are problematic |
|||
$\rightarrow$ simplify your model |
|||
|
|||
\vspace{3cm} |
|||
|
|||
|
|||
## Minuit2 - The iminuit package |
|||
|
|||
\vspace{0.4cm} |
|||
|
|||
[\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) is |
|||
a Jupyter-friendly Python interface for the Minuit2 C++ library. |
|||
\vspace{0.2cm} |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* The class `iminuit.Minuit` instanciates the minuit object. The minimizer |
|||
function is given as argument. Basic steering of the fit |
|||
like setting start parameters, error definition and print level is also |
|||
done here. |
|||
|
|||
\footnotesize |
|||
```python |
|||
from iminuit import Minuit |
|||
def fcn(x, y, z): # definition of the minimizer function |
|||
return (x - 2) ** 2 + (y - x) ** 2 + (z - 4) ** 2 |
|||
m = Minuit(fcn, x=0, y=0, z=0, errordef=1 , print_level=1) |
|||
``` |
|||
\normalsize |
|||
|
|||
* Several methods determine the interaction with the fitting process, calls |
|||
to `migrad` , `hesse` or printing of parameters and errors |
|||
|
|||
\footnotesize |
|||
```python |
|||
...... |
|||
m.migrad() # run optimiser |
|||
print(m.values , m.errors) # print results |
|||
m.hesse() # run covariance estimator |
|||
``` |
|||
\normalsize |
|||
|
|||
## Minuit2 - iminuit example |
|||
|
|||
\vspace{0.2cm} |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* The function `fcn` describes the model with parameters to be determined by |
|||
data.`fcn` is minimal when the model parameters agree best with data. |
|||
`fcn` has positional arguments, one for each fit parameter. `iminuit` |
|||
example fit: |
|||
|
|||
[\textcolor{violet}{02\_fit\_exp\_fit\_iMinuit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_exp_fit_iMinuit.py) |
|||
|
|||
\footnotesize |
|||
```python |
|||
...... |
|||
x = np.array([....],dtype='d') # measurements x |
|||
y = np.array([....],dtype='d') # measurements y |
|||
dy = np.array([....],dtype='d') # error in y |
|||
def xp(a, b , c): |
|||
return a * np.exp(b*x) + c |
|||
# least-squares function = sum of data residuals squared |
|||
def fcn(a,b,c): |
|||
return np.sum((y - xp(a,b,c)) ** 2 / dy ** 2) |
|||
# limit the range of b and fix parameter c |
|||
m = Minuit(fcn,a=1,b=-0.7,c=1,limit_b=(-1,0.1),fix_c=True) |
|||
m.migrad() # run minimizer |
|||
m.fixed["c"] = False # release parameter c |
|||
m.migrad() # rerun minimizer |
|||
``` |
|||
\normalsize |
|||
|
|||
* Might be useful to fix parameters or limit the range for some applications |
|||
|
|||
## Minuit2 - iminuit (3) |
|||
|
|||
\vspace{0.2cm} |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* Results and control information of the fit can be printed and accessed |
|||
in the the prorgamm. |
|||
|
|||
\footnotesize |
|||
```python |
|||
...... |
|||
m = Minuit(fcn,....,print_level=1) # set flag in the initializer |
|||
m.migrad() # run minimizer |
|||
a_fit = m.values['a'] # get parameter value a |
|||
a_fit_error = m.errors['a'] # get parameter error of a |
|||
print (m.values,m.errors) # print results |
|||
``` |
|||
\normalsize |
|||
|
|||
* After processing Hesse, covariance and correlation information of the |
|||
fit is available |
|||
|
|||
\footnotesize |
|||
```python |
|||
...... |
|||
m.hesse() # run covariance estimator |
|||
m.matrix() # get covariance matrix |
|||
m.matrix(correlation=True) # get full correlation matrix |
|||
cov = m.np_matrix() # save matrix to numpy |
|||
cor = m.np_matrix(correlation=True) |
|||
print(cor[0, 1]) # print correlation between parameter 1 and 2 |
|||
``` |
|||
\normalsize |
|||
|
|||
## Minuit2 - iminuit (4) |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* Minos provides asymmetric uncertainty intervals and parameter contours by |
|||
scanning one parameter and minimizing the function with respect to all other |
|||
parameters for each scan point. Results are displayed with `matplotlib`. |
|||
|
|||
\footnotesize |
|||
```python |
|||
...... |
|||
m.minos() |
|||
print (m.get_merrors()['a']) |
|||
m.draw_mnprofile('b') |
|||
m.draw_mncontour('a', 'b', nsigma=4) |
|||
``` |
|||
::: columns |
|||
:::: {.column width=40%} |
|||
![](figures/iminuit_minos_scan-1.png) |
|||
:::: |
|||
:::: {.column width=40%} |
|||
![](figures/iminuit_minos_scan-2.png) |
|||
:::: |
|||
::: |
|||
|
|||
## Exercise 3 |
|||
|
|||
Plot the following data with mathplotlib as in the iminuit example: |
|||
|
|||
\footnotesize |
|||
``` |
|||
x: 0.2,0.4,0.6,0.8,1.,1.2,1.4,1.6,1.8,2.,2.2,2.4,2.6,2.8,3.,3.2, |
|||
3.4,3.6, 3.8,4. |
|||
y: 0.04,0.021,0.035,0.03,0.029,0.019,0.024,0.018,0.019,0.022,0.02, |
|||
0.025,0.018,0.024,0.019,0.021,0.03,0.019,0.03,0.024 |
|||
dy: 1.792,1.695,1.541,1.514,1.427,1.399,1.388,1.270,1.262,1.228,1.189, |
|||
1.182,1.121,1.129,1.124,1.089,1.092,1.084,1.058,1.057 |
|||
``` |
|||
\normalsize |
|||
\setbeamertemplate{itemize item}{\color{red}$\square$} |
|||
|
|||
* Exchange in the example iminuit fit `02_fit_exp_fit_iMinuit.ipynb` the |
|||
exponential function by a 3rd order polynomial and perform the fit |
|||
|
|||
* Compare the correlation of the parameters of the exponential and |
|||
the polynomial fit |
|||
|
|||
* What defines the fit quality, give an estimate |
|||
|
|||
\small |
|||
Solution: [\textcolor{violet}{02\_fit\_ex\_3\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_3_sol.py) \normalsize |
|||
|
|||
## Exercise 4 |
|||
|
|||
Plot the following data with mathplotlib: |
|||
|
|||
\footnotesize |
|||
``` |
|||
x: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
|||
dx: 0.1,0.1,0.5,0.1,0.5,0.1,0.5,0.1,0.5,0.1 |
|||
y: 1.1,2.3,2.7,3.2,3.1,2.4,1.7,1.5,1.5,1.7 |
|||
dy: 0.15,0.22,0.29,0.39,0.31,0.21,0.13,0.15,0.19,0.13 |
|||
``` |
|||
\normalsize |
|||
\setbeamertemplate{itemize item}{\color{red}$\square$} |
|||
|
|||
* Perform a fit with iminuit. Which model do you use? |
|||
|
|||
* Plot the resulting fit function in the graph with the data |
|||
|
|||
* Print the covariance matrix. Can we improve the errors. |
|||
|
|||
* Can you draw a contour plot of 2 of the fit parameters. |
|||
|
|||
\small |
|||
Solution: [\textcolor{violet}{02\_fit\_ex\_4\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_4_sol.py) \normalsize |
|||
|
|||
|
|||
## PyROOT |
|||
|
|||
[\textcolor{violet}{PyROOT}](https://root.cern/manual/python/) is the python binding for the C++ data analysis toolkit [\textcolor{violet}{ROOT}](https://root.cern/) developed with and for the LHC community. You can access the full |
|||
ROOT functionality from Python while |
|||
benefiting from the performance of the ROOT C++ libraries. The PyROOT bindings |
|||
are automatic and dynamic and are able to interoperate with widely-used Python |
|||
data-science libraries as `NumPy`, `pandas`, SciPy `scikit-learn` and `tensorflow`. |
|||
|
|||
* ROOT/PyROOT can be installed easily within anaconda3 (ROOT version 6.22.02 |
|||
or later ) or is available in the |
|||
[\textcolor{violet}{CIP jupyter2 Hub}](https://jupyter2.kip.uni-heidelberg.de/) |
|||
|
|||
* Tools for statistical analysis, a math library with optimized algorithms, |
|||
multivariate analysis, visualization and simulation of data. |
|||
|
|||
* Storing data including objects and classes with compression in files is a |
|||
very powerfull aspect for any data analysis project |
|||
|
|||
* Within PyROOT Minuit2 can be accessed easily either with predefined functions |
|||
or your own function definition |
|||
|
|||
* For advanced statistical analyses and data modeling likelihood fitting with |
|||
the packages **rooFit** and **rooStats** is available. |
|||
|
|||
|
|||
## |
|||
|
|||
* Example reading the invariant mass measurements of a $D^0$ from a text file |
|||
and determine $\mu$ and $\sigma$ \hspace{1.0cm} \small |
|||
[\textcolor{violet}{02\_fit\_histFit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_histFit.py) |
|||
\normalsize |
|||
|
|||
\footnotesize |
|||
```python |
|||
import numpy as np |
|||
import math |
|||
from ROOT import TCanvas, TFile, TH1D, TF1, TMinuit, TFitResult |
|||
data = np.genfromtxt('D0Mass.txt', dtype='d') # read data from text file |
|||
c = TCanvas('c','D0 Mass',200,10,700,500) # instanciate output canvas |
|||
d0 = TH1D('d0','D0 Mass',200,1700.,2000.) # instanciate histogramm |
|||
for x in data : # fill data into histogramm d0 |
|||
d0.Fill(x) |
|||
def pyf_tf1_params(x, p): # define fit function |
|||
return p[0] * math.exp (-0.5 * ((x[0] - p[1])**2 / p[2]**2)) |
|||
func = TF1("func",pyf_tf1_params,1840.,1880.,3) |
|||
# func = TF1("func",'gaus',1840.,1880.) # use predefined function |
|||
func.SetParameters(500.,1860.,5.5) # set start parameters |
|||
myfit = d0.Fit(func,"S") # fit function to the histogramm data |
|||
print ("Fit results: mean=",myfit.Parameter(0)," +/- ",myfit.ParError(0)) |
|||
c.Draw() # draw canvas |
|||
myfile = TFile('myOutFile.root','RECREATE') # Open a ROOT file for output |
|||
c.Write() # Write canvas |
|||
d0.Write() # Write histogram |
|||
myfile.Close() # close file |
|||
``` |
|||
\normalsize |
|||
|
|||
|
|||
## |
|||
|
|||
* Fit Options |
|||
\vspace{0.1cm} |
|||
|
|||
::: columns |
|||
:::: {.column width=2%} |
|||
:::: |
|||
:::: {.column width=98%} |
|||
![](figures/rootOptions.png) |
|||
:::: |
|||
::: |
|||
|
|||
## Exercise 5 |
|||
|
|||
Read text file [\textcolor{violet}{FitTestData.txt}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/FitTestData.txt) and draw a histogramm using PyROOT. |
|||
\setbeamertemplate{itemize item}{\color{red}$\square$} |
|||
|
|||
* Determine the mean and sigma of the signal distribution. Which function do |
|||
you use for fitting? |
|||
|
|||
* The option S fills the result object. |
|||
|
|||
* Try to improve the errors of the fit values with minos using the option E |
|||
and also try the option M to scan for a new minimum, option V provides more |
|||
output. |
|||
|
|||
* Fit the background outside the signal region use the option R+ to add the |
|||
function to your fit |
|||
|
|||
\small |
|||
Solution: [\textcolor{violet}{02\_fit\_ex\_5\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_5_sol.py) \normalsize |
|||
|
|||
|
|||
## iPython Examples for Fitting |
|||
|
|||
The different python packages are used in |
|||
\textcolor{blue}{example iPython notebooks} |
|||
to demonstrate the fitting of a third order polynomial to the same data |
|||
available as numpy arrays. |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* LSQ fit of a polynomial to data using Minuit2 with |
|||
\textcolor{blue}{iminuit} and \textcolor{blue}{matplotlib} plot: |
|||
|
|||
\small |
|||
[\textcolor{violet}{02\_fit\_iminuitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_iminuitFit.ipynb) |
|||
\normalsize |
|||
|
|||
* Graph fitting with \textcolor{blue}{pyROOT} with options using a python |
|||
function including confidence level plot: |
|||
|
|||
\small |
|||
[\textcolor{violet}{02\_fit\_fitGraph.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_fitGraph.ipynb) |
|||
\normalsize |
|||
|
|||
* Graph fitting with \textcolor{blue}{numpy} and confidence level |
|||
plotting with \textcolor{blue}{matplotlib}: |
|||
|
|||
\small |
|||
[\textcolor{violet}{02\_fit\_numpyFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_numpyFit.ipynb) |
|||
\normalsize |
|||
|
|||
* Graph fitting with a polynomial fit of \textcolor{blue}{scikit-learn} and |
|||
plotting with \textcolor{blue}{matplotlib}: |
|||
|
|||
\normalsize |
|||
\small |
|||
[\textcolor{violet}{02\_fit\_scikitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_scikitFit.ipynb) |
|||
\normalsize |
@ -0,0 +1,830 @@ |
|||
--- |
|||
title: | |
|||
| Introduction to Data Analysis and Machine Learning in Physics: |
|||
| 1. Introduction to python |
|||
|
|||
author: "Martino Borsato, Jörg Marks, Klaus Reygers" |
|||
date: "Studierendentage, 11-14 April 2022" |
|||
--- |
|||
|
|||
## Outline of the $1^{st}$ day |
|||
|
|||
* Technical instructions for your interactions with the CIP pool, like |
|||
* using the jupyter hub |
|||
* using python locally in your own linux environment (anaconda) |
|||
* access the CIP pool from your own windows or linux system |
|||
* transfer data from and to the CIP pool |
|||
|
|||
Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/Folien/CIPpoolAccess.pdf)\normalsize |
|||
|
|||
* Summary of NumPy |
|||
|
|||
* Plotting with matplotlib |
|||
|
|||
* Input / output of data |
|||
|
|||
* Summary of pandas |
|||
|
|||
* Fitting with iminuit and pyROOT |
|||
|
|||
|
|||
## A glimpse into python classes |
|||
|
|||
The following python classes are important to data analysis and machine |
|||
learning will be used during the course |
|||
|
|||
* [\textcolor{violet}{NumPy}](https://numpy.org/doc/stable/user/basics.html) - python library adding support for large, |
|||
multi-dimensional arrays and matrices, along with high-level |
|||
mathematical functions to operate on these arrays |
|||
|
|||
* [\textcolor{violet}{matplotlib}](https://matplotlib.org/stable/tutorials/index.html) - a python plotting library |
|||
|
|||
* [\textcolor{violet}{SciPy}](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) - extension of NumPy by a collection of |
|||
mathematical algorithms for minimization, regression, |
|||
fourier transformation, linear algebra and image processing |
|||
|
|||
* [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) - |
|||
python wrapper to the data fitting toolkit |
|||
[\textcolor{violet}{Minuit2}](https://root.cern.ch/doc/master/Minuit2Page.html) |
|||
developed at CERN by F. James in the 1970ies |
|||
|
|||
* [\textcolor{violet}{pyROOT}](https://root.cern/manual/python/) - python wrapper to the C++ data analysis toolkit |
|||
ROOT used at the LHC |
|||
|
|||
* [\textcolor{violet}{scikit-learn}](https://scikit-learn.org/stable/) - machine learning library written in |
|||
python, which makes use extensively of NumPy for high-performance |
|||
linear algebra algorithms |
|||
|
|||
## NumPy |
|||
|
|||
\textcolor{blue}{NumPy} (Numerical Python) is an open source Python library, |
|||
which contains multidimensional array and matrix data structures and methods |
|||
to efficiently operate on these. The core object is |
|||
a homogeneous n-dimensional array object, \textcolor{blue}{ndarray}, which |
|||
allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations |
|||
with arrays and matrices} due to the extensive usage of compiled code. |
|||
|
|||
* It is heavily used in numerous scientific python packages |
|||
|
|||
* `ndarray` 's have a fixed size at creation $\rightarrow$ changing size |
|||
leads to recreation |
|||
|
|||
* Array elements are all required to be of the same data type |
|||
|
|||
* Facilitates advanced mathematical operations on large datasets |
|||
|
|||
* See for a summary, e.g. |
|||
\small |
|||
[\textcolor{violet}{https://cs231n.github.io/python-numpy-tutorial/\#numpy}](https://cs231n.github.io/python-numpy-tutorial/#numpy) \normalsize |
|||
|
|||
\vfill |
|||
|
|||
::: columns |
|||
:::: {.column width=30%} |
|||
|
|||
:::: |
|||
::: |
|||
|
|||
::: columns |
|||
:::: {.column width=35%} |
|||
|
|||
`c = []` |
|||
|
|||
`for i in range(len(a)):` |
|||
|
|||
`c.append(a[i]*b[i])` |
|||
|
|||
:::: |
|||
|
|||
:::: {.column width=35%} |
|||
|
|||
with NumPy |
|||
|
|||
`c = a * b` |
|||
|
|||
:::: |
|||
::: |
|||
|
|||
<!--- |
|||
It seem we need to indent by hand. |
|||
I don't manage to align under the bullet text |
|||
If we do it with column the vertical space is with code sections not good |
|||
If we do it without code section the vertical space is ok, but there is no |
|||
code high lightning. |
|||
See the different versions of the same page in the following |
|||
--> |
|||
|
|||
## NumPy - array basics |
|||
|
|||
* numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed. |
|||
The *rank* is the dimension of the array. |
|||
There are methods to create and preset arrays. |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
myA = np.array([2, 5 , 11]) # create rank 1 array (vector like) |
|||
type(myA) # <class ‘numpy.ndarray’> |
|||
myA.shape # (3,) |
|||
print(myA[2]) # 11 access 3. element |
|||
myA[0] = 12 # set 1. element to 12 |
|||
myB = np.array([[1,5],[7,9]]) # create rank 2 array |
|||
myB.shape # (2,2) |
|||
print(myB[0,0],myB[0,1],myB[1,1]) # 1 5 9 |
|||
myC = np.arange(6) # create rank 1 set to 0 - 5 |
|||
myC.reshape(2,3) # change rank to (2,3) |
|||
|
|||
zero = np.zeros((2,5)) # 2 rows, 5 columns, set to 0 |
|||
one = np.ones((2,2)) # 2 rows, 2 columns, set to 1 |
|||
five = np.full((2,2), 5) # 2 rows, 2 columns, set to 5 |
|||
e = np.eye(2) # create 2x2 identity matrix |
|||
``` |
|||
\normalsize |
|||
|
|||
|
|||
## NumPy - array indexing (1) |
|||
|
|||
* select slices of a numpy array |
|||
|
|||
\footnotesize |
|||
```python |
|||
a = np.array([[1,2,3,4], |
|||
[5,6,7,8], # 3 rows 4 columns array |
|||
[9,10,11,12]]) |
|||
b = a[:2, 1:3] # subarray of 2 rows and |
|||
array([[2, 3], # column 1 and 2 |
|||
[6, 7]]) |
|||
``` |
|||
\normalsize |
|||
|
|||
* a slice of an array points into the same data, *modifying* changes the original array! |
|||
|
|||
\footnotesize |
|||
```python |
|||
b[0, 0] = 77 # b[0,0] and a[0,1] are 77 |
|||
|
|||
r1_row = a[1, :] # get 2nd row -> rank 1 |
|||
r1_row.shape # (4,) |
|||
r2_row = a[1:2, :] # get 2nd row -> rank 2 |
|||
r2_row.shape # (1,4) |
|||
a=np.array([[1,2],[3,4],[5,6]]) # set a , 3 rows 2 cols |
|||
d=a[[0, 1, 2], [0, 1, 1]] # d contains [1 4 6] |
|||
e=a[[1, 2], [1, 1]] # e contains [4 6] |
|||
np.array([a[0,0],a[1,1],a[2,0]]) # address elements explicitly |
|||
``` |
|||
\normalsize |
|||
|
|||
|
|||
## NumPy - array indexing (2) |
|||
|
|||
|
|||
* integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements |
|||
|
|||
\footnotesize |
|||
```python |
|||
a = np.array([[1,2,3,4], |
|||
[5,6,7,8], # 3 rows 4 columns array |
|||
[9,10,11,12]]) |
|||
p_a = np.array([0,2,0]) # Create an array of indices |
|||
s = a[np.arange(3), p_a] # number the rows, p_a points to cols |
|||
print (s) # s contains [1 7 9] |
|||
a[np.arange(3),p_a] += 10 # add 10 to corresponding elements |
|||
x=np.array([[8,2],[7,4]]) # create 2x2 array |
|||
bool = (x > 5) # bool : array of boolians |
|||
# [[True False] |
|||
# [True False]] |
|||
print(x[x>5]) # select elements, prints [8 7] |
|||
``` |
|||
\normalsize |
|||
|
|||
* data type in numpy - create according to input numbers or set explicitly |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
x = np.array([1.1, 2.1]) # create float array |
|||
print(x.dtype) # print float64 |
|||
y=np.array([1.1,2.9],dtype=np.int64) # create float array [1 2] |
|||
``` |
|||
\normalsize |
|||
|
|||
|
|||
## NumPy - functions |
|||
|
|||
* math functions operate elementwise either as operator overload or as methods |
|||
|
|||
\footnotesize |
|||
```python |
|||
x=np.array([[1,2],[3,4]],dtype=np.float64) # define 2x2 float array |
|||
y=np.array([[3,1],[5,1]],dtype=np.float64) # define 2x2 float array |
|||
s = x + y # elementwise sum |
|||
s = np.add(x,y) |
|||
s = np.subtract(x,y) |
|||
s = np.multiply(x,y) # no matrix multiplication! |
|||
s = np.divide(x,y) |
|||
s = np.sqrt(x), np.exp(x), ... |
|||
x @ y , or np.dot(x, y) # matrix product |
|||
np.sum(x, axis=0) # sum of each column |
|||
np.sum(x, axis=1) # sum of each row |
|||
xT = x.T # transpose of x |
|||
x = np.linspace(0,2*pi,100) # get equal spaced points in x |
|||
|
|||
r = np.random.default_rng(seed=42) # constructor random number class |
|||
b = r.random((2,3)) # random 2x3 matrix |
|||
``` |
|||
\normalsize |
|||
|
|||
|
|||
|
|||
## |
|||
|
|||
* broadcasting in numpy |
|||
\vspace{0.4cm} |
|||
|
|||
The term broadcasting describes how numpy treats arrays with different |
|||
shapes during arithmetic operations |
|||
|
|||
* add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to |
|||
$[b,b,b]$ |
|||
\vspace{0.2cm} |
|||
|
|||
* add a scalar $b$ to a 2D [2,3] array $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ |
|||
$\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise |
|||
\vspace{0.2cm} |
|||
|
|||
* add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ $\rightarrow$ 1D array is broadcast |
|||
across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added element wise |
|||
\vspace{0.2cm} |
|||
|
|||
Arithmetic operations can only be performed when the shape of each |
|||
dimension in the arrays are equal or one has the dimension size of 1. Look |
|||
[\textcolor{violet}{here}](https://numpy.org/doc/stable/user/basics.broadcasting.html) for more details |
|||
|
|||
\footnotesize |
|||
```python |
|||
# Add a vector to each row of a matrix |
|||
x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3) |
|||
v = np.array([1,2,3]) # v has shape (3,) |
|||
x + v # [[2 4 6] |
|||
# [5 7 9]] |
|||
``` |
|||
\normalsize |
|||
|
|||
## Plot data |
|||
|
|||
A popular library to present data is the `pyplot` module of `matplotlib`. |
|||
|
|||
* Drawing a function in one plot |
|||
|
|||
\footnotesize |
|||
::: columns |
|||
:::: {.column width=35%} |
|||
```python |
|||
import numpy as np |
|||
import matplotlib.pyplot as plt |
|||
# generate 100 points from 0 to 2 pi |
|||
x = np.linspace( 0, 10*np.pi, 100 ) |
|||
f = np.sin(x)**2 |
|||
# plot function |
|||
plt.plot(x,f,'blueviolet',label='sine') |
|||
plt.xlabel('x [radian]') |
|||
plt.ylabel('f(x)') |
|||
plt.title('Plot sin^2') |
|||
plt.legend(loc='upper right') |
|||
plt.axis([0,30,-0.1,1.2]) # limit the plot range |
|||
|
|||
# show the plot |
|||
plt.show() |
|||
``` |
|||
:::: |
|||
:::: {.column width=40%} |
|||
![](figures/matplotlib_Figure_1.png) |
|||
:::: |
|||
::: |
|||
|
|||
\normalsize |
|||
|
|||
## |
|||
* Drawing subplots in one canvas |
|||
|
|||
\footnotesize |
|||
::: columns |
|||
:::: {.column width=35%} |
|||
```python |
|||
... |
|||
g = np.exp(-0.2*x) |
|||
# create figure |
|||
plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey') |
|||
plt.suptitle('1 x 2 Plot') |
|||
# create subplot and plot first one |
|||
plt.subplot(1,2,1) |
|||
# plot first one |
|||
plt.title('exp(x)') |
|||
plt.xlabel('x') |
|||
plt.ylabel('g(x)') |
|||
plt.plot(x,g,'blueviolet') |
|||
# create subplot and plot second one |
|||
plt.subplot(1,2,2) |
|||
plt.plot(x,f,'orange') |
|||
plt.plot(x,f*g,'red') |
|||
plt.legend(['sine^2','exp*sine']) |
|||
# show the plot |
|||
plt.show() |
|||
``` |
|||
:::: |
|||
:::: {.column width=40%} |
|||
\vspace{3cm} |
|||
![](figures/matplotlib_Figure_2.png) |
|||
:::: |
|||
::: |
|||
\normalsize |
|||
|
|||
## Image data |
|||
|
|||
The `image` class of the `matplotlib` library can be used to load the image |
|||
to numpy arrays and to render the image. |
|||
|
|||
* There are 3 common formats for the numpy array |
|||
|
|||
* (M, N) scalar data used for greyscale images |
|||
|
|||
* (M, N, 3) for RGB images (each pixel has an array with RGB color attached) |
|||
|
|||
* (M, N, 4) for RGBA images (each pixel has an array with RGB color |
|||
and transparency attached) |
|||
|
|||
|
|||
The method `imread` loads the image into an `ndarray`, which can be |
|||
manipulated. |
|||
|
|||
The method `imshow` renders the image data |
|||
|
|||
\vspace {2cm} |
|||
|
|||
## |
|||
* Drawing pixel data and images |
|||
|
|||
\footnotesize |
|||
::: columns |
|||
:::: {.column width=50%} |
|||
|
|||
```python |
|||
.... |
|||
# create data array with pixel postion and RGB color code |
|||
width, height = 400, 400 |
|||
data = np.zeros((height, width, 3), dtype=np.uint8) |
|||
# red patch in the center |
|||
data[175:225, 175:225] = [255, 0, 0] |
|||
x = np.random.randint(0,width-1,100) |
|||
y = np.random.randint(0,height-1,100) |
|||
data[x,y]= [0,255,0] # random green pixel |
|||
plt.imshow(data) |
|||
plt.show() |
|||
.... |
|||
import matplotlib.image as mpimg |
|||
#read image into numpy array |
|||
pic = mpimg.imread('picture.jpg') |
|||
mod_pic = pic[:,:,0] # grab slice 0 of the colors |
|||
plt.imshow(mod_pic) # use default color code also |
|||
plt.colorbar() # try cmap='hot' |
|||
plt.show() |
|||
``` |
|||
:::: |
|||
:::: {.column width=25%} |
|||
![](figures/matplotlib_Figure_3.png) |
|||
\vspace{1cm} |
|||
![](figures/matplotlib_Figure_4.png) |
|||
:::: |
|||
::: |
|||
\normalsize |
|||
|
|||
|
|||
## Input / output |
|||
|
|||
For the analysis of measured data efficient input \/ output plays an |
|||
important role. In numpy, `ndarrays` can be saved and read in from files. |
|||
`load()` and `save()` functions handle numpy binary files (.npy extension) |
|||
which contain data, shape, dtype and other information required to |
|||
reconstruct the `ndarray` of the disk file. |
|||
|
|||
\footnotesize |
|||
```python |
|||
r = np.random.default_rng() # instanciate random number generator |
|||
a = r.random((4,3)) # random 4x3 array |
|||
np.save('myBinary.npy', a) # write array a to binary file myBinary.npy |
|||
b = np.arange(12) |
|||
np.savez('myComp.npz', a=a, b=b) # write a and b in compressed binary file |
|||
...... |
|||
b = np.load('myBinary.npy') # read content of myBinary.npy into b |
|||
``` |
|||
\normalsize |
|||
|
|||
The storage and retrieval of array data in text file format is done |
|||
with `savetxt()` and `loadtxt()` methods. Parameter controling delimiter, |
|||
line separators, file header and footer can be specified. |
|||
|
|||
\footnotesize |
|||
```python |
|||
x = np.array([1,2,3,4,5,6,7]) # create ndarray |
|||
np.savetxt('myText.txt',x,fmt='%d') # write array x to text file myText.txt |
|||
..... |
|||
y = np.loadtxt('myText.txt',dtype=int) # read content of myText.txt in y |
|||
``` |
|||
\normalsize |
|||
|
|||
|
|||
## Exercise 1 |
|||
|
|||
i) Display a numpy array as figure of a blue cross. The size should be 200 |
|||
by 200 pixel. Use as array format (M, N, 3), where the first 2 specify |
|||
the pixel positions and the last 3 the rbg color from 0:255. |
|||
- Draw in addition a red square of arbitrary position into the figure. |
|||
- Draw a circle in the center of the figure. Try to create a mask which |
|||
selects the inner part of the circle using the indexing. |
|||
|
|||
\small |
|||
[Solution: 01_intro_ex_1a_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1a_sol.py) \normalsize |
|||
|
|||
ii) Read data which contains pixels from the binary file horse.py into a |
|||
numpy array. Display the data and the following transformations in 4 |
|||
subplots: scaling and translation, compression in x and y, rotation |
|||
and mirroring. |
|||
|
|||
\small |
|||
[Solution: 01_intro_ex_1b_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1b_sol.py) \normalsize |
|||
|
|||
|
|||
## Pandas |
|||
|
|||
[\textcolor{violet}{pandas}](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) is a software library written in Python for |
|||
\textcolor{blue}{data manipulation and analysis}. |
|||
|
|||
\vspace{0.4cm} |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* Offers data structures and operations for manipulating numerical tables with |
|||
integrated indexing |
|||
|
|||
* Imports data from various file formats, e.g. comma-separated values, JSON, |
|||
SQL or Excel |
|||
|
|||
* Tools for reading and writing data structures, allows analyzing, filtering, |
|||
spliting, merging and joining |
|||
|
|||
* Built on top of `NumPy` |
|||
|
|||
* Visualize the data with `matplotlib` |
|||
|
|||
* Most machine learning tools support `pandas` $\rightarrow$ |
|||
it is widely used to preprocess data sets for machine learning |
|||
|
|||
## Pandas micro introduction |
|||
|
|||
Goal: Exploring, cleaning, transforming, and visualization of data. |
|||
The basic indexable objects are |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* `Series` -> vector (list) of data elements of arbitrary type |
|||
|
|||
* `DataFrame` -> tabular arangement of data elements of column wise |
|||
arbitrary type |
|||
|
|||
Both allow cleaning data by removing of `empty` or `nan` data entries |
|||
|
|||
\footnotesize |
|||
```python |
|||
import numpy as np |
|||
import pandas as pd # use together with numpy |
|||
s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of float64 |
|||
r = pd.Series(np.random.randn(4)) # Series of random numbers float64 |
|||
dates = pd.date_range("20130101", periods=3) # index according to dates |
|||
df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD")) |
|||
print (df) # print the DataFrame |
|||
A B C D |
|||
2013-01-01 1.618395 1.210263 -1.276586 -0.775545 |
|||
2013-01-02 0.676783 -0.754161 -1.148029 -0.244821 |
|||
2013-01-03 -0.359081 0.296019 1.541571 0.235337 |
|||
|
|||
new_s = s.dropna() # return a new Data Frame with no empty cells |
|||
``` |
|||
\normalsize |
|||
|
|||
## |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* pandas data can be saved in different file formats (CSV, JASON, html, XML, |
|||
Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept |
|||
in the output file. |
|||
|
|||
* csv file |
|||
\footnotesize |
|||
```python |
|||
df.to_csv("myFile.csv") # Write the DataFrame df to a csv file |
|||
``` |
|||
\normalsize |
|||
|
|||
* HDF5 output |
|||
|
|||
\footnotesize |
|||
```python |
|||
df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5 |
|||
s.to_hdf("myFile.h5", key='s',mode='a') |
|||
``` |
|||
\normalsize |
|||
|
|||
* Writing to an excel file |
|||
|
|||
\footnotesize |
|||
```python |
|||
df.to_excel("myFile.xlsx", sheet_name="Sheet1") |
|||
``` |
|||
\normalsize |
|||
|
|||
* Deleting file with data in python |
|||
|
|||
\footnotesize |
|||
```python |
|||
import os |
|||
os.remove('myFile.h5') |
|||
``` |
|||
\normalsize |
|||
|
|||
## |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* read in data from various formats |
|||
|
|||
* csv file |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
....... |
|||
df = pd.read_csv('heart.csv') # read csv data table |
|||
print(df.info()) |
|||
<class 'pandas.core.frame.DataFrame'> |
|||
RangeIndex: 303 entries, 0 to 302 |
|||
Data columns (total 14 columns): |
|||
# Column Non-Null Count Dtype |
|||
--- ------ -------------- ----- |
|||
0 age 303 non-null int64 |
|||
1 sex 303 non-null int64 |
|||
2 cp 303 non-null int64 |
|||
print(df.head(5)) # prints the first 5 rows of the data table |
|||
print(df.describe()) # shows a quick statistic summary of your data |
|||
``` |
|||
\normalsize |
|||
|
|||
* Reading an excel file |
|||
|
|||
\footnotesize |
|||
```python |
|||
df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"]) |
|||
``` |
|||
\normalsize |
|||
|
|||
\textcolor{olive}{There are many options specifying details for IO.} |
|||
|
|||
## |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* Various functions exist to select and view data from pandas objects |
|||
|
|||
* Display column and index |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df.index # show datetime index of df |
|||
DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'], |
|||
dtype='datetime64[ns]',freq='D') |
|||
df.column # show columns info |
|||
Index(['A', 'B', 'C', 'D'], dtype='object') |
|||
``` |
|||
\normalsize |
|||
|
|||
* `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df.to_numpy() # one dtype for the entire array, not per column! |
|||
[[-0.62660101 -0.67330526 0.23269168 -0.67403546] |
|||
[-0.53033339 0.32872063 -0.09893568 0.44814084] |
|||
[-0.60289996 -0.22352548 -0.43393248 0.47531456]] |
|||
``` |
|||
\normalsize |
|||
|
|||
Does not include the index or column labels in the output |
|||
|
|||
* more on viewing |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df.T # transpose the DataFrame df |
|||
df.sort_values(by="B") # Sorting by values of a column of df |
|||
df.sort_index(axis=0,ascending=False) # Sorting by index descending values |
|||
df.sort_index(axis=0,ascending=False) # Display columns in inverse order |
|||
|
|||
``` |
|||
\normalsize |
|||
|
|||
## |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions |
|||
|
|||
* get a named column as a Series |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df["A"] # selects a column A from df, simular to df.A |
|||
df.iloc[:, 1:2] # slices column A explicitly from df, df.loc[:, ["A"]] |
|||
``` |
|||
\normalsize |
|||
|
|||
* select rows of a DataFrame |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df[0:2] # selects row 0 and 1 from df, |
|||
df["20130102":"20130103"] # use indices endpoint are included! |
|||
df.iloc[3] # Select with the position of the passed integers |
|||
df.iloc[1:3, :] # selects row 1 and 2 from df |
|||
``` |
|||
\normalsize |
|||
|
|||
* select by label |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D |
|||
df.loc[dates[0], "A"] # selects a single value (scalar) |
|||
``` |
|||
\normalsize |
|||
|
|||
* select by lists of integer position (as in `NumPy`) |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D |
|||
df.iloc[1, 1] # get a value explicitly |
|||
|
|||
``` |
|||
\normalsize |
|||
|
|||
* select according to expressions |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df.query('B<C') # select rows where B < C |
|||
df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows |
|||
``` |
|||
\normalsize |
|||
|
|||
## |
|||
|
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$} |
|||
|
|||
* Selecting data of pandas objects continued |
|||
|
|||
* Boolean indexing |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df[df["A"] > 0] # select df where all values of column A are >0 |
|||
df[df > 0] # select values from the entire DataFrame |
|||
``` |
|||
\normalsize |
|||
|
|||
more complex example |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df2 = df.copy() # copy df |
|||
df2["E"] = ["eight","one","four"] # add column E |
|||
df2[df2["E"].isin(["two", "four"])] # test if elements "two" and "four" are |
|||
# contained in Series column E |
|||
``` |
|||
\normalsize |
|||
|
|||
* Operations (in general exclude missing data) |
|||
|
|||
\footnotesize |
|||
|
|||
```python |
|||
df2[df2 > 0] = -df2 # All elements > 0 change sign |
|||
df.mean(0) # get column wise mean (numbers=axis) |
|||
df.mean(1) # get row wise mean |
|||
df.std(0) # standard deviation according to axis |
|||
df.cumsum() # cumulative sum of each column |
|||
df.apply(np.sin) # apply function to each element of df |
|||
df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise |
|||
df + 10 # add scalar 10 |
|||
df - [1, 2, 10 , 100] # subtract values of each column |
|||
df.corr() # Compute pairwise correlation of columns |
|||
``` |
|||
\normalsize |
|||
|
|||
|
|||
## Pandas - plotting data |
|||
|
|||
[\textcolor{violet}{Visualization}](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) is integrated in pandas using mathplotlib. Here are only 2 examples |
|||
|
|||
* Plot random data in histogramm and scatter plot |
|||
|
|||
\footnotesize |
|||
```python |
|||
# create DataFrame with random normal distributed data |
|||
df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"]) |
|||
df = df + [1, 3, 8 , 10] # shift mean to 1, 3, 8 , 10 |
|||
plt.figure() |
|||
df.plot.hist(bins=20) # histogram all 4 columns |
|||
g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1") |
|||
df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1) |
|||
``` |
|||
\normalsize |
|||
|
|||
::: columns |
|||
:::: {.column width=35%} |
|||
![](figures/pandas_histogramm.png) |
|||
:::: |
|||
:::: {.column width=35%} |
|||
![](figures/pandas_scatterplot.png) |
|||
:::: |
|||
::: |
|||
|
|||
## Pandas - plotting data |
|||
|
|||
The function crosstab() takes one or more array-like objects as indexes or |
|||
columns and constructs a new DataFrame of variable counts on the inputs |
|||
|
|||
\footnotesize |
|||
```python |
|||
df = pd.DataFrame( # create DataFrame of 2 categories |
|||
{"sex": np.array([0,0,0,0,1,1,1,1,0,0,0]), |
|||
"heart": np.array([1,1,1,0,1,1,1,0,0,0,1]) |
|||
} ) # closing bracket goes on next line |
|||
pd.crosstab(df2.sex,df2.heart) # create cross table of possibilities |
|||
pd.crosstab(df2.sex,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts |
|||
``` |
|||
\normalsize |
|||
::: columns |
|||
:::: {.column width=42%} |
|||
![](figures/pandas_crosstabplot.png) |
|||
:::: |
|||
::: |
|||
|
|||
## Exercise 2 |
|||
|
|||
Read the file [\textcolor{violet}{heart.csv}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/heart.csv) into a DataFrame. |
|||
[\textcolor{violet}{Information on the dataset}](https://archive.ics.uci.edu/ml/datasets/heart+Disease) |
|||
|
|||
\setbeamertemplate{itemize item}{\color{red}$\square$} |
|||
|
|||
* Which columns do we have |
|||
|
|||
* Print the first 3 rows |
|||
|
|||
* Print the statistics summary and the correlations |
|||
|
|||
* Print mean values for each column with and without disease |
|||
|
|||
* Select the data according to `sex` and `target` (heart disease 0=no 1=yes). |
|||
|
|||
* Plot the `age` distribution of male and female in one histogram |
|||
|
|||
* Plot the heart disease distribution according to chest pain type `cp` |
|||
|
|||
* Plot `thalach` according to `target` in one histogramm |
|||
|
|||
* Plot `sex` and `target` in a histogramm figure |
|||
|
|||
* Correlate `age` and `max heart rate` according to `target` |
|||
|
|||
* Correlate `age` and `colesterol` according to `target` |
|||
|
|||
\small |
|||
[Solution: 01_intro_ex_2_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_2_sol.py) \normalsize |
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|