initial set of slides corresponding to 2022 version

2023-03-10 14:19:06 +01:00 · 2023-03-10 14:19:06 +01:00 · 46fae93569
commit 46fae93569
parent 702e034438
104 changed files with 3988 additions and 0 deletions
--- a/slides/.gitignore
+++ b/slides/.gitignore
@ -0,0 +1 @@
+.DS_Store
--- a/slides/Makefile
+++ b/slides/Makefile
@ -0,0 +1,10 @@
+# make     creates pdf files of all newly edited .md files
+
+SRCS := $(wildcard *.md)
+PDF := $(SRCS:%.md=%.pdf)
+
+OPT := --pdf-engine=xelatex --variable mainfont="Helvetica" --variable sansfont="Helvetica" -t beamer -s -fmarkdown-implicit_figures --template=template.beamer --highlight-style=kate 
+all: ${PDF}
+
+%.pdf: %.md
+	pandoc $(OPT) --output=$@ $<
--- a/slides/README.md
+++ b/slides/README.md
@ -0,0 +1,2 @@
+Pandoc slides example following style of [Stefan Wunsch's CERN IML workhsop presenation](https://github.com/stwunsch/iml_keras_workshop) on [keras](https://keras.io/) (see slides folder)
+
--- a/slides/copy_slides.sh
+++ b/slides/copy_slides.sh
@ -0,0 +1,6 @@
+# slides (do chgrp machlearn <file> later)
+# scp CIPpoolAccess.PDF reygers@rho0:public_html/lectures/2021/ml/transparencies/
+# scp 03_ml_basics.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
+# scp 04_decision_trees.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
+scp 05_neural_networks.pdf reygers@rho0:public_html/lectures/2021/ml/transparencies/
+
--- a/slides/decision_trees.md
+++ b/slides/decision_trees.md
@ -0,0 +1,347 @@
+---
+title: |
+  | Introduction to Data Analysis and Machine Learning in Physics:  
+  | 4. Decisions Trees  
+
+author: "Martino Borsato, Jörg Marks, Klaus Reygers"
+date: "Studierendentage, 11-14 April 2022"
+---
+## Exercises
+
+* Exercise 1: Compare different decision tree classifiers
+	* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
+* Exercise 2: Apply XGBoost classifier to MAGIC data set
+	* [`04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
+* Exercise 3: Feature importance
+* Exercise 4: Interpret a classifier with SHAP values
+
+## Decision trees
+
+\begin{figure}
+\centering
+\includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
+\end{figure}
+
+\begin{center}
+Leaf nodes classify events as either signal or background
+\end{center}
+
+## Decision trees: Rectangular volumes in feature space
+
+\begin{figure}
+\centering
+\includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
+\end{figure}
+
+* Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
+* How to build a decision tree in an optimal way?
+
+## Finding optimal cuts
+
+Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
+
+$$ G = p (1-p) $$
+
+Here $p$ is the purity:
+$$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
+
+\vfill
+\textcolor{gray}{Usefulness of weights will become apparent soon.}
+
+\vfill
+Improvement in signal/background separation after splitting a set A into two sets B and C:
+$$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
+
+## Gini impurity and other purity measures
+\begin{figure}
+\centering
+\includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
+\end{figure}
+
+
+## Decision tree pruning
+
+::: columns
+:::: {.column width=50%}
+
+When to stop growing a tree?
+
+* When all nodes are essentially pure?
+* Well, that's overfitting!
+
+\vspace{3ex}
+
+Pruning
+
+* Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
+
+::::
+:::: {.column width=50%}
+\begin{figure}
+\centering
+\includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
+\end{figure}
+::::
+:::
+
+## Single decision trees: Pros and cons
+
+\textcolor{green}{Pros:}
+
+* Requires little data preparation (unlike neural networks)
+* Can use continuous and categorical inputs
+
+\vfill
+
+\textcolor{red}{Cons:}
+
+* Danger of overfitting training data
+* Sensitive to fluctuations in the training data
+* Hard to find global optimum
+* When to stop splitting?
+
+## Ensemble methods: Combine weak learners
+
+::: columns
+:::: {.column width=70%}
+* Bootstrap Aggregating (Bagging)
+	* Sample training data (with replacement) and train a separate model on each of the derived training sets
+	* Classify example with majority vote, or compute average output from each tree as model output
+
+::::
+:::: {.column width=30%}
+$$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$ 
+::::
+:::
+\vfill
+::: columns
+:::: {.column width=70%}
+* Boosting
+	* Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
+	* Take weighted average to classify examples
+
+::::
+:::: {.column width=30%}
+$$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$ 
+::::
+:::
+
+## Random forests
+
+* "One of the most widely used and versatile algorithms in data science and machine learning" 
+\tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
+\vfill
+* Use bagging to select random example subset
+\vfill
+* Train a tree, but only use random subset of features at each split
+	* this reduces the correlation between different trees
+	* makes the decision more robust to missing data
+
+## Boosted decision trees: Idea
+
+\begin{figure}
+\centering
+\includegraphics[width=0.75\textwidth]{figures/bdt.png}
+\end{figure}
+
+## AdaBoost (short for Adaptive Boosting)
+
+Initial training sample
+
+\begin{center}
+\begin{tabular}{l l}
+$\vec x_1, ..., \vec x_n$: & multivariate event data \\
+$y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
+$w_1^{(1)}, ..., w_n^{(1)}$ & event weights
+\end{tabular}
+\end{center}
+
+with equal weights normalized as
+
+$$ \sum_{i=1}^n w_i^{(1)} = 1 $$
+
+Train first classifier $f_1$:
+
+\begin{center}
+\begin{tabular}{l l}
+$f_1(\vec x_i) > 0$ & classify as signal \\
+$f_1(\vec x_i) < 0$ & classify as background
+\end{tabular}
+\end{center}
+
+## AdaBoost: Updating events weights
+
+Define training sample $k+1$ from training sample $k$ by updating weights:
+
+$$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
+
+\footnotesize
+\textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
+\normalsize
+
+Weight is increased if event was misclassified by the previous classifier
+
+$\to$ "Next classifier should pay more attention to misclassified events"
+
+
+\vfill
+At each step the classifier $f_k$ minimizes error rate:
+
+$$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0), 
+\quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise}  $$
+
+## AdaBoost: Assigning the classifier score
+
+Assign score to each classifier according to its error rate:
+$$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
+
+\vfill
+
+Combined classifier (weighted average):
+$$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
+
+
+
+## Gradient boosting
+
+Basic idea:
+
+* Train a first decision tree
+* Then train a second one on the residual errors made by the first tree
+* And so on
+
+\vfill
+
+In slightly more detail:
+
+* \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
+* Model prediction at iteration $m$: $F_m(\vec x_i)$
+* New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
+* Find $h_m(\vec x)$ by fitting it to 
+$\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
+
+\color{black}
+
+## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
+\small
+[\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
+\normalsize
+
+\vfill
+
+Superconductivty data set: 
+
+Predict the critical temperature based on 81 material features.
+\footnotesize
+[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
+\normalsize
+
+\vfill
+
+From the abstract:
+
+
+We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.
+
+\vfill
+
+\tiny 
+[\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
+\normalsize
+
+
+## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
+
+::: columns
+:::: {.column width=60%}
+\footnotesize
+```python
+import xgboost as xgb
+
+XGBreg = xgb.sklearn.XGBRegressor()
+
+XGBreg.fit(X_train, y_train)
+
+y_pred = XGBreg.predict(X_test)
+
+from sklearn.metrics import mean_squared_error
+rms = np.sqrt(mean_squared_error(y_test, y_pred))
+print(f"root mean square error {rms:.2f}")
+```
+
+\textcolor{gray}{This gives:}
+
+`root mean square error 9.68`
+::::
+:::: {.column width=40%}
+\vspace{6ex}
+![](figures/critical_temperature.pdf)
+::::
+:::
+
+## Exercise 1: Compare different decision tree classifiers
+
+\small
+[\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
+
+\vspace{5ex}
+
+Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
+
+\vspace{2ex}
+
+Is there a classifier that clearly performs best?
+
+
+## Exercise 2: Apply XGBoost classifier to MAGIC data set
+
+\small
+[\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
+\normalsize
+
+\footnotesize
+```python
+# train XGBoost boosted decision tree
+import xgboost as xgb
+XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
+```
+\normalsize
+
+\small
+a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
+b) Which is the most important feature for discriminating signal and background according to XGBoost? \ 
+Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
+c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
+d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
+\normalsize
+
+
+## Exercise 3: Feature importance
+
+\small
+[\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
+\normalsize
+
+\vspace{3ex}
+
+Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score. 
+
+
+## Exercise 4: Interpret a classifier with SHAP values
+
+SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
+
+\vfill
+
+Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
+
+a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
+
+b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
+
+c) Do the same for the superconductivity data set? What are the three most important features? 
+
+
+
+
+
--- a/slides/figures/03_ml_basics_galton_linear_regression_iminuit.pdf
+++ b/slides/figures/03_ml_basics_galton_linear_regression_iminuit.pdf
--- a/slides/figures/03_ml_basics_log_regr_heart_disease.pdf
+++ b/slides/figures/03_ml_basics_log_regr_heart_disease.pdf
--- a/slides/figures/03_ml_basics_logistic_regression.pdf
+++ b/slides/figures/03_ml_basics_logistic_regression.pdf
--- a/slides/figures/L1vsL2.pdf
+++ b/slides/figures/L1vsL2.pdf
--- a/slides/figures/activation_functions.png
+++ b/slides/figures/activation_functions.png
--- a/slides/figures/adversarial_attack.png
+++ b/slides/figures/adversarial_attack.png
--- a/slides/figures/ai_history.png
+++ b/slides/figures/ai_history.png
--- a/slides/figures/ai_ml_dl.pdf
+++ b/slides/figures/ai_ml_dl.pdf
--- a/slides/figures/ann.png
+++ b/slides/figures/ann.png
--- a/slides/figures/anomaly_detection.png
+++ b/slides/figures/anomaly_detection.png
--- a/slides/figures/autoencoder_example.pdf
+++ b/slides/figures/autoencoder_example.pdf
--- a/slides/figures/bdt.png
+++ b/slides/figures/bdt.png
--- a/slides/figures/book-murphy.png
+++ b/slides/figures/book-murphy.png
--- a/slides/figures/book_deep_learning_for_physics_research.png
+++ b/slides/figures/book_deep_learning_for_physics_research.png
--- a/slides/figures/boston_house_prices.pdf
+++ b/slides/figures/boston_house_prices.pdf
--- a/slides/figures/cnn.png
+++ b/slides/figures/cnn.png
--- a/slides/figures/cnn_conv_layer.png
+++ b/slides/figures/cnn_conv_layer.png
--- a/slides/figures/cnn_fully_connected.png
+++ b/slides/figures/cnn_fully_connected.png
--- a/slides/figures/cnn_pooling.png
+++ b/slides/figures/cnn_pooling.png
--- a/slides/figures/cnn_sliding_filter.png
+++ b/slides/figures/cnn_sliding_filter.png
--- a/slides/figures/critical_temperature.pdf
+++ b/slides/figures/critical_temperature.pdf
--- a/slides/figures/cross_val.png
+++ b/slides/figures/cross_val.png
--- a/slides/figures/decision_boundaries.png
+++ b/slides/figures/decision_boundaries.png
--- a/slides/figures/decision_trees_feature_space.png
+++ b/slides/figures/decision_trees_feature_space.png
--- a/slides/figures/deep_learning_book.png
+++ b/slides/figures/deep_learning_book.png
--- a/slides/figures/deep_learning_with_python.png
+++ b/slides/figures/deep_learning_with_python.png
--- a/slides/figures/deepl.png
+++ b/slides/figures/deepl.png
--- a/slides/figures/dnn.png
+++ b/slides/figures/dnn.png
--- a/slides/figures/dropout.png
+++ b/slides/figures/dropout.png
--- a/slides/figures/example_overtraining.png
+++ b/slides/figures/example_overtraining.png
--- a/slides/figures/feature_transformation.png
+++ b/slides/figures/feature_transformation.png
--- a/slides/figures/fisher.png
+++ b/slides/figures/fisher.png
--- a/slides/figures/fisher_linear_decision_boundary.png
+++ b/slides/figures/fisher_linear_decision_boundary.png
--- a/slides/figures/gan.png
+++ b/slides/figures/gan.png
--- a/slides/figures/gradient_descent.png
+++ b/slides/figures/gradient_descent.png
--- a/slides/figures/gradient_descent_cmp.png
+++ b/slides/figures/gradient_descent_cmp.png
--- a/slides/figures/hands_on_machine_learning.png
+++ b/slides/figures/hands_on_machine_learning.png
--- a/slides/figures/handwritten_digits.png
+++ b/slides/figures/handwritten_digits.png
--- a/slides/figures/heart_table.png
+++ b/slides/figures/heart_table.png
--- a/slides/figures/imagenet.png
+++ b/slides/figures/imagenet.png
--- a/slides/figures/imagenet_challenge.png
+++ b/slides/figures/imagenet_challenge.png
--- a/slides/figures/iminuit_minos_scan-1.png
+++ b/slides/figures/iminuit_minos_scan-1.png
--- a/slides/figures/iminuit_minos_scan-2.png
+++ b/slides/figures/iminuit_minos_scan-2.png
--- a/slides/figures/iris_dataset.png
+++ b/slides/figures/iris_dataset.png
--- a/slides/figures/keras.png
+++ b/slides/figures/keras.png
--- a/slides/figures/knn.png
+++ b/slides/figures/knn.png
--- a/slides/figures/logistic_fct.png
+++ b/slides/figures/logistic_fct.png
--- a/slides/figures/loss_fct.png
+++ b/slides/figures/loss_fct.png
--- a/slides/figures/magic_photo.png
+++ b/slides/figures/magic_photo.png
--- a/slides/figures/magic_photo_small.png
+++ b/slides/figures/magic_photo_small.png
--- a/slides/figures/magic_shower_em_had.png
+++ b/slides/figures/magic_shower_em_had.png
--- a/slides/figures/magic_shower_em_had_small.png
+++ b/slides/figures/magic_shower_em_had_small.png
--- a/slides/figures/magic_shower_parameters.png
+++ b/slides/figures/magic_shower_parameters.png
--- a/slides/figures/magic_sketch.png
+++ b/slides/figures/magic_sketch.png
--- a/slides/figures/matplotlib_Figure_1.png
+++ b/slides/figures/matplotlib_Figure_1.png
--- a/slides/figures/matplotlib_Figure_2.png
+++ b/slides/figures/matplotlib_Figure_2.png
--- a/slides/figures/matplotlib_Figure_3.png
+++ b/slides/figures/matplotlib_Figure_3.png
--- a/slides/figures/matplotlib_Figure_4.png
+++ b/slides/figures/matplotlib_Figure_4.png
--- a/slides/figures/mini_boone_decisions_tree.png
+++ b/slides/figures/mini_boone_decisions_tree.png
--- a/slides/figures/ml_example_spam.png
+++ b/slides/figures/ml_example_spam.png
--- a/slides/figures/mlp.png
+++ b/slides/figures/mlp.png
--- a/slides/figures/mnist.png
+++ b/slides/figures/mnist.png
--- a/slides/figures/monitoring_overtraining.png
+++ b/slides/figures/monitoring_overtraining.png
--- a/slides/figures/mva.png
+++ b/slides/figures/mva.png
--- a/slides/figures/mva_nn.png
+++ b/slides/figures/mva_nn.png
--- a/slides/figures/neuron.png
+++ b/slides/figures/neuron.png
--- a/slides/figures/nn_decision_boundary.png
+++ b/slides/figures/nn_decision_boundary.png
--- a/slides/figures/pandas_crosstabplot.png
+++ b/slides/figures/pandas_crosstabplot.png
--- a/slides/figures/pandas_histogramm.png
+++ b/slides/figures/pandas_histogramm.png
--- a/slides/figures/pandas_scatterplot.png
+++ b/slides/figures/pandas_scatterplot.png
--- a/slides/figures/pdf_from_2d_histogram.png
+++ b/slides/figures/pdf_from_2d_histogram.png
--- a/slides/figures/perceptron_photo.png
+++ b/slides/figures/perceptron_photo.png
--- a/slides/figures/perceptron_retina.png
+++ b/slides/figures/perceptron_retina.png
--- a/slides/figures/perceptron_weighted_sum.png
+++ b/slides/figures/perceptron_weighted_sum.png
--- a/slides/figures/perceptron_with_threshold.png
+++ b/slides/figures/perceptron_with_threshold.png
--- a/slides/figures/regularization.png
+++ b/slides/figures/regularization.png
--- a/slides/figures/relu.png
+++ b/slides/figures/relu.png
--- a/slides/figures/rootOptions.png
+++ b/slides/figures/rootOptions.png
--- a/slides/figures/scikit-learn.png
+++ b/slides/figures/scikit-learn.png
--- a/slides/figures/sigmoid.png
+++ b/slides/figures/sigmoid.png
--- a/slides/figures/signal_background_distr.png
+++ b/slides/figures/signal_background_distr.png
--- a/slides/figures/signal_purity.png
+++ b/slides/figures/signal_purity.png
--- a/slides/figures/stochastic_gradient_descent.png
+++ b/slides/figures/stochastic_gradient_descent.png
--- a/slides/figures/supervised_learning_car_plane.png
+++ b/slides/figures/supervised_learning_car_plane.png
--- a/slides/figures/supervised_nutshell.png
+++ b/slides/figures/supervised_nutshell.png
--- a/slides/figures/tensorflow.png
+++ b/slides/figures/tensorflow.png
--- a/slides/figures/tf_playground.png
+++ b/slides/figures/tf_playground.png
--- a/slides/figures/tree_pruning_slides.png
+++ b/slides/figures/tree_pruning_slides.png
--- a/slides/figures/underfitting_overfitting.pdf
+++ b/slides/figures/underfitting_overfitting.pdf
--- a/slides/figures/underfitting_overfitting_001.png
+++ b/slides/figures/underfitting_overfitting_001.png
--- a/slides/figures/videogame.png
+++ b/slides/figures/videogame.png
--- a/slides/figures/xor.png
+++ b/slides/figures/xor.png
--- a/slides/figures/xor_like_data.pdf
+++ b/slides/figures/xor_like_data.pdf
--- a/slides/fit_intro.md
+++ b/slides/fit_intro.md
@ -0,0 +1,563 @@
+---
+title: |
+  | Introduction to Data Analysis and Machine Learning in Physics:  
+  | 2. Data modeling and fitting  
+
+author: "Martino Borsato, Jörg Marks, Klaus Reygers"
+date: "Studierendentage, 11-14 April 2022"
+---
+
+## Data modeling and fitting  - introduction
+
+Data analysis is a process of understanding and modeling measured
+data. The goal is to find patterns and to obtain inferences allowing to
+observe underlying patterns.
+
+ * There are 2 approaches to statistical data modeling
+   * Hypothesis testing: is our data compatible with a certain model?
+   * Determination of model parameter: use the data to determine the parameters
+     of a (theoretical) model
+
+ * For the determination of model parameter 
+   * Analysis of data distributions $\rightarrow$ mean, variance,
+     median, FWHM, .... \newline
+     allows for an approximate determination of model parameter
+
+   * Data fitting with the least square method $\rightarrow$ an iterative
+     process which minimizes the deviation of a model decribed by parameters
+     from data. This determines the optimal values and uncertainties
+     of the parameters.
+
+   * Maximum likelihood fitting $\rightarrow$ find a set of model parameters
+     which most likely describe the data by maximizing the probability
+     distributions.
+
+The parameter determination by minimization is an integral part of machine
+learning approaches, here a system learns patterns and predicts
+related ones. This is the focus in the upcoming days.
+
+## Data modeling and fitting  - introduction
+
+Data analysis is a process of understanding and modeling measured
+data. The goal is to find patterns and to obtain inferences allowing to
+observe underlying patterns.
+
+ * There are 2 approaches to statistical data modeling
+   * Hypothesis testing: is our data compatible with a certain model?
+   * Determination of model parameter: use the data to determine the parameters
+     of a (theoretical) model
+
+ * For the determination of model parameter 
+   * Analysis of data distributions $\rightarrow$ mean, variance,
+     median, FWHM, .... \newline
+     allows for an approximate determination of model parameter
+     
+    \setbeamertemplate{itemize subitem}{\color{red}\tiny$\blacksquare$}
+   * \textcolor{blue}{Data fitting with the least square method
+     $\rightarrow$ an iterative
+     process which minimizes the deviation of a model decribed by parameters
+     from data. This determines the optimal values and uncertainties
+     of the parameters.}
+     
+    \setbeamertemplate{itemize subitem}{\color{blue}\tiny$\blacktriangleright$} 
+   * Maximum likelihood fitting $\rightarrow$ find a set of model parameters
+     which most likely describe the data by maximizing the probability
+     distributions.
+
+The parameter determination by minimization is an integral part of machine
+learning approaches, here a system learns patterns and predicts
+related ones. This is the focus in the upcoming days.
+
+
+
+## Least Square (LS) Method (1)
+
+The method determines the \textcolor{blue}{optimal parameters of functions
+     to gaussian distributed measurements}.
+
+Lets consider a sample of $n$ measurements $y_{i}$ and a parametrized
+description of the measurement $\eta_{i} = f(x_{i} | \theta)$ 
+with a parameter set $\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$,
+dependent values $x_{i}$ and measurement errors $\sigma_{i}$.
+
+The parameter set should be determined such that
+\begin{equation*}
+ \color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2}  = \sum \limits_{i=1}^{n} \frac{(y_i- f(x_i|\theta))^2}{\sigma_i^2}    \longrightarrow \, minimal }
+\end{equation*}
+In case of correlated measurements the covariance matrix of the $y_{i}$ has to
+be taken into account. This is accomplished by defining a weight matrix from
+the covariance matrix of the input data. A decorrelation of the input data
+should be considered.
+\vspace{0.2cm}
+ 
+$S$ follows a $\chi^{2}$-distribution with $(n-k)$ degrees of freedom.
+
+## Least Square (LS) Method (2)
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+* Example LS-method
+   \vspace{0.2cm}
+   
+  Often the fit function $f(x, \theta)$ is linear in
+  $\theta = \theta_{1}, \theta_{2} ,.... \theta_{k}$
+  \vspace{0.2cm}
+
+  $f(x | \theta) = \theta_{1} f_{1}(x) + .... + \theta_{k} f_{k}(x)$
+  \vspace{0.2cm}
+
+  If the model is a straight line and our parameters are $\theta_{1}$ and
+  $\theta_{2}$ $(f_{1}(x) = 1,$  $f_{2}(x) = x)$ we have
+  $f(x | \theta) =  \theta_{1} + \theta_{2} x$
+  \vspace{0.2cm}
+
+  The LS equation is
+  \vspace{0.2cm}
+  
+  $\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum
+  \limits_{i=1}^{n}  \frac{(y_{i} -  \theta_{1} -  x_{i}
+  \theta_{2})^2}{\sigma_i^2 }}$   \hspace{0.4cm} and with
+  \vspace{0.2cm}
+
+  $\frac{\partial S}{\partial \theta_1} =   \sum\limits_{i=1}^{n} \frac{-2
+  (y_i - \theta_1 -  x_i \theta_2)}{\sigma_i^2} = 0$  \hspace{0.4cm}  and  \hspace{0.4cm} 
+   $\frac{\partial S}{\partial \theta_2} =   \sum\limits_{i=1}^{n} \frac{-2 x_i (y_i - \theta_1 -  x_i \theta_2)}{\sigma_i^2} = 0$
+   \vspace{0.2cm}
+   
+   the parameters $\theta_{1}$ and $\theta_{2}$ can be determined.
+
+   \vspace{0.2cm}
+   \textcolor{olive}{In case of linear fit functions solutions can be found by matrix inversion}
+
+   \vfill
+
+## Least Square (LS) Method (3)
+
+   \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+   
+* Use of a nonlinear fit function $f(x, \theta)$ like \hspace{0.4cm}
+  $f(x | \theta) =  \theta_{1} \cdot e^{-\theta_{2} x}$
+  \vspace{0.2cm}
+
+  results in the LS equation 
+  \vspace{0.2cm}
+
+  $\color{blue}{S = \sum \limits_{i=1}^{n} \frac{(y_i-\eta_i)^2}{\sigma_i^2} } \color{black} {= \sum \limits_{i=1}^{n}  \frac{(y_{i} - \theta_{1} \cdot  e^{-\theta_{2} x_{i}})^2}{\sigma_i^2 }}$   \hspace{0.4cm} 
+  \vspace{0.2cm}
+
+  which we have to minimize
+  \vspace{0.2cm}
+ 
+  $\frac{\partial S}{\partial \theta_1} =   \sum\limits_{i=1}^{n} \frac{ 2 e^{-2 \theta_2 x_i} ( \theta_1 - y_i e^{\theta_2 x_i} )} {\sigma_i^2 } = 0$   \hspace{0.4cm}  and  \hspace{0.4cm}
+  $\frac{\partial S}{\partial \theta_2} =   \sum\limits_{i=1}^{n} \frac{ 2  \theta_1 x_I e^{-2 \theta_2 x_i} (y_i e^{\theta_2 x_i} - \theta_1)} {\sigma_i^2 } = 0$
+
+  \vspace{0.4cm}
+ 
+  In a nonlinear system, the LS Ansatz leads to derivatives which are
+  functions of the independent variable and the parameters $\color{red}\rightarrow$ \textcolor{olive}{no closed solutions}
+  \vspace{0.4cm}
+  
+  In general, we have gradient equations which don't have closed solutions.
+  There are a couple of methods including approximations which allow together
+  with numerical methods to find a global minimum, Gauss–Newton algorithm,
+  Levenberg–Marquardt algorithm,  gradient descend methods and also direct
+  search methods.
+
+## Minuit - a programm package for minimization (1)
+
+In general data fitting and also solving machine learning algorithms lead
+to a minimization problem of functions. In the
+1975-1980 F. James (CERN) developed
+a FORTRAN-based package, [\textcolor{violet}{MINUIT}](http://seal.web.cern.ch/seal/documents/minuit/mntutorial.pdf), which is a framework to handle
+multiparameter minimization and compute the best-fit parameter values and
+uncertainties, including correlations between the parameters.
+\vspace{0.2cm}
+  
+The user provides a minimization function
+$F(X,P)$ with the parameter space $P=(p_1,....p_k)$ and
+variable space $X$ (also multi-dimensional). There is an interface via
+functions which influences the
+the minimization process. MINUIT provides
+[\textcolor{violet}{error calculations}](http://seal.web.cern.ch/seal/documents/minuit/mnerror.pdf) including correlations for the parameter space by evaluating the shape of the function in some neighbourhood of the minimum.
+\vspace{0.2cm}
+
+The package
+has now a new object-oriented implementation as [\textcolor{violet}{Minuit2 library}](https://root.cern.ch/doc/master/Minuit2Page.html) , written
+in C++.
+\vspace{0.2cm}
+ 
+During the minimization $F(X,P)$ is evaluated for various $X$. For the
+choice of $P=(p_1,....p_k)$ different methods are used 
+
+## Minuit - a programm package for minimization (2)
+
+\vspace{0.4cm}
+\textcolor{olive}{SEEK}: Search for the minimum with Monte Carlo methods, mostly used at the start
+  of the minimization with unknown starting values. It is not a converging
+  algorithm.
+  \vspace{0.2cm}
+
+\textcolor{olive}{SIMPLX}:
+  Uses the simplex method of Nelder and Mead. Function values are compared
+  in the parameter space. Via step size control the minimum is approached.
+  Parameter errors are only approximate, no covariance matrix is calculated.
+\vspace{0.2cm}
+
+<!---
+A simplex is the smallest n dimensional figure with n+1 corners. By reflecting
+one point in the hyperplane of the other point and adopts itself to the
+function plane.
+-->
+
+\textcolor{olive}{MIGRAD}:
+  Uses an algorithm of R. Fletcher, which takes the function and the gradient
+  to  approach the minimum with a variable metric method. An error matrix and
+  correlation coefficients are available
+ \vspace{0.2cm}
+
+\textcolor{olive}{HESSE}:
+  Calculates the hessian matrix of second derivatives and determines the
+  covariance matrix.
+ \vspace{0.2cm}
+ 
+\textcolor{olive}{MINOS}:
+  Calculates (asymmetric) errors using likelihood profiles.
+  The algorithm for finding the positive and negative MINOS errors for parameter
+  $n$ consists of varying $n$ each time minimizing $F(X,P)$ with respect to
+  all the others.
+   \vspace{0.2cm}
+
+## Minuit - a programm package for minimization (3)
+
+\vspace{0.4cm}
+
+Fit process with the minuit package
+\vspace{0.2cm}
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* The individual steps decribed above can be called several times and in different order during the minimization process.
+
+* Each of the parameters $p_i$ of  $P=(p_1,....p_k)$ can be set constant and
+  released during the minimization steps.
+  
+* Problems are expected in models with strong correlation between
+  parameters $\rightarrow$ change model to uncorrelated definitions
+
+* Local minima, edges/steps or undefined ranges in $F(X,P)$ are problematic
+  $\rightarrow$ simplify your model
+
+ \vspace{3cm}
+
+
+## Minuit2 - The iminuit package
+
+\vspace{0.4cm}
+
+ [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/)  is
+ a Jupyter-friendly Python interface for the Minuit2 C++ library.
+\vspace{0.2cm}
+
+ \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* The class `iminuit.Minuit` instanciates the minuit object. The minimizer
+   function is given as argument. Basic steering of the fit
+   like setting start parameters, error definition and print level is also
+   done here.
+   
+\footnotesize
+```python
+     from iminuit import Minuit
+     def fcn(x, y, z):                    # definition of the minimizer function
+         return (x - 2) ** 2 + (y - x) ** 2 + (z - 4) ** 2
+     m = Minuit(fcn, x=0, y=0, z=0, errordef=1 , print_level=1)       
+```
+\normalsize
+
+ * Several methods determine the interaction with the fitting process, calls
+   to `migrad` , `hesse` or  printing of parameters and errors
+   
+\footnotesize
+```python
+     ......
+     m.migrad()                     # run optimiser
+     print(m.values , m.errors)     # print results
+     m.hesse()                      # run covariance estimator
+```
+\normalsize
+
+## Minuit2 - iminuit example
+
+\vspace{0.2cm}
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+ * The function `fcn` describes the model with parameters to be determined by
+   data.`fcn` is minimal when the model parameters agree best with data.
+   `fcn` has positional arguments, one for each fit parameter. `iminuit`
+   example fit:
+   
+   [\textcolor{violet}{02\_fit\_exp\_fit\_iMinuit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_exp_fit_iMinuit.py)
+   
+\footnotesize
+```python
+     ......
+     x =  np.array([....],dtype='d') # measurements x
+     y =  np.array([....],dtype='d') # measurements y
+     dy = np.array([....],dtype='d') # error in y
+     def xp(a, b , c):
+         return a * np.exp(b*x) + c
+     # least-squares function = sum of data residuals squared
+     def fcn(a,b,c):
+        return np.sum((y - xp(a,b,c)) ** 2 / dy ** 2)
+     # limit the range of b and fix parameter c
+     m = Minuit(fcn,a=1,b=-0.7,c=1,limit_b=(-1,0.1),fix_c=True)
+     m.migrad()                      # run minimizer
+     m.fixed["c"] = False            # release  parameter c
+     m.migrad()                      # rerun minimizer
+```
+\normalsize
+
+ * Might be useful to fix parameters or limit the range for some applications
+
+## Minuit2 - iminuit (3)
+
+\vspace{0.2cm}
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* Results and control information of the fit can be printed and accessed
+  in the the prorgamm.
+  
+\footnotesize
+```python
+     ......
+     m = Minuit(fcn,....,print_level=1) # set flag in the initializer
+     m.migrad()                         # run minimizer
+     a_fit = m.values['a']              # get parameter value a
+     a_fit_error =  m.errors['a']       # get parameter error of a
+     print (m.values,m.errors)          # print results
+ ```
+\normalsize      
+
+* After processing Hesse, covariance and correlation information of the
+   fit is available
+
+\footnotesize
+```python
+     ......
+     m.hesse()                           # run covariance estimator
+     m.matrix()                          # get covariance matrix
+     m.matrix(correlation=True)          # get full correlation matrix
+     cov = m.np_matrix()                 # save matrix to numpy
+     cor = m.np_matrix(correlation=True) 
+     print(cor[0, 1])      # print correlation between parameter 1 and 2
+ ```
+\normalsize      
+
+## Minuit2 - iminuit (4)
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+ * Minos provides asymmetric uncertainty intervals and parameter contours by
+   scanning one parameter and minimizing the function with respect to all other
+   parameters for each scan point. Results are displayed with `matplotlib`.
+
+\footnotesize
+```python
+     ......
+     m.minos()
+     print (m.get_merrors()['a'])
+     m.draw_mnprofile('b')
+     m.draw_mncontour('a', 'b', nsigma=4)
+```
+::: columns
+:::: {.column width=40%}
+![](figures/iminuit_minos_scan-1.png)
+::::
+:::: {.column width=40%}
+![](figures/iminuit_minos_scan-2.png)
+::::
+:::
+
+## Exercise 3
+
+Plot the following data with mathplotlib as in the iminuit example:
+
+ \footnotesize
+```
+   x:   0.2,0.4,0.6,0.8,1.,1.2,1.4,1.6,1.8,2.,2.2,2.4,2.6,2.8,3.,3.2,
+        3.4,3.6, 3.8,4.
+   y:   0.04,0.021,0.035,0.03,0.029,0.019,0.024,0.018,0.019,0.022,0.02,
+        0.025,0.018,0.024,0.019,0.021,0.03,0.019,0.03,0.024
+   dy:  1.792,1.695,1.541,1.514,1.427,1.399,1.388,1.270,1.262,1.228,1.189,
+        1.182,1.121,1.129,1.124,1.089,1.092,1.084,1.058,1.057
+```
+\normalsize
+ \setbeamertemplate{itemize item}{\color{red}$\square$}
+
+*  Exchange in the example iminuit fit `02_fit_exp_fit_iMinuit.ipynb` the
+   exponential function by a 3rd order polynomial and perform the fit
+
+*  Compare the correlation of the parameters of the exponential and
+   the polynomial fit
+
+*  What defines the fit quality, give an estimate
+   
+ \small
+  Solution: [\textcolor{violet}{02\_fit\_ex\_3\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_3_sol.py) \normalsize
+
+## Exercise 4
+
+Plot the following data with mathplotlib:
+
+ \footnotesize
+```
+   x:   1, 2, 3, 4, 5, 6, 7, 8, 9, 10
+   dx:  0.1,0.1,0.5,0.1,0.5,0.1,0.5,0.1,0.5,0.1
+   y:   1.1,2.3,2.7,3.2,3.1,2.4,1.7,1.5,1.5,1.7
+   dy:  0.15,0.22,0.29,0.39,0.31,0.21,0.13,0.15,0.19,0.13
+```
+\normalsize
+ \setbeamertemplate{itemize item}{\color{red}$\square$}
+
+  * Perform a fit with iminuit. Which model do you use?
+
+  * Plot the resulting fit function in the graph with the data
+
+  * Print the covariance matrix.  Can we improve the errors.
+
+  * Can you draw a contour plot of 2 of the fit parameters.
+  
+  \small
+   Solution: [\textcolor{violet}{02\_fit\_ex\_4\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_4_sol.py) \normalsize
+
+
+## PyROOT 
+
+[\textcolor{violet}{PyROOT}](https://root.cern/manual/python/) is the python binding for the C++ data analysis toolkit [\textcolor{violet}{ROOT}](https://root.cern/) developed with and for the LHC community. You can access the full
+ROOT functionality from Python while
+benefiting from the performance of the ROOT C++ libraries. The PyROOT bindings
+are automatic and dynamic and are able to interoperate with widely-used Python
+data-science libraries as `NumPy`, `pandas`, SciPy `scikit-learn` and `tensorflow`.
+
+* ROOT/PyROOT can be installed easily within anaconda3 (ROOT version 6.22.02
+  or later ) or is available in the
+  [\textcolor{violet}{CIP jupyter2 Hub}](https://jupyter2.kip.uni-heidelberg.de/)
+
+* Tools for statistical analysis, a math library with optimized algorithms,
+  multivariate analysis, visualization and simulation of data.
+
+* Storing data including objects and classes with compression in files is a
+  very powerfull aspect for any data analysis project 
+
+* Within PyROOT Minuit2 can be accessed easily either with predefined functions
+  or your own function definition
+
+* For advanced statistical analyses and data modeling  likelihood fitting with
+  the packages **rooFit** and **rooStats** is available.
+
+
+## 
+
+* Example reading the invariant mass measurements of a $D^0$ from a text file
+  and determine $\mu$ and $\sigma$  \hspace{1.0cm}  \small
+  [\textcolor{violet}{02\_fit\_histFit.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_histFit.py)
+  \normalsize  
+  
+\footnotesize
+```python
+     import numpy as np
+     import math
+     from ROOT import TCanvas, TFile, TH1D, TF1, TMinuit, TFitResult
+     data = np.genfromtxt('D0Mass.txt', dtype='d') # read data from text file
+     c = TCanvas('c','D0 Mass',200,10,700,500)     # instanciate output canvas
+     d0 = TH1D('d0','D0 Mass',200,1700.,2000.)     # instanciate histogramm
+     for x in data :                               # fill data into histogramm d0
+          d0.Fill(x)
+     def pyf_tf1_params(x, p):                     # define fit function
+          return p[0] * math.exp (-0.5 * ((x[0] - p[1])**2 / p[2]**2))
+     func = TF1("func",pyf_tf1_params,1840.,1880.,3)
+     # func = TF1("func",'gaus',1840.,1880.)  # use predefined function   
+     func.SetParameters(500.,1860.,5.5)    # set start parameters
+     myfit = d0.Fit(func,"S")              # fit function to the histogramm data
+     print ("Fit results: mean=",myfit.Parameter(0)," +/- ",myfit.ParError(0))
+     c.Draw()                                      # draw canvas
+     myfile = TFile('myOutFile.root','RECREATE')   # Open a ROOT file for output
+     c.Write()                                     # Write canvas
+     d0.Write()                                    # Write histogram
+     myfile.Close()                                # close file
+```
+\normalsize
+
+
+## 
+
+* Fit Options 
+\vspace{0.1cm}
+
+::: columns
+:::: {.column width=2%}
+::::
+:::: {.column width=98%}
+![](figures/rootOptions.png)
+::::
+:::
+
+## Exercise 5
+
+ Read text file [\textcolor{violet}{FitTestData.txt}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/FitTestData.txt) and draw a histogramm using PyROOT.
+ \setbeamertemplate{itemize item}{\color{red}$\square$}
+
+* Determine the mean and sigma of the signal distribution. Which function do
+  you use for fitting?
+
+* The option S fills the result object.
+
+* Try to improve the errors of the fit values with minos using the option E
+  and also try the option M to scan for a new minimum, option V provides more
+  output.
+
+* Fit the background outside the signal region use the option R+ to add the
+  function to your fit
+
+  \small
+   Solution: [\textcolor{violet}{02\_fit\_ex\_5\_sol.py}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/02_fit_ex_5_sol.py) \normalsize
+   
+
+## iPython Examples for Fitting
+
+ The different python packages are used in
+ \textcolor{blue}{example iPython notebooks}
+ to demonstrate the fitting of a third order polynomial to the same data
+ available as numpy arrays.
+
+ \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+  * LSQ fit of a polynomial to data using Minuit2 with
+  \textcolor{blue}{iminuit} and \textcolor{blue}{matplotlib} plot:
+
+    \small
+    [\textcolor{violet}{02\_fit\_iminuitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_iminuitFit.ipynb)
+    \normalsize
+
+  * Graph fitting with \textcolor{blue}{pyROOT} with options using a python
+    function including confidence level plot:
+
+    \small
+    [\textcolor{violet}{02\_fit\_fitGraph.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_fitGraph.ipynb)
+    \normalsize
+
+  * Graph fitting with  \textcolor{blue}{numpy} and confidence level
+    plotting with  \textcolor{blue}{matplotlib}:
+
+    \small
+    [\textcolor{violet}{02\_fit\_numpyFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_numpyFit.ipynb)   
+    \normalsize
+
+  * Graph fitting with a polynomial fit of  \textcolor{blue}{scikit-learn} and
+    plotting with  \textcolor{blue}{matplotlib}:
+    
+    \normalsize
+    \small
+    [\textcolor{violet}{02\_fit\_scikitFit.ipynb}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/examples/02_fit_scikitFit.ipynb)
+    \normalsize
--- a/slides/intro_python.md
+++ b/slides/intro_python.md
@ -0,0 +1,830 @@
+---
+title: |
+  | Introduction to Data Analysis and Machine Learning in Physics:  
+  | 1. Introduction to python  
+
+author: "Martino Borsato, Jörg Marks, Klaus Reygers"
+date: "Studierendentage, 11-14 April 2022"
+---
+
+## Outline of the $1^{st}$ day
+
+* Technical instructions for your interactions with the CIP pool, like
+   * using the jupyter hub
+   * using python locally in your own linux environment (anaconda)
+   * access the CIP pool from your own windows or linux system
+   * transfer data from and to the CIP pool
+
+  Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/Folien/CIPpoolAccess.pdf)\normalsize
+
+* Summary of NumPy
+
+* Plotting with matplotlib
+
+* Input / output of data
+
+* Summary of pandas
+
+* Fitting with iminuit and pyROOT
+
+
+## A glimpse into python classes
+
+ The following python classes are important to data analysis and machine
+ learning will be used during the course
+ 
+ * [\textcolor{violet}{NumPy}](https://numpy.org/doc/stable/user/basics.html) - python library adding support for large,
+   multi-dimensional arrays and matrices, along with high-level
+   mathematical functions to operate on these arrays
+
+ * [\textcolor{violet}{matplotlib}](https://matplotlib.org/stable/tutorials/index.html) - a python plotting library
+
+ * [\textcolor{violet}{SciPy}](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) - extension of NumPy by a collection of
+   mathematical algorithms for minimization, regression, 
+   fourier transformation, linear algebra and image processing
+
+ * [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) -
+   python wrapper to the data fitting toolkit
+   [\textcolor{violet}{Minuit2}](https://root.cern.ch/doc/master/Minuit2Page.html)
+   developed at CERN by F. James in the 1970ies 
+
+ * [\textcolor{violet}{pyROOT}](https://root.cern/manual/python/) - python wrapper to the C++ data analysis toolkit
+   ROOT used at the LHC
+
+ * [\textcolor{violet}{scikit-learn}](https://scikit-learn.org/stable/) - machine learning library written in
+   python, which makes use extensively of NumPy for high-performance
+   linear algebra algorithms
+   
+## NumPy
+ 
+   \textcolor{blue}{NumPy} (Numerical Python) is an open source Python library,
+   which contains multidimensional array and matrix data structures and methods
+   to efficiently operate on these. The core object is
+   a homogeneous n-dimensional array object,  \textcolor{blue}{ndarray}, which
+   allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations
+   with arrays and matrices} due to the extensive usage of compiled code.  
+
+   * It is heavily used in numerous scientific python packages
+
+   * `ndarray` 's  have a fixed size at creation $\rightarrow$ changing size
+     leads to recreation
+
+   * Array elements are all required to be of the same data type
+
+   * Facilitates advanced mathematical operations on large datasets
+
+   * See for a summary, e.g. &nbsp;&nbsp;  
+ \small
+[\textcolor{violet}{https://cs231n.github.io/python-numpy-tutorial/\#numpy}](https://cs231n.github.io/python-numpy-tutorial/#numpy) \normalsize
+
+\vfill
+ 
+::: columns
+:::: {.column width=30%}
+
+::::
+:::
+
+::: columns
+:::: {.column width=35%}
+
+`c = []`
+
+`for i in range(len(a)):`
+
+&nbsp;&nbsp;&nbsp; `c.append(a[i]*b[i])`
+
+::::
+
+:::: {.column width=35%}
+
+with NumPy
+
+`c = a * b`
+
+::::
+:::
+
+<!---
+It seem we need to indent by hand.
+I don't manage to align under the bullet text
+If we do it with column the vertical space is with code sections not good
+If we do it without code section the vertical space is ok, but there is no
+code high lightning.
+See the different versions of the same page in the following
+-->
+
+## NumPy - array basics
+
+* numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed.
+  The *rank* is the dimension of the array.
+  There are methods to create  and preset arrays.
+
+\footnotesize
+
+```python
+	 myA = np.array([2, 5 , 11])             # create rank 1 array (vector like)
+	 type(myA)                               # <class ‘numpy.ndarray’>
+	 myA.shape                               # (3,)
+	 print(myA[2])                           # 11   access 3. element
+	 myA[0] = 12                             # set 1. element to 12
+	 myB = np.array([[1,5],[7,9]])           # create  rank 2 array
+	 myB.shape                               # (2,2)
+	 print(myB[0,0],myB[0,1],myB[1,1])       # 1 5 9
+	 myC = np.arange(6)                      # create rank 1 set to 0 - 5
+	 myC.reshape(2,3)                        # change rank to (2,3)
+
+	 zero = np.zeros((2,5))                  # 2 rows, 5 columns, set to 0
+	 one = np.ones((2,2))                    # 2 rows, 2 columns, set to 1
+	 five = np.full((2,2), 5)                # 2 rows, 2 columns, set to 5
+	 e = np.eye(2)                           # create 2x2 identity matrix
+```
+\normalsize
+
+
+##  NumPy - array indexing (1)
+
+* select slices of a numpy array
+
+\footnotesize
+```python
+     a = np.array([[1,2,3,4],
+                   [5,6,7,8],                # 3 rows 4 columns array
+                   [9,10,11,12]])
+     b = a[:2, 1:3]                          # subarray of 2 rows and
+         array([[2, 3],                      # column 1 and 2
+                [6, 7]])
+```		    
+\normalsize
+
+* a slice of an array points into the same data, *modifying* changes the original array!
+	 
+\footnotesize
+```python
+     b[0, 0] = 77	                         # b[0,0] and a[0,1] are 77
+
+     r1_row = a[1, :]                        # get 2nd row ->  rank 1
+     r1_row.shape	                         # (4,)
+     r2_row = a[1:2, :]                      # get 2nd row -> rank 2
+     r2_row.shape                            # (1,4)
+     a=np.array([[1,2],[3,4],[5,6]])         # set a , 3 rows 2 cols
+     d=a[[0, 1, 2], [0, 1, 1]]               # d contains [1 4 6]
+     e=a[[1, 2], [1, 1]]                     # e contains [4 6]
+     np.array([a[0,0],a[1,1],a[2,0]])        # address elements explicitly
+```
+\normalsize
+
+
+##  NumPy - array indexing (2)
+
+ 
+* integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements
+
+\footnotesize
+```python
+     a = np.array([[1,2,3,4],
+                   [5,6,7,8],                # 3 rows 4 columns array
+                   [9,10,11,12]])
+     p_a = np.array([0,2,0])                 # Create an array of indices
+     s = a[np.arange(3), p_a]                # number the rows, p_a points to cols
+     print (s)                               # s contains [1 7 9]
+     a[np.arange(3),p_a] += 10               # add 10 to corresponding elements
+     x=np.array([[8,2],[7,4]])               # create 2x2 array
+     bool = (x > 5)                          # bool : array of boolians
+                                             #   [[True False]
+                                             #    [True False]]
+     print(x[x>5])                           # select elements, prints [8 7]
+```		    
+\normalsize
+
+* data type in numpy - create according to input numbers or set explicitly
+
+\footnotesize
+
+```python
+     x = np.array([1.1, 2.1])                # create float array 
+     print(x.dtype)                          # print  float64
+     y=np.array([1.1,2.9],dtype=np.int64)    # create float array [1 2]
+```
+\normalsize
+
+
+## NumPy - functions
+
+* math functions operate elementwise either as operator overload or as methods
+
+\footnotesize
+```python
+     x=np.array([[1,2],[3,4]],dtype=np.float64)    # define 2x2 float array
+     y=np.array([[3,1],[5,1]],dtype=np.float64)    # define 2x2 float array
+     s = x + y                                     # elementwise sum 
+     s = np.add(x,y)
+     s = np.subtract(x,y)
+     s = np.multiply(x,y)                          # no matrix multiplication!
+     s = np.divide(x,y)
+     s = np.sqrt(x), np.exp(x), ...
+     x @ y , or np.dot(x, y)                       # matrix product
+     np.sum(x, axis=0)                             # sum of each column
+     np.sum(x, axis=1)                             # sum of each row
+     xT = x.T                                      # transpose of x
+     x = np.linspace(0,2*pi,100)                   # get equal spaced points in x
+
+     r = np.random.default_rng(seed=42)            # constructor random number class
+     b = r.random((2,3))                           # random 2x3 matrix
+```
+\normalsize
+
+
+
+##
+
+*  broadcasting in  numpy
+  \vspace{0.4cm}
+  
+   The term broadcasting describes how numpy treats arrays with different
+   shapes during arithmetic operations
+
+   * add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to
+     $[b,b,b]$
+     \vspace{0.2cm}
+
+   * add a  scalar $b$ to a 2D [2,3] array  $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$
+     $\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise
+     \vspace{0.2cm}
+
+   * add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$   $\rightarrow$  1D array is broadcast
+     across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added  element wise 
+ \vspace{0.2cm}
+
+   Arithmetic operations can only be performed when the shape of each
+   dimension in the arrays are equal or one has the dimension size of 1. Look
+   [\textcolor{violet}{here}](https://numpy.org/doc/stable/user/basics.broadcasting.html) for more details 
+
+\footnotesize
+```python
+     # Add a vector to each row of a matrix
+     x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3)
+     v = np.array([1,2,3])            # v has shape (3,)
+     x + v     # [[2 4 6]
+               #  [5 7 9]]    
+```
+\normalsize
+
+## Plot data
+
+A popular library to present data is the `pyplot` module of `matplotlib`.
+
+* Drawing a function in one plot
+
+\footnotesize
+::: columns
+:::: {.column width=35%}
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+# generate 100 points from 0 to 2 pi
+x = np.linspace( 0, 10*np.pi, 100 )
+f = np.sin(x)**2
+# plot function
+plt.plot(x,f,'blueviolet',label='sine')
+plt.xlabel('x [radian]')
+plt.ylabel('f(x)')
+plt.title('Plot sin^2')
+plt.legend(loc='upper right')
+plt.axis([0,30,-0.1,1.2]) # limit the plot range
+
+# show the plot
+plt.show()
+```
+::::
+:::: {.column width=40%}
+![](figures/matplotlib_Figure_1.png)
+::::
+:::
+
+\normalsize
+
+##
+* Drawing subplots in one canvas
+
+\footnotesize
+::: columns
+:::: {.column width=35%}
+```python
+...
+g = np.exp(-0.2*x)
+# create figure
+plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey')
+plt.suptitle('1 x 2 Plot')
+# create subplot and plot first one
+plt.subplot(1,2,1)
+# plot first one
+plt.title('exp(x)')
+plt.xlabel('x')
+plt.ylabel('g(x)')
+plt.plot(x,g,'blueviolet')
+# create subplot and plot second one 
+plt.subplot(1,2,2)
+plt.plot(x,f,'orange')
+plt.plot(x,f*g,'red')
+plt.legend(['sine^2','exp*sine'])
+# show the plot
+plt.show()
+```
+::::
+:::: {.column width=40%}
+\vspace{3cm}
+![](figures/matplotlib_Figure_2.png)
+::::
+:::
+\normalsize
+
+## Image data 
+
+The `image` class of the `matplotlib` library can be used to load the image
+to numpy arrays and to render the image.
+
+* There are 3 common formats for the numpy array  
+
+  * (M, N) scalar data used for greyscale images
+
+  * (M, N, 3) for RGB images (each pixel has an array with RGB color attached) 
+
+  * (M, N, 4) for RGBA images (each pixel has an array with RGB color
+    and transparency attached)
+
+
+  The method `imread` loads the image into an `ndarray`, which can be
+  manipulated.
+
+  The method `imshow` renders the image data
+
+ \vspace {2cm}
+ 
+##
+* Drawing pixel data and images
+
+\footnotesize
+::: columns
+:::: {.column width=50%}
+
+```python
+....
+# create data array with pixel postion and RGB color code
+width, height = 400, 400
+data = np.zeros((height, width, 3), dtype=np.uint8)
+# red patch in the center
+data[175:225, 175:225] = [255, 0, 0] 
+x = np.random.randint(0,width-1,100)
+y = np.random.randint(0,height-1,100)
+data[x,y]= [0,255,0] # random green pixel
+plt.imshow(data)
+plt.show()
+....
+import matplotlib.image as mpimg
+#read image into numpy array
+pic = mpimg.imread('picture.jpg')
+mod_pic = pic[:,:,0] # grab slice 0 of the colors
+plt.imshow(mod_pic)  # use default color code also
+plt.colorbar()       # try cmap='hot' 
+plt.show()
+```
+::::
+:::: {.column width=25%} 
+![](figures/matplotlib_Figure_3.png)
+\vspace{1cm}
+![](figures/matplotlib_Figure_4.png)
+::::
+::: 
+\normalsize
+
+
+## Input / output
+
+For the analysis of measured data efficient input \/ output plays an
+important role. In numpy, `ndarrays` can be saved and read in from files.
+`load()` and `save()` functions handle numpy binary files (.npy extension)
+which contain  data, shape, dtype and other information required to
+reconstruct the `ndarray` of the disk file.
+
+\footnotesize
+```python
+   r = np.random.default_rng()       # instanciate random number generator
+   a = r.random((4,3))               # random 4x3 array
+   np.save('myBinary.npy', a)        # write array a to binary file myBinary.npy
+   b = np.arange(12)                 
+   np.savez('myComp.npz', a=a, b=b)  # write a and b in compressed binary file  
+   ......
+   b = np.load('myBinary.npy')       # read content of myBinary.npy into b
+```
+\normalsize
+   
+The storage and retrieval of array data in text file format is done
+with `savetxt()` and `loadtxt()` methods. Parameter controling delimiter,
+line separators, file header and footer can be specified.
+
+\footnotesize
+```python
+   x = np.array([1,2,3,4,5,6,7])       # create ndarray 
+   np.savetxt('myText.txt',x,fmt='%d') # write array x to text file myText.txt
+   .....
+   y = np.loadtxt('myText.txt',dtype=int)  # read content of myText.txt in y
+```
+\normalsize
+
+
+## Exercise 1
+
+i) Display a numpy array as figure of a blue cross. The size should be 200
+   by 200 pixel. Use as array format (M, N, 3), where the first 2 specify
+   the pixel positions and the last 3 the rbg color from 0:255.
+   - Draw in addition a red square of arbitrary position into the figure.
+   - Draw a circle in the center of the figure. Try to create a mask which
+     selects the inner part of the circle using the indexing.
+     
+   \small
+   [Solution:  01_intro_ex_1a_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1a_sol.py) \normalsize
+   
+ii) Read data which contains pixels from the binary file horse.py into a
+    numpy array. Display the data and the following transformations in 4
+    subplots: scaling and translation, compression in x and y, rotation
+    and mirroring.
+    
+    \small
+    [Solution: 01_intro_ex_1b_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1b_sol.py) \normalsize 
+      
+
+## Pandas
+
+[\textcolor{violet}{pandas}](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) is a software library written in Python for
+\textcolor{blue}{data manipulation and analysis}. 
+
+ \vspace{0.4cm}
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* Offers data structures and operations for manipulating numerical tables with
+  integrated indexing
+  
+* Imports data from various file formats, e.g. comma-separated values, JSON,
+  SQL or Excel
+
+* Tools for reading and writing data structures, allows analyzing, filtering,
+  spliting, merging and joining 
+
+* Built on top of `NumPy`
+
+* Visualize the data with `matplotlib`
+
+* Most machine learning tools support `pandas` $\rightarrow$ 
+  it is widely used to preprocess data sets for machine learning
+  
+## Pandas micro introduction
+
+Goal: Exploring, cleaning, transforming, and visualization of data.
+The basic indexable objects are
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* `Series` -> vector (list) of data elements of arbitrary  type
+ 
+* `DataFrame` -> tabular arangement of data elements of column wise
+                 arbitrary type
+		 
+   Both allow cleaning data by removing of `empty` or `nan` data entries
+		 
+\footnotesize
+```python
+     import numpy as np
+     import pandas as pd                    # use together with numpy
+     s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of float64
+     r = pd.Series(np.random.randn(4))      # Series of random numbers float64 
+     dates = pd.date_range("20130101", periods=3) # index according to dates
+     df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD"))
+     print (df)                             # print the DataFrame
+                        A         B         C         D
+          2013-01-01  1.618395  1.210263 -1.276586 -0.775545
+          2013-01-02  0.676783 -0.754161 -1.148029 -0.244821
+          2013-01-03 -0.359081  0.296019  1.541571  0.235337
+
+     new_s = s.dropna() # return a new Data Frame with no empty cells	  
+```
+\normalsize
+
+##
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* pandas data can be saved in different file formats (CSV, JASON, html, XML,
+  Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept
+  in the output file.
+
+   * csv file
+     \footnotesize
+     ```python
+     df.to_csv("myFile.csv")  # Write the DataFrame df to a csv file 
+     ```
+      \normalsize
+
+   * HDF5 output
+  
+     \footnotesize
+     ```python  
+     df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5
+     s.to_hdf("myFile.h5", key='s',mode='a')	  
+     ```
+     \normalsize
+
+   * Writing to an excel file
+  
+     \footnotesize
+     ```python  
+     df.to_excel("myFile.xlsx", sheet_name="Sheet1")
+     ```
+     \normalsize
+
+* Deleting file with data in python
+  
+\footnotesize
+```python  
+     import os
+     os.remove('myFile.h5')
+```
+\normalsize
+
+##
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* read in data from various formats
+ 
+   * csv file
+
+     \footnotesize
+
+     ```python
+      .......
+      df = pd.read_csv('heart.csv')  # read csv data table
+      print(df.info())
+         <class 'pandas.core.frame.DataFrame'>
+         RangeIndex: 303 entries, 0 to 302
+         Data columns (total 14 columns):
+         #   Column    Non-Null Count  Dtype  
+         ---  ------    --------------  -----  
+         0   age       303 non-null    int64  
+         1   sex       303 non-null    int64  
+         2   cp        303 non-null    int64
+         print(df.head(5))       # prints the first 5 rows of the data table 
+         print(df.describe())    # shows a quick statistic summary of your data
+     ```
+\normalsize
+
+   * Reading an excel file
+
+     \footnotesize
+     ```python  
+     df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"])
+     ```
+     \normalsize
+
+     \textcolor{olive}{There are many options specifying details for IO.}
+
+##
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* Various functions exist to select and view data from pandas objects
+
+  * Display column and index
+
+    \footnotesize
+
+     ```python
+     df.index                    # show datetime index of df
+     DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'],
+                   dtype='datetime64[ns]',freq='D')
+     df.column                   # show columns info
+     Index(['A', 'B', 'C', 'D'], dtype='object')
+     ```
+     \normalsize
+     
+  * `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data
+
+    \footnotesize
+
+     ```python
+     df.to_numpy()       # one dtype for the entire array, not per column!
+     [[-0.62660101 -0.67330526  0.23269168 -0.67403546]
+     [-0.53033339  0.32872063 -0.09893568  0.44814084]
+     [-0.60289996 -0.22352548 -0.43393248  0.47531456]]
+     ```
+     \normalsize
+     
+     Does not include the index or column labels in the output
+     
+  * more on viewing 
+
+    \footnotesize
+
+    ```python
+    df.T                                   # transpose the DataFrame df
+    df.sort_values(by="B")                 # Sorting by values of a column of df
+    df.sort_index(axis=0,ascending=False)  # Sorting by index descending values
+    df.sort_index(axis=0,ascending=False)  # Display columns in inverse order
+			
+    ```
+    \normalsize
+    
+##
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions
+
+  * get a named column as a Series
+
+    \footnotesize
+
+     ```python
+     df["A"]           # selects a column A from df, simular to  df.A
+     df.iloc[:, 1:2]   # slices column A explicitly from df, df.loc[:, ["A"]]
+     ```
+     \normalsize
+
+  * select rows of a DataFrame 
+
+    \footnotesize
+
+     ```python
+     df[0:2]                   # selects row 0 and 1 from df, 
+     df["20130102":"20130103"] # use indices endpoint are included!
+     df.iloc[3]                # Select with the position of the passed integers
+     df.iloc[1:3, :]           # selects row 1 and 2 from df
+     ```
+     \normalsize
+
+  * select by label
+
+     \footnotesize
+
+     ```python
+     df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D
+     df.loc[dates[0], "A"]                   # selects a single value (scalar)
+     ```
+     \normalsize
+
+  *  select by lists of integer position (as in `NumPy`)
+
+     \footnotesize
+
+     ```python
+     df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D
+     df.iloc[1, 1]           # get a value explicitly
+
+     ```
+     \normalsize
+
+  *  select according to expressions
+
+     \footnotesize
+
+     ```python
+     df.query('B<C')         # select rows where B < C
+     df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows
+     ```
+     \normalsize
+
+##
+
+
+\setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
+
+* Selecting data of pandas objects continued
+
+  * Boolean indexing
+
+    \footnotesize
+
+     ```python
+     df[df["A"] > 0]           # select df where all values of column A are >0
+     df[df > 0]                # select values from the entire DataFrame
+     ```
+     \normalsize
+
+     more complex example
+     
+     \footnotesize
+     
+     ```python
+     df2 = df.copy()                     # copy df
+     df2["E"] = ["eight","one","four"]   # add column E
+     df2[df2["E"].isin(["two", "four"])] # test if elements "two" and  "four" are
+                                         # contained in Series column E
+     ```
+     \normalsize
+
+  * Operations (in general exclude missing data)
+
+    \footnotesize
+
+     ```python
+     df2[df2 > 0] = -df2   # All elements > 0 change sign
+     df.mean(0)            # get column wise mean (numbers=axis)  
+     df.mean(1)            # get row wise mean
+     df.std(0)             # standard deviation according to axis
+     df.cumsum()           # cumulative sum of each column
+     df.apply(np.sin)      # apply function to each element of df
+     df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise
+     df + 10               # add scalar 10
+     df - [1, 2, 10 , 100] # subtract values of each column
+     df.corr()             # Compute pairwise correlation of columns
+     ```
+     \normalsize
+
+
+##  Pandas - plotting data
+
+[\textcolor{violet}{Visualization}](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) is integrated in pandas using mathplotlib. Here are only 2 examples
+
+* Plot random data in histogramm and scatter plot
+
+\footnotesize
+```python
+     # create DataFrame with random normal distributed data
+     df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"])
+     df = df + [1, 3, 8 , 10]  # shift mean to  1, 3, 8 , 10
+     plt.figure()
+     df.plot.hist(bins=20)     # histogram all 4 columns
+     g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1")
+     df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1)
+```
+\normalsize
+   
+::: columns
+:::: {.column width=35%}
+![](figures/pandas_histogramm.png)
+::::
+:::: {.column width=35%}
+![](figures/pandas_scatterplot.png)
+::::
+:::
+
+##  Pandas - plotting data
+
+The function crosstab() takes one or more array-like objects as indexes or
+columns and constructs a new DataFrame of variable counts on the inputs
+
+\footnotesize
+```python
+   df = pd.DataFrame(           # create DataFrame of 2 categories
+      {"sex":   np.array([0,0,0,0,1,1,1,1,0,0,0]),
+       "heart": np.array([1,1,1,0,1,1,1,0,0,0,1])
+      }  )                      # closing bracket goes on next line
+   pd.crosstab(df2.sex,df2.heart)    # create cross table of possibilities
+   pd.crosstab(df2.sex,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts
+```
+\normalsize
+::: columns
+:::: {.column width=42%}
+![](figures/pandas_crosstabplot.png)
+::::
+:::
+
+## Exercise 2
+
+Read the file [\textcolor{violet}{heart.csv}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/heart.csv) into a DataFrame.
+[\textcolor{violet}{Information on the dataset}](https://archive.ics.uci.edu/ml/datasets/heart+Disease)
+
+\setbeamertemplate{itemize item}{\color{red}$\square$}
+
+  * Which columns do we have
+
+  * Print the first 3 rows
+
+  * Print the statistics summary and the correlations
+
+  * Print mean values for each column with and without disease
+
+  * Select the data according to `sex` and `target` (heart disease 0=no 1=yes). 
+
+  * Plot the `age` distribution of male and female in one histogram
+
+  * Plot the heart disease distribution according to chest pain type `cp`
+  
+  * Plot `thalach`  according to `target` in one histogramm
+
+  * Plot `sex` and `target` in a histogramm figure    
+
+  * Correlate `age` and `max heart rate` according to `target` 
+
+  * Correlate `age` and `colesterol` according to `target` 
+
+  \small
+   [Solution: 01_intro_ex_2_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_2_sol.py) \normalsize
+
+
+
+
+
+
+
--- a/Show More
+++ b/Show More
				`@ -0,0 +1,2 @@`
				`Pandoc slides example following style of [Stefan Wunsch's CERN IML workhsop presenation](https://github.com/stwunsch/iml_keras_workshop) on [keras](https://keras.io/) (see slides folder)`