From 6979c92d95b1516d8d6a74752ed078c8f7490fcc Mon Sep 17 00:00:00 2001
From: Joerg Marks <marks@physi.uni-heidelberg.de>
Date: Fri, 7 Apr 2023 08:33:58 +0200
Subject: [PATCH] links updated to ml link 2023

---
 slides/04_decision_trees.md  | 343 +++++++++++++++
 slides/05_neural_networks.md | 802 +++++++++++++++++++++++++++++++++++
 2 files changed, 1145 insertions(+)
 create mode 100644 slides/04_decision_trees.md
 create mode 100644 slides/05_neural_networks.md

diff --git a/slides/04_decision_trees.md b/slides/04_decision_trees.md
new file mode 100644
index 0000000..8024fce
--- /dev/null
+++ b/slides/04_decision_trees.md
@@ -0,0 +1,343 @@
+% Introduction to Data Analysis and Machine Learning in Physics: \ 4. Decisions Trees
+% Jörg Marks, \underline{Klaus Reygers}
+% Studierendentage, 11-14 April 2023
+
+## Exercises
+
+* Exercise 1: Compare different decision tree classifiers
+	* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
+* Exercise 2: Apply XGBoost classifier to MAGIC data set
+	* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
+* Exercise 3: Feature importance
+* Exercise 4: Interpret a classifier with SHAP values
+
+## Decision trees
+
+\begin{figure}
+\centering
+\includegraphics[width=0.85\textwidth]{figures/mini_boone_decisions_tree.png}
+\end{figure}
+
+\begin{center}
+Leaf nodes classify events as either signal or background
+\end{center}
+
+## Decision trees: Rectangular volumes in feature space
+
+\begin{figure}
+\centering
+\includegraphics[width=0.75\textwidth]{figures/decision_trees_feature_space.png}
+\end{figure}
+
+* Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background)
+* How to build a decision tree in an optimal way?
+
+## Finding optimal cuts
+
+Separation btw. signal and background is often measured with the Gini index (or Gini impurity):
+
+$$ G = p (1-p) $$
+
+Here $p$ is the purity:
+$$ p = \frac{\sum_\mathrm{signal} w_i}{\sum_\mathrm{signal} w_i + \sum_\mathrm{background} w_i}, \quad w_i = \text{weight of event}\; i$$
+
+\vfill
+\textcolor{gray}{Usefulness of weights will become apparent soon.}
+
+\vfill
+Improvement in signal/background separation after splitting a set A into two sets B and C:
+$$ \Delta = W_A G_A - W_B G_B - W_C G_C \quad \text{where} \quad W_X = \sum_{X} w_i $$
+
+## Gini impurity and other purity measures
+\begin{figure}
+\centering
+\includegraphics[width=0.7\textwidth]{figures/signal_purity.png}
+\end{figure}
+
+
+## Decision tree pruning
+
+::: columns
+:::: {.column width=50%}
+
+When to stop growing a tree?
+
+* When all nodes are essentially pure?
+* Well, that's overfitting!
+
+\vspace{3ex}
+
+Pruning
+
+* Cut back fully grown tree to avoid overtraining, i.e., replace nodes and subtrees by leaves
+
+::::
+:::: {.column width=50%}
+\begin{figure}
+\centering
+\includegraphics[width=0.85\textwidth]{figures/tree_pruning_slides.png}
+\end{figure}
+::::
+:::
+
+## Single decision trees: Pros and cons
+
+\textcolor{green}{Pros:}
+
+* Requires little data preparation (unlike neural networks)
+* Can use continuous and categorical inputs
+
+\vfill
+
+\textcolor{red}{Cons:}
+
+* Danger of overfitting training data
+* Sensitive to fluctuations in the training data
+* Hard to find global optimum
+* When to stop splitting?
+
+## Ensemble methods: Combine weak learners
+
+::: columns
+:::: {.column width=70%}
+* Bootstrap Aggregating (Bagging)
+	* Sample training data (with replacement) and train a separate model on each of the derived training sets
+	* Classify example with majority vote, or compute average output from each tree as model output
+
+::::
+:::: {.column width=30%}
+$$ y(\vec x) = \frac{1}{N_\mathrm{trees}} \sum_{i=1}^{N_{trees}} y_i(\vec x) $$ 
+::::
+:::
+\vfill
+::: columns
+:::: {.column width=70%}
+* Boosting
+	* Train $N$ models in sequence, giving more weight to examples not correctly classified by previous model
+	* Take weighted average to classify examples
+
+::::
+:::: {.column width=30%}
+$$ y(\vec x) = \frac{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i y_i(\vec x)}{\sum_{i=1}^{N_\mathrm{trees}} \alpha_i} $$ 
+::::
+:::
+
+## Random forests
+
+* "One of the most widely used and versatile algorithms in data science and machine learning" 
+\tiny \textcolor{gray}{arXiv:1803.08823v3} \normalsize
+\vfill
+* Use bagging to select random example subset
+\vfill
+* Train a tree, but only use random subset of features at each split
+	* this reduces the correlation between different trees
+	* makes the decision more robust to missing data
+
+## Boosted decision trees: Idea
+
+\begin{figure}
+\centering
+\includegraphics[width=0.75\textwidth]{figures/bdt.png}
+\end{figure}
+
+## AdaBoost (short for Adaptive Boosting)
+
+Initial training sample
+
+\begin{center}
+\begin{tabular}{l l}
+$\vec x_1, ..., \vec x_n$: & multivariate event data \\
+$y_1, ..., y_n$: & true class labels, $+1$ or $-1$ \\
+$w_1^{(1)}, ..., w_n^{(1)}$ & event weights
+\end{tabular}
+\end{center}
+
+with equal weights normalized as
+
+$$ \sum_{i=1}^n w_i^{(1)} = 1 $$
+
+Train first classifier $f_1$:
+
+\begin{center}
+\begin{tabular}{l l}
+$f_1(\vec x_i) > 0$ & classify as signal \\
+$f_1(\vec x_i) < 0$ & classify as background
+\end{tabular}
+\end{center}
+
+## AdaBoost: Updating events weights
+
+Define training sample $k+1$ from training sample $k$ by updating weights:
+
+$$ w_i^{(k+1)} = w_i^{(k)} \frac{e^{- \alpha_k f_k(\vec x_i) y_i/2}}{Z_k} $$
+
+\footnotesize
+\textcolor{gray}{$$ i = \text{event index}, \quad Z_k:\; \text{normalization factor so that } \sum_{i=1}^n w_i^{(k)} = 1$$}
+\normalsize
+
+Weight is increased if event was misclassified by the previous classifier
+
+$\to$ "Next classifier should pay more attention to misclassified events"
+
+
+\vfill
+At each step the classifier $f_k$ minimizes error rate:
+
+$$ \varepsilon_k = \sum_{i=1}^n w_i^{(k)} I(y_i f_k( \vec x_i) \le 0), 
+\quad I(X) = 1 \; \text{if} \; X \; \text{is true, 0 otherwise}  $$
+
+## AdaBoost: Assigning the classifier score
+
+Assign score to each classifier according to its error rate:
+$$ \alpha_k = \ln \frac{1 - \varepsilon_k}{\varepsilon_k} $$
+
+\vfill
+
+Combined classifier (weighted average):
+$$ f(\vec x) = \sum_{k=1}^K \alpha_k f_k(\vec x) $$
+
+
+
+## Gradient boosting
+
+Basic idea:
+
+* Train a first decision tree
+* Then train a second one on the residual errors made by the first tree
+* And so on
+
+\vfill
+
+In slightly more detail:
+
+* \color{gray} Consider labeled training data: $\{\vec x_i, y_i\}$
+* Model prediction at iteration $m$: $F_m(\vec x_i)$
+* New model: $F_{m+1}(\vec x) = F_m(\vec x) + h_m(\vec x)$
+* Find $h_m(\vec x)$ by fitting it to 
+$\{(\vec x_1, y_1 - F_m(\vec x_1)), \; (\vec x_2, y_2 - F_m(\vec x_2)), \; ... \; (\vec x_n, y_n - F_m(\vec x_n)) \}$
+
+\color{black}
+
+## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (1)
+\small
+[\textcolor{gray}{04\_decision\_trees\_critical\_temp\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/04_decision_trees_critical_temp_regression.ipynb)
+\normalsize
+
+\vfill
+
+Superconductivty data set: 
+
+Predict the critical temperature based on 81 material features.
+\footnotesize
+[\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data}](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data)
+\normalsize
+
+\vfill
+
+From the abstract:
+
+
+We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor’s chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model’s predictive accuracy.
+
+\vfill
+
+\tiny 
+[\textcolor{gray}{https://doi.org/10.1016/j.commatsci.2018.07.052}](https://doi.org/10.1016/j.commatsci.2018.07.052)
+\normalsize
+
+
+## Example 1: Predict critical temperature for superconductivty (Regression with XGBoost) (2)
+
+::: columns
+:::: {.column width=60%}
+\footnotesize
+```python
+import xgboost as xgb
+
+XGBreg = xgb.sklearn.XGBRegressor()
+
+XGBreg.fit(X_train, y_train)
+
+y_pred = XGBreg.predict(X_test)
+
+from sklearn.metrics import mean_squared_error
+rms = np.sqrt(mean_squared_error(y_test, y_pred))
+print(f"root mean square error {rms:.2f}")
+```
+
+\textcolor{gray}{This gives:}
+
+`root mean square error 9.68`
+::::
+:::: {.column width=40%}
+\vspace{6ex}
+![](figures/critical_temperature.pdf)
+::::
+:::
+
+## Exercise 1: Compare different decision tree classifiers
+
+\small
+[\textcolor{gray}{04\_decision\_trees\_ex\_1\_compare\_tree\_classifiers.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
+
+\vspace{5ex}
+
+Compare scikit-learns's `AdaBoostClassifier`, `RandomForestClassifier`, and `GradientBoostingClassifier` by plotting their ROC curves for the heart disease data set. \newline
+
+\vspace{2ex}
+
+Is there a classifier that clearly performs best?
+
+
+## Exercise 2: Apply XGBoost classifier to MAGIC data set
+
+\small
+[\textcolor{gray}{04\_decision\_trees\_ex\_2\_magic\_xgboost\_and\_random\_forest.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
+\normalsize
+
+\footnotesize
+```python
+# train XGBoost boosted decision tree
+import xgboost as xgb
+XGBclassifier = xgb.sklearn.XGBClassifier(nthread=-1, seed=1, n_estimators=1000)
+```
+\normalsize
+
+\small
+a) Plot predicted probabilities for the test sample for signal and background events (\texttt{plt.hist})
+b) Which is the most important feature for discriminating signal and background according to XGBoost? \ 
+Hint: use plot_impartance from XGBoost (see [XGBoost plotting API](https://xgboost.readthedocs.io/en/latest/python/python_api.html)). Do you get the same answer for all three performance measures provided by XGBoost (“weight”, “gain”, or “cover”)?
+c) Visualize one decision tree from the ensemble (let's say tree number 10). For this you need the the graphviz package (`pip3 install graphviz`)
+d) Compare the performance of XGBoost with the [**random forest classifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from [**scikit learn**](https://scikit-learn.org/stable/index.html). Plot signal and background efficiency for both classifiers in one plot. Which classifier performs better?
+\normalsize
+
+
+## Exercise 3: Feature importance
+
+\small
+[\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
+\normalsize
+
+\vspace{3ex}
+
+Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score. 
+
+
+## Exercise 4: Interpret a classifier with SHAP values
+
+SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
+
+\vfill
+
+Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
+
+a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
+
+b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
+
+c) Do the same for the superconductivity data set? What are the three most important features? 
+
+
+
+
+
diff --git a/slides/05_neural_networks.md b/slides/05_neural_networks.md
new file mode 100644
index 0000000..a16adda
--- /dev/null
+++ b/slides/05_neural_networks.md
@@ -0,0 +1,802 @@
+% Introduction to Data Analysis and Machine Learning in Physics: \ 5. Neural networks
+% Jörg Marks, \underline{Klaus Reygers}
+% Studierendentage, 11-14 April 2023
+
+## Exercises
+
+* Exercise 1: Learn XOR with a MLP
+	* [`05_neural_networks_ex_1_xor.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
+* Exercise 2: Visualising decision boundaries of classifiers
+	* [`05_neural_networks_ex_2_decision_boundaries.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
+* Exercise 3: Boston house prices (MLP regression)
+	* [`05_neural_networks_ex_3_boston_house_prices.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
+* Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
+	* [`05_neural_networks_ex_4_mnist_keras_train.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
+* Exercise 5: Higgs data set
+
+## Perceptron (1)
+
+::: columns
+:::: {.column width=65%}
+\begin{center}
+\includegraphics[width=0.40\textwidth]{figures/perceptron_weighted_sum.png}
+\vspace{1ex}
+\includegraphics[width=0.75\textwidth]{figures/perceptron_retina.png}
+\end{center}
+::::
+:::: {.column width=35%}
+$$h(\vec x) = \begin{cases}1 & \text{if }\ \vec w \cdot \vec x + b > 0,\\0 & \text{otherwise}\end{cases}$$
+\begin{center}
+\includegraphics[width=0.95\textwidth]{figures/perceptron_photo.png}
+\tiny
+\textcolor{gray}{Mark 1 Perceptron. Frank Rosenblatt (1961)}
+\normalsize
+\end{center}
+::::
+:::
+\footnotesize
+\vspace{2ex}
+\textcolor{gray}{The perceptron was designed for image recognition. It was first implemented in hardware (400 photocells, weights = potentiometer settings).}
+\normalsize
+
+## Perceptron (2)
+::: columns
+:::: {.column width=60%}
+* McCulloch–Pitts (MCP) neuron (1943)
+	* First mathematical model of a biological neuron
+	* Boolean input
+	* Equal weights for all inputs
+	* Threshold hardcoded
+* Improvements by Rosenblatt
+	* Different weights for inputs
+	* Algorithm to update weights and threshold given labeled training data
+
+\vfill
+
+Shortcoming of the perceptron: \newline 
+it cannot learn the XOR function \newline
+\tiny \textcolor{gray}{Minsky, Papert, 1969} \normalsize
+
+::::
+:::: {.column width=40%}
+![](figures/perceptron_with_threshold.png){width=80%}
+![](figures/xor.png)
+\small \textcolor{gray}{XOR: not linearly separable } \normalsize
+
+::::
+:::
+
+## The biological inspiration: the neuron
+
+\begin{figure}
+\centering
+\includegraphics[width=0.95\textwidth]{figures/neuron.png}
+\end{figure}
+
+## Non-linear transfer / activation function
+
+Discriminant: $$ y(\vec x) = h\left( w_0 + \sum_{i=1}^n w_i x_i \right) $$
+
+Examples for function $h$: \newline
+$$ \frac{1}{1+e^{-x}} \; \text{("sigmoid" or "logistic" function)}, \quad \tanh x $$ 
+
+::: columns
+:::: {.column width=50%}
+\begin{figure}
+\centering
+\includegraphics[width=0.75\textwidth]{figures/logistic_fct.png}
+\end{figure}
+::::
+:::: {.column width=50%}
+\vspace{3ex}
+Non-linear activation function needed in neural networks when feature space is not linearly separable.
+\newline
+
+\small 
+\textcolor{gray}{Neural net with linear activation functions is just a perceptron}
+\normalsize
+::::
+:::
+
+## Feedforward neural network with one hidden layer
+::: columns
+:::: {.column width=60%}
+![](figures/mlp.png){width=80%}
+::::
+:::: {.column width=40%}
+$$ \phi_i(\vec x) = h\left(w_{i0}^{(1)} + \sum_{j=1}^n w_{ij}^{(1)} x_j\right) $$
+\vfill
+$$ y(\vec x) = h\left( w_{10}^{(2)} + \sum_{j=1}^m w_{1j}^{(2)} \phi_j(\vec x)\right) $$
+\vfill
+\vspace{2ex}
+\footnotesize 
+\textcolor{gray}{superscripts indicates layer number, i.e., $w_{ij}^{(1)}$ refers to the input weights of neuron $i$ in the hidden layer (= layer 1).}
+\normalsize
+
+::::
+:::
+\begin{center} 
+Straightforward to generalize to multiple hidden layers
+\end{center}
+
+## Neural network output and decision boundaries
+::: columns
+:::: {.column width=75%}
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{figures/nn_decision_boundary.png}
+\end{figure}
+::::
+:::: {.column width=25%}
+\vspace{3ex}
+\footnotesize 
+\textcolor{gray}{P. Bhat, Multivariate Analysis Methods in Particle Physics, inspirehep.net/record/879273}
+\normalsize
+::::
+:::
+
+## Fun with neural nets in the browser
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{figures/tf_playground.png}
+\end{figure}
+\tiny
+[\textcolor{gray}{http://playground.tensorflow.org}](http://playground.tensorflow.org) 
+\normalsize
+
+## Backpropagation (1)
+Start with an initial guess $\vec w_0$ for the weights an then update weights after each training event:
+$$ \vec w^{(\tau+1)} = \vec w^{(\tau)} - \eta \nabla E_a(\vec w^{(\tau)}), \quad \eta = \text{learning rate}$$
+
+Gradient descent:
+\begin{figure}
+\centering
+\includegraphics[width=0.46\textwidth]{figures/gradient_descent.png}
+\end{figure}
+
+## Backpropagation (2)
+::: columns
+:::: {.column width=40%}
+\vspace{6ex}
+![](figures/mlp.png){width=100%}
+::::
+:::: {.column width=60%}
+Let's write network output as follows:
+\begin{align*}
+y(\vec x) &= h(u(\vec x)); \quad u(\vec x) = \sum_{j=0}^m w_{1j}^{(2)} \phi_j(\vec x) \\
+\phi_j(\vec x) &= h\left( \sum_{k=0}^n w_{jk}^{(1)} x_k\right) 
+\equiv h\left( v_j(\vec x) \right)
+\end{align*}
+
+For $E_a = \frac{1}{2} (y_a - t_a)^2$ one obtains for the weights from hidden layer to output:
+\begin{align*} 
+\frac{\partial E_a}{\partial w_{1j}^{(2)}} &= (y_a -t_a) h'(u(\vec x_a)) \frac{\partial u}{\partial w_{1j}^{(2)}} \\
+&= (y_a -t_a) h'(u(\vec x_a)) \phi_j(\vec x_a)
+\end{align*}
+::::
+:::
+\vspace{2ex}
+Further application of the chain rule gives weights from input to hidden layer.
+
+## Backpropagation (3)
+Backpropagation summary
+
+* Make prediction for a given training instance (forward pass)
+* Calculate error (value of loss function)
+* Go backwards and determine the contribution of each weight (reverse pass)
+* Adjust the weights to reduce the error
+
+\vfill
+
+Practical considerations:
+
+* Nowadays, people will implements neural networks with frameworks like Keras or TensorFlow
+* No need to implement backpropagation yourself
+* TensorFlow efficiently calculates gradient function based on a kind of symbolic differentiation
+
+
+## More on gradient descent
+
+::: columns
+:::: {.column width=60%}
+* Stochastic gradient descent
+	* just uses one training event at a time
+	* fast, but quite irregular approach to the minimum
+	* can help escape local minima
+	* one can decrease learning rate to settle at the minimum ("simulated annealing")
+* Batch gradient descent
+	* use entire training sample to calculate gradient of loss function
+	* computationally expensive
+* Mini-batch gradient descent
+	* calculate gradient for a random sub-sample of the training set
+
+::::
+:::: {.column width=40%}
+\begin{figure}
+\centering
+\includegraphics[width=0.7\textwidth]{figures/stochastic_gradient_descent.png}
+\end{figure}
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{figures/gradient_descent_cmp.png}
+\end{figure}
+::::
+:::
+
+## Universal approximation theorem
+
+::: columns
+:::: {.column width=60%}
+"A feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$."
+
+\vspace{5ex}
+
+One of the first versions of the theorem was proved by George Cybenko in 1989 for sigmoid activation functions
+
+\vspace{5ex}
+
+The theorem does not touch upon the algorithmic learnability of those parameters
+
+::::
+:::: {.column width=40%}
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{figures/ann.png}
+\end{figure}
+::::
+:::
+
+## Deep neural networks
+Deep networks: many hidden layers with large number of neurons 
+
+::: columns
+:::: {.column width=50%}
+* Challenges
+	* Hard too train ("vanishing gradient problem")
+	* Training slow
+	* Risk of overtraining
+::::
+:::: {.column width=50%}
+* Big progress in recent years
+	* Interest in NN waned before ca. 2006
+	* Milestone: paper by G. Hinton (2006): "learning for deep belief nets"
+	* Image recognition, AlphaGo, …
+	* Soon: self-driving cars, …
+::::
+:::
+\begin{figure}
+\centering
+\includegraphics[width=0.5\textwidth]{figures/dnn.png}
+\end{figure}
+
+## Drawbacks of the sigmoid activation function
+
+::: columns
+:::: {.column width=50%}
+\includegraphics[width=.75\textwidth]{figures/sigmoid.png}
+::::
+:::: {.column width=50%}
+$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
+\vspace{3ex}
+
+* Saturated neurons “kill” the gradients
+* Sigmoid outputs are not zero-centered
+* exp() is a bit compute expensive
+::::
+:::
+
+## Activation functions
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{figures/activation_functions.png}
+\end{figure}
+
+## ReLU
+::: columns
+:::: {.column width=50%}
+\includegraphics[width=.75\textwidth]{figures/relu.png}
+::::
+:::: {.column width=50%}
+$$ f(x) = \max(0,x) $$
+\vspace{1ex}
+
+* Does not saturate (in +region)
+* Very computationally efficient
+* Converges much faster than sigmoid tanh in practice
+* Actually more biologically plausible than sigmoid
+* But: gradient vanishes for $x < 0$
+
+::::
+:::
+
+
+## Bias-variance tradeoff
+
+Goal: generalization of training data
+
+* Simple models (few parameters): danger of bias
+	* \textcolor{gray}{Classifiers with a small number of degrees of freedom are less prone to statistical fluctuations: different training samples would result in similar classification boundaries ("small variance")}
+* Complex models (many parameters): danger of overfitting
+	* \textcolor{gray}{large variance of decision boundaries for different training samples}
+
+\begin{figure}
+\centering
+\includegraphics[width=0.8\textwidth]{figures/underfitting_overfitting.pdf}
+\end{figure}
+
+## Example of overtraining
+Too many neurons/layers make a neural network too flexible \newline $\to$ overtraining
+
+\begin{figure}
+\centering
+\includegraphics[width=0.9\textwidth]{figures/example_overtraining.png}
+\end{figure}
+
+## Monitoring overtraining
+Monitor fraction of misclassified events (or loss function:) 
+\begin{figure}
+\centering
+\includegraphics[width=0.8\textwidth]{figures/monitoring_overtraining.png}
+\end{figure}
+
+## Regularization: Avoid overfitting
+\scriptsize
+[\hfill \textcolor{gray}{http://cs231n.stanford.edu/slides}](http://cs231n.stanford.edu/slides)
+\normalsize
+\begin{figure}
+\centering
+\includegraphics[width=0.75\textwidth]{figures/regularization.png}
+\end{figure}
+\begin{center}
+$L_1$ regularization: $R(W) = \sum_k |W_k|$, $L_2$ regularization: $R(W) = \sum_k W_k^2$
+\end{center}
+
+## Another approach to prevent overfitting: Dropout
+* Randomly remove nodes during training
+* Avoid co-adaptation of nodes
+\begin{figure}
+\centering
+\includegraphics[width=0.8\textwidth]{figures/dropout.png}
+\end{figure}
+\scriptsize
+\textcolor{gray}{Srivastava et al.,}
+[\textcolor{gray}{"Dropout: A Simple Way to Prevent Neural Networks from Overfitting"}](jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf)
+\normalsize
+
+ 
+
+## Pros and cons of multi-layer perceptrons
+
+\textcolor{green}{Pros}
+
+* Capability to learn non-linear models
+
+\vspace{3ex}
+
+\textcolor{red}{Cons} 
+
+* Loss function can have several local minima
+* Hyperparameters need to be tuned
+	* \textcolor{gray}{number of layers, neurons per layer, and training iterations}
+* Sensitive to feature scaling
+	* \textcolor{gray}{preprocessing needed (e.g., scaling of all feature to range [0,1])}
+
+
+## Example 1: Boston house prices (MLP regression) (1)
+* Objective: predict house prices in Boston suburbs in the mid-1970s
+* Boston house data set: 506 instances, 13 features
+
+\footnotesize
+```
+    - CRIM     per capita crime rate by town
+    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
+    - INDUS    proportion of non-retail business acres per town
+    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
+    - NOX      nitric oxides concentration (parts per 10 million)
+    - RM       average number of rooms per dwelling
+    - AGE      proportion of owner-occupied units built prior to 1940
+    - DIS      weighted distances to five Boston employment centres
+    - RAD      index of accessibility to radial highways
+    - TAX      full-value property-tax rate per $10,000
+    - PTRATIO  pupil-teacher ratio by town
+    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
+    - LSTAT    % lower status of the population
+    - MEDV     Median value of owner-occupied homes in $1000's
+```
+
+\footnotesize 
+[\textcolor{gray}{05\_neural\_networks\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/05_neural_networks_boston_house_prices.ipynb)
+
+## Example 1: Boston house prices (MLP regression) (2)
+```python
+boston = datasets.load_boston()
+X = boston.data
+y = boston.target
+
+from sklearn.neural_network import MLPRegressor
+mlp = MLPRegressor(hidden_layer_sizes=(100), 
+	activation='logistic', random_state=1, max_iter=5000)
+mlp.fit(X_train, y_train)
+
+y_pred_mlp = mlp.predict(X_test)
+
+rms = np.sqrt(mean_squared_error(y_test, y_pred_mlp))
+print(f"root mean square error {rms:.2f}")
+```
+
+## Example 1: Boston house prices (MLP regression) (3)
+\begin{center}
+\includegraphics[width=0.7\textwidth]{figures/boston_house_prices.pdf}
+\end{center}
+
+## Exercise 1: XOR
+\small
+[\textcolor{gray}{05\_neural\_networks\_ex\_1\_xor.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
+\normalsize
+
+::: columns
+:::: {.column width=60%}
+a) Define a multi-layer perceptron classifier that learns the XOR problem.
+\scriptsize
+```python
+	from sklearn.neural_network import MLPClassifier
+
+	X = [[0, 0], [0, 1], [1, 0], [1, 1]]
+	y = [0, 1, 1, 0]
+```
+\normalsize
+b) Define a multi-layer perceptron regressor that fits the depicted 2d data (see notebook).
+
+c) Plot the mean square error vs. the number of number of training epochs for b).
+::::
+:::: {.column width=40%}
+\vspace{10ex}
+![](figures/xor_like_data.pdf)
+::::
+:::
+
+## Exercise 2: Visualising decision boundaries of classifiers
+
+\small
+[\textcolor{gray}{05\_neural\_networks\_ex\_2\_decision\_boundaries.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
+\normalsize
+
+\vspace{5ex}
+
+Visualize the decision boundaries of a scikit-learn decision tree, a scikit-learn multi-layer perceptron, and XGBoost for different toy data sets.
+
+
+## Exercise 3: Boston house prices (hyperparameter optimization)
+
+\small
+[\textcolor{gray}{05\_neural\_networks\_ex\_3\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
+\normalsize
+
+\vspace{5ex}
+
+a) Can you find better hyperparameters (number of hidden layers, neurons per layer, loss function, ...)? Try this first by hand.
+b) Now use [\textcolor{gray}{sklearn.model\_selection.GridSearchCV}](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find optimal parameters.
+
+## TensorFlow
+
+::: columns
+:::: {.column width=70%}
+
+* Powerful open source library with a focus on deep neural networks
+* Performs computations of data flow graphs
+* Takes care of computing gradients of the defined functions (\textit{automatic differentiation})
+* Computations in parallel on multiple CPUs or GPUs
+* Developed by the Google Brain team
+* Initial release in 2015
+* [https://www.tensorflow.org/](https://www.tensorflow.org/)
+
+::::
+:::: {.column width=30%}
+\begin{center}
+\includegraphics[width=0.7\textwidth]{figures/tensorflow.png}
+\end{center}
+::::
+:::
+
+## Keras
+
+::: columns
+:::: {.column width=70%}
+
+* Open-source library providing high-level building blocks for developing deep-learning models  
+* Uses TensorFlow as \textit{backend engine} for low-level tensor manipulation (version 2.4)
+* Part of TensorFlow core API since TensorFlow 1.4 release
+* Over 375,000 individual users as of early-2020
+* Primary author: Fran\c{c}ois Chollet (Google engineer)
+* [https://keras.io/](https://keras.io/)
+
+::::
+:::: {.column width=30%}
+\begin{center}
+\includegraphics[width=0.5\textwidth]{figures/keras.png}
+\end{center}
+::::
+:::
+
+
+
+## Example 2: Boston house prices with Keras
+
+\small
+```python
+from keras import models
+from keras import layers
+
+model = models.Sequential()
+model.add(layers.Dense(64, activation='relu',
+                       input_shape=(train_data.shape[1],)))
+model.add(layers.Dense(64, activation='relu'))
+model.add(layers.Dense(1))
+model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
+
+model.fit(partial_train_data, partial_train_targets,
+              epochs=num_epochs, batch_size=1, verbose=0)
+
+# Evaluate the model on the validation data
+val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
+
+```
+\normalsize
+
+\footnotesize 
+[\textcolor{gray}{05\_neural\_networks\_boston\_keras.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/05_neural_networks_boston_keras.ipynb)
+
+## Convolutional neutral networks (CNNs)
+\begin{center}
+\includegraphics[width=0.7\textwidth]{figures/cnn.png}
+\end{center}
+::: columns
+:::: {.column width=80%}
+* CNNs emerged from the study of the visual cortex
+* Behind many deep learning successes (e.g. in image recognition)
+* Partially connected layers
+ 	* \textcolor{gray}{Fully connected layers impractical for large images (too many neurons, overfitting)}
+ * Key component: Convolutional layers
+ 	* \textcolor{gray}{Set of learnable filters}
+ 	* \textcolor{gray}{Low-level features at the first layers; high-level features a the end}
+::::
+:::: {.column width=20%}
+\small
+\textcolor{gray}{Sliding $3 \times3$ filter}
+![](figures/cnn_sliding_filter.png)
+::::
+:::
+
+## Different types of layers in a CNN
+::: columns
+:::: {.column width=50%}
+\small \textcolor{gray}{1. Convolutional layers} \newline
+\includegraphics[width=0.9\textwidth]{figures/cnn_conv_layer.png} 
+::::
+:::: {.column width=50%}
+\small \textcolor{gray}{3. Fully connected layers} \newline
+\includegraphics[width=0.9\textwidth]{figures/cnn_fully_connected.png}
+::::
+:::
+
+\vspace{3ex}
+
+::: columns
+:::: {.column width=60%}
+\vfill
+\small \textcolor{gray}{2. Pooling layers} \newline
+\includegraphics[width=\textwidth]{figures/cnn_pooling.png}
+::::
+:::: {.column width=40%}
+\textcolor{gray}{\footnotesize Afshine Amidi, Shervine Amidi} \
+[\textcolor{gray}{\footnotesize Convolutional Neural Networks cheatsheet}](https://github.com/afshinea/stanford-cs-230-deep-learning/blob/master/en/cheatsheet-convolutional-neural-networks.pdf)
+::::
+:::
+
+## MNIST classification with a CNN in Keras
+\footnotesize
+```python
+from keras.models import Sequential
+from keras.layers import Dense, Flatten, MaxPooling2D, Conv2D, Input, Dropout
+
+# conv layer with 8 3x3 filters
+
+model = Sequential(
+    [
+        Input(shape=input_shape),
+        Conv2D(8, kernel_size=(3, 3), activation="relu"),
+        MaxPooling2D(pool_size=(2, 2)),
+        Flatten(),
+        Dense(16, activation="relu"),
+        Dense(num_classes, activation="softmax"),
+    ]
+)
+
+model.summary()
+```
+\normalsize
+
+## Defining the CNN in Keras (2)
+
+\footnotesize
+```
+Model: "sequential_1"
+_________________________________________________________________
+Layer (type)                 Output Shape              Param #   
+=================================================================
+conv2d_1 (Conv2D)            (None, 26, 26, 8)         80        
+_________________________________________________________________
+max_pooling2d_1 (MaxPooling2 (None, 13, 13, 8)         0         
+_________________________________________________________________
+flatten_1 (Flatten)          (None, 1352)              0         
+_________________________________________________________________
+dense_2 (Dense)              (None, 16)                21648     
+_________________________________________________________________
+dense_3 (Dense)              (None, 10)                170       
+=================================================================
+Total params: 21,898
+Trainable params: 21,898
+Non-trainable params: 0
+```
+\normalsize
+
+## Model definition
+Using Keras, you have to `compile` a model, which means adding the loss function, the optimizer algorithm and validation metrics to your training setup.
+\vspace{5ex}
+
+\footnotesize
+```python
+model.compile(loss="categorical_crossentropy",
+        optimizer="adam",
+        metrics=["accuracy"])
+```
+\normalsize
+
+## Model training
+
+\footnotesize
+```python
+from keras.callbacks import ModelCheckpoint, EarlyStopping
+
+checkpoint = ModelCheckpoint(
+            filepath="mnist_keras_model.h5",
+            save_best_only=True,
+            verbose=1)
+early_stopping = EarlyStopping(patience=2)
+
+history = model.fit(x_train, y_train, # Training data
+            batch_size=200, # Batch size
+            epochs=50, # Maximum number of training epochs
+            validation_split=0.5, # Use 50% of the train dataset for validation
+            callbacks=[checkpoint, early_stopping]) # Register callbacks
+```
+\normalsize
+
+## Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
+
+\small
+[\textcolor{gray}{05\_neural\_networks\_ex\_4\_mnist\_keras\_train.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
+\normalsize
+
+\vspace{5ex}
+
+a) Plot training and validation loss as well as training and validation accuracy as a function of the number of epochs
+
+b) Determine the accuracy of the fully trained model.
+
+c) Create a second notebook that reads the trained model (`mnist_keras_model.h5`). Read `your_own_digit.png` and classify it. Create your own $28 \times 28$ pixel digits with a program like gimp and check how the model performs. 
+
+## Exercise 5: Higgs data set (1)
+
+Application of deep neural networks for separation of signal and background in an exotic Higgs scenario
+
+\vfill
+
+\small
+\color{gray}
+In this exercise we want to explore various techniques to optimize the event selection in the search for supersymmetric Higgs bosons at the LHC. In supersymmetry the Higgs sector constitutes of five Higgs bosons in contrast to the single Higgs in the standard model. Here we deal with a heavy Higgs boson which decays into two W-bosons and a standard Higgs boson ($H^0 \to W^+ W^- h$) which decay further into leptons ($W^\pm \to l^\pm \nu$) and b-quarks ($h\to b \bar{b}$) respectively.
+
+This exercise is based on a [Nature paper](https://www.nature.com/articles/ncomms5308) (Pierre Baldi, Peter Sadowski, Daniel Whiteson) which contains much more information like general background information, details about the selection variables and links to large sets of simulated events. You might also use the paper as inspiration for the solution of this exercise.
+
+## Exercise 5: Higgs data set (2)
+
+The two dataset consists of 10k and 100k events respectively. For each event 29 variables are stored:
+
+\footnotesize
+```
+    0: classification (1 = signal, 0 = background) 
+    1 - 21 : low level quantities (var1 - var21)
+    22 -28 : high level quantities (var22 - var28)
+```
+
+\normalsize
+
+You can read the data as follows:
+
+\scriptsize
+```python
+#filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/HIGGS_10k.csv" 
+filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/HIGGS_100k.csv"  
+
+df = pd.read_csv(filename, engine='python')
+```
+
+\normalsize
+a) Use a classifier of you choice to separate signal and background events. Determine the accuracy score.
+b) Compare the results when using i) the low level quantities and ii) the high level quantities
+
+
+## Practical advice -- Which algorithm to choose?
+\textcolor{gray}{From Kaggle competitions:}
+
+\vspace{3ex}
+Structured data: "High level" features that have meaning:
+
+* feature engineering + decision trees
+* Random forests
+* XGBoost
+
+\vspace{3ex}
+Unstructured data: "Low level" features, no individual meaning:
+
+* deep neural networks
+* e.g. image classification: convolutional NN
+
+
+## Outlook: Autoencoders
+
+::: columns
+:::: {.column width=50%}
+* Unsupervised method based on neural networks to learn a representation of the input data
+* Autoencoders learn to copy the input to the output layer
+	* low dimensional coding of the input in the central layer
+* The decoder generates data based on the coding (*generative model*)
+* Applications
+	* Dimensionality reduction
+	* Denoising of data
+	* Machine translation
+::::
+:::: {.column width=50%}
+\vspace{3ex}
+\begin{center}
+\includegraphics[width=\textwidth]{figures/autoencoder_example.pdf}
+\end{center}
+::::
+:::
+
+## Outlook: Generative adversarial network (GANs)
+
+\begin{center}
+\includegraphics[width=0.65\textwidth]{figures/gan.png}
+\end{center}
+\scriptsize
+[\textcolor{gray}{https://developers.google.com/machine-learning/gan/gan\_structure}](https://developers.google.com/machine-learning/gan/gan_structure)
+\normalsize
+
+* Discriminator's classification provides a signal that the generator uses to update its weights
+* Application in particle physics: fast detector simulation
+* 	Full GEANT simulation usually very CPU intensive
+
+## The future
+
+"Das Interessante an unserer Intelligenz ist, dass wir Go spielen können und dann vom Tisch aufstehen und Essen machen können, was eine Maschine nicht kann."
+
+\vspace{2ex}
+
+\color{gray}
+\small
+\hfill Bernhard Schölkopf, Max-Planck-Institut für intelligente Systeme ([Interview FAZ](https://www.faz.net/aktuell/wirtschaft/kuenstliche-intelligenz/ki-fachmann-wie-gut-europa-in-der-forschung-aufgestellt-ist-16650700.html))
+\normalsize
+\color{black}
+
+\vfill
+
+"My view is throw it all away and start again"
+
+\color{gray}
+\small
+\hfill Geoffrey Hinton (DNN pioneer) on deep neural networks and backpropagation ([Interview, 2017](https://www.axios.com/artificial-intelligence-pioneer-says-we-need-to-start-over-1513305524-f619efbd-9db0-4947-a9b2-7a4c310a28fe.html))
+\normalsize
+\color{black}
+
+