update in slides 04/05

2023-04-14 10:02:55 +02:00 · 2023-04-14 10:02:55 +02:00 · 7d2f42da62
commit 7d2f42da62
parent 1b6d744ebe
2 changed files with 267 additions and 406 deletions
--- a/slides/04_decision_trees.md
+++ b/slides/04_decision_trees.md
@ -2,15 +2,6 @@
 % Jörg Marks, \underline{Klaus Reygers}
 % Studierendentage, 11-14 April 2023
 ## Exercises
 * Exercise 1: Compare different decision tree classifiers
 	* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_1_compare_tree_classifiers.ipynb)
 * Exercise 2: Apply XGBoost classifier to MAGIC data set
 	* [`04_decision_trees_ex_1_compare_tree_classifiers.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_2_magic_xgboost_and_random_forest.ipynb)
 * Exercise 3: Feature importance
 * Exercise 4: Interpret a classifier with SHAP values
 ## Decision trees
 \begin{figure}
@ -312,31 +303,6 @@ d) Compare the performance of XGBoost with the [**random forest classifier**](ht
 \normalsize
 ## Exercise 3: Feature importance
 \small
 [\textcolor{gray}{04\_decision\_trees\_ex\_3\_magic\_feature\_importance.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/04_decision_trees_ex_3_magic_feature_importance.ipynb)
 \normalsize
 \vspace{3ex}
 Evaluate the importance of each of the $n$ features in the training of the XGBoost classifier for the MAGIC data set by dropping one of the features. This gives $n$ different classifiers. Compare the performance of these classifiers using the AUC score. 
 ## Exercise 4: Interpret a classifier with SHAP values
 SHAP (SHapley Additive exPlanations) are a means to explain the output of any machine learning model. [Shapeley values](https://en.wikipedia.org/wiki/Shapley_value) are a concept that is used in cooperative game theory. They are named after Lloyd Shapley who won the Nobel Prize in Economics in 2012.
 \vfill
 Use the Python library [`SHAP`](https://shap.readthedocs.io/en/latest/index.html) to quantify the feature importance.
 a) Study the documentation at [https://shap.readthedocs.io/en/latest/tabular_examples.html](https://shap.readthedocs.io/en/latest/tabular_examples.html)
 b) Create a summary plot of the feature importance in the MAGIC data set with `shap.summary_plot` for the XGBoost classifier of exercise 2. What are the three most important features?
 c) Do the same for the superconductivity data set? What are the three most important features? 
--- a/slides/05_neural_networks.md
+++ b/slides/05_neural_networks.md
@ -1,20 +1,8 @@
-% Introduction to Data Analysis and Machine Learning in Physics: \ 5. Neural networks
+% Introduction to Data Analysis and Machine Learning in Physics: \ 5. Convolutional Neural Networks and Graph Neural Networks
 % Jörg Marks, \underline{Klaus Reygers}
 % Studierendentage, 11-14 April 2023
-## Exercises
+## Historical perspective: Perceptron (1)
 * Exercise 1: Learn XOR with a MLP
 	* [`05_neural_networks_ex_1_xor.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
 * Exercise 2: Visualising decision boundaries of classifiers
 	* [`05_neural_networks_ex_2_decision_boundaries.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
 * Exercise 3: Boston house prices (MLP regression)
 	* [`05_neural_networks_ex_3_boston_house_prices.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
 * Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
 	* [`05_neural_networks_ex_4_mnist_keras_train.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
 * Exercise 5: Higgs data set
 ## Perceptron (1)
 ::: columns
 :::: {.column width=65%}
@ -39,7 +27,7 @@ $$h(\vec x) = \begin{cases}1 & \text{if }\ \vec w \cdot \vec x + b > 0,\\0 & \te
 \textcolor{gray}{The perceptron was designed for image recognition. It was first implemented in hardware (400 photocells, weights = potentiometer settings).}
 \normalsize
-## Perceptron (2)
+## Historical perspective: Perceptron (2)
 ::: columns
 :::: {.column width=60%}
 * McCulloch–Pitts (MCP) neuron (1943)
@ -73,52 +61,6 @@ it cannot learn the XOR function \newline
 \includegraphics[width=0.95\textwidth]{figures/neuron.png}
 \end{figure}
 ## Non-linear transfer / activation function
 Discriminant: $$ y(\vec x) = h\left( w_0 + \sum_{i=1}^n w_i x_i \right) $$
 Examples for function $h$: \newline
 $$ \frac{1}{1+e^{-x}} \; \text{("sigmoid" or "logistic" function)}, \quad \tanh x $$ 
 ::: columns
 :::: {.column width=50%}
 \begin{figure}
 \centering
 \includegraphics[width=0.75\textwidth]{figures/logistic_fct.png}
 \end{figure}
 ::::
 :::: {.column width=50%}
 \vspace{3ex}
 Non-linear activation function needed in neural networks when feature space is not linearly separable.
 \newline
 \small 
 \textcolor{gray}{Neural net with linear activation functions is just a perceptron}
 \normalsize
 ::::
 :::
 ## Feedforward neural network with one hidden layer
 ::: columns
 :::: {.column width=60%}
 ![](figures/mlp.png){width=80%}
 ::::
 :::: {.column width=40%}
 $$ \phi_i(\vec x) = h\left(w_{i0}^{(1)} + \sum_{j=1}^n w_{ij}^{(1)} x_j\right) $$
 \vfill
 $$ y(\vec x) = h\left( w_{10}^{(2)} + \sum_{j=1}^m w_{1j}^{(2)} \phi_j(\vec x)\right) $$
 \vfill
 \vspace{2ex}
 \footnotesize 
 \textcolor{gray}{superscripts indicates layer number, i.e., $w_{ij}^{(1)}$ refers to the input weights of neuron $i$ in the hidden layer (= layer 1).}
 \normalsize
 ::::
 :::
 \begin{center} 
 Straightforward to generalize to multiple hidden layers
 \end{center}
 ## Neural network output and decision boundaries
 ::: columns
 :::: {.column width=75%}
@ -135,50 +77,7 @@ Straightforward to generalize to multiple hidden layers
 ::::
 :::
-## Fun with neural nets in the browser
+## Recap: Backpropagation
 \begin{figure}
 \centering
 \includegraphics[width=\textwidth]{figures/tf_playground.png}
 \end{figure}
 \tiny
 [\textcolor{gray}{http://playground.tensorflow.org}](http://playground.tensorflow.org) 
 \normalsize
 ## Backpropagation (1)
 Start with an initial guess $\vec w_0$ for the weights an then update weights after each training event:
 $$ \vec w^{(\tau+1)} = \vec w^{(\tau)} - \eta \nabla E_a(\vec w^{(\tau)}), \quad \eta = \text{learning rate}$$
 Gradient descent:
 \begin{figure}
 \centering
 \includegraphics[width=0.46\textwidth]{figures/gradient_descent.png}
 \end{figure}
 ## Backpropagation (2)
 ::: columns
 :::: {.column width=40%}
 \vspace{6ex}
 ![](figures/mlp.png){width=100%}
 ::::
 :::: {.column width=60%}
 Let's write network output as follows:
 \begin{align*}
 y(\vec x) &= h(u(\vec x)); \quad u(\vec x) = \sum_{j=0}^m w_{1j}^{(2)} \phi_j(\vec x) \\
 \phi_j(\vec x) &= h\left( \sum_{k=0}^n w_{jk}^{(1)} x_k\right) 
 \equiv h\left( v_j(\vec x) \right)
 \end{align*}
 For $E_a = \frac{1}{2} (y_a - t_a)^2$ one obtains for the weights from hidden layer to output:
 \begin{align*} 
 \frac{\partial E_a}{\partial w_{1j}^{(2)}} &= (y_a -t_a) h'(u(\vec x_a)) \frac{\partial u}{\partial w_{1j}^{(2)}} \\
 &= (y_a -t_a) h'(u(\vec x_a)) \phi_j(\vec x_a)
 \end{align*}
 ::::
 :::
 \vspace{2ex}
 Further application of the chain rule gives weights from input to hidden layer.
 ## Backpropagation (3)
 Backpropagation summary
 * Make prediction for a given training instance (forward pass)
@ -192,7 +91,7 @@ Practical considerations:
 * Nowadays, people will implements neural networks with frameworks like Keras or TensorFlow
 * No need to implement backpropagation yourself
-* TensorFlow efficiently calculates gradient function based on a kind of symbolic differentiation
+* TensorFlow efficiently calculates gradient function efficiently ('autodiff')
 ## More on gradient descent
@ -246,46 +145,7 @@ The theorem does not touch upon the algorithmic learnability of those parameters
 ::::
 :::
-## Deep neural networks
+## Recap: Activation functions
 Deep networks: many hidden layers with large number of neurons 
 ::: columns
 :::: {.column width=50%}
 * Challenges
 	* Hard too train ("vanishing gradient problem")
 	* Training slow
 	* Risk of overtraining
 ::::
 :::: {.column width=50%}
 * Big progress in recent years
 	* Interest in NN waned before ca. 2006
 	* Milestone: paper by G. Hinton (2006): "learning for deep belief nets"
 	* Image recognition, AlphaGo, …
 	* Soon: self-driving cars, …
 ::::
 :::
 \begin{figure}
 \centering
 \includegraphics[width=0.5\textwidth]{figures/dnn.png}
 \end{figure}
 ## Drawbacks of the sigmoid activation function
 ::: columns
 :::: {.column width=50%}
 \includegraphics[width=.75\textwidth]{figures/sigmoid.png}
 ::::
 :::: {.column width=50%}
 $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
 \vspace{3ex}
 * Saturated neurons “kill” the gradients
 * Sigmoid outputs are not zero-centered
 * exp() is a bit compute expensive
 ::::
 :::
 ## Activation functions
 \begin{figure}
 \centering
 \includegraphics[width=\textwidth]{figures/activation_functions.png}
@ -324,7 +184,7 @@ Goal: generalization of training data
 \includegraphics[width=0.8\textwidth]{figures/underfitting_overfitting.pdf}
 \end{figure}
-## Example of overtraining
+## Recap: Overtraining
 Too many neurons/layers make a neural network too flexible \newline $\to$ overtraining
 \begin{figure}
@ -351,202 +211,21 @@ Monitor fraction of misclassified events (or loss function:)
 $L_1$ regularization: $R(W) = \sum_k |W_k|$, $L_2$ regularization: $R(W) = \sum_k W_k^2$
 \end{center}
-## Another approach to prevent overfitting: Dropout
+## Exercise 1: Hyperparameter optimization
 * Randomly remove nodes during training
 * Avoid co-adaptation of nodes
 \begin{figure}
 \centering
 \includegraphics[width=0.8\textwidth]{figures/dropout.png}
 \end{figure}
 \scriptsize
 \textcolor{gray}{Srivastava et al.,}
 [\textcolor{gray}{"Dropout: A Simple Way to Prevent Neural Networks from Overfitting"}](jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf)
 \normalsize
 ## Pros and cons of multi-layer perceptrons
 \textcolor{green}{Pros}
 * Capability to learn non-linear models
 \vspace{3ex}
 \textcolor{red}{Cons} 
 * Loss function can have several local minima
 * Hyperparameters need to be tuned
 	* \textcolor{gray}{number of layers, neurons per layer, and training iterations}
 * Sensitive to feature scaling
 	* \textcolor{gray}{preprocessing needed (e.g., scaling of all feature to range [0,1])}
 ## Example 1: Boston house prices (MLP regression) (1)
 * Objective: predict house prices in Boston suburbs in the mid-1970s
 * Boston house data set: 506 instances, 13 features
 \footnotesize
 ```
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's
 ```
 \footnotesize 
 [\textcolor{gray}{05\_neural\_networks\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/05_neural_networks_boston_house_prices.ipynb)
 ## Example 1: Boston house prices (MLP regression) (2)
 ```python
 boston = datasets.load_boston()
 X = boston.data
 y = boston.target
 from sklearn.neural_network import MLPRegressor
 mlp = MLPRegressor(hidden_layer_sizes=(100), 
 	activation='logistic', random_state=1, max_iter=5000)
 mlp.fit(X_train, y_train)
 y_pred_mlp = mlp.predict(X_test)
 rms = np.sqrt(mean_squared_error(y_test, y_pred_mlp))
 print(f"root mean square error {rms:.2f}")
 ```
 ## Example 1: Boston house prices (MLP regression) (3)
 \begin{center}
 \includegraphics[width=0.7\textwidth]{figures/boston_house_prices.pdf}
 \end{center}
 ## Exercise 1: XOR
 \small
 [\textcolor{gray}{05\_neural\_networks\_ex\_1\_xor.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
 \normalsize
 ::: columns
 :::: {.column width=60%}
 a) Define a multi-layer perceptron classifier that learns the XOR problem.
 \scriptsize
 ```python
 	from sklearn.neural_network import MLPClassifier
 	X = [[0, 0], [0, 1], [1, 0], [1, 1]]
 	y = [0, 1, 1, 0]
 ```
 \normalsize
 b) Define a multi-layer perceptron regressor that fits the depicted 2d data (see notebook).
 c) Plot the mean square error vs. the number of number of training epochs for b).
 ::::
 :::: {.column width=40%}
 \vspace{10ex}
 ![](figures/xor_like_data.pdf)
 ::::
 :::
 ## Exercise 2: Visualising decision boundaries of classifiers
 \small
-[\textcolor{gray}{05\_neural\_networks\_ex\_2\_decision\_boundaries.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
+[\textcolor{gray}{05\_neural\_networks\_ex\_1\_hyperparameter\_optimization.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_hyperparameter_optimization.ipynb)
 \normalsize
 \vspace{5ex}
-Visualize the decision boundaries of a scikit-learn decision tree, a scikit-learn multi-layer perceptron, and XGBoost for different toy data sets.
+The multi-layer perceptron did not perform well on the superconductivity dataset. Can you find better hyperparameters (number of hidden layers, neurons per layer, loss function, learning rate...)?
 \vspace{2ex}
-## Exercise 3: Boston house prices (hyperparameter optimization)
+Use [\textcolor{gray}{sklearn.model\_selection.GridSearchCV}](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find optimal parameters.
-\small
+## Convolutional neutral networks (CNNs): Overview
 [\textcolor{gray}{05\_neural\_networks\_ex\_3\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
 \normalsize
 \vspace{5ex}
 a) Can you find better hyperparameters (number of hidden layers, neurons per layer, loss function, ...)? Try this first by hand.
 b) Now use [\textcolor{gray}{sklearn.model\_selection.GridSearchCV}](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find optimal parameters.
 ## TensorFlow
 ::: columns
 :::: {.column width=70%}
 * Powerful open source library with a focus on deep neural networks
 * Performs computations of data flow graphs
 * Takes care of computing gradients of the defined functions (\textit{automatic differentiation})
 * Computations in parallel on multiple CPUs or GPUs
 * Developed by the Google Brain team
 * Initial release in 2015
 * [https://www.tensorflow.org/](https://www.tensorflow.org/)
 ::::
 :::: {.column width=30%}
 \begin{center}
 \includegraphics[width=0.7\textwidth]{figures/tensorflow.png}
 \end{center}
 ::::
 :::
 ## Keras
 ::: columns
 :::: {.column width=70%}
 * Open-source library providing high-level building blocks for developing deep-learning models  
 * Uses TensorFlow as \textit{backend engine} for low-level tensor manipulation (version 2.4)
 * Part of TensorFlow core API since TensorFlow 1.4 release
 * Over 375,000 individual users as of early-2020
 * Primary author: Fran\c{c}ois Chollet (Google engineer)
 * [https://keras.io/](https://keras.io/)
 ::::
 :::: {.column width=30%}
 \begin{center}
 \includegraphics[width=0.5\textwidth]{figures/keras.png}
 \end{center}
 ::::
 :::
 ## Example 2: Boston house prices with Keras
 \small
 ```python
 from keras import models
 from keras import layers
 model = models.Sequential()
 model.add(layers.Dense(64, activation='relu',
                       input_shape=(train_data.shape[1],)))
 model.add(layers.Dense(64, activation='relu'))
 model.add(layers.Dense(1))
 model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
 model.fit(partial_train_data, partial_train_targets,
              epochs=num_epochs, batch_size=1, verbose=0)
 # Evaluate the model on the validation data
 val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
 ```
 \normalsize
 \footnotesize 
 [\textcolor{gray}{05\_neural\_networks\_boston\_keras.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/05_neural_networks_boston_keras.ipynb)
 ## Convolutional neutral networks (CNNs)
 \begin{center}
 \includegraphics[width=0.7\textwidth]{figures/cnn.png}
 \end{center}
@ -593,14 +272,99 @@ val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
 ::::
 :::
 ## Convolution
 [Convolution](https://en.wikipedia.org/wiki/Convolution) of a function $f$ with a kernel or filter function $g$:
 \begin{figure}
 \centering
 \includegraphics[width=0.5\textwidth]{figures/convolution.png}
 \end{figure}
 \vspace{1ex}
 Practical example: blurring of an image with a Gaussian filter \newline
 \tiny \textcolor{gray}{https://www.cs.cornell.edu/courses/cs6670/2011sp/lectures/lec02\_filter.pdf} \normalsize
 \begin{figure}
 \centering
 \includegraphics[width=0.8\textwidth]{figures/gaussian_filter.png}
 \end{figure}
 ## Filters can detect structures/features in an image (1)
 Filters to detects the "X" structures:
 \begin{figure}
 \centering
 \includegraphics[width=0.8\textwidth]{figures/features_of_x.png}
 \end{figure}
 \tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize
 ## Filters can detect structures/features in an image (2)
 \begin{figure}
 \centering
 \includegraphics[width=0.8\textwidth]{figures/filters_to_detect_x_features.png}
 \end{figure}
 \tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize
 ## Filters can detect structures/features in an image (3)
 The convolutional operation
 \begin{figure}
 \centering
 \includegraphics[width=0.8\textwidth]{figures/convolutional_operation.png}
 \end{figure}
 \tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize
 ## Filters can detect structures/features in an image (4)
 Extracting a feature map
 \begin{figure}
 \centering
 \includegraphics[width=0.8\textwidth]{figures/sliding_filter.png}
 \end{figure}
 \tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize
 ## CNNs for classification
 \begin{figure}
 \centering
 \includegraphics[width=0.95\textwidth]{figures/cnns_for_classification.png}
 \end{figure}
 \tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize
 ## AlexNet
 "ImageNet Classification with Deep Convolutional Neural Networks", \newline 
 Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, 2012
 \begin{figure}
 \centering
 \includegraphics[width=0.95\textwidth]{figures/alexnet.png}
 \end{figure}
 \tiny \textcolor{gray}{figure: https://www.mdpi.com/2072-4292/9/8/848} \normalsize
 ## Inductive bias of a CNNs
 Can you summarize the inductive bias that is associated with a convolutional neural network? \newline
 \vspace{0.5ex}
 ChatGPT Mar 23 Version: \newline
 \vspace{0.5ex}
 \small "Yes, of course! The inductive bias associated with a convolutional neural network (CNN) is that it assumes that the input data has a grid-like topology, such as images, audio spectrograms or video frames.
 CNNs are designed to learn local and translation-invariant features by convolving small filters over the input data. This allows them to identify patterns in the input that are useful for classification or other tasks. The filters can be learned through backpropagation, and the output of each convolutional layer is usually downsampled (e.g., through max pooling) to reduce the dimensionality of the data.
 The inductive bias of CNNs allows them to perform well on tasks involving images, audio, and video data, where the local structure of the data is important. However, it may not be as effective for other types of data that do not have a grid-like topology, such as text or graphs. In these cases, other types of neural networks or machine learning algorithms may be more appropriate."
 \normalsize
 ## MNIST classification with a CNN in Keras
 \footnotesize
 ```python
-from keras.models import Sequential
+from tensorflow.keras.models import Sequential
-from keras.layers import Dense, Flatten, MaxPooling2D, Conv2D, Input, Dropout
+from tensorflow.keras.layers import Dense, Flatten, MaxPooling2D, Conv2D, Input
 # conv layer with 8 3x3 filters
 model = Sequential(
    [
        Input(shape=input_shape),
@ -656,7 +420,7 @@ model.compile(loss="categorical_crossentropy",
 \footnotesize
 ```python
-from keras.callbacks import ModelCheckpoint, EarlyStopping
+from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
 checkpoint = ModelCheckpoint(
            filepath="mnist_keras_model.h5",
@ -672,10 +436,10 @@ history = model.fit(x_train, y_train, # Training data
 ```
 \normalsize
-## Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
+## Exercise 2: Training a digit-classification neural network on the MNIST dataset using Keras
 \small
-[\textcolor{gray}{05\_neural\_networks\_ex\_4\_mnist\_keras\_train.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
+[\textcolor{gray}{05\_neural\_networks\_ex\_2\_mnist\_keras\_train.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_mnist_keras_train.ipynb)
 \normalsize
 \vspace{5ex}
@ -686,44 +450,175 @@ b) Determine the accuracy of the fully trained model.
 c) Create a second notebook that reads the trained model (`mnist_keras_model.h5`). Read `your_own_digit.png` and classify it. Create your own $28 \times 28$ pixel digits with a program like gimp and check how the model performs.
-## Exercise 5: Higgs data set (1)
+d) Try to improve the performance of the network by increasing the number of filters and by adding a second convolutional layer.
-Application of deep neural networks for separation of signal and background in an exotic Higgs scenario
+## Graph Neural Networks
-\vfill
+\tiny \textcolor{gray}{slides on GNNs by Martin Kroesen} \normalsize
 \small
 \color{gray}
 In this exercise we want to explore various techniques to optimize the event selection in the search for supersymmetric Higgs bosons at the LHC. In supersymmetry the Higgs sector constitutes of five Higgs bosons in contrast to the single Higgs in the standard model. Here we deal with a heavy Higgs boson which decays into two W-bosons and a standard Higgs boson ($H^0 \to W^+ W^- h$) which decay further into leptons ($W^\pm \to l^\pm \nu$) and b-quarks ($h\to b \bar{b}$) respectively.
 This exercise is based on a [Nature paper](https://www.nature.com/articles/ncomms5308) (Pierre Baldi, Peter Sadowski, Daniel Whiteson) which contains much more information like general background information, details about the selection variables and links to large sets of simulated events. You might also use the paper as inspiration for the solution of this exercise.
 ## Exercise 5: Higgs data set (2)
 The two dataset consists of 10k and 100k events respectively. For each event 29 variables are stored:
 \footnotesize
 ```
    0: classification (1 = signal, 0 = background) 
    1 - 21 : low level quantities (var1 - var21)
    22 -28 : high level quantities (var22 - var28)
 ```
 ::: columns
 :::: {.column width=65%}
 * Graph Neural Networks (GNNs): Neural Networks that operate on graph structured data
 * Graph: consists of nodes that can be connected by edges, edges can be directed or undirected
 * no grid structure as given for CNNs
 * node features and edge features possible
 * relation often represented by adjacency matrix: $A_{ij}=1$ if there is a link between node $i$ and $j$, else 0
 * tasks on node level, edge level and graph level
 * full lecture: \url{https://web.stanford.edu/class/cs224w/}
 ::::
 :::: {.column width=35%}
 \begin{center}
 \includegraphics[width=1.1\textwidth]{figures/graph_example.png}
 \normalsize
 \end{center}
 ::::
 :::
-You can read the data as follows:
+## Simple Example: Zachary's karate club
 \scriptsize
 ```python
 #filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/HIGGS_10k.csv" 
 filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/HIGGS_100k.csv"  
 df = pd.read_csv(filename, engine='python')
 ```
 ::: columns
 :::: {.column width=60%}
 * link: \url{https://en.wikipedia.org/wiki/Zachary's_karate_club}
 * 34 nodes: each node represents a member of the karate club
 * 4 classes: a community each member belongs to 
 * task: classify the nodes
 * many real world problems for GNNs exist, e.g.\ social networks, molecules, recommender systems, particle tracks 
 ::::
 :::: {.column width=40%}
 \begin{center}
 \includegraphics[width=1.\textwidth]{figures/karateclub.png}
 \normalsize
-a) Use a classifier of you choice to separate signal and background events. Determine the accuracy score.
+\end{center}
-b) Compare the results when using i) the low level quantities and ii) the high level quantities
+::::
 :::
 ## From CNN to GNN
 \begin{center}
 \includegraphics[width=0.8\textwidth]{figures/fromCNNtoGNN.png}
 \normalsize
 \newline
 \tiny (from Stanford GNN lecture)
 \end{center}
 \normalsize
 * GNN: Generalization of convolutional neural network
 * No grid structure, arbitrary number of neighbors defined by adjacency matrix
 * Operations pass information from neighborhood
 ## Architecture: Graph Convolutional Network
 ::: columns
 :::: {.column width=60%}
 * Message passing from connected nodes
 * The graph convolution is defined as:
 $$ H^{(l+1)} = \sigma \left( \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)} \right)$$
 * The adjacency matrix $A$ including self-connections is given by $\tilde{A}$
 * The degree matrix of the corrected adjacency matrix is given by $\tilde{D}_{ii} = \Sigma_j \tilde{A}_{ij}$
 * The weights of the given layer are called $W^{(l)}$
 * $H^{(l)}$ is the matrix for activations in layer $l$
 ::::
 :::: {.column width=40%}
 \begin{center}
 \includegraphics[width=1.1\textwidth]{figures/GCN.png}
 \normalsize
 \end{center}
 \tiny \url{https://arxiv.org/abs/1609.02907} 
 ::::
 :::
 ## Architecture: Graph Attention Network
 ::: columns
 :::: {.column width=50%}
 * Calculate the attention coefficients $e_{ij}$ from the features $\vec{h}$ for each node $i$ with its neighbors $j$
 $$ e_{ij} = a\left( W\vec{h}_i, W\vec{h}_j \right)$$
 $a$: learnable weight vector 
 * Normalize attention coefficients 
 $$ \alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\text{exp}(e_{ij})}{\Sigma_k \text{exp}(e_{ik})} $$
 * Calculate node features 
 $$
 \vec{h}^{(l+1)}_i = \sigma \left( \Sigma \alpha_{ij} W \vec{h}^l_j \right)$$
 ::::
 :::: {.column width=50%}
 \begin{center}
 \includegraphics[width=1.1\textwidth]{figures/GraphAttention.png}
 \normalsize
 \end{center}
 \tiny \url{https://arxiv.org/abs/1710.10903} 
 ::::
 :::
 ## Example: Identification of inelastic interactions in TRD
 ::: columns
 :::: {.column width=60%}
 * Identification of inelastic interactions of light antinuclei
 in the Transition Radiation Detector in ALICE
 * Thesis: \url{https://www.physi.uni-heidelberg.de/Publications/Bachelor_Thesis_Maximilian_Hammermann.pdf}
 * Construct nearest neighbor graph from signals in detector
 * Use global pooling for graph classification
 ::::
 :::: {.column width=40%}
 interaction of antideuteron:
 \begin{center}
 \includegraphics[width=0.8\textwidth]{figures/antideuteronsgnMax.png}
 \normalsize
 \end{center}
 ::::
 :::
 \begin{center}
 \includegraphics[width=0.9\textwidth]{figures/GNN_conf.png}
 \normalsize
 \end{center}
 ## Example: Google Maps
 * link: \url{https://www.deepmind.com/blog/traffic-prediction-with-advanced-graph-neural-networks}
 * GNNs are used for traffic predictions and estimated times of arrival (ETAs)
 \begin{center}
 \includegraphics[width=0.8\textwidth]{figures/GNNgooglemaps.png}
 \normalsize
 \end{center}
 ## Example: Alpha Fold
 * link: \url{https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology}
 * "A folded protein can be thought of as a 'spatial graph', where residues are the nodes and edges connect the residues in close proximity"
 \begin{center}
 \includegraphics[width=0.9\textwidth]{figures/alphafold.png}
 \normalsize
 \end{center}
 ## Exercise 3: Illustration of Graphs and Graph Neural Networks
 On the PyTorch Geometric webpage, you can find official examples for the application of Graph Neural Networks:
 https://pytorch-geometric.readthedocs.io/en/latest/get_started/colabs.html
 \vspace{3ex}
 The first introduction notebook shows the functionality of graphs with the example of the Karate Club. Follow and reproduce the first [\textcolor{green}{notebook}](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8?usp=sharing). Study and understand the data format.
 \vspace{3ex}
 At the end, the separation power of Graph Convolutional Networks (GCN) are shown via the node embeddings. You can replace the GCN with a Graph Attention Layers and compare the results. 
 ## Exercise 4: Classifying molecules
 The PyTorch Geometric webpage also provides an example of the classification of molecules:
 https://colab.research.google.com/drive/1I8a0DfQ3fI7Njc62__mVXUlcAleUclnb?usp=sharing
 \vspace{3ex}
 Study this notebook and then modify it to apply it to the [PROTEINS dataset](https://pytorch-geometric.readthedocs.io/en/latest/cheatsheet/data_cheatsheet.html).
 ## Practical advice -- Which algorithm to choose?