ML-Kurs-SS2023/slides/05_neural_networks.md

% Introduction to Data Analysis and Machine Learning in Physics: \ 5. Convolutional Neural Networks and Graph Neural Networks
% Jörg Marks, \underline{Klaus Reygers}
% Studierendentage, 11-14 April 2023

## Historical perspective: Perceptron (1)

::: columns
:::: {.column width=65%}
\begin{center}
\includegraphics[width=0.40\textwidth]{figures/perceptron_weighted_sum.png}
\vspace{1ex}
\includegraphics[width=0.75\textwidth]{figures/perceptron_retina.png}
\end{center}
::::
:::: {.column width=35%}
$$h(\vec x) = \begin{cases}1 & \text{if }\ \vec w \cdot \vec x + b > 0,\\0 & \text{otherwise}\end{cases}$$
\begin{center}
\includegraphics[width=0.95\textwidth]{figures/perceptron_photo.png}
\tiny
\textcolor{gray}{Mark 1 Perceptron. Frank Rosenblatt (1961)}
\normalsize
\end{center}
::::
:::
\footnotesize
\vspace{2ex}
\textcolor{gray}{The perceptron was designed for image recognition. It was first implemented in hardware (400 photocells, weights = potentiometer settings).}
\normalsize

## Historical perspective: Perceptron (2)
::: columns
:::: {.column width=60%}
* McCulloch–Pitts (MCP) neuron (1943)
	* First mathematical model of a biological neuron
	* Boolean input
	* Equal weights for all inputs
	* Threshold hardcoded
* Improvements by Rosenblatt
	* Different weights for inputs
	* Algorithm to update weights and threshold given labeled training data

\vfill

Shortcoming of the perceptron: \newline
it cannot learn the XOR function \newline
\tiny \textcolor{gray}{Minsky, Papert, 1969} \normalsize

::::
:::: {.column width=40%}
![](figures/perceptron_with_threshold.png){width=80%}
![](figures/xor.png)
\small \textcolor{gray}{XOR: not linearly separable } \normalsize

::::
:::

## The biological inspiration: the neuron

\begin{figure}
\centering
\includegraphics[width=0.95\textwidth]{figures/neuron.png}
\end{figure}

## Neural network output and decision boundaries
::: columns
:::: {.column width=75%}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/nn_decision_boundary.png}
\end{figure}
::::
:::: {.column width=25%}
\vspace{3ex}
\footnotesize
\textcolor{gray}{P. Bhat, Multivariate Analysis Methods in Particle Physics, inspirehep.net/record/879273}
\normalsize
::::
:::

## Recap: Backpropagation
Backpropagation summary

* Make prediction for a given training instance (forward pass)
* Calculate error (value of loss function)
* Go backwards and determine the contribution of each weight (reverse pass)
* Adjust the weights to reduce the error

\vfill

Practical considerations:

* Nowadays, people will implements neural networks with frameworks like Keras or TensorFlow
* No need to implement backpropagation yourself
* TensorFlow efficiently calculates gradient function efficiently ('autodiff')


## More on gradient descent

::: columns
:::: {.column width=60%}
* Stochastic gradient descent
	* just uses one training event at a time
	* fast, but quite irregular approach to the minimum
	* can help escape local minima
	* one can decrease learning rate to settle at the minimum ("simulated annealing")
* Batch gradient descent
	* use entire training sample to calculate gradient of loss function
	* computationally expensive
* Mini-batch gradient descent
	* calculate gradient for a random sub-sample of the training set

::::
:::: {.column width=40%}
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{figures/stochastic_gradient_descent.png}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/gradient_descent_cmp.png}
\end{figure}
::::
:::

## Universal approximation theorem

::: columns
:::: {.column width=60%}
"A feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$."

\vspace{5ex}

One of the first versions of the theorem was proved by George Cybenko in 1989 for sigmoid activation functions

\vspace{5ex}

The theorem does not touch upon the algorithmic learnability of those parameters

::::
:::: {.column width=40%}
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/ann.png}
\end{figure}
::::
:::

## Recap: Activation functions
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/activation_functions.png}
\end{figure}

## ReLU
::: columns
:::: {.column width=50%}
\includegraphics[width=.75\textwidth]{figures/relu.png}
::::
:::: {.column width=50%}
$$ f(x) = \max(0,x) $$
\vspace{1ex}

* Does not saturate (in +region)
* Very computationally efficient
* Converges much faster than sigmoid tanh in practice
* Actually more biologically plausible than sigmoid
* But: gradient vanishes for $x < 0$

::::
:::


## Bias-variance tradeoff

Goal: generalization of training data

* Simple models (few parameters): danger of bias
	* \textcolor{gray}{Classifiers with a small number of degrees of freedom are less prone to statistical fluctuations: different training samples would result in similar classification boundaries ("small variance")}
* Complex models (many parameters): danger of overfitting
	* \textcolor{gray}{large variance of decision boundaries for different training samples}

\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/underfitting_overfitting.pdf}
\end{figure}

## Recap: Overtraining
Too many neurons/layers make a neural network too flexible \newline $\to$ overtraining

\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{figures/example_overtraining.png}
\end{figure}

## Monitoring overtraining
Monitor fraction of misclassified events (or loss function:)
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/monitoring_overtraining.png}
\end{figure}

## Regularization: Avoid overfitting
\scriptsize
[\hfill \textcolor{gray}{http://cs231n.stanford.edu/slides}](http://cs231n.stanford.edu/slides)
\normalsize
\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{figures/regularization.png}
\end{figure}
\begin{center}
$L_1$ regularization: $R(W) = \sum_k |W_k|$, $L_2$ regularization: $R(W) = \sum_k W_k^2$
\end{center}

## Exercise 1: Hyperparameter optimization

\small
[\textcolor{gray}{05\_neural\_networks\_ex\_1\_hyperparameter\_optimization.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_hyperparameter_optimization.ipynb)
\normalsize

\vspace{5ex}

The multi-layer perceptron did not perform well on the superconductivity dataset. Can you find better hyperparameters (number of hidden layers, neurons per layer, loss function, learning rate...)?

\vspace{2ex}

Use [\textcolor{gray}{sklearn.model\_selection.GridSearchCV}](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find optimal parameters.

## Convolutional neutral networks (CNNs): Overview
\begin{center}
\includegraphics[width=0.7\textwidth]{figures/cnn.png}
\end{center}
::: columns
:::: {.column width=80%}
* CNNs emerged from the study of the visual cortex
* Behind many deep learning successes (e.g. in image recognition)
* Partially connected layers
 	* \textcolor{gray}{Fully connected layers impractical for large images (too many neurons, overfitting)}
 * Key component: Convolutional layers
 	* \textcolor{gray}{Set of learnable filters}
 	* \textcolor{gray}{Low-level features at the first layers; high-level features a the end}
::::
:::: {.column width=20%}
\small
\textcolor{gray}{Sliding $3 \times3$ filter}
![](figures/cnn_sliding_filter.png)
::::
:::

## Different types of layers in a CNN
::: columns
:::: {.column width=50%}
\small \textcolor{gray}{1. Convolutional layers} \newline
\includegraphics[width=0.9\textwidth]{figures/cnn_conv_layer.png}
::::
:::: {.column width=50%}
\small \textcolor{gray}{3. Fully connected layers} \newline
\includegraphics[width=0.9\textwidth]{figures/cnn_fully_connected.png}
::::
:::

\vspace{3ex}

::: columns
:::: {.column width=60%}
\vfill
\small \textcolor{gray}{2. Pooling layers} \newline
\includegraphics[width=\textwidth]{figures/cnn_pooling.png}
::::
:::: {.column width=40%}
\textcolor{gray}{\footnotesize Afshine Amidi, Shervine Amidi} \
[\textcolor{gray}{\footnotesize Convolutional Neural Networks cheatsheet}](https://github.com/afshinea/stanford-cs-230-deep-learning/blob/master/en/cheatsheet-convolutional-neural-networks.pdf)
::::
:::

## Convolution
[Convolution](https://en.wikipedia.org/wiki/Convolution) of a function $f$ with a kernel or filter function $g$:
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{figures/convolution.png}
\end{figure}

\vspace{1ex}

Practical example: blurring of an image with a Gaussian filter \newline
\tiny \textcolor{gray}{https://www.cs.cornell.edu/courses/cs6670/2011sp/lectures/lec02\_filter.pdf} \normalsize
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/gaussian_filter.png}
\end{figure}

## Filters can detect structures/features in an image (1)

Filters to detects the "X" structures:
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/features_of_x.png}
\end{figure}

\tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize

## Filters can detect structures/features in an image (2)

\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/filters_to_detect_x_features.png}
\end{figure}

\tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize

## Filters can detect structures/features in an image (3)
The convolutional operation
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/convolutional_operation.png}
\end{figure}

\tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize

## Filters can detect structures/features in an image (4)
Extracting a feature map
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/sliding_filter.png}
\end{figure}

\tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize

## CNNs for classification
\begin{figure}
\centering
\includegraphics[width=0.95\textwidth]{figures/cnns_for_classification.png}
\end{figure}

\tiny \textcolor{gray}{© MIT 6.S191: Introduction to Deep Learning IntroToDeepLearning.com} \normalsize

## AlexNet
"ImageNet Classification with Deep Convolutional Neural Networks", \newline
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, 2012
\begin{figure}
\centering
\includegraphics[width=0.95\textwidth]{figures/alexnet.png}
\end{figure}

\tiny \textcolor{gray}{figure: https://www.mdpi.com/2072-4292/9/8/848} \normalsize


## Inductive bias of a CNNs

Can you summarize the inductive bias that is associated with a convolutional neural network? \newline

\vspace{0.5ex}
ChatGPT Mar 23 Version: \newline
\vspace{0.5ex}
\small "Yes, of course! The inductive bias associated with a convolutional neural network (CNN) is that it assumes that the input data has a grid-like topology, such as images, audio spectrograms or video frames.

CNNs are designed to learn local and translation-invariant features by convolving small filters over the input data. This allows them to identify patterns in the input that are useful for classification or other tasks. The filters can be learned through backpropagation, and the output of each convolutional layer is usually downsampled (e.g., through max pooling) to reduce the dimensionality of the data.

The inductive bias of CNNs allows them to perform well on tasks involving images, audio, and video data, where the local structure of the data is important. However, it may not be as effective for other types of data that do not have a grid-like topology, such as text or graphs. In these cases, other types of neural networks or machine learning algorithms may be more appropriate."
\normalsize

## MNIST classification with a CNN in Keras
\footnotesize
```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, MaxPooling2D, Conv2D, Input

# conv layer with 8 3x3 filters
model = Sequential(
    [
        Input(shape=input_shape),
        Conv2D(8, kernel_size=(3, 3), activation="relu"),
        MaxPooling2D(pool_size=(2, 2)),
        Flatten(),
        Dense(16, activation="relu"),
        Dense(num_classes, activation="softmax"),
    ]
)

model.summary()
```
\normalsize

## Defining the CNN in Keras (2)

\footnotesize
```
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 8)         80
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 8)         0
_________________________________________________________________
flatten_1 (Flatten)          (None, 1352)              0
_________________________________________________________________
dense_2 (Dense)              (None, 16)                21648
_________________________________________________________________
dense_3 (Dense)              (None, 10)                170
=================================================================
Total params: 21,898
Trainable params: 21,898
Non-trainable params: 0
```
\normalsize

## Model definition
Using Keras, you have to `compile` a model, which means adding the loss function, the optimizer algorithm and validation metrics to your training setup.
\vspace{5ex}

\footnotesize
```python
model.compile(loss="categorical_crossentropy",
        optimizer="adam",
        metrics=["accuracy"])
```
\normalsize

## Model training

\footnotesize
```python
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

checkpoint = ModelCheckpoint(
            filepath="mnist_keras_model.h5",
            save_best_only=True,
            verbose=1)
early_stopping = EarlyStopping(patience=2)

history = model.fit(x_train, y_train, # Training data
            batch_size=200, # Batch size
            epochs=50, # Maximum number of training epochs
            validation_split=0.5, # Use 50% of the train dataset for validation
            callbacks=[checkpoint, early_stopping]) # Register callbacks
```
\normalsize

## Exercise 2: Training a digit-classification neural network on the MNIST dataset using Keras

\small
[\textcolor{gray}{05\_neural\_networks\_ex\_2\_mnist\_keras\_train.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_mnist_keras_train.ipynb)
\normalsize

\vspace{5ex}

a) Plot training and validation loss as well as training and validation accuracy as a function of the number of epochs

b) Determine the accuracy of the fully trained model.

c) Create a second notebook that reads the trained model (`mnist_keras_model.h5`). Read `your_own_digit.png` and classify it. Create your own $28 \times 28$ pixel digits with a program like gimp and check how the model performs.

d) Try to improve the performance of the network by increasing the number of filters and by adding a second convolutional layer.

## Graph Neural Networks

\tiny \textcolor{gray}{slides on GNNs by Martin Kroesen} \normalsize

::: columns
:::: {.column width=65%}
* Graph Neural Networks (GNNs): Neural Networks that operate on graph structured data
* Graph: consists of nodes that can be connected by edges, edges can be directed or undirected
* no grid structure as given for CNNs
* node features and edge features possible
* relation often represented by adjacency matrix: $A_{ij}=1$ if there is a link between node $i$ and $j$, else 0
* tasks on node level, edge level and graph level
* full lecture: \url{https://web.stanford.edu/class/cs224w/}
::::
:::: {.column width=35%}
\begin{center}
\includegraphics[width=1.1\textwidth]{figures/graph_example.png}
\normalsize
\end{center}
::::
:::

## Simple Example: Zachary's karate club

::: columns
:::: {.column width=60%}
* link: \url{https://en.wikipedia.org/wiki/Zachary's_karate_club}
* 34 nodes: each node represents a member of the karate club
* 4 classes: a community each member belongs to
* task: classify the nodes
* many real world problems for GNNs exist, e.g.\ social networks, molecules, recommender systems, particle tracks
::::
:::: {.column width=40%}
\begin{center}
\includegraphics[width=1.\textwidth]{figures/karateclub.png}
\normalsize
\end{center}
::::
:::

## From CNN to GNN

\begin{center}
\includegraphics[width=0.8\textwidth]{figures/fromCNNtoGNN.png}
\normalsize
\newline
\tiny (from Stanford GNN lecture)
\end{center}
\normalsize
* GNN: Generalization of convolutional neural network
* No grid structure, arbitrary number of neighbors defined by adjacency matrix
* Operations pass information from neighborhood

## Architecture: Graph Convolutional Network

::: columns
:::: {.column width=60%}
* Message passing from connected nodes
* The graph convolution is defined as:
$$ H^{(l+1)} = \sigma \left( \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)} \right)$$
* The adjacency matrix $A$ including self-connections is given by $\tilde{A}$
* The degree matrix of the corrected adjacency matrix is given by $\tilde{D}_{ii} = \Sigma_j \tilde{A}_{ij}$
* The weights of the given layer are called $W^{(l)}$
* $H^{(l)}$ is the matrix for activations in layer $l$
::::
:::: {.column width=40%}
\begin{center}
\includegraphics[width=1.1\textwidth]{figures/GCN.png}
\normalsize
\end{center}
\tiny \url{https://arxiv.org/abs/1609.02907}
::::
:::


## Architecture: Graph Attention Network

::: columns
:::: {.column width=50%}
* Calculate the attention coefficients $e_{ij}$ from the features $\vec{h}$ for each node $i$ with its neighbors $j$
$$ e_{ij} = a\left( W\vec{h}_i, W\vec{h}_j \right)$$
$a$: learnable weight vector
* Normalize attention coefficients
$$ \alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\text{exp}(e_{ij})}{\Sigma_k \text{exp}(e_{ik})} $$
* Calculate node features
$$
\vec{h}^{(l+1)}_i = \sigma \left( \Sigma \alpha_{ij} W \vec{h}^l_j \right)$$
::::
:::: {.column width=50%}
\begin{center}
\includegraphics[width=1.1\textwidth]{figures/GraphAttention.png}
\normalsize
\end{center}
\tiny \url{https://arxiv.org/abs/1710.10903}
::::
:::

## Example: Identification of inelastic interactions in TRD

::: columns
:::: {.column width=60%}
* Identification of inelastic interactions of light antinuclei
in the Transition Radiation Detector in ALICE
* Thesis: \url{https://www.physi.uni-heidelberg.de/Publications/Bachelor_Thesis_Maximilian_Hammermann.pdf}
* Construct nearest neighbor graph from signals in detector
* Use global pooling for graph classification
::::
:::: {.column width=40%}

interaction of antideuteron:

\begin{center}
\includegraphics[width=0.8\textwidth]{figures/antideuteronsgnMax.png}
\normalsize
\end{center}
::::
:::


\begin{center}
\includegraphics[width=0.9\textwidth]{figures/GNN_conf.png}
\normalsize
\end{center}


## Example: Google Maps

* link: \url{https://www.deepmind.com/blog/traffic-prediction-with-advanced-graph-neural-networks}
* GNNs are used for traffic predictions and estimated times of arrival (ETAs)

\begin{center}
\includegraphics[width=0.8\textwidth]{figures/GNNgooglemaps.png}
\normalsize
\end{center}


## Example: Alpha Fold
* link: \url{https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology}
* "A folded protein can be thought of as a 'spatial graph', where residues are the nodes and edges connect the residues in close proximity"

\begin{center}
\includegraphics[width=0.9\textwidth]{figures/alphafold.png}
\normalsize
\end{center}

## Exercise 3: Illustration of Graphs and Graph Neural Networks

On the PyTorch Geometric webpage, you can find official examples for the application of Graph Neural Networks:
https://pytorch-geometric.readthedocs.io/en/latest/get_started/colabs.html

\vspace{3ex}

The first introduction notebook shows the functionality of graphs with the example of the Karate Club. Follow and reproduce the first [\textcolor{green}{notebook}](https://colab.research.google.com/drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8?usp=sharing). Study and understand the data format.

\vspace{3ex}

At the end, the separation power of Graph Convolutional Networks (GCN) are shown via the node embeddings. You can replace the GCN with a Graph Attention Layers and compare the results.

## Exercise 4: Classifying molecules

The PyTorch Geometric webpage also provides an example of the classification of molecules:
https://colab.research.google.com/drive/1I8a0DfQ3fI7Njc62__mVXUlcAleUclnb?usp=sharing

\vspace{3ex}

Study this notebook and then modify it to apply it to the [PROTEINS dataset](https://pytorch-geometric.readthedocs.io/en/latest/cheatsheet/data_cheatsheet.html).


## Practical advice -- Which algorithm to choose?
\textcolor{gray}{From Kaggle competitions:}

\vspace{3ex}
Structured data: "High level" features that have meaning:

* feature engineering + decision trees
* Random forests
* XGBoost

\vspace{3ex}
Unstructured data: "Low level" features, no individual meaning:

* deep neural networks
* e.g. image classification: convolutional NN


## Outlook: Autoencoders

::: columns
:::: {.column width=50%}
* Unsupervised method based on neural networks to learn a representation of the input data
* Autoencoders learn to copy the input to the output layer
	* low dimensional coding of the input in the central layer
* The decoder generates data based on the coding (*generative model*)
* Applications
	* Dimensionality reduction
	* Denoising of data
	* Machine translation
::::
:::: {.column width=50%}
\vspace{3ex}
\begin{center}
\includegraphics[width=\textwidth]{figures/autoencoder_example.pdf}
\end{center}
::::
:::

## Outlook: Generative adversarial network (GANs)

\begin{center}
\includegraphics[width=0.65\textwidth]{figures/gan.png}
\end{center}
\scriptsize
[\textcolor{gray}{https://developers.google.com/machine-learning/gan/gan\_structure}](https://developers.google.com/machine-learning/gan/gan_structure)
\normalsize

* Discriminator's classification provides a signal that the generator uses to update its weights
* Application in particle physics: fast detector simulation
* 	Full GEANT simulation usually very CPU intensive

## The future

"Das Interessante an unserer Intelligenz ist, dass wir Go spielen können und dann vom Tisch aufstehen und Essen machen können, was eine Maschine nicht kann."

\vspace{2ex}

\color{gray}
\small
\hfill Bernhard Schölkopf, Max-Planck-Institut für intelligente Systeme ([Interview FAZ](https://www.faz.net/aktuell/wirtschaft/kuenstliche-intelligenz/ki-fachmann-wie-gut-europa-in-der-forschung-aufgestellt-ist-16650700.html))
\normalsize
\color{black}

\vfill

"My view is throw it all away and start again"

\color{gray}
\small
\hfill Geoffrey Hinton (DNN pioneer) on deep neural networks and backpropagation ([Interview, 2017](https://www.axios.com/artificial-intelligence-pioneer-says-we-need-to-start-over-1513305524-f619efbd-9db0-4947-a9b2-7a4c310a28fe.html))
\normalsize
\color{black}