ML-Kurs-SS2023

Machine Learning Kurs im Rahmen der Studierendentage im SS 2023

23 KiB

Raw Blame History

% Introduction to Data Analysis and Machine Learning in Physics: \ 5. Convolutional Neural Networks and Graph Neural Networks % Jörg Marks, \underline{Klaus Reygers} % Studierendentage, 11-14 April 2023

Historical perspective: Perceptron (1)

::: columns :::: {.column width=65%} \begin{center} \includegraphics[width=0.40\textwidth]{figures/perceptron_weighted_sum.png} \vspace{1ex} \includegraphics[width=0.75\textwidth]{figures/perceptron_retina.png} \end{center} :::: :::: {.column width=35%} $$h(\vec x) = \begin{cases}1 & \text{if }\ \vec w \cdot \vec x + b > 0,\0 & \text{otherwise}\end{cases}$$ \begin{center} \includegraphics[width=0.95\textwidth]{figures/perceptron_photo.png} \tiny \textcolor{gray}{Mark 1 Perceptron. Frank Rosenblatt (1961)} \normalsize \end{center} :::: ::: \footnotesize \vspace{2ex} \textcolor{gray}{The perceptron was designed for image recognition. It was first implemented in hardware (400 photocells, weights = potentiometer settings).} \normalsize

Historical perspective: Perceptron (2)

::: columns :::: {.column width=60%}

McCulloch–Pitts (MCP) neuron (1943)
- First mathematical model of a biological neuron
- Boolean input
- Equal weights for all inputs
- Threshold hardcoded
Improvements by Rosenblatt
- Different weights for inputs
- Algorithm to update weights and threshold given labeled training data

\vfill

Shortcoming of the perceptron: \newline it cannot learn the XOR function \newline \tiny \textcolor{gray}{Minsky, Papert, 1969} \normalsize

:::: :::: {.column width=40%} {width=80%} \small \textcolor{gray}{XOR: not linearly separable } \normalsize

:::: :::

The biological inspiration: the neuron

\begin{figure} \centering \includegraphics[width=0.95\textwidth]{figures/neuron.png} \end{figure}

Neural network output and decision boundaries

::: columns :::: {.column width=75%} \begin{figure} \centering \includegraphics[width=\textwidth]{figures/nn_decision_boundary.png} \end{figure} :::: :::: {.column width=25%} \vspace{3ex} \footnotesize \textcolor{gray}{P. Bhat, Multivariate Analysis Methods in Particle Physics, inspirehep.net/record/879273} \normalsize :::: :::

Recap: Backpropagation

Backpropagation summary

Make prediction for a given training instance (forward pass)
Calculate error (value of loss function)
Go backwards and determine the contribution of each weight (reverse pass)
Adjust the weights to reduce the error

\vfill

Practical considerations:

Nowadays, people will implements neural networks with frameworks like Keras or TensorFlow
No need to implement backpropagation yourself
TensorFlow efficiently calculates gradient function efficiently ('autodiff')

More on gradient descent

::: columns :::: {.column width=60%}

Stochastic gradient descent
- just uses one training event at a time
- fast, but quite irregular approach to the minimum
- can help escape local minima
- one can decrease learning rate to settle at the minimum ("simulated annealing")
Batch gradient descent
- use entire training sample to calculate gradient of loss function
- computationally expensive
Mini-batch gradient descent
- calculate gradient for a random sub-sample of the training set

:::: :::: {.column width=40%} \begin{figure} \centering \includegraphics[width=0.7\textwidth]{figures/stochastic_gradient_descent.png} \end{figure} \begin{figure} \centering \includegraphics[width=\textwidth]{figures/gradient_descent_cmp.png} \end{figure} :::: :::

Universal approximation theorem

::: columns :::: {.column width=60%} "A feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$."

\vspace{5ex}

One of the first versions of the theorem was proved by George Cybenko in 1989 for sigmoid activation functions

\vspace{5ex}

The theorem does not touch upon the algorithmic learnability of those parameters

:::: :::: {.column width=40%} \begin{figure} \centering \includegraphics[width=\textwidth]{figures/ann.png} \end{figure} :::: :::

Recap: Activation functions

\begin{figure} \centering \includegraphics[width=\textwidth]{figures/activation_functions.png} \end{figure}

ReLU

::: columns :::: {.column width=50%} \includegraphics[width=.75\textwidth]{figures/relu.png} :::: :::: {.column width=50%} $$ f(x) = \max(0,x) $$ \vspace{1ex}

Does not saturate (in +region)
Very computationally efficient
Converges much faster than sigmoid tanh in practice
Actually more biologically plausible than sigmoid
But: gradient vanishes for $x < 0$

:::: :::

Bias-variance tradeoff

Goal: generalization of training data

Simple models (few parameters): danger of bias
- \textcolor{gray}{Classifiers with a small number of degrees of freedom are less prone to statistical fluctuations: different training samples would result in similar classification boundaries ("small variance")}
Complex models (many parameters): danger of overfitting
- \textcolor{gray}{large variance of decision boundaries for different training samples}

\begin{figure} \centering \includegraphics[width=0.8\textwidth]{figures/underfitting_overfitting.pdf} \end{figure}

Recap: Overtraining

Too many neurons/layers make a neural network too flexible \newline $\to$ overtraining

\begin{figure} \centering \includegraphics[width=0.9\textwidth]{figures/example_overtraining.png} \end{figure}

Monitoring overtraining

Monitor fraction of misclassified events (or loss function:) \begin{figure} \centering \includegraphics[width=0.8\textwidth]{figures/monitoring_overtraining.png} \end{figure}

Regularization: Avoid overfitting

\scriptsize \hfill \textcolor{gray}{http://cs231n.stanford.edu/slides} \normalsize \begin{figure} \centering \includegraphics[width=0.75\textwidth]{figures/regularization.png} \end{figure} \begin{center} $L_1$ regularization: $R(W) = \sum_k |W_k|$, $L_2$ regularization: $R(W) = \sum_k W_k^2$ \end{center}

Exercise 1: Hyperparameter optimization

\small \textcolor{gray}{05_neural_networks_ex_1_hyperparameter_optimization.ipynb} \normalsize

\vspace{5ex}

The multi-layer perceptron did not perform well on the superconductivity dataset. Can you find better hyperparameters (number of hidden layers, neurons per layer, loss function, learning rate...)?

\vspace{2ex}

Use \textcolor{gray}{sklearn.model_selection.GridSearchCV} to find optimal parameters.

Convolutional neutral networks (CNNs): Overview

\begin{center} \includegraphics[width=0.7\textwidth]{figures/cnn.png} \end{center} ::: columns :::: {.column width=80%}

CNNs emerged from the study of the visual cortex
Behind many deep learning successes (e.g. in image recognition)
Partially connected layers
- \textcolor{gray}{Fully connected layers impractical for large images (too many neurons, overfitting)}
Key component: Convolutional layers
- \textcolor{gray}{Set of learnable filters}
- \textcolor{gray}{Low-level features at the first layers; high-level features a the end} :::: :::: {.column width=20%} \small \textcolor{gray}{Sliding $3 \times3$ filter} :::: :::

Different types of layers in a CNN

::: columns :::: {.column width=50%} \small \textcolor{gray}{1. Convolutional layers} \newline \includegraphics[width=0.9\textwidth]{figures/cnn_conv_layer.png} :::: :::: {.column width=50%} \small \textcolor{gray}{3. Fully connected layers} \newline \includegraphics[width=0.9\textwidth]{figures/cnn_fully_connected.png} :::: :::

\vspace{3ex}

::: columns :::: {.column width=60%} \vfill \small \textcolor{gray}{2. Pooling layers} \newline \includegraphics[width=\textwidth]{figures/cnn_pooling.png} :::: :::: {.column width=40%} \textcolor{gray}{\footnotesize Afshine Amidi, Shervine Amidi}
\textcolor{gray}{\footnotesize Convolutional Neural Networks cheatsheet} :::: :::

Convolution

Convolution of a function $f$ with a kernel or filter function $g$: \begin{figure} \centering \includegraphics[width=0.5\textwidth]{figures/convolution.png} \end{figure}

\vspace{1ex}

Practical example: blurring of an image with a Gaussian filter \newline \tiny \textcolor{gray}{https://www.cs.cornell.edu/courses/cs6670/2011sp/lectures/lec02_filter.pdf} \normalsize \begin{figure} \centering \includegraphics[width=0.8\textwidth]{figures/gaussian_filter.png} \end{figure}

Filters can detect structures/features in an image (1)

Filters to detects the "X" structures: \begin{figure} \centering \includegraphics[width=0.8\textwidth]{figures/features_of_x.png} \end{figure}