Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

802 lines
26 KiB

  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 5. Neural networks
  2. % Jörg Marks, \underline{Klaus Reygers}
  3. % Studierendentage, 11-14 April 2023
  4. ## Exercises
  5. * Exercise 1: Learn XOR with a MLP
  6. * [`05_neural_networks_ex_1_xor.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
  7. * Exercise 2: Visualising decision boundaries of classifiers
  8. * [`05_neural_networks_ex_2_decision_boundaries.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
  9. * Exercise 3: Boston house prices (MLP regression)
  10. * [`05_neural_networks_ex_3_boston_house_prices.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
  11. * Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
  12. * [`05_neural_networks_ex_4_mnist_keras_train.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
  13. * Exercise 5: Higgs data set
  14. ## Perceptron (1)
  15. ::: columns
  16. :::: {.column width=65%}
  17. \begin{center}
  18. \includegraphics[width=0.40\textwidth]{figures/perceptron_weighted_sum.png}
  19. \vspace{1ex}
  20. \includegraphics[width=0.75\textwidth]{figures/perceptron_retina.png}
  21. \end{center}
  22. ::::
  23. :::: {.column width=35%}
  24. $$h(\vec x) = \begin{cases}1 & \text{if }\ \vec w \cdot \vec x + b > 0,\\0 & \text{otherwise}\end{cases}$$
  25. \begin{center}
  26. \includegraphics[width=0.95\textwidth]{figures/perceptron_photo.png}
  27. \tiny
  28. \textcolor{gray}{Mark 1 Perceptron. Frank Rosenblatt (1961)}
  29. \normalsize
  30. \end{center}
  31. ::::
  32. :::
  33. \footnotesize
  34. \vspace{2ex}
  35. \textcolor{gray}{The perceptron was designed for image recognition. It was first implemented in hardware (400 photocells, weights = potentiometer settings).}
  36. \normalsize
  37. ## Perceptron (2)
  38. ::: columns
  39. :::: {.column width=60%}
  40. * McCulloch–Pitts (MCP) neuron (1943)
  41. * First mathematical model of a biological neuron
  42. * Boolean input
  43. * Equal weights for all inputs
  44. * Threshold hardcoded
  45. * Improvements by Rosenblatt
  46. * Different weights for inputs
  47. * Algorithm to update weights and threshold given labeled training data
  48. \vfill
  49. Shortcoming of the perceptron: \newline
  50. it cannot learn the XOR function \newline
  51. \tiny \textcolor{gray}{Minsky, Papert, 1969} \normalsize
  52. ::::
  53. :::: {.column width=40%}
  54. ![](figures/perceptron_with_threshold.png){width=80%}
  55. ![](figures/xor.png)
  56. \small \textcolor{gray}{XOR: not linearly separable } \normalsize
  57. ::::
  58. :::
  59. ## The biological inspiration: the neuron
  60. \begin{figure}
  61. \centering
  62. \includegraphics[width=0.95\textwidth]{figures/neuron.png}
  63. \end{figure}
  64. ## Non-linear transfer / activation function
  65. Discriminant: $$ y(\vec x) = h\left( w_0 + \sum_{i=1}^n w_i x_i \right) $$
  66. Examples for function $h$: \newline
  67. $$ \frac{1}{1+e^{-x}} \; \text{("sigmoid" or "logistic" function)}, \quad \tanh x $$
  68. ::: columns
  69. :::: {.column width=50%}
  70. \begin{figure}
  71. \centering
  72. \includegraphics[width=0.75\textwidth]{figures/logistic_fct.png}
  73. \end{figure}
  74. ::::
  75. :::: {.column width=50%}
  76. \vspace{3ex}
  77. Non-linear activation function needed in neural networks when feature space is not linearly separable.
  78. \newline
  79. \small
  80. \textcolor{gray}{Neural net with linear activation functions is just a perceptron}
  81. \normalsize
  82. ::::
  83. :::
  84. ## Feedforward neural network with one hidden layer
  85. ::: columns
  86. :::: {.column width=60%}
  87. ![](figures/mlp.png){width=80%}
  88. ::::
  89. :::: {.column width=40%}
  90. $$ \phi_i(\vec x) = h\left(w_{i0}^{(1)} + \sum_{j=1}^n w_{ij}^{(1)} x_j\right) $$
  91. \vfill
  92. $$ y(\vec x) = h\left( w_{10}^{(2)} + \sum_{j=1}^m w_{1j}^{(2)} \phi_j(\vec x)\right) $$
  93. \vfill
  94. \vspace{2ex}
  95. \footnotesize
  96. \textcolor{gray}{superscripts indicates layer number, i.e., $w_{ij}^{(1)}$ refers to the input weights of neuron $i$ in the hidden layer (= layer 1).}
  97. \normalsize
  98. ::::
  99. :::
  100. \begin{center}
  101. Straightforward to generalize to multiple hidden layers
  102. \end{center}
  103. ## Neural network output and decision boundaries
  104. ::: columns
  105. :::: {.column width=75%}
  106. \begin{figure}
  107. \centering
  108. \includegraphics[width=\textwidth]{figures/nn_decision_boundary.png}
  109. \end{figure}
  110. ::::
  111. :::: {.column width=25%}
  112. \vspace{3ex}
  113. \footnotesize
  114. \textcolor{gray}{P. Bhat, Multivariate Analysis Methods in Particle Physics, inspirehep.net/record/879273}
  115. \normalsize
  116. ::::
  117. :::
  118. ## Fun with neural nets in the browser
  119. \begin{figure}
  120. \centering
  121. \includegraphics[width=\textwidth]{figures/tf_playground.png}
  122. \end{figure}
  123. \tiny
  124. [\textcolor{gray}{http://playground.tensorflow.org}](http://playground.tensorflow.org)
  125. \normalsize
  126. ## Backpropagation (1)
  127. Start with an initial guess $\vec w_0$ for the weights an then update weights after each training event:
  128. $$ \vec w^{(\tau+1)} = \vec w^{(\tau)} - \eta \nabla E_a(\vec w^{(\tau)}), \quad \eta = \text{learning rate}$$
  129. Gradient descent:
  130. \begin{figure}
  131. \centering
  132. \includegraphics[width=0.46\textwidth]{figures/gradient_descent.png}
  133. \end{figure}
  134. ## Backpropagation (2)
  135. ::: columns
  136. :::: {.column width=40%}
  137. \vspace{6ex}
  138. ![](figures/mlp.png){width=100%}
  139. ::::
  140. :::: {.column width=60%}
  141. Let's write network output as follows:
  142. \begin{align*}
  143. y(\vec x) &= h(u(\vec x)); \quad u(\vec x) = \sum_{j=0}^m w_{1j}^{(2)} \phi_j(\vec x) \\
  144. \phi_j(\vec x) &= h\left( \sum_{k=0}^n w_{jk}^{(1)} x_k\right)
  145. \equiv h\left( v_j(\vec x) \right)
  146. \end{align*}
  147. For $E_a = \frac{1}{2} (y_a - t_a)^2$ one obtains for the weights from hidden layer to output:
  148. \begin{align*}
  149. \frac{\partial E_a}{\partial w_{1j}^{(2)}} &= (y_a -t_a) h'(u(\vec x_a)) \frac{\partial u}{\partial w_{1j}^{(2)}} \\
  150. &= (y_a -t_a) h'(u(\vec x_a)) \phi_j(\vec x_a)
  151. \end{align*}
  152. ::::
  153. :::
  154. \vspace{2ex}
  155. Further application of the chain rule gives weights from input to hidden layer.
  156. ## Backpropagation (3)
  157. Backpropagation summary
  158. * Make prediction for a given training instance (forward pass)
  159. * Calculate error (value of loss function)
  160. * Go backwards and determine the contribution of each weight (reverse pass)
  161. * Adjust the weights to reduce the error
  162. \vfill
  163. Practical considerations:
  164. * Nowadays, people will implements neural networks with frameworks like Keras or TensorFlow
  165. * No need to implement backpropagation yourself
  166. * TensorFlow efficiently calculates gradient function based on a kind of symbolic differentiation
  167. ## More on gradient descent
  168. ::: columns
  169. :::: {.column width=60%}
  170. * Stochastic gradient descent
  171. * just uses one training event at a time
  172. * fast, but quite irregular approach to the minimum
  173. * can help escape local minima
  174. * one can decrease learning rate to settle at the minimum ("simulated annealing")
  175. * Batch gradient descent
  176. * use entire training sample to calculate gradient of loss function
  177. * computationally expensive
  178. * Mini-batch gradient descent
  179. * calculate gradient for a random sub-sample of the training set
  180. ::::
  181. :::: {.column width=40%}
  182. \begin{figure}
  183. \centering
  184. \includegraphics[width=0.7\textwidth]{figures/stochastic_gradient_descent.png}
  185. \end{figure}
  186. \begin{figure}
  187. \centering
  188. \includegraphics[width=\textwidth]{figures/gradient_descent_cmp.png}
  189. \end{figure}
  190. ::::
  191. :::
  192. ## Universal approximation theorem
  193. ::: columns
  194. :::: {.column width=60%}
  195. "A feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$."
  196. \vspace{5ex}
  197. One of the first versions of the theorem was proved by George Cybenko in 1989 for sigmoid activation functions
  198. \vspace{5ex}
  199. The theorem does not touch upon the algorithmic learnability of those parameters
  200. ::::
  201. :::: {.column width=40%}
  202. \begin{figure}
  203. \centering
  204. \includegraphics[width=\textwidth]{figures/ann.png}
  205. \end{figure}
  206. ::::
  207. :::
  208. ## Deep neural networks
  209. Deep networks: many hidden layers with large number of neurons
  210. ::: columns
  211. :::: {.column width=50%}
  212. * Challenges
  213. * Hard too train ("vanishing gradient problem")
  214. * Training slow
  215. * Risk of overtraining
  216. ::::
  217. :::: {.column width=50%}
  218. * Big progress in recent years
  219. * Interest in NN waned before ca. 2006
  220. * Milestone: paper by G. Hinton (2006): "learning for deep belief nets"
  221. * Image recognition, AlphaGo, …
  222. * Soon: self-driving cars, …
  223. ::::
  224. :::
  225. \begin{figure}
  226. \centering
  227. \includegraphics[width=0.5\textwidth]{figures/dnn.png}
  228. \end{figure}
  229. ## Drawbacks of the sigmoid activation function
  230. ::: columns
  231. :::: {.column width=50%}
  232. \includegraphics[width=.75\textwidth]{figures/sigmoid.png}
  233. ::::
  234. :::: {.column width=50%}
  235. $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
  236. \vspace{3ex}
  237. * Saturated neurons “kill” the gradients
  238. * Sigmoid outputs are not zero-centered
  239. * exp() is a bit compute expensive
  240. ::::
  241. :::
  242. ## Activation functions
  243. \begin{figure}
  244. \centering
  245. \includegraphics[width=\textwidth]{figures/activation_functions.png}
  246. \end{figure}
  247. ## ReLU
  248. ::: columns
  249. :::: {.column width=50%}
  250. \includegraphics[width=.75\textwidth]{figures/relu.png}
  251. ::::
  252. :::: {.column width=50%}
  253. $$ f(x) = \max(0,x) $$
  254. \vspace{1ex}
  255. * Does not saturate (in +region)
  256. * Very computationally efficient
  257. * Converges much faster than sigmoid tanh in practice
  258. * Actually more biologically plausible than sigmoid
  259. * But: gradient vanishes for $x < 0$
  260. ::::
  261. :::
  262. ## Bias-variance tradeoff
  263. Goal: generalization of training data
  264. * Simple models (few parameters): danger of bias
  265. * \textcolor{gray}{Classifiers with a small number of degrees of freedom are less prone to statistical fluctuations: different training samples would result in similar classification boundaries ("small variance")}
  266. * Complex models (many parameters): danger of overfitting
  267. * \textcolor{gray}{large variance of decision boundaries for different training samples}
  268. \begin{figure}
  269. \centering
  270. \includegraphics[width=0.8\textwidth]{figures/underfitting_overfitting.pdf}
  271. \end{figure}
  272. ## Example of overtraining
  273. Too many neurons/layers make a neural network too flexible \newline $\to$ overtraining
  274. \begin{figure}
  275. \centering
  276. \includegraphics[width=0.9\textwidth]{figures/example_overtraining.png}
  277. \end{figure}
  278. ## Monitoring overtraining
  279. Monitor fraction of misclassified events (or loss function:)
  280. \begin{figure}
  281. \centering
  282. \includegraphics[width=0.8\textwidth]{figures/monitoring_overtraining.png}
  283. \end{figure}
  284. ## Regularization: Avoid overfitting
  285. \scriptsize
  286. [\hfill \textcolor{gray}{http://cs231n.stanford.edu/slides}](http://cs231n.stanford.edu/slides)
  287. \normalsize
  288. \begin{figure}
  289. \centering
  290. \includegraphics[width=0.75\textwidth]{figures/regularization.png}
  291. \end{figure}
  292. \begin{center}
  293. $L_1$ regularization: $R(W) = \sum_k |W_k|$, $L_2$ regularization: $R(W) = \sum_k W_k^2$
  294. \end{center}
  295. ## Another approach to prevent overfitting: Dropout
  296. * Randomly remove nodes during training
  297. * Avoid co-adaptation of nodes
  298. \begin{figure}
  299. \centering
  300. \includegraphics[width=0.8\textwidth]{figures/dropout.png}
  301. \end{figure}
  302. \scriptsize
  303. \textcolor{gray}{Srivastava et al.,}
  304. [\textcolor{gray}{"Dropout: A Simple Way to Prevent Neural Networks from Overfitting"}](jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf)
  305. \normalsize
  306. ## Pros and cons of multi-layer perceptrons
  307. \textcolor{green}{Pros}
  308. * Capability to learn non-linear models
  309. \vspace{3ex}
  310. \textcolor{red}{Cons}
  311. * Loss function can have several local minima
  312. * Hyperparameters need to be tuned
  313. * \textcolor{gray}{number of layers, neurons per layer, and training iterations}
  314. * Sensitive to feature scaling
  315. * \textcolor{gray}{preprocessing needed (e.g., scaling of all feature to range [0,1])}
  316. ## Example 1: Boston house prices (MLP regression) (1)
  317. * Objective: predict house prices in Boston suburbs in the mid-1970s
  318. * Boston house data set: 506 instances, 13 features
  319. \footnotesize
  320. ```
  321. - CRIM per capita crime rate by town
  322. - ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  323. - INDUS proportion of non-retail business acres per town
  324. - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  325. - NOX nitric oxides concentration (parts per 10 million)
  326. - RM average number of rooms per dwelling
  327. - AGE proportion of owner-occupied units built prior to 1940
  328. - DIS weighted distances to five Boston employment centres
  329. - RAD index of accessibility to radial highways
  330. - TAX full-value property-tax rate per $10,000
  331. - PTRATIO pupil-teacher ratio by town
  332. - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  333. - LSTAT % lower status of the population
  334. - MEDV Median value of owner-occupied homes in $1000's
  335. ```
  336. \footnotesize
  337. [\textcolor{gray}{05\_neural\_networks\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/05_neural_networks_boston_house_prices.ipynb)
  338. ## Example 1: Boston house prices (MLP regression) (2)
  339. ```python
  340. boston = datasets.load_boston()
  341. X = boston.data
  342. y = boston.target
  343. from sklearn.neural_network import MLPRegressor
  344. mlp = MLPRegressor(hidden_layer_sizes=(100),
  345. activation='logistic', random_state=1, max_iter=5000)
  346. mlp.fit(X_train, y_train)
  347. y_pred_mlp = mlp.predict(X_test)
  348. rms = np.sqrt(mean_squared_error(y_test, y_pred_mlp))
  349. print(f"root mean square error {rms:.2f}")
  350. ```
  351. ## Example 1: Boston house prices (MLP regression) (3)
  352. \begin{center}
  353. \includegraphics[width=0.7\textwidth]{figures/boston_house_prices.pdf}
  354. \end{center}
  355. ## Exercise 1: XOR
  356. \small
  357. [\textcolor{gray}{05\_neural\_networks\_ex\_1\_xor.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
  358. \normalsize
  359. ::: columns
  360. :::: {.column width=60%}
  361. a) Define a multi-layer perceptron classifier that learns the XOR problem.
  362. \scriptsize
  363. ```python
  364. from sklearn.neural_network import MLPClassifier
  365. X = [[0, 0], [0, 1], [1, 0], [1, 1]]
  366. y = [0, 1, 1, 0]
  367. ```
  368. \normalsize
  369. b) Define a multi-layer perceptron regressor that fits the depicted 2d data (see notebook).
  370. c) Plot the mean square error vs. the number of number of training epochs for b).
  371. ::::
  372. :::: {.column width=40%}
  373. \vspace{10ex}
  374. ![](figures/xor_like_data.pdf)
  375. ::::
  376. :::
  377. ## Exercise 2: Visualising decision boundaries of classifiers
  378. \small
  379. [\textcolor{gray}{05\_neural\_networks\_ex\_2\_decision\_boundaries.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
  380. \normalsize
  381. \vspace{5ex}
  382. Visualize the decision boundaries of a scikit-learn decision tree, a scikit-learn multi-layer perceptron, and XGBoost for different toy data sets.
  383. ## Exercise 3: Boston house prices (hyperparameter optimization)
  384. \small
  385. [\textcolor{gray}{05\_neural\_networks\_ex\_3\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
  386. \normalsize
  387. \vspace{5ex}
  388. a) Can you find better hyperparameters (number of hidden layers, neurons per layer, loss function, ...)? Try this first by hand.
  389. b) Now use [\textcolor{gray}{sklearn.model\_selection.GridSearchCV}](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find optimal parameters.
  390. ## TensorFlow
  391. ::: columns
  392. :::: {.column width=70%}
  393. * Powerful open source library with a focus on deep neural networks
  394. * Performs computations of data flow graphs
  395. * Takes care of computing gradients of the defined functions (\textit{automatic differentiation})
  396. * Computations in parallel on multiple CPUs or GPUs
  397. * Developed by the Google Brain team
  398. * Initial release in 2015
  399. * [https://www.tensorflow.org/](https://www.tensorflow.org/)
  400. ::::
  401. :::: {.column width=30%}
  402. \begin{center}
  403. \includegraphics[width=0.7\textwidth]{figures/tensorflow.png}
  404. \end{center}
  405. ::::
  406. :::
  407. ## Keras
  408. ::: columns
  409. :::: {.column width=70%}
  410. * Open-source library providing high-level building blocks for developing deep-learning models
  411. * Uses TensorFlow as \textit{backend engine} for low-level tensor manipulation (version 2.4)
  412. * Part of TensorFlow core API since TensorFlow 1.4 release
  413. * Over 375,000 individual users as of early-2020
  414. * Primary author: Fran\c{c}ois Chollet (Google engineer)
  415. * [https://keras.io/](https://keras.io/)
  416. ::::
  417. :::: {.column width=30%}
  418. \begin{center}
  419. \includegraphics[width=0.5\textwidth]{figures/keras.png}
  420. \end{center}
  421. ::::
  422. :::
  423. ## Example 2: Boston house prices with Keras
  424. \small
  425. ```python
  426. from keras import models
  427. from keras import layers
  428. model = models.Sequential()
  429. model.add(layers.Dense(64, activation='relu',
  430. input_shape=(train_data.shape[1],)))
  431. model.add(layers.Dense(64, activation='relu'))
  432. model.add(layers.Dense(1))
  433. model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
  434. model.fit(partial_train_data, partial_train_targets,
  435. epochs=num_epochs, batch_size=1, verbose=0)
  436. # Evaluate the model on the validation data
  437. val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
  438. ```
  439. \normalsize
  440. \footnotesize
  441. [\textcolor{gray}{05\_neural\_networks\_boston\_keras.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/05_neural_networks_boston_keras.ipynb)
  442. ## Convolutional neutral networks (CNNs)
  443. \begin{center}
  444. \includegraphics[width=0.7\textwidth]{figures/cnn.png}
  445. \end{center}
  446. ::: columns
  447. :::: {.column width=80%}
  448. * CNNs emerged from the study of the visual cortex
  449. * Behind many deep learning successes (e.g. in image recognition)
  450. * Partially connected layers
  451. * \textcolor{gray}{Fully connected layers impractical for large images (too many neurons, overfitting)}
  452. * Key component: Convolutional layers
  453. * \textcolor{gray}{Set of learnable filters}
  454. * \textcolor{gray}{Low-level features at the first layers; high-level features a the end}
  455. ::::
  456. :::: {.column width=20%}
  457. \small
  458. \textcolor{gray}{Sliding $3 \times3$ filter}
  459. ![](figures/cnn_sliding_filter.png)
  460. ::::
  461. :::
  462. ## Different types of layers in a CNN
  463. ::: columns
  464. :::: {.column width=50%}
  465. \small \textcolor{gray}{1. Convolutional layers} \newline
  466. \includegraphics[width=0.9\textwidth]{figures/cnn_conv_layer.png}
  467. ::::
  468. :::: {.column width=50%}
  469. \small \textcolor{gray}{3. Fully connected layers} \newline
  470. \includegraphics[width=0.9\textwidth]{figures/cnn_fully_connected.png}
  471. ::::
  472. :::
  473. \vspace{3ex}
  474. ::: columns
  475. :::: {.column width=60%}
  476. \vfill
  477. \small \textcolor{gray}{2. Pooling layers} \newline
  478. \includegraphics[width=\textwidth]{figures/cnn_pooling.png}
  479. ::::
  480. :::: {.column width=40%}
  481. \textcolor{gray}{\footnotesize Afshine Amidi, Shervine Amidi} \
  482. [\textcolor{gray}{\footnotesize Convolutional Neural Networks cheatsheet}](https://github.com/afshinea/stanford-cs-230-deep-learning/blob/master/en/cheatsheet-convolutional-neural-networks.pdf)
  483. ::::
  484. :::
  485. ## MNIST classification with a CNN in Keras
  486. \footnotesize
  487. ```python
  488. from keras.models import Sequential
  489. from keras.layers import Dense, Flatten, MaxPooling2D, Conv2D, Input, Dropout
  490. # conv layer with 8 3x3 filters
  491. model = Sequential(
  492. [
  493. Input(shape=input_shape),
  494. Conv2D(8, kernel_size=(3, 3), activation="relu"),
  495. MaxPooling2D(pool_size=(2, 2)),
  496. Flatten(),
  497. Dense(16, activation="relu"),
  498. Dense(num_classes, activation="softmax"),
  499. ]
  500. )
  501. model.summary()
  502. ```
  503. \normalsize
  504. ## Defining the CNN in Keras (2)
  505. \footnotesize
  506. ```
  507. Model: "sequential_1"
  508. _________________________________________________________________
  509. Layer (type) Output Shape Param #
  510. =================================================================
  511. conv2d_1 (Conv2D) (None, 26, 26, 8) 80
  512. _________________________________________________________________
  513. max_pooling2d_1 (MaxPooling2 (None, 13, 13, 8) 0
  514. _________________________________________________________________
  515. flatten_1 (Flatten) (None, 1352) 0
  516. _________________________________________________________________
  517. dense_2 (Dense) (None, 16) 21648
  518. _________________________________________________________________
  519. dense_3 (Dense) (None, 10) 170
  520. =================================================================
  521. Total params: 21,898
  522. Trainable params: 21,898
  523. Non-trainable params: 0
  524. ```
  525. \normalsize
  526. ## Model definition
  527. Using Keras, you have to `compile` a model, which means adding the loss function, the optimizer algorithm and validation metrics to your training setup.
  528. \vspace{5ex}
  529. \footnotesize
  530. ```python
  531. model.compile(loss="categorical_crossentropy",
  532. optimizer="adam",
  533. metrics=["accuracy"])
  534. ```
  535. \normalsize
  536. ## Model training
  537. \footnotesize
  538. ```python
  539. from keras.callbacks import ModelCheckpoint, EarlyStopping
  540. checkpoint = ModelCheckpoint(
  541. filepath="mnist_keras_model.h5",
  542. save_best_only=True,
  543. verbose=1)
  544. early_stopping = EarlyStopping(patience=2)
  545. history = model.fit(x_train, y_train, # Training data
  546. batch_size=200, # Batch size
  547. epochs=50, # Maximum number of training epochs
  548. validation_split=0.5, # Use 50% of the train dataset for validation
  549. callbacks=[checkpoint, early_stopping]) # Register callbacks
  550. ```
  551. \normalsize
  552. ## Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
  553. \small
  554. [\textcolor{gray}{05\_neural\_networks\_ex\_4\_mnist\_keras\_train.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
  555. \normalsize
  556. \vspace{5ex}
  557. a) Plot training and validation loss as well as training and validation accuracy as a function of the number of epochs
  558. b) Determine the accuracy of the fully trained model.
  559. c) Create a second notebook that reads the trained model (`mnist_keras_model.h5`). Read `your_own_digit.png` and classify it. Create your own $28 \times 28$ pixel digits with a program like gimp and check how the model performs.
  560. ## Exercise 5: Higgs data set (1)
  561. Application of deep neural networks for separation of signal and background in an exotic Higgs scenario
  562. \vfill
  563. \small
  564. \color{gray}
  565. In this exercise we want to explore various techniques to optimize the event selection in the search for supersymmetric Higgs bosons at the LHC. In supersymmetry the Higgs sector constitutes of five Higgs bosons in contrast to the single Higgs in the standard model. Here we deal with a heavy Higgs boson which decays into two W-bosons and a standard Higgs boson ($H^0 \to W^+ W^- h$) which decay further into leptons ($W^\pm \to l^\pm \nu$) and b-quarks ($h\to b \bar{b}$) respectively.
  566. This exercise is based on a [Nature paper](https://www.nature.com/articles/ncomms5308) (Pierre Baldi, Peter Sadowski, Daniel Whiteson) which contains much more information like general background information, details about the selection variables and links to large sets of simulated events. You might also use the paper as inspiration for the solution of this exercise.
  567. ## Exercise 5: Higgs data set (2)
  568. The two dataset consists of 10k and 100k events respectively. For each event 29 variables are stored:
  569. \footnotesize
  570. ```
  571. 0: classification (1 = signal, 0 = background)
  572. 1 - 21 : low level quantities (var1 - var21)
  573. 22 -28 : high level quantities (var22 - var28)
  574. ```
  575. \normalsize
  576. You can read the data as follows:
  577. \scriptsize
  578. ```python
  579. #filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/HIGGS_10k.csv"
  580. filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/HIGGS_100k.csv"
  581. df = pd.read_csv(filename, engine='python')
  582. ```
  583. \normalsize
  584. a) Use a classifier of you choice to separate signal and background events. Determine the accuracy score.
  585. b) Compare the results when using i) the low level quantities and ii) the high level quantities
  586. ## Practical advice -- Which algorithm to choose?
  587. \textcolor{gray}{From Kaggle competitions:}
  588. \vspace{3ex}
  589. Structured data: "High level" features that have meaning:
  590. * feature engineering + decision trees
  591. * Random forests
  592. * XGBoost
  593. \vspace{3ex}
  594. Unstructured data: "Low level" features, no individual meaning:
  595. * deep neural networks
  596. * e.g. image classification: convolutional NN
  597. ## Outlook: Autoencoders
  598. ::: columns
  599. :::: {.column width=50%}
  600. * Unsupervised method based on neural networks to learn a representation of the input data
  601. * Autoencoders learn to copy the input to the output layer
  602. * low dimensional coding of the input in the central layer
  603. * The decoder generates data based on the coding (*generative model*)
  604. * Applications
  605. * Dimensionality reduction
  606. * Denoising of data
  607. * Machine translation
  608. ::::
  609. :::: {.column width=50%}
  610. \vspace{3ex}
  611. \begin{center}
  612. \includegraphics[width=\textwidth]{figures/autoencoder_example.pdf}
  613. \end{center}
  614. ::::
  615. :::
  616. ## Outlook: Generative adversarial network (GANs)
  617. \begin{center}
  618. \includegraphics[width=0.65\textwidth]{figures/gan.png}
  619. \end{center}
  620. \scriptsize
  621. [\textcolor{gray}{https://developers.google.com/machine-learning/gan/gan\_structure}](https://developers.google.com/machine-learning/gan/gan_structure)
  622. \normalsize
  623. * Discriminator's classification provides a signal that the generator uses to update its weights
  624. * Application in particle physics: fast detector simulation
  625. * Full GEANT simulation usually very CPU intensive
  626. ## The future
  627. "Das Interessante an unserer Intelligenz ist, dass wir Go spielen können und dann vom Tisch aufstehen und Essen machen können, was eine Maschine nicht kann."
  628. \vspace{2ex}
  629. \color{gray}
  630. \small
  631. \hfill Bernhard Schölkopf, Max-Planck-Institut für intelligente Systeme ([Interview FAZ](https://www.faz.net/aktuell/wirtschaft/kuenstliche-intelligenz/ki-fachmann-wie-gut-europa-in-der-forschung-aufgestellt-ist-16650700.html))
  632. \normalsize
  633. \color{black}
  634. \vfill
  635. "My view is throw it all away and start again"
  636. \color{gray}
  637. \small
  638. \hfill Geoffrey Hinton (DNN pioneer) on deep neural networks and backpropagation ([Interview, 2017](https://www.axios.com/artificial-intelligence-pioneer-says-we-need-to-start-over-1513305524-f619efbd-9db0-4947-a9b2-7a4c310a28fe.html))
  639. \normalsize
  640. \color{black}