Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

808 lines
26 KiB

  1. ---
  2. title: |
  3. | Introduction to Data Analysis and Machine Learning in Physics:
  4. | 5. Neural networks
  5. author: "Martino Borsato, Jörg Marks, Klaus Reygers"
  6. date: "Studierendentage, 11-14 April 2022"
  7. ---
  8. ## Exercises
  9. * Exercise 1: Learn XOR with a MLP
  10. * [`05_neural_networks_ex_1_xor.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
  11. * Exercise 2: Visualising decision boundaries of classifiers
  12. * [`05_neural_networks_ex_2_decision_boundaries.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
  13. * Exercise 3: Boston house prices (MLP regression)
  14. * [`05_neural_networks_ex_3_boston_house_prices.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
  15. * Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
  16. * [`05_neural_networks_ex_4_mnist_keras_train.ipynb`](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
  17. ## Perceptron (1)
  18. ::: columns
  19. :::: {.column width=65%}
  20. \begin{center}
  21. \includegraphics[width=0.40\textwidth]{figures/perceptron_weighted_sum.png}
  22. \vspace{1ex}
  23. \includegraphics[width=0.75\textwidth]{figures/perceptron_retina.png}
  24. \end{center}
  25. ::::
  26. :::: {.column width=35%}
  27. $$h(\vec x) = \begin{cases}1 & \text{if }\ \vec w \cdot \vec x + b > 0,\\0 & \text{otherwise}\end{cases}$$
  28. \begin{center}
  29. \includegraphics[width=0.95\textwidth]{figures/perceptron_photo.png}
  30. \tiny
  31. \textcolor{gray}{Mark 1 Perceptron. Frank Rosenblatt (1961)}
  32. \normalsize
  33. \end{center}
  34. ::::
  35. :::
  36. \footnotesize
  37. \vspace{2ex}
  38. \textcolor{gray}{The perceptron was designed for image recognition. It was first implemented in hardware (400 photocells, weights = potentiometer settings).}
  39. \normalsize
  40. ## Perceptron (2)
  41. ::: columns
  42. :::: {.column width=60%}
  43. * McCulloch–Pitts (MCP) neuron (1943)
  44. * First mathematical model of a biological neuron
  45. * Boolean input
  46. * Equal weights for all inputs
  47. * Threshold hardcoded
  48. * Improvements by Rosenblatt
  49. * Different weights for inputs
  50. * Algorithm to update weights and threshold given labeled training data
  51. \vfill
  52. Shortcoming of the perceptron: \newline
  53. it cannot learn the XOR function \newline
  54. \tiny \textcolor{gray}{Minsky, Papert, 1969} \normalsize
  55. ::::
  56. :::: {.column width=40%}
  57. ![](figures/perceptron_with_threshold.png){width=80%}
  58. ![](figures/xor.png)
  59. \small \textcolor{gray}{XOR: not linearly separable } \normalsize
  60. ::::
  61. :::
  62. ## The biological inspiration: the neuron
  63. \begin{figure}
  64. \centering
  65. \includegraphics[width=0.95\textwidth]{figures/neuron.png}
  66. \end{figure}
  67. ## Non-linear transfer / activation function
  68. Discriminant: $$ y(\vec x) = h\left( w_0 + \sum_{i=1}^n w_i x_i \right) $$
  69. Examples for function $h$: \newline
  70. $$ \frac{1}{1+e^{-x}} \; \text{("sigmoid" or "logistic" function)}, \quad \tanh x $$
  71. ::: columns
  72. :::: {.column width=50%}
  73. \begin{figure}
  74. \centering
  75. \includegraphics[width=0.75\textwidth]{figures/logistic_fct.png}
  76. \end{figure}
  77. ::::
  78. :::: {.column width=50%}
  79. \vspace{3ex}
  80. Non-linear activation function needed in neural networks when feature space is not linearly separable.
  81. \newline
  82. \small
  83. \textcolor{gray}{Neural net with linear activation functions is just a perceptron}
  84. \normalsize
  85. ::::
  86. :::
  87. ## Feedforward neural network with one hidden layer
  88. ::: columns
  89. :::: {.column width=60%}
  90. ![](figures/mlp.png){width=80%}
  91. ::::
  92. :::: {.column width=40%}
  93. $$ \phi_i(\vec x) = h\left(w_{i0}^{(1)} + \sum_{j=1}^n w_{ij}^{(1)} x_j\right) $$
  94. \vfill
  95. $$ y(\vec x) = h\left( w_{10}^{(2)} + \sum_{j=1}^m w_{1j}^{(2)} \phi_j(\vec x)\right) $$
  96. \vfill
  97. \vspace{2ex}
  98. \footnotesize
  99. \textcolor{gray}{superscripts indicates layer number, i.e., $w_{ij}^{(1)}$ refers to the input weights of neuron $i$ in the hidden layer (= layer 1).}
  100. \normalsize
  101. ::::
  102. :::
  103. \begin{center}
  104. Straightforward to generalize to multiple hidden layers
  105. \end{center}
  106. ## Neural network output and decision boundaries
  107. ::: columns
  108. :::: {.column width=75%}
  109. \begin{figure}
  110. \centering
  111. \includegraphics[width=\textwidth]{figures/nn_decision_boundary.png}
  112. \end{figure}
  113. ::::
  114. :::: {.column width=25%}
  115. \vspace{3ex}
  116. \footnotesize
  117. \textcolor{gray}{P. Bhat, Multivariate Analysis Methods in Particle Physics, inspirehep.net/record/879273}
  118. \normalsize
  119. ::::
  120. :::
  121. ## Fun with neural nets in the browser
  122. \begin{figure}
  123. \centering
  124. \includegraphics[width=\textwidth]{figures/tf_playground.png}
  125. \end{figure}
  126. \tiny
  127. [\textcolor{gray}{http://playground.tensorflow.org}](http://playground.tensorflow.org)
  128. \normalsize
  129. ## Backpropagation (1)
  130. Start with an initial guess $\vec w_0$ for the weights an then update weights after each training event:
  131. $$ \vec w^{(\tau+1)} = \vec w^{(\tau)} - \eta \nabla E_a(\vec w^{(\tau)}), \quad \eta = \text{learning rate}$$
  132. Gradient descent:
  133. \begin{figure}
  134. \centering
  135. \includegraphics[width=0.46\textwidth]{figures/gradient_descent.png}
  136. \end{figure}
  137. ## Backpropagation (2)
  138. ::: columns
  139. :::: {.column width=40%}
  140. \vspace{6ex}
  141. ![](figures/mlp.png){width=100%}
  142. ::::
  143. :::: {.column width=60%}
  144. Let's write network output as follows:
  145. \begin{align*}
  146. y(\vec x) &= h(u(\vec x)); \quad u(\vec x) = \sum_{j=0}^m w_{1j}^{(2)} \phi_j(\vec x) \\
  147. \phi_j(\vec x) &= h\left( \sum_{k=0}^n w_{jk}^{(1)} x_k\right)
  148. \equiv h\left( v_j(\vec x) \right)
  149. \end{align*}
  150. For $E_a = \frac{1}{2} (y_a - t_a)^2$ one obtains for the weights from hidden layer to output:
  151. \begin{align*}
  152. \frac{\partial E_a}{\partial w_{1j}^{(2)}} &= (y_a -t_a) h'(u(\vec x_a)) \frac{\partial u}{\partial w_{1j}^{(2)}} \\
  153. &= (y_a -t_a) h'(u(\vec x_a)) \phi_j(\vec x_a)
  154. \end{align*}
  155. ::::
  156. :::
  157. \vspace{2ex}
  158. Further application of the chain rule gives weights from input to hidden layer.
  159. ## Backpropagation (3)
  160. Backpropagation summary
  161. * Make prediction for a given training instance (forward pass)
  162. * Calculate error (value of loss function)
  163. * Go backwards and determine the contribution of each weight (reverse pass)
  164. * Adjust the weights to reduce the error
  165. \vfill
  166. Practical considerations:
  167. * Nowadays, people will implements neural networks with frameworks like Keras or TensorFlow
  168. * No need to implement backpropagation yourself
  169. * TensorFlow efficiently calculates gradient function based on a kind of symbolic differentiation
  170. ## More on gradient descent
  171. ::: columns
  172. :::: {.column width=60%}
  173. * Stochastic gradient descent
  174. * just uses one training event at a time
  175. * fast, but quite irregular approach to the minimum
  176. * can help escape local minima
  177. * one can decrease learning rate to settle at the minimum ("simulated annealing")
  178. * Batch gradient descent
  179. * use entire training sample to calculate gradient of loss function
  180. * computationally expensive
  181. * Mini-batch gradient descent
  182. * calculate gradient for a random sub-sample of the training set
  183. ::::
  184. :::: {.column width=40%}
  185. \begin{figure}
  186. \centering
  187. \includegraphics[width=0.7\textwidth]{figures/stochastic_gradient_descent.png}
  188. \end{figure}
  189. \begin{figure}
  190. \centering
  191. \includegraphics[width=\textwidth]{figures/gradient_descent_cmp.png}
  192. \end{figure}
  193. ::::
  194. :::
  195. ## Universal approximation theorem
  196. ::: columns
  197. :::: {.column width=60%}
  198. "A feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$."
  199. \vspace{5ex}
  200. One of the first versions of the theorem was proved by George Cybenko in 1989 for sigmoid activation functions
  201. \vspace{5ex}
  202. The theorem does not touch upon the algorithmic learnability of those parameters
  203. ::::
  204. :::: {.column width=40%}
  205. \begin{figure}
  206. \centering
  207. \includegraphics[width=\textwidth]{figures/ann.png}
  208. \end{figure}
  209. ::::
  210. :::
  211. ## Deep neural networks
  212. Deep networks: many hidden layers with large number of neurons
  213. ::: columns
  214. :::: {.column width=45%}
  215. * Challenges
  216. * Hard too train ("vanishing gradient problem")
  217. * Training slow
  218. * Risk of overtraining
  219. ::::
  220. :::: {.column width=55%}
  221. * Big progress in recent years
  222. * Interest in NN waned before ca. 2006
  223. * Milestone: paper by G. Hinton (2006): "learning for deep belief nets"
  224. * Image recognition, AlphaGo, …
  225. * Soon: self-driving cars, …
  226. ::::
  227. :::
  228. \begin{figure}
  229. \centering
  230. \includegraphics[width=0.5\textwidth]{figures/dnn.png}
  231. \end{figure}
  232. ## Drawbacks of the sigmoid activation function
  233. ::: columns
  234. :::: {.column width=50%}
  235. \includegraphics[width=.75\textwidth]{figures/sigmoid.png}
  236. ::::
  237. :::: {.column width=50%}
  238. $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
  239. \vspace{3ex}
  240. * Saturated neurons “kill” the gradients
  241. * Sigmoid outputs are not zero-centered
  242. * exp() is a bit compute expensive
  243. ::::
  244. :::
  245. ## Activation functions
  246. \begin{figure}
  247. \centering
  248. \includegraphics[width=\textwidth]{figures/activation_functions.png}
  249. \end{figure}
  250. ## ReLU
  251. ::: columns
  252. :::: {.column width=50%}
  253. \includegraphics[width=.75\textwidth]{figures/relu.png}
  254. ::::
  255. :::: {.column width=50%}
  256. $$ f(x) = \max(0,x) $$
  257. \vspace{1ex}
  258. * Does not saturate (in +region)
  259. * Very computationally efficient
  260. * Converges much faster than sigmoid tanh in practice
  261. * Actually more biologically plausible than sigmoid
  262. * But: gradient vanishes for $x < 0$
  263. ::::
  264. :::
  265. ## Bias-variance tradeoff (1)
  266. Goal: generalization of training data
  267. * Simple models (few parameters): danger of bias
  268. * \textcolor{gray}{Classifiers with a small number of degrees of freedom are less prone to statistical fluctuations: different training samples would result in similar classification boundaries ("small variance")}
  269. * Complex models (many parameters): danger of overfitting
  270. * \textcolor{gray}{large variance of decision boundaries for different training samples}
  271. ## Bias-variance tradeoff (2)
  272. \begin{figure}
  273. \centering
  274. \includegraphics[trim=4cm 0cm 4cm 0cm, width=\textwidth]{figures/underfitting_overfitting.pdf}
  275. \end{figure}
  276. ## Example of overtraining
  277. Too many neurons/layers make a neural network too flexible \newline $\to$ overtraining
  278. \begin{figure}
  279. \centering
  280. \includegraphics[width=0.9\textwidth]{figures/example_overtraining.png}
  281. \end{figure}
  282. ## Monitoring overtraining
  283. Monitor fraction of misclassified events (or loss function:)
  284. \begin{figure}
  285. \centering
  286. \includegraphics[width=0.8\textwidth]{figures/monitoring_overtraining.png}
  287. \end{figure}
  288. ## Regularization: Avoid overfitting
  289. \scriptsize
  290. [\hfill \textcolor{gray}{http://cs231n.stanford.edu/slides}](http://cs231n.stanford.edu/slides)
  291. \normalsize
  292. \begin{figure}
  293. \centering
  294. \includegraphics[width=0.75\textwidth]{figures/regularization.png}
  295. \end{figure}
  296. \begin{center}
  297. $L_1$ regularization: $R(W) = \sum_k |W_k|$, $L_2$ regularization: $R(W) = \sum_k W_k^2$
  298. \end{center}
  299. ## Another approach to prevent overfitting: Dropout
  300. * Randomly remove nodes during training
  301. * Avoid co-adaptation of nodes
  302. \begin{figure}
  303. \centering
  304. \includegraphics[width=0.8\textwidth]{figures/dropout.png}
  305. \end{figure}
  306. \scriptsize
  307. \textcolor{gray}{Srivastava et al.,}
  308. [\textcolor{gray}{"Dropout: A Simple Way to Prevent Neural Networks from Overfitting"}](jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf)
  309. \normalsize
  310. ## Pros and cons of multi-layer perceptrons
  311. \textcolor{green}{Pros}
  312. * Capability to learn non-linear models
  313. \vspace{3ex}
  314. \textcolor{red}{Cons}
  315. * Loss function can have several local minima
  316. * Hyperparameters need to be tuned
  317. * \textcolor{gray}{number of layers, neurons per layer, and training iterations}
  318. * Sensitive to feature scaling
  319. * \textcolor{gray}{preprocessing needed (e.g., scaling of all feature to range [0,1])}
  320. ## Example 1: Boston house prices (MLP regression) (1)
  321. * Objective: predict house prices in Boston suburbs in the mid-1970s
  322. * Boston house data set: 506 instances, 13 features
  323. \footnotesize
  324. ```
  325. - CRIM per capita crime rate by town
  326. - ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  327. - INDUS proportion of non-retail business acres per town
  328. - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  329. - NOX nitric oxides concentration (parts per 10 million)
  330. - RM average number of rooms per dwelling
  331. - AGE proportion of owner-occupied units built prior to 1940
  332. - DIS weighted distances to five Boston employment centres
  333. - RAD index of accessibility to radial highways
  334. - TAX full-value property-tax rate per $10,000
  335. - PTRATIO pupil-teacher ratio by town
  336. - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  337. - LSTAT % lower status of the population
  338. - MEDV Median value of owner-occupied homes in $1000's
  339. ```
  340. \footnotesize
  341. [\textcolor{gray}{05\_neural\_networks\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/05_neural_networks_boston_house_prices.ipynb)
  342. ## Example 1: Boston house prices (MLP regression) (2)
  343. ```python
  344. boston = datasets.load_boston()
  345. X = boston.data
  346. y = boston.target
  347. from sklearn.neural_network import MLPRegressor
  348. mlp = MLPRegressor(hidden_layer_sizes=(100),
  349. activation='logistic', random_state=1, max_iter=5000)
  350. mlp.fit(X_train, y_train)
  351. y_pred_mlp = mlp.predict(X_test)
  352. rms = np.sqrt(mean_squared_error(y_test, y_pred_mlp))
  353. print(f"root mean square error {rms:.2f}")
  354. ```
  355. ## Example 1: Boston house prices (MLP regression) (3)
  356. \begin{center}
  357. \includegraphics[width=0.7\textwidth]{figures/boston_house_prices.pdf}
  358. \end{center}
  359. ## Exercise 1: XOR
  360. \small
  361. [\textcolor{gray}{05\_neural\_networks\_ex\_1\_xor.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_1_xor.ipynb)
  362. \normalsize
  363. ::: columns
  364. :::: {.column width=60%}
  365. a) Define a multi-layer perceptron classifier that learns the XOR problem.
  366. \scriptsize
  367. ```python
  368. from sklearn.neural_network import MLPClassifier
  369. X = [[0, 0], [0, 1], [1, 0], [1, 1]]
  370. y = [0, 1, 1, 0]
  371. ```
  372. \normalsize
  373. b) Define a multi-layer perceptron regressor that fits the depicted 2d data (see notebook).
  374. c) Plot the mean square error vs. the number of number of training epochs for b).
  375. ::::
  376. :::: {.column width=40%}
  377. \vspace{10ex}
  378. ![](figures/xor_like_data.pdf)
  379. ::::
  380. :::
  381. ## Exercise 2: Visualising decision boundaries of classifiers
  382. \small
  383. [\textcolor{gray}{05\_neural\_networks\_ex\_2\_decision\_boundaries.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_2_decision_boundaries.ipynb)
  384. \normalsize
  385. \vspace{5ex}
  386. Visualize the decision boundaries of a scikit-learn decision tree, a scikit-learn multi-layer perceptron, and XGBoost for different toy data sets.
  387. ## Exercise 3: Boston house prices (hyperparameter optimization)
  388. \small
  389. [\textcolor{gray}{05\_neural\_networks\_ex\_3\_boston\_house\_prices.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_3_boston_house_prices.ipynb)
  390. \normalsize
  391. \vspace{5ex}
  392. a) Can you find better hyperparameters (number of hidden layers, neurons per layer, loss function, ...)? Try this first by hand.
  393. b) Now use [\textcolor{gray}{sklearn.model\_selection.GridSearchCV}](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find optimal parameters.
  394. ## TensorFlow
  395. ::: columns
  396. :::: {.column width=70%}
  397. * Powerful open source library with a focus on deep neural networks
  398. * Performs computations of data flow graphs
  399. * Takes care of computing gradients of the defined functions (\textit{automatic differentiation})
  400. * Computations in parallel on multiple CPUs or GPUs
  401. * Developed by the Google Brain team
  402. * Initial release in 2015
  403. * [https://www.tensorflow.org/](https://www.tensorflow.org/)
  404. ::::
  405. :::: {.column width=30%}
  406. \begin{center}
  407. \includegraphics[width=0.7\textwidth]{figures/tensorflow.png}
  408. \end{center}
  409. ::::
  410. :::
  411. ## Keras
  412. ::: columns
  413. :::: {.column width=70%}
  414. * Open-source library providing high-level building blocks for developing deep-learning models
  415. * Uses TensorFlow as \textit{backend engine} for low-level tensor manipulation (version 2.4)
  416. * Part of TensorFlow core API since TensorFlow 1.4 release
  417. * Over 375,000 individual users as of early-2020
  418. * Primary author: Fran\c{c}ois Chollet (Google engineer)
  419. * [https://keras.io/](https://keras.io/)
  420. ::::
  421. :::: {.column width=30%}
  422. \begin{center}
  423. \includegraphics[width=0.5\textwidth]{figures/keras.png}
  424. \end{center}
  425. ::::
  426. :::
  427. ## Example 2: Boston house prices with Keras
  428. \small
  429. ```python
  430. from tensorflow.keras import models
  431. from tensorflow.keras import layers
  432. model = models.Sequential()
  433. model.add(layers.Dense(64, activation='relu',
  434. input_shape=(train_data.shape[1],)))
  435. model.add(layers.Dense(64, activation='relu'))
  436. model.add(layers.Dense(1))
  437. model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
  438. model.fit(partial_train_data, partial_train_targets,
  439. epochs=num_epochs, batch_size=1, verbose=0)
  440. # Evaluate the model on the validation data
  441. val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
  442. ```
  443. \normalsize
  444. \footnotesize
  445. [\textcolor{gray}{05\_neural\_networks\_boston\_keras.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/examples/05_neural_networks_boston_keras.ipynb)
  446. ## Convolutional neutral networks (CNNs)
  447. \begin{center}
  448. \includegraphics[width=0.7\textwidth]{figures/cnn.png}
  449. \end{center}
  450. ::: columns
  451. :::: {.column width=80%}
  452. * CNNs emerged from the study of the visual cortex
  453. * Behind many deep learning successes
  454. * Partially connected layers
  455. * \textcolor{gray}{Fully connected layers impractical for large images (too many neurons, overfitting)}
  456. * Key component: Convolutional layers
  457. * \textcolor{gray}{Set of learnable filters}
  458. * \textcolor{gray}{Low-level features at the first layers; high-level features a the end}
  459. ::::
  460. :::: {.column width=20%}
  461. \small
  462. \textcolor{gray}{Sliding $3 \times3$ filter}
  463. ![](figures/cnn_sliding_filter.png)
  464. ::::
  465. :::
  466. ## Different types of layers in a CNN
  467. ::: columns
  468. :::: {.column width=50%}
  469. \small \textcolor{gray}{1. Convolutional layers} \newline
  470. \includegraphics[width=0.9\textwidth]{figures/cnn_conv_layer.png}
  471. ::::
  472. :::: {.column width=50%}
  473. \small \textcolor{gray}{3. Fully connected layers} \newline
  474. \includegraphics[width=0.9\textwidth]{figures/cnn_fully_connected.png}
  475. ::::
  476. :::
  477. \vspace{3ex}
  478. ::: columns
  479. :::: {.column width=60%}
  480. \vfill
  481. \small \textcolor{gray}{2. Pooling layers} \newline
  482. \includegraphics[width=\textwidth]{figures/cnn_pooling.png}
  483. ::::
  484. :::: {.column width=40%}
  485. \textcolor{gray}{\footnotesize Afshine Amidi, Shervine Amidi} \
  486. [\textcolor{gray}{\footnotesize Convolutional Neural Networks cheatsheet}](https://github.com/afshinea/stanford-cs-230-deep-learning/blob/master/en/cheatsheet-convolutional-neural-networks.pdf)
  487. ::::
  488. :::
  489. ## MNIST classification with a CNN in Keras
  490. \footnotesize
  491. ```python
  492. from tensorflow.keras.models import Sequential
  493. from tensorflow.keras.layers import Dense, Flatten, MaxPooling2D, Conv2D, Input
  494. # conv layer with 8 3x3 filters
  495. model = Sequential(
  496. [
  497. Input(shape=input_shape),
  498. Conv2D(8, kernel_size=(3, 3), activation="relu"),
  499. MaxPooling2D(pool_size=(2, 2)),
  500. Flatten(),
  501. Dense(16, activation="relu"),
  502. Dense(num_classes, activation="softmax"),
  503. ]
  504. )
  505. model.summary()
  506. ```
  507. \normalsize
  508. ## Defining the CNN in Keras (2)
  509. \footnotesize
  510. ```
  511. Model: "sequential_1"
  512. _________________________________________________________________
  513. Layer (type) Output Shape Param #
  514. =================================================================
  515. conv2d_1 (Conv2D) (None, 26, 26, 8) 80
  516. _________________________________________________________________
  517. max_pooling2d_1 (MaxPooling2 (None, 13, 13, 8) 0
  518. _________________________________________________________________
  519. flatten_1 (Flatten) (None, 1352) 0
  520. _________________________________________________________________
  521. dense_2 (Dense) (None, 16) 21648
  522. _________________________________________________________________
  523. dense_3 (Dense) (None, 10) 170
  524. =================================================================
  525. Total params: 21,898
  526. Trainable params: 21,898
  527. Non-trainable params: 0
  528. ```
  529. \normalsize
  530. ## Model definition
  531. Using Keras, you have to `compile` a model, which means adding the loss function, the optimizer algorithm and validation metrics to your training setup.
  532. \vspace{5ex}
  533. \footnotesize
  534. ```python
  535. model.compile(loss="categorical_crossentropy",
  536. optimizer="adam",
  537. metrics=["accuracy"])
  538. ```
  539. \normalsize
  540. ## Model training
  541. \footnotesize
  542. ```python
  543. from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
  544. checkpoint = ModelCheckpoint(
  545. filepath="mnist_keras_model.h5",
  546. save_best_only=True,
  547. verbose=1)
  548. early_stopping = EarlyStopping(patience=2)
  549. history = model.fit(x_train, y_train, # Training data
  550. batch_size=200, # Batch size
  551. epochs=50, # Maximum number of training epochs
  552. validation_split=0.5, # Use 50% of the train dataset for validation
  553. callbacks=[checkpoint, early_stopping]) # Register callbacks
  554. ```
  555. \normalsize
  556. ## Exercise 4: Training a digit-classification neural network on the MNIST dataset using Keras
  557. \small
  558. [\textcolor{gray}{05\_neural\_networks\_ex\_4\_mnist\_keras\_train.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/exercises/05_neural_networks_ex_4_mnist_keras_train.ipynb)
  559. \normalsize
  560. \vspace{5ex}
  561. a) Plot training and validation loss as well as training and validation accuracy as a function of the number of epochs
  562. b) Determine the accuracy of the fully trained model.
  563. c) Create a second notebook that reads the trained model (`mnist_keras_model.h5`). Read `your_own_digit.png` and classify it. Create your own $28 \times 28$ pixel digits with a program like gimp and check how the model performs.
  564. <!--
  565. ## Exercise 5: Higgs data set (1)
  566. Application of deep neural networks for separation of signal and background in an exotic Higgs scenario
  567. \vfill
  568. \small
  569. \color{gray}
  570. In this exercise we want to explore various techniques to optimize the event selection in the search for supersymmetric Higgs bosons at the LHC. In supersymmetry the Higgs sector constitutes of five Higgs bosons in contrast to the single Higgs in the standard model. Here we deal with a heavy Higgs boson which decays into two W-bosons and a standard Higgs boson ($H^0 \to W^+ W^- h$) which decay further into leptons ($W^\pm \to l^\pm \nu$) and b-quarks ($h\to b \bar{b}$) respectively.
  571. This exercise is based on a [Nature paper](https://www.nature.com/articles/ncomms5308) (Pierre Baldi, Peter Sadowski, Daniel Whiteson) which contains much more information like general background information, details about the selection variables and links to large sets of simulated events. You might also use the paper as inspiration for the solution of this exercise.
  572. ## Exercise 5: Higgs data set (2)
  573. The two dataset consists of 10k and 100k events respectively. For each event 29 variables are stored:
  574. \footnotesize
  575. ```
  576. 0: classification (1 = signal, 0 = background)
  577. 1 - 21 : low level quantities (var1 - var21)
  578. 22 -28 : high level quantities (var22 - var28)
  579. ```
  580. \normalsize
  581. You can read the data as follows:
  582. \scriptsize
  583. ```python
  584. #filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/data/HIGGS_10k.csv"
  585. filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/data/HIGGS_100k.csv"
  586. df = pd.read_csv(filename, engine='python')
  587. ```
  588. \normalsize
  589. a) Use a classifier of you choice to separate signal and background events. Determine the accuracy score.
  590. b) Compare the results when using i) the low level quantities and ii) the high level quantities
  591. -->
  592. ## Practical advice -- Which algorithm to choose?
  593. \textcolor{gray}{From Kaggle competitions:}
  594. \vspace{3ex}
  595. Structured data: "High level" features that have meaning:
  596. * feature engineering + decision trees
  597. * Random forests
  598. * XGBoost
  599. \vspace{3ex}
  600. Unstructured data: "Low level" features, no individual meaning:
  601. * deep neural networks
  602. * e.g. image classification: convolutional NN
  603. ## Outlook: Autoencoders
  604. ::: columns
  605. :::: {.column width=50%}
  606. * Unsupervised method based on neural networks to learn a representation of the input data
  607. * Autoencoders learn to copy the input to the output layer
  608. * low dimensional coding of the input in the central layer
  609. * The decoder generates data based on the coding (*generative model*)
  610. * Applications
  611. * Dimensionality reduction
  612. * Denoising of data
  613. * Machine translation
  614. ::::
  615. :::: {.column width=50%}
  616. \vspace{3ex}
  617. \begin{center}
  618. \includegraphics[width=\textwidth]{figures/autoencoder_example.pdf}
  619. \end{center}
  620. ::::
  621. :::
  622. ## Outlook: Generative adversarial network (GANs)
  623. \begin{center}
  624. \includegraphics[width=0.65\textwidth]{figures/gan.png}
  625. \end{center}
  626. \scriptsize
  627. [\textcolor{gray}{https://developers.google.com/machine-learning/gan/gan\_structure}](https://developers.google.com/machine-learning/gan/gan_structure)
  628. \normalsize
  629. * Discriminator's classification provides a signal that the generator uses to update its weights
  630. * Application in particle physics: fast detector simulation
  631. * Full GEANT simulation usually very CPU intensive
  632. ## The future
  633. "Das Interessante an unserer Intelligenz ist, dass wir Go spielen können und dann vom Tisch aufstehen und Essen machen können, was eine Maschine nicht kann."
  634. \vspace{2ex}
  635. \color{gray}
  636. \small
  637. \hfill Bernhard Schölkopf, Max-Planck-Institut für intelligente Systeme ([Interview FAZ](https://www.faz.net/aktuell/wirtschaft/kuenstliche-intelligenz/ki-fachmann-wie-gut-europa-in-der-forschung-aufgestellt-ist-16650700.html))
  638. \normalsize
  639. \color{black}
  640. \vfill
  641. "My view is throw it all away and start again"
  642. \color{gray}
  643. \small
  644. \hfill Geoffrey Hinton (DNN pioneer) on deep neural networks and backpropagation ([Interview, 2017](https://www.axios.com/artificial-intelligence-pioneer-says-we-need-to-start-over-1513305524-f619efbd-9db0-4947-a9b2-7a4c310a28fe.html))
  645. \normalsize
  646. \color{black}