update 03_ml_intro_mva.md

2023-04-03 11:49:25 +02:00 · 2023-04-03 11:49:25 +02:00 · 6073838ab2
commit 6073838ab2
parent 77f81cdb2b
1 changed files with 503 additions and 21 deletions
--- a/slides/03_ml_intro_mva.md
+++ b/slides/03_ml_intro_mva.md
@ -1,7 +1,8 @@
 % Introduction to Data Analysis and Machine Learning in Physics: \ 3. Machine Learning Basics, Multivariate Analysis
-% Martino Borsato, Jörg Marks, Klaus Reygers
+% Jörg Marks, Klaus Reygers
 % Studierendentage, 11-14 April 2023
 ## Multi-variate analyses (MVA)
 * General Question
@ -190,15 +191,506 @@ The $k$-NN classifier has best performance when the boundary that separates sign
 ## Linear regression revisited
-##
+\vfill 
 ::: columns
 :::: {.column width=50%}
 \small \textcolor{gray}{"Galton family heights data": \\ origin of the term "regression"} \normalsize
 ![](figures/03_ml_basics_galton_linear_regression_iminuit.pdf)
 ::::
 :::: {.column width=50%}
 * data: $\{x_i,y_i\}$ \
 * objective: predict $y = f(x)$ 
 * model: $f(x; \vec \theta) = m x + b, \quad \vec \theta = (m, b)$ 
 * loss function: $J(\theta|x,y) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i))^2$ 
 * model training: optimal parameters $\hat{\vec{\theta}} = \mathrm{arg\,min} \, J(\vec \theta)$
 ::::
 :::
 ## Linear regression
 * Data: vectors with $p$ components ("features"): $\vec x = (x_1, ..., x_p)$
 * $n$ observations: $\{\vec x_i, y_i\}, \quad i = 1, ..., n$
 * Prediction for given vector $x$: 
 	$$ y = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x \quad \text{where } x_0 := 1 $$
 * Find weights that minimze loss function:
 	$$\hat{\vec{w}} = \underset{\vec w}{\min} \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2$$
 *  In case of linear regression closed-form solution exists:
 	$$ \hat{\vec{w}} = (X^\intercal X)^{-1} X^\intercal \vec y \quad \text{where} \; X \in \mathbb{R}^{n \times p}$$
 * $X$ is called the design matrix, row $i$ of $X$ is $\vec x_i$ 
 ## Linear regression with regularization
 ::: columns
 :::: {.column width=45%}
 * Standard loss function
 	$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 $$
 * Ridge regression
 	$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w|^2$$
 * LASSO regression
 	$$ C(\vec w) = \sum_{i=1}^{n} (\vec w^\intercal \vec x_i - y_i)^2 + \lambda |\vec w| $$
 ::::
 :::: {.column width=55%}
 \vfill 
 ![](figures/L1vsL2.pdf)
 \small \textcolor{gray}{LASSO regression tends to give sparse solutions (many components $w_j = 0$). This is why LASSO regression is also called sparse regression.} \normalsize
 ::::
 :::
 ## Logistic regression (1)
 * Consider binary classification task, e.g., $y_i \in \{0,1\}$
 * Objective: Predict probability for outcome $y=1$ given an observation $\vec x$
 * Starting with linear "score" 
 	$$ s = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p \equiv \vec w^\intercal \vec x$$
 * Define function that translates $s$ into a quantity that has the properties of a probability
 	$$ \sigma(s) = \frac{1}{1+e^{-s}} $$
 * We would like to determine the optimal weights for a given training data set. They result from the maximum-likelihood principle.
 ## Logistic regression (2)
 * Consider feature vector $\vec x$. For a given set of weights $\vec w$ the model predicts 
 	* a probability $p(1|\vec w) = \sigma(\vec w^\intercal \vec x)$ for outcome $y=1$
 	* a probabiltiy $p(0|\vec w) = 1 - \sigma(\vec w^\intercal \vec x)$ for outcome $y=0$
 * The probability $p(y_i | \vec w)$ defines the likelihood $L_i(\vec w) = p(y_i | \vec w)$ (the likelihood is a function of the parameters $\vec w$ and the observations $y_i$ are fixed).
 * Likelihood for the full data sample ($n$ observations)
 	$$ L(\vec w) = \prod_{i=1}^n L_i(\vec w) = \prod_{i=1}^n \sigma(\vec w^\intercal \vec x)^{y_i} \,(1-\sigma(\vec w^\intercal \vec x))^{1-y_i} $$
 * Maximizing the log-likelihood $\ln L(\vec w)$ corresponds to minimizing the loss function
 	$$ C(\vec w) = - \ln L(\vec w) = \sum_{i=1}^n - y_i \ln \sigma(\vec w^\intercal \vec x) -
 (1-y_i) \ln(1-\sigma(\vec w^\intercal \vec x))$$
 * This is nothing else but the cross-entropy loss function
 ## Example 1 - Probability of passing an exam (logistic regression) (1)
 Objective: predict the probability that someone passes an exam based on the number of hours studying 
 $$ p_\mathrm{pass} = \sigma(s) = \frac{1}{1+e^{-s}}, \quad s = w_1 t + w_0, \quad t = \text{\# hours}$$
 ::: columns
 :::: {.column width=40%}
 * Data set: \
 	* preparation $t$ time in hours 
 	* passed / not passes (0/1)
 * Parameters need to be determined through numerical minimization
 	* $w_0 = -4.0777$
 	* $w_1 = 1.5046$
-  * Determination of the underlying data structure (regression problem)
+\vspace{1.5ex}
 \footnotesize 
 [\textcolor{gray}{03\_ml\_basics\_logistic\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_logistic_regression.ipynb)
 \normalsize
 ::::
 :::: {.column width=60%}
 ![](figures/03_ml_basics_logistic_regression.pdf){width=90%}
 ::::
 :::
 ## Example 1 - Probability of passing an exam (logistic regression) (2)
 \footnotesize
 \textcolor{gray}{Read data from file:}
 ```python
 # data: 1. hours studies, 2. passed (0/1)  
 df = pd.read_csv(filename, engine='python', sep='\s+')
 x_tmp = df['hours_studied'].values
 x = np.reshape(x_tmp, (-1, 1))
 y = df['passed'].values
 ```
 \vfill
 \textcolor{gray}{Fit the data:}
 ```python
 from sklearn.linear_model import LogisticRegression
 clf = LogisticRegression(penalty='none', fit_intercept=True)
 clf.fit(x, y);
 ```
 \vfill
 \textcolor{gray}{Calculate predictions:} 
 ```python
 hours_studied_tmp = np.linspace(0., 6., 1000)
 hours_studied = np.reshape(hours_studied_tmp, (-1, 1))
 y_pred = clf.predict_proba(hours_studied)
 ```
 \normalsize
 ## Precision and recall
 ::: columns
 :::: {.column width=50%}
 \textcolor{blue}{Precision:}\
 Fraction of correctly classified instances among all instances that obtain a certain class label.
 $$ \text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
 \begin{center} 
 \textcolor{gray}{"purity"}
 \end{center}
 ::::
 :::: {.column width=50%}
 \textcolor{blue}{Recall:}\
 Fraction of positive instances that are correctly classified.
 \vspace{2.9ex}
 $$ \text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
 \begin{center} 
 \textcolor{gray}{"efficiency"}
 \end{center}
 ::::
 :::
 \vfill
 \begin{center}
 \textcolor{gray}{TP: true positives, FP: false positives, FN: false negatives}
 \end{center}
 ## Example 2: Heart disease data set (logistic regression) (1)
 \scriptsize
 \textcolor{gray}{Read data:}
 ```python
 filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/data/heart.csv"
 df = pd.read_csv(filename)
 df
 ```
 \vfill
 ![](figures/heart_table.png){width=70%}
 \normalsize
 \vspace{1.5ex}
 \footnotesize 
 [\textcolor{gray}{03\_ml\_basics\_log\_regr\_heart\_disease.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)
 \normalsize
 ## Example 2: Heart disease data set (logistic regression) (2)
 \footnotesize
 \textcolor{gray}{Define array of labels and feature vectors}
 ```python
 y = df['target'].values
 X = df[[col for col in df.columns if col!="target"]]
 ```
 \vfill
 \textcolor{gray}{Generate training and test data sets}
 ```python
 from sklearn.model_selection import train_test_split
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True)
 ```
 \vfill
 \textcolor{gray}{Fit the model}
 ```python
 from sklearn.linear_model import LogisticRegression
 lr = LogisticRegression(penalty='none', fit_intercept=True, max_iter=1000, tol=1E-5)
 lr.fit(X_train, y_train)
 ```
 \normalsize
 ## Example 2: Heart disease data set (logistic regression) (3)
 \footnotesize
 \textcolor{gray}{Test predictions on test data set:}
 ```python
 from sklearn.metrics import classification_report
 y_pred_lr = lr.predict(X_test)
 print(classification_report(y_test, y_pred_lr))
 ```
 \vfill
 \textcolor{gray}{Output:}
 ```
              precision    recall  f1-score   support
           0       0.75      0.86      0.80        63
           1       0.89      0.80      0.84        89
    accuracy                           0.82       152
   macro avg       0.82      0.83      0.82       152
 weighted avg       0.83      0.82      0.82       152
 ```
 ## Example 2: Heart disease data set (logistic regression) (4)
 \textcolor{gray}{Compare to another classifier usinf the \textit{receiver operating characteristic} (ROC) curve}
 \vfill
 \textcolor{gray}{Let's take the random forest classifier}
 \footnotesize
 ```python
 from sklearn.ensemble import RandomForestClassifier
 rf = RandomForestClassifier(max_depth=3)
 rf.fit(X_train, y_train)
 ```
 \normalsize
 \vfill
 \textcolor{gray}{Use \texttt{roc\_curve} from scikit-learn}
 \footnotesize
 ```python
 from sklearn.metrics import roc_curve
 y_pred_prob_lr = lr.predict_proba(X_test) # predicted probabilities
 fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_prob_lr[:,1])
 y_pred_prob_rf = rf.predict_proba(X_test) # predicted probabilities
 fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_prob_rf[:,1])
 ```
 \normalsize
 ## Example 2: Heart disease data set (logistic regression) (5)
 ::: columns
 :::: {.column width=50%}
 \scriptsize
 ```python
 plt.plot(tpr_lr, 1-fpr_lr, label="log. regression")
 plt.plot(tpr_rf, 1-fpr_rf, label="random forest")
 ```
 \vspace{5ex}
 \normalsize
 \textcolor{gray}{Classifiers can be compared with the \textit{area under curve} (AUC) score.}
 \scriptsize
 ```python
 from sklearn.metrics import roc_auc_score
 auc_lr = roc_auc_score(y_test,y_pred_lr)
 auc_rf = roc_auc_score(y_test,y_pred_rf)
 print(f"AUC scores: {auc_lr:.2f}, {auc_knn:.2f}")
 ```
 \vspace{5ex}
 \normalsize
 \textcolor{gray}{This gives}
 \scriptsize
 ```
 AUC scores: 0.82, 0.83
 ```
 \normalsize
 ::::
 :::: {.column width=50%}
 \begin{figure}
 \centering
 \includegraphics[width=0.96\textwidth]{figures/03_ml_basics_log_regr_heart_disease.pdf}
 \end{figure}
 ::::
 :::
 ## Multinomial logistic regression: Softmax function
 In the previous example we considered two classes (0, 1). For multi-class classification, the logistic function can generalized to the softmax function.
 \vfill
 Now consider $k$ classes and let $s_i$ be the score for class $i$: $\vec s = (s_1, ..., s_k)$
 \vfill
 A probability for class $i$ can be predicted with the softmax function:
 	$$ \sigma(\vec s)_i = \frac{e^{s_i}}{\sum_{j=1}^k e^{s_j}} \quad \text{ for } \quad i = 1, ... , k $$
 The softmax functions is often used as the last activation function of a neural network in order to predict probabilities in a classification task.
 \vfill
 Multinomial logistic regression is also known as softmax regression.
 ## Example 3: Iris data set (softmax regression) (1)
 Iris flower data set
 * Introduced 1936 in a paper by Ronald Fisher
 * Task: classify flowers
 * Three species: iris setosa, iris virginica and iris versicolor
 * Four features: petal width and length, sepal width/length, in centimeters
 ::: columns
 :::: {.column width=40%}
 \begin{figure}
 \centering
 \includegraphics[width=0.95\textwidth]{figures/iris_dataset.png}
 \end{figure}
 ::::
 :::: {.column width=60%}
 \vspace{2ex}
 \footnotesize 
 [\textcolor{gray}{03\_ml\_basics\_iris\_softmax\_regression.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_iris_softmax_regression.ipynb)
 \vspace{19ex}
 \scriptsize
 [https://archive.ics.uci.edu/ml/datasets/Iris](https://archive.ics.uci.edu/ml/datasets/Iris)
 [https://en.wikipedia.org/wiki/Iris_flower_data_set](https://en.wikipedia.org/wiki/Iris_flower_data_set)
 \normalsize
 ::::
 :::
 ## Example 3: Iris data set (softmax regression) (2)
 \textcolor{gray}{Get data set}
 \footnotesize
 ```python
 # import some data to play with
 # columns: Sepal Length, Sepal Width, Petal Length and Petal Width
 iris = datasets.load_iris()
 X = iris.data
 y = iris.target
 # split data into training and test data sets
 x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
 ```
 \normalsize
 \vfill
 \textcolor{gray}{Softmax regression}
 \footnotesize
 ```python
 from sklearn.linear_model import LogisticRegression
 log_reg = LogisticRegression(multi_class='multinomial', penalty='none')
 log_reg.fit(x_train, y_train);
 ```
 \normalsize
 ## Example 3 : Iris data set (softmax regression) (3)
 ::: columns
 :::: {.column width=70%}
 \textcolor{gray}{Accuracy and confusion matrix for different classifiers}
 \footnotesize
 ```python
 for clf in [log_reg, kn_neigh, fisher_ld]:
    y_pred = clf.predict(x_test)
    acc = accuracy_score(y_test, y_pred)
    print(type(clf).__name__)
    print(f"accuracy: {acc:0.2f}")
    # confusion matrix: 
    # columns: true class, row: predicted class
    print(confusion_matrix(y_test, y_pred),"\n")
 ```
 \normalsize
 ::::
 :::: {.column width=30%}
 \footnotesize
 ```
 LogisticRegression
 accuracy: 0.96
 [[29  0  0]
 [ 0 23  0]
 [ 0  3 20]] 
 KNeighborsClassifier
 accuracy: 0.95
 [[29  0  0]
 [ 0 23  0]
 [ 0  4 19]] 
 LinearDiscriminantAnalysis
 accuracy: 0.99
 [[29  0  0]
 [ 0 23  0]
 [ 0  1 22]] 
 ```
 \normalsize
 ::::
 :::
 ## Exercise 1: Classification of air showers measured with the MAGIC telescope
 ::: columns
 :::: {.column width=50%}
 \small
 * Cosmic gamma rays (30 GeV - 30 TeV).
 * Cherenkov light from air showers
 * Background: air showers caused by hadrons.
 \normalsize
 \begin{figure}
 \centering
 \includegraphics[width=0.85\textwidth]{figures/magic_photo_small.png}
 \end{figure}
 ::::
 :::: {.column width=50%}
 ![](figures/magic_sketch.png)
 ::::
 :::
 ## Exercise 1: Classification of air showers measured with the MAGIC telescope
 \begin{figure}
 \centering
 \includegraphics[width=0.75\textwidth]{figures/magic_shower_em_had_small.png}
 \end{figure}
 ::: columns
 :::: {.column width=50%}
 \begin{center}
 Gamma shower
 \end{center}
 ::::
 :::: {.column width=50%}
 \begin{center}
 Hadronic shower
 \end{center}
 ::::
 :::
 ## Exercise 1: Classification of air showers measured with the MAGIC telescope
 \begin{figure}
 \centering
 \includegraphics[width=0.95\textwidth]{figures/magic_shower_parameters.png}
 \end{figure}
 ## Exercise 1: Classification of air showers measured with the MAGIC telescope
 MAGIC data set \
 \tiny
 [\textcolor{gray}{https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope}](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope)
 \normalsize
 \scriptsize
 ```
 1.  fLength:  continuous  # major axis of ellipse [mm]
 2.  fWidth:   continuous  # minor axis of ellipse [mm] 
 3.  fSize:    continuous  # 10-log of sum of content of all pixels [in #phot]
 4.  fConc:    continuous  # ratio of sum of two highest pixels over fSize  [ratio]
 5.  fConc1:   continuous  # ratio of highest pixel over fSize  [ratio]
 6.  fAsym:    continuous  # distance from highest pixel to center, projected onto major axis [mm]
 7.  fM3Long:  continuous  # 3rd root of third moment along major axis  [mm] 
 8.  fM3Trans: continuous  # 3rd root of third moment along minor axis  [mm]
 9.  fAlpha:   continuous  # angle of major axis with vector to origin [deg]
 10. fDist:    continuous  # distance from origin to center of ellipse [mm]
 11. class:    g,h         # gamma (signal), hadron (background)
 g = gamma (signal):     12332
 h = hadron (background): 6688
 For technical reasons, the number of h events is underestimated.
 In the real data, the h class represents the majority of the events.
 ```
 \normalsize
 ## Exercise 1: Classification of air showers measured with the MAGIC telescope
 \small
 [\textcolor{gray}{03\_ml\_basics\_ex\_1\_magic.ipynb}](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/03_ml_basics_ex_1_magic.ipynb)
 \normalsize
 a) Create for each variable a figure with a plot for gammas and hadrons overlayed.
 b) Create training and test data set. The test data should amount to 50% of the total data set.
 c) Define the logistic regressor and fit the training data
 d) Determine the model accuracy and the AUC score
 e) Plot the ROC curve (background rejection vs signal efficiency)
 ## General remarks on multi-variate analyses (MVAs)
 * MVA Methods
 	* More effective than classic cut-based analyses
@ -214,27 +706,17 @@ The $k$-NN classifier has best performance when the boundary that separates sign
 	* Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle $\theta$ use $|\cos \theta|$ and not $\cos \theta$ as input variable
 	* It is generally useful to bring all input variables to a similar numerical range 
-## Fischer Discriminant
+## Example of feature transformation
-## Regression
+\begin{figure}
-
+\centering
-## Logistic regression
+\includegraphics[width=0.95\textwidth]{figures/feature_transformation.png}
-
+\end{figure}
 ## Decision Trees
   XGBoost example with the iris dataset
-MVA stands for "Multivariate Analysis," which is a statistical technique used to analyze data that involves multiple variables. In MVA, the relationships between the variables are studied to identify patterns, trends, and dependencies.
+## Exercise 2: Data preprocessing
-MVA techniques are used in many fields, including finance, engineering, physics, biology, and social sciences. Some common MVA techniques include principal component analysis (PCA), factor analysis, cluster analysis, discriminant analysis, and regression analysis.
+a) Read the description of the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) package. 
-PCA, for example, is a commonly used MVA technique that reduces the dimensionality of a data set by identifying the most important variables or components. This can help to simplify data visualization and analysis. Factor analysis, on the other hand, is used to identify underlying factors that contribute to the variation in a set of variables.
+b) Start from the example notebook on the logistic regression for the heart disease data set ([03_ml_basics_log_regr_heart_disease.ipynb](https://nbviewer.jupyter.org/urls/www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/examples/03_ml_basics_log_regr_heart_disease.ipynb)). Pre-process the heart disease data set according to the given example. Does preprocessing make a difference in this case?  
 Overall, MVA is a powerful tool for analyzing complex data sets and can provide insights into relationships and dependencies that might not be apparent when analyzing individual variables in isolation.
 Regenerate response