# Exercise: Classification of air showers measured with the MAGIC telescope

The [MAGIC telescope](https://en.wikipedia.org/wiki/MAGIC_(telescope)) is a Cherenkov telescope situated on La Palma, one of the Canary Islands. The [MAGIC machine learning dataset](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope) can be obtained from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).

The task is to separate signal events (gamma showers) and background events (hadron showers) based on the features of a measured Cherenkov shower.

The features of a shower are:

 1. fLength: continuous # major axis of ellipse [mm]
 2. fWidth: continuous # minor axis of ellipse [mm] 
 3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]
 4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]
 5. fConc1: continuous # ratio of highest pixel over fSize [ratio]
 6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]
 7. fM3Long: continuous # 3rd root of third moment along major axis [mm] 
 8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]
 9. fAlpha: continuous # angle of major axis with vector to origin [deg]
 10. fDist: continuous # distance from origin to center of ellipse [mm]
 11. class: g,h # gamma (signal), hadron (background)

g = gamma (signal): 12332
h = hadron (background): 6688

For technical reasons, the number of h events is underestimated.
In the real data, the h class represents the majority of the events.

You can find further information about the MAGIC telescope and the data discrimination studies in the following [paper](https://reader.elsevier.com/reader/sd/pii/S0168900203025051?token=8A02764E2448BDC5E4DD0ED53A301295162A6E9C8F223378E8CF80B187DBFD98BD3B642AB83886944002206EB1688FF4) (R. K. Bock et al., "Methods for multidimensional event classification: a case studyusing images from a Cherenkov gamma-ray telescope" NIM A 516 (2004) 511-528) (You need to be within the university network to get free access.) 

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
filename = "https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/magic04_data.txt"
df = pd.read_csv(filename, engine='python')

In [None]:
# use categories 1 and 0 insted of "g" and "h"
df['class'] = df['class'].map({'g': 1, 'h': 0})

In [None]:
df.head()

#### a) Create for each variable a figure with a plot for gammas and hadrons overlayed.

In [None]:
import matplotlib.pyplot as plt

In [None]:
df0 = df[df['class'] == 0] # hadron data set
df1 = df[df['class'] == 1] # gamma data set

print(len(df0),len(df1))

### YOUR CODE ###



#### b) Create training and test data set. The tast data should amount to 50\% of the total data set.

In [None]:
y = df['class'].values
X = df[[col for col in df.columns if col!="class"]]

### YOUR CODE ### 



#### c) Define the logistic regressor and fit the training data

In [None]:
from sklearn import linear_model

# define logistic regressor

### YOUR CODE ###



# fit training data

### YOUR CODE ###



#### d) Determine the Model Accuracy, the AUC score and the Run time

In [None]:
from sklearn.metrics import roc_auc_score

### YOUR CODE ###



#### e) Plot the ROC curve (Backgropund Rejection vs signal efficiency)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
%matplotlib inline

y_pred_prob = logreg.predict_proba(X_test) # predicted probabilities

### YOUR CODE ###



In [None]:
### YOUR CODE ###

