ML-Kurs-SS2023/notebooks/03_ml_basics_ex_1_magic.ipynb
2023-04-05 17:35:33 +02:00

229 lines
6.3 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise: Classification of air showers measured with the MAGIC telescope\n",
"\n",
"The [MAGIC telescope](https://en.wikipedia.org/wiki/MAGIC_(telescope)) is a Cherenkov telescope situated on La Palma, one of the Canary Islands. The [MAGIC machine learning dataset](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope) can be obtained from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).\n",
"\n",
"The task is to separate signal events (gamma showers) and background events (hadron showers) based on the features of a measured Cherenkov shower.\n",
"\n",
"The features of a shower are:\n",
"\n",
" 1. fLength: continuous # major axis of ellipse [mm]\n",
" 2. fWidth: continuous # minor axis of ellipse [mm] \n",
" 3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]\n",
" 4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]\n",
" 5. fConc1: continuous # ratio of highest pixel over fSize [ratio]\n",
" 6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]\n",
" 7. fM3Long: continuous # 3rd root of third moment along major axis [mm] \n",
" 8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]\n",
" 9. fAlpha: continuous # angle of major axis with vector to origin [deg]\n",
" 10. fDist: continuous # distance from origin to center of ellipse [mm]\n",
" 11. class: g,h # gamma (signal), hadron (background)\n",
"\n",
"g = gamma (signal): 12332\n",
"h = hadron (background): 6688\n",
"\n",
"For technical reasons, the number of h events is underestimated.\n",
"In the real data, the h class represents the majority of the events.\n",
"\n",
"You can find further information about the MAGIC telescope and the data discrimination studies in the following [paper](https://reader.elsevier.com/reader/sd/pii/S0168900203025051?token=8A02764E2448BDC5E4DD0ED53A301295162A6E9C8F223378E8CF80B187DBFD98BD3B642AB83886944002206EB1688FF4) (R. K. Bock et al., \"Methods for multidimensional event classification: a case studyusing images from a Cherenkov gamma-ray telescope\" NIM A 516 (2004) 511-528) (You need to be within the university network to get free access.) "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"filename = \"https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/magic04_data.txt\"\n",
"df = pd.read_csv(filename, engine='python')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use categories 1 and 0 insted of \"g\" and \"h\"\n",
"df['class'] = df['class'].map({'g': 1, 'h': 0})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### a) Create for each variable a figure with a plot for gammas and hadrons overlayed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df0 = df[df['class'] == 0] # hadron data set\n",
"df1 = df[df['class'] == 1] # gamma data set\n",
"\n",
"print(len(df0),len(df1))\n",
"\n",
"### YOUR CODE ###\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### b) Create training and test data set. The tast data should amount to 50\\% of the total data set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y = df['class'].values\n",
"X = df[[col for col in df.columns if col!=\"class\"]]\n",
"\n",
"### YOUR CODE ### \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### c) Define the logistic regressor and fit the training data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import linear_model\n",
"\n",
"# define logistic regressor\n",
"\n",
"### YOUR CODE ###\n",
"\n",
"\n",
"\n",
"# fit training data\n",
"\n",
"### YOUR CODE ###\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### d) Determine the Model Accuracy, the AUC score and the Run time"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import roc_auc_score\n",
"\n",
"### YOUR CODE ###\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### e) Plot the ROC curve (Backgropund Rejection vs signal efficiency)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"from sklearn.metrics import roc_curve\n",
"%matplotlib inline\n",
"\n",
"y_pred_prob = logreg.predict_proba(X_test) # predicted probabilities\n",
"\n",
"### YOUR CODE ###\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### YOUR CODE ###\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}