{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise: Classification of air showers measured with the MAGIC telescope\n", "\n", "The [MAGIC telescope](https://en.wikipedia.org/wiki/MAGIC_(telescope)) is a Cherenkov telescope situated on La Palma, one of the Canary Islands. The [MAGIC machine learning dataset](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope) can be obtained from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).\n", "\n", "The task is to separate signal events (gamma showers) and background events (hadron showers) based on the features of a measured Cherenkov shower.\n", "\n", "The features of a shower are:\n", "\n", " 1. fLength: continuous # major axis of ellipse [mm]\n", " 2. fWidth: continuous # minor axis of ellipse [mm] \n", " 3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]\n", " 4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]\n", " 5. fConc1: continuous # ratio of highest pixel over fSize [ratio]\n", " 6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]\n", " 7. fM3Long: continuous # 3rd root of third moment along major axis [mm] \n", " 8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]\n", " 9. fAlpha: continuous # angle of major axis with vector to origin [deg]\n", " 10. fDist: continuous # distance from origin to center of ellipse [mm]\n", " 11. class: g,h # gamma (signal), hadron (background)\n", "\n", "g = gamma (signal): 12332\n", "h = hadron (background): 6688\n", "\n", "For technical reasons, the number of h events is underestimated.\n", "In the real data, the h class represents the majority of the events.\n", "\n", "You can find further information about the MAGIC telescope and the data discrimination studies in the following [paper](https://reader.elsevier.com/reader/sd/pii/S0168900203025051?token=8A02764E2448BDC5E4DD0ED53A301295162A6E9C8F223378E8CF80B187DBFD98BD3B642AB83886944002206EB1688FF4) (R. K. Bock et al., \"Methods for multidimensional event classification: a case studyusing images from a Cherenkov gamma-ray telescope\" NIM A 516 (2004) 511-528) (You need to be within the university network to get free access.) " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "filename = \"https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/magic04_data.txt\"\n", "df = pd.read_csv(filename, engine='python')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use categories 1 and 0 insted of \"g\" and \"h\"\n", "df['class'] = df['class'].map({'g': 1, 'h': 0})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### a) Create for each variable a figure with a plot for gammas and hadrons overlayed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df0 = df[df['class'] == 0] # hadron data set\n", "df1 = df[df['class'] == 1] # gamma data set\n", "\n", "print(len(df0),len(df1))\n", "\n", "### YOUR CODE ###\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b) Create training and test data set. The tast data should amount to 50\\% of the total data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = df['class'].values\n", "X = df[[col for col in df.columns if col!=\"class\"]]\n", "\n", "### YOUR CODE ### \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### c) Define the logistic regressor and fit the training data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import linear_model\n", "\n", "# define logistic regressor\n", "\n", "### YOUR CODE ###\n", "\n", "\n", "\n", "# fit training data\n", "\n", "### YOUR CODE ###\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### d) Determine the Model Accuracy, the AUC score and the Run time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_auc_score\n", "\n", "### YOUR CODE ###\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### e) Plot the ROC curve (Backgropund Rejection vs signal efficiency)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from sklearn.metrics import roc_curve\n", "%matplotlib inline\n", "\n", "y_pred_prob = logreg.predict_proba(X_test) # predicted probabilities\n", "\n", "### YOUR CODE ###\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### YOUR CODE ###\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }