Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

228 lines
6.3 KiB

2 years ago
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "# Exercise: Classification of air showers measured with the MAGIC telescope\n",
  8. "\n",
  9. "The [MAGIC telescope](https://en.wikipedia.org/wiki/MAGIC_(telescope)) is a Cherenkov telescope situated on La Palma, one of the Canary Islands. The [MAGIC machine learning dataset](https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope) can be obtained from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).\n",
  10. "\n",
  11. "The task is to separate signal events (gamma showers) and background events (hadron showers) based on the features of a measured Cherenkov shower.\n",
  12. "\n",
  13. "The features of a shower are:\n",
  14. "\n",
  15. " 1. fLength: continuous # major axis of ellipse [mm]\n",
  16. " 2. fWidth: continuous # minor axis of ellipse [mm] \n",
  17. " 3. fSize: continuous # 10-log of sum of content of all pixels [in #phot]\n",
  18. " 4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio]\n",
  19. " 5. fConc1: continuous # ratio of highest pixel over fSize [ratio]\n",
  20. " 6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm]\n",
  21. " 7. fM3Long: continuous # 3rd root of third moment along major axis [mm] \n",
  22. " 8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm]\n",
  23. " 9. fAlpha: continuous # angle of major axis with vector to origin [deg]\n",
  24. " 10. fDist: continuous # distance from origin to center of ellipse [mm]\n",
  25. " 11. class: g,h # gamma (signal), hadron (background)\n",
  26. "\n",
  27. "g = gamma (signal): 12332\n",
  28. "h = hadron (background): 6688\n",
  29. "\n",
  30. "For technical reasons, the number of h events is underestimated.\n",
  31. "In the real data, the h class represents the majority of the events.\n",
  32. "\n",
  33. "You can find further information about the MAGIC telescope and the data discrimination studies in the following [paper](https://reader.elsevier.com/reader/sd/pii/S0168900203025051?token=8A02764E2448BDC5E4DD0ED53A301295162A6E9C8F223378E8CF80B187DBFD98BD3B642AB83886944002206EB1688FF4) (R. K. Bock et al., \"Methods for multidimensional event classification: a case studyusing images from a Cherenkov gamma-ray telescope\" NIM A 516 (2004) 511-528) (You need to be within the university network to get free access.) "
  34. ]
  35. },
  36. {
  37. "cell_type": "code",
  38. "execution_count": 2,
  39. "metadata": {},
  40. "outputs": [],
  41. "source": [
  42. "import pandas as pd\n",
  43. "import numpy as np\n",
  44. "from sklearn.model_selection import train_test_split"
  45. ]
  46. },
  47. {
  48. "cell_type": "code",
  49. "execution_count": 3,
  50. "metadata": {},
  51. "outputs": [],
  52. "source": [
  53. "filename = \"https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/data/magic04_data.txt\"\n",
  54. "df = pd.read_csv(filename, engine='python')"
  55. ]
  56. },
  57. {
  58. "cell_type": "code",
  59. "execution_count": null,
  60. "metadata": {},
  61. "outputs": [],
  62. "source": [
  63. "# use categories 1 and 0 insted of \"g\" and \"h\"\n",
  64. "df['class'] = df['class'].map({'g': 1, 'h': 0})"
  65. ]
  66. },
  67. {
  68. "cell_type": "code",
  69. "execution_count": null,
  70. "metadata": {},
  71. "outputs": [],
  72. "source": [
  73. "df.head()"
  74. ]
  75. },
  76. {
  77. "cell_type": "markdown",
  78. "metadata": {},
  79. "source": [
  80. "#### a) Create for each variable a figure with a plot for gammas and hadrons overlayed."
  81. ]
  82. },
  83. {
  84. "cell_type": "code",
  85. "execution_count": null,
  86. "metadata": {},
  87. "outputs": [],
  88. "source": [
  89. "import matplotlib.pyplot as plt"
  90. ]
  91. },
  92. {
  93. "cell_type": "code",
  94. "execution_count": null,
  95. "metadata": {},
  96. "outputs": [],
  97. "source": [
  98. "df0 = df[df['class'] == 0] # hadron data set\n",
  99. "df1 = df[df['class'] == 1] # gamma data set\n",
  100. "\n",
  101. "print(len(df0),len(df1))\n",
  102. "\n",
  103. "### YOUR CODE ###\n",
  104. "\n"
  105. ]
  106. },
  107. {
  108. "cell_type": "markdown",
  109. "metadata": {},
  110. "source": [
  111. "#### b) Create training and test data set. The tast data should amount to 50\\% of the total data set."
  112. ]
  113. },
  114. {
  115. "cell_type": "code",
  116. "execution_count": null,
  117. "metadata": {},
  118. "outputs": [],
  119. "source": [
  120. "y = df['class'].values\n",
  121. "X = df[[col for col in df.columns if col!=\"class\"]]\n",
  122. "\n",
  123. "### YOUR CODE ### \n",
  124. "\n"
  125. ]
  126. },
  127. {
  128. "cell_type": "markdown",
  129. "metadata": {},
  130. "source": [
  131. "#### c) Define the logistic regressor and fit the training data"
  132. ]
  133. },
  134. {
  135. "cell_type": "code",
  136. "execution_count": null,
  137. "metadata": {},
  138. "outputs": [],
  139. "source": [
  140. "from sklearn import linear_model\n",
  141. "\n",
  142. "# define logistic regressor\n",
  143. "\n",
  144. "### YOUR CODE ###\n",
  145. "\n",
  146. "\n",
  147. "\n",
  148. "# fit training data\n",
  149. "\n",
  150. "### YOUR CODE ###\n",
  151. "\n"
  152. ]
  153. },
  154. {
  155. "cell_type": "markdown",
  156. "metadata": {},
  157. "source": [
  158. "#### d) Determine the Model Accuracy, the AUC score and the Run time"
  159. ]
  160. },
  161. {
  162. "cell_type": "code",
  163. "execution_count": null,
  164. "metadata": {},
  165. "outputs": [],
  166. "source": [
  167. "from sklearn.metrics import roc_auc_score\n",
  168. "\n",
  169. "### YOUR CODE ###\n",
  170. "\n"
  171. ]
  172. },
  173. {
  174. "cell_type": "markdown",
  175. "metadata": {},
  176. "source": [
  177. "#### e) Plot the ROC curve (Backgropund Rejection vs signal efficiency)"
  178. ]
  179. },
  180. {
  181. "cell_type": "code",
  182. "execution_count": null,
  183. "metadata": {},
  184. "outputs": [],
  185. "source": [
  186. "import matplotlib.pyplot as plt\n",
  187. "from sklearn.metrics import roc_curve\n",
  188. "%matplotlib inline\n",
  189. "\n",
  190. "y_pred_prob = logreg.predict_proba(X_test) # predicted probabilities\n",
  191. "\n",
  192. "### YOUR CODE ###\n",
  193. "\n"
  194. ]
  195. },
  196. {
  197. "cell_type": "code",
  198. "execution_count": null,
  199. "metadata": {},
  200. "outputs": [],
  201. "source": [
  202. "### YOUR CODE ###\n",
  203. "\n"
  204. ]
  205. }
  206. ],
  207. "metadata": {
  208. "kernelspec": {
  209. "display_name": "Python 3",
  210. "language": "python",
  211. "name": "python3"
  212. },
  213. "language_info": {
  214. "codemirror_mode": {
  215. "name": "ipython",
  216. "version": 3
  217. },
  218. "file_extension": ".py",
  219. "mimetype": "text/x-python",
  220. "name": "python",
  221. "nbconvert_exporter": "python",
  222. "pygments_lexer": "ipython3",
  223. "version": "3.8.5"
  224. }
  225. },
  226. "nbformat": 4,
  227. "nbformat_minor": 4
  228. }