Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

830 lines
25 KiB

  1. ---
  2. title: |
  3. | Introduction to Data Analysis and Machine Learning in Physics:
  4. | 1. Introduction to python
  5. author: "Martino Borsato, Jörg Marks, Klaus Reygers"
  6. date: "Studierendentage, 11-14 April 2022"
  7. ---
  8. ## Outline of the $1^{st}$ day
  9. * Technical instructions for your interactions with the CIP pool, like
  10. * using the jupyter hub
  11. * using python locally in your own linux environment (anaconda)
  12. * access the CIP pool from your own windows or linux system
  13. * transfer data from and to the CIP pool
  14. Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/Folien/CIPpoolAccess.pdf)\normalsize
  15. * Summary of NumPy
  16. * Plotting with matplotlib
  17. * Input / output of data
  18. * Summary of pandas
  19. * Fitting with iminuit and pyROOT
  20. ## A glimpse into python classes
  21. The following python classes are important to data analysis and machine
  22. learning will be used during the course
  23. * [\textcolor{violet}{NumPy}](https://numpy.org/doc/stable/user/basics.html) - python library adding support for large,
  24. multi-dimensional arrays and matrices, along with high-level
  25. mathematical functions to operate on these arrays
  26. * [\textcolor{violet}{matplotlib}](https://matplotlib.org/stable/tutorials/index.html) - a python plotting library
  27. * [\textcolor{violet}{SciPy}](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) - extension of NumPy by a collection of
  28. mathematical algorithms for minimization, regression,
  29. fourier transformation, linear algebra and image processing
  30. * [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) -
  31. python wrapper to the data fitting toolkit
  32. [\textcolor{violet}{Minuit2}](https://root.cern.ch/doc/master/Minuit2Page.html)
  33. developed at CERN by F. James in the 1970ies
  34. * [\textcolor{violet}{pyROOT}](https://root.cern/manual/python/) - python wrapper to the C++ data analysis toolkit
  35. ROOT used at the LHC
  36. * [\textcolor{violet}{scikit-learn}](https://scikit-learn.org/stable/) - machine learning library written in
  37. python, which makes use extensively of NumPy for high-performance
  38. linear algebra algorithms
  39. ## NumPy
  40. \textcolor{blue}{NumPy} (Numerical Python) is an open source Python library,
  41. which contains multidimensional array and matrix data structures and methods
  42. to efficiently operate on these. The core object is
  43. a homogeneous n-dimensional array object, \textcolor{blue}{ndarray}, which
  44. allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations
  45. with arrays and matrices} due to the extensive usage of compiled code.
  46. * It is heavily used in numerous scientific python packages
  47. * `ndarray` 's have a fixed size at creation $\rightarrow$ changing size
  48. leads to recreation
  49. * Array elements are all required to be of the same data type
  50. * Facilitates advanced mathematical operations on large datasets
  51. * See for a summary, e.g.   
  52. \small
  53. [\textcolor{violet}{https://cs231n.github.io/python-numpy-tutorial/\#numpy}](https://cs231n.github.io/python-numpy-tutorial/#numpy) \normalsize
  54. \vfill
  55. ::: columns
  56. :::: {.column width=30%}
  57. ::::
  58. :::
  59. ::: columns
  60. :::: {.column width=35%}
  61. `c = []`
  62. `for i in range(len(a)):`
  63.     `c.append(a[i]*b[i])`
  64. ::::
  65. :::: {.column width=35%}
  66. with NumPy
  67. `c = a * b`
  68. ::::
  69. :::
  70. <!---
  71. It seem we need to indent by hand.
  72. I don't manage to align under the bullet text
  73. If we do it with column the vertical space is with code sections not good
  74. If we do it without code section the vertical space is ok, but there is no
  75. code high lightning.
  76. See the different versions of the same page in the following
  77. -->
  78. ## NumPy - array basics
  79. * numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed.
  80. The *rank* is the dimension of the array.
  81. There are methods to create and preset arrays.
  82. \footnotesize
  83. ```python
  84. myA = np.array([2, 5 , 11]) # create rank 1 array (vector like)
  85. type(myA) # <class ‘numpy.ndarray’>
  86. myA.shape # (3,)
  87. print(myA[2]) # 11 access 3. element
  88. myA[0] = 12 # set 1. element to 12
  89. myB = np.array([[1,5],[7,9]]) # create rank 2 array
  90. myB.shape # (2,2)
  91. print(myB[0,0],myB[0,1],myB[1,1]) # 1 5 9
  92. myC = np.arange(6) # create rank 1 set to 0 - 5
  93. myC.reshape(2,3) # change rank to (2,3)
  94. zero = np.zeros((2,5)) # 2 rows, 5 columns, set to 0
  95. one = np.ones((2,2)) # 2 rows, 2 columns, set to 1
  96. five = np.full((2,2), 5) # 2 rows, 2 columns, set to 5
  97. e = np.eye(2) # create 2x2 identity matrix
  98. ```
  99. \normalsize
  100. ## NumPy - array indexing (1)
  101. * select slices of a numpy array
  102. \footnotesize
  103. ```python
  104. a = np.array([[1,2,3,4],
  105. [5,6,7,8], # 3 rows 4 columns array
  106. [9,10,11,12]])
  107. b = a[:2, 1:3] # subarray of 2 rows and
  108. array([[2, 3], # column 1 and 2
  109. [6, 7]])
  110. ```
  111. \normalsize
  112. * a slice of an array points into the same data, *modifying* changes the original array!
  113. \footnotesize
  114. ```python
  115. b[0, 0] = 77 # b[0,0] and a[0,1] are 77
  116. r1_row = a[1, :] # get 2nd row -> rank 1
  117. r1_row.shape # (4,)
  118. r2_row = a[1:2, :] # get 2nd row -> rank 2
  119. r2_row.shape # (1,4)
  120. a=np.array([[1,2],[3,4],[5,6]]) # set a , 3 rows 2 cols
  121. d=a[[0, 1, 2], [0, 1, 1]] # d contains [1 4 6]
  122. e=a[[1, 2], [1, 1]] # e contains [4 6]
  123. np.array([a[0,0],a[1,1],a[2,0]]) # address elements explicitly
  124. ```
  125. \normalsize
  126. ## NumPy - array indexing (2)
  127. * integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements
  128. \footnotesize
  129. ```python
  130. a = np.array([[1,2,3,4],
  131. [5,6,7,8], # 3 rows 4 columns array
  132. [9,10,11,12]])
  133. p_a = np.array([0,2,0]) # Create an array of indices
  134. s = a[np.arange(3), p_a] # number the rows, p_a points to cols
  135. print (s) # s contains [1 7 9]
  136. a[np.arange(3),p_a] += 10 # add 10 to corresponding elements
  137. x=np.array([[8,2],[7,4]]) # create 2x2 array
  138. bool = (x > 5) # bool : array of boolians
  139. # [[True False]
  140. # [True False]]
  141. print(x[x>5]) # select elements, prints [8 7]
  142. ```
  143. \normalsize
  144. * data type in numpy - create according to input numbers or set explicitly
  145. \footnotesize
  146. ```python
  147. x = np.array([1.1, 2.1]) # create float array
  148. print(x.dtype) # print float64
  149. y=np.array([1.1,2.9],dtype=np.int64) # create float array [1 2]
  150. ```
  151. \normalsize
  152. ## NumPy - functions
  153. * math functions operate elementwise either as operator overload or as methods
  154. \footnotesize
  155. ```python
  156. x=np.array([[1,2],[3,4]],dtype=np.float64) # define 2x2 float array
  157. y=np.array([[3,1],[5,1]],dtype=np.float64) # define 2x2 float array
  158. s = x + y # elementwise sum
  159. s = np.add(x,y)
  160. s = np.subtract(x,y)
  161. s = np.multiply(x,y) # no matrix multiplication!
  162. s = np.divide(x,y)
  163. s = np.sqrt(x), np.exp(x), ...
  164. x @ y , or np.dot(x, y) # matrix product
  165. np.sum(x, axis=0) # sum of each column
  166. np.sum(x, axis=1) # sum of each row
  167. xT = x.T # transpose of x
  168. x = np.linspace(0,2*pi,100) # get equal spaced points in x
  169. r = np.random.default_rng(seed=42) # constructor random number class
  170. b = r.random((2,3)) # random 2x3 matrix
  171. ```
  172. \normalsize
  173. ##
  174. * broadcasting in numpy
  175. \vspace{0.4cm}
  176. The term broadcasting describes how numpy treats arrays with different
  177. shapes during arithmetic operations
  178. * add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to
  179. $[b,b,b]$
  180. \vspace{0.2cm}
  181. * add a scalar $b$ to a 2D [2,3] array $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$
  182. $\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise
  183. \vspace{0.2cm}
  184. * add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ $\rightarrow$ 1D array is broadcast
  185. across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added element wise
  186. \vspace{0.2cm}
  187. Arithmetic operations can only be performed when the shape of each
  188. dimension in the arrays are equal or one has the dimension size of 1. Look
  189. [\textcolor{violet}{here}](https://numpy.org/doc/stable/user/basics.broadcasting.html) for more details
  190. \footnotesize
  191. ```python
  192. # Add a vector to each row of a matrix
  193. x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3)
  194. v = np.array([1,2,3]) # v has shape (3,)
  195. x + v # [[2 4 6]
  196. # [5 7 9]]
  197. ```
  198. \normalsize
  199. ## Plot data
  200. A popular library to present data is the `pyplot` module of `matplotlib`.
  201. * Drawing a function in one plot
  202. \footnotesize
  203. ::: columns
  204. :::: {.column width=35%}
  205. ```python
  206. import numpy as np
  207. import matplotlib.pyplot as plt
  208. # generate 100 points from 0 to 2 pi
  209. x = np.linspace( 0, 10*np.pi, 100 )
  210. f = np.sin(x)**2
  211. # plot function
  212. plt.plot(x,f,'blueviolet',label='sine')
  213. plt.xlabel('x [radian]')
  214. plt.ylabel('f(x)')
  215. plt.title('Plot sin^2')
  216. plt.legend(loc='upper right')
  217. plt.axis([0,30,-0.1,1.2]) # limit the plot range
  218. # show the plot
  219. plt.show()
  220. ```
  221. ::::
  222. :::: {.column width=40%}
  223. ![](figures/matplotlib_Figure_1.png)
  224. ::::
  225. :::
  226. \normalsize
  227. ##
  228. * Drawing subplots in one canvas
  229. \footnotesize
  230. ::: columns
  231. :::: {.column width=35%}
  232. ```python
  233. ...
  234. g = np.exp(-0.2*x)
  235. # create figure
  236. plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey')
  237. plt.suptitle('1 x 2 Plot')
  238. # create subplot and plot first one
  239. plt.subplot(1,2,1)
  240. # plot first one
  241. plt.title('exp(x)')
  242. plt.xlabel('x')
  243. plt.ylabel('g(x)')
  244. plt.plot(x,g,'blueviolet')
  245. # create subplot and plot second one
  246. plt.subplot(1,2,2)
  247. plt.plot(x,f,'orange')
  248. plt.plot(x,f*g,'red')
  249. plt.legend(['sine^2','exp*sine'])
  250. # show the plot
  251. plt.show()
  252. ```
  253. ::::
  254. :::: {.column width=40%}
  255. \vspace{3cm}
  256. ![](figures/matplotlib_Figure_2.png)
  257. ::::
  258. :::
  259. \normalsize
  260. ## Image data
  261. The `image` class of the `matplotlib` library can be used to load the image
  262. to numpy arrays and to render the image.
  263. * There are 3 common formats for the numpy array
  264. * (M, N) scalar data used for greyscale images
  265. * (M, N, 3) for RGB images (each pixel has an array with RGB color attached)
  266. * (M, N, 4) for RGBA images (each pixel has an array with RGB color
  267. and transparency attached)
  268. The method `imread` loads the image into an `ndarray`, which can be
  269. manipulated.
  270. The method `imshow` renders the image data
  271. \vspace {2cm}
  272. ##
  273. * Drawing pixel data and images
  274. \footnotesize
  275. ::: columns
  276. :::: {.column width=50%}
  277. ```python
  278. ....
  279. # create data array with pixel postion and RGB color code
  280. width, height = 400, 400
  281. data = np.zeros((height, width, 3), dtype=np.uint8)
  282. # red patch in the center
  283. data[175:225, 175:225] = [255, 0, 0]
  284. x = np.random.randint(0,width-1,100)
  285. y = np.random.randint(0,height-1,100)
  286. data[x,y]= [0,255,0] # random green pixel
  287. plt.imshow(data)
  288. plt.show()
  289. ....
  290. import matplotlib.image as mpimg
  291. #read image into numpy array
  292. pic = mpimg.imread('picture.jpg')
  293. mod_pic = pic[:,:,0] # grab slice 0 of the colors
  294. plt.imshow(mod_pic) # use default color code also
  295. plt.colorbar() # try cmap='hot'
  296. plt.show()
  297. ```
  298. ::::
  299. :::: {.column width=25%}
  300. ![](figures/matplotlib_Figure_3.png)
  301. \vspace{1cm}
  302. ![](figures/matplotlib_Figure_4.png)
  303. ::::
  304. :::
  305. \normalsize
  306. ## Input / output
  307. For the analysis of measured data efficient input \/ output plays an
  308. important role. In numpy, `ndarrays` can be saved and read in from files.
  309. `load()` and `save()` functions handle numpy binary files (.npy extension)
  310. which contain data, shape, dtype and other information required to
  311. reconstruct the `ndarray` of the disk file.
  312. \footnotesize
  313. ```python
  314. r = np.random.default_rng() # instanciate random number generator
  315. a = r.random((4,3)) # random 4x3 array
  316. np.save('myBinary.npy', a) # write array a to binary file myBinary.npy
  317. b = np.arange(12)
  318. np.savez('myComp.npz', a=a, b=b) # write a and b in compressed binary file
  319. ......
  320. b = np.load('myBinary.npy') # read content of myBinary.npy into b
  321. ```
  322. \normalsize
  323. The storage and retrieval of array data in text file format is done
  324. with `savetxt()` and `loadtxt()` methods. Parameter controling delimiter,
  325. line separators, file header and footer can be specified.
  326. \footnotesize
  327. ```python
  328. x = np.array([1,2,3,4,5,6,7]) # create ndarray
  329. np.savetxt('myText.txt',x,fmt='%d') # write array x to text file myText.txt
  330. .....
  331. y = np.loadtxt('myText.txt',dtype=int) # read content of myText.txt in y
  332. ```
  333. \normalsize
  334. ## Exercise 1
  335. i) Display a numpy array as figure of a blue cross. The size should be 200
  336. by 200 pixel. Use as array format (M, N, 3), where the first 2 specify
  337. the pixel positions and the last 3 the rbg color from 0:255.
  338. - Draw in addition a red square of arbitrary position into the figure.
  339. - Draw a circle in the center of the figure. Try to create a mask which
  340. selects the inner part of the circle using the indexing.
  341. \small
  342. [Solution: 01_intro_ex_1a_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1a_sol.py) \normalsize
  343. ii) Read data which contains pixels from the binary file horse.py into a
  344. numpy array. Display the data and the following transformations in 4
  345. subplots: scaling and translation, compression in x and y, rotation
  346. and mirroring.
  347. \small
  348. [Solution: 01_intro_ex_1b_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_1b_sol.py) \normalsize
  349. ## Pandas
  350. [\textcolor{violet}{pandas}](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) is a software library written in Python for
  351. \textcolor{blue}{data manipulation and analysis}.
  352. \vspace{0.4cm}
  353. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  354. * Offers data structures and operations for manipulating numerical tables with
  355. integrated indexing
  356. * Imports data from various file formats, e.g. comma-separated values, JSON,
  357. SQL or Excel
  358. * Tools for reading and writing data structures, allows analyzing, filtering,
  359. spliting, merging and joining
  360. * Built on top of `NumPy`
  361. * Visualize the data with `matplotlib`
  362. * Most machine learning tools support `pandas` $\rightarrow$
  363. it is widely used to preprocess data sets for machine learning
  364. ## Pandas micro introduction
  365. Goal: Exploring, cleaning, transforming, and visualization of data.
  366. The basic indexable objects are
  367. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  368. * `Series` -> vector (list) of data elements of arbitrary type
  369. * `DataFrame` -> tabular arangement of data elements of column wise
  370. arbitrary type
  371. Both allow cleaning data by removing of `empty` or `nan` data entries
  372. \footnotesize
  373. ```python
  374. import numpy as np
  375. import pandas as pd # use together with numpy
  376. s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of float64
  377. r = pd.Series(np.random.randn(4)) # Series of random numbers float64
  378. dates = pd.date_range("20130101", periods=3) # index according to dates
  379. df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD"))
  380. print (df) # print the DataFrame
  381. A B C D
  382. 2013-01-01 1.618395 1.210263 -1.276586 -0.775545
  383. 2013-01-02 0.676783 -0.754161 -1.148029 -0.244821
  384. 2013-01-03 -0.359081 0.296019 1.541571 0.235337
  385. new_s = s.dropna() # return a new Data Frame with no empty cells
  386. ```
  387. \normalsize
  388. ##
  389. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  390. * pandas data can be saved in different file formats (CSV, JASON, html, XML,
  391. Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept
  392. in the output file.
  393. * csv file
  394. \footnotesize
  395. ```python
  396. df.to_csv("myFile.csv") # Write the DataFrame df to a csv file
  397. ```
  398. \normalsize
  399. * HDF5 output
  400. \footnotesize
  401. ```python
  402. df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5
  403. s.to_hdf("myFile.h5", key='s',mode='a')
  404. ```
  405. \normalsize
  406. * Writing to an excel file
  407. \footnotesize
  408. ```python
  409. df.to_excel("myFile.xlsx", sheet_name="Sheet1")
  410. ```
  411. \normalsize
  412. * Deleting file with data in python
  413. \footnotesize
  414. ```python
  415. import os
  416. os.remove('myFile.h5')
  417. ```
  418. \normalsize
  419. ##
  420. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  421. * read in data from various formats
  422. * csv file
  423. \footnotesize
  424. ```python
  425. .......
  426. df = pd.read_csv('heart.csv') # read csv data table
  427. print(df.info())
  428. <class 'pandas.core.frame.DataFrame'>
  429. RangeIndex: 303 entries, 0 to 302
  430. Data columns (total 14 columns):
  431. # Column Non-Null Count Dtype
  432. --- ------ -------------- -----
  433. 0 age 303 non-null int64
  434. 1 sex 303 non-null int64
  435. 2 cp 303 non-null int64
  436. print(df.head(5)) # prints the first 5 rows of the data table
  437. print(df.describe()) # shows a quick statistic summary of your data
  438. ```
  439. \normalsize
  440. * Reading an excel file
  441. \footnotesize
  442. ```python
  443. df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"])
  444. ```
  445. \normalsize
  446. \textcolor{olive}{There are many options specifying details for IO.}
  447. ##
  448. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  449. * Various functions exist to select and view data from pandas objects
  450. * Display column and index
  451. \footnotesize
  452. ```python
  453. df.index # show datetime index of df
  454. DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'],
  455. dtype='datetime64[ns]',freq='D')
  456. df.column # show columns info
  457. Index(['A', 'B', 'C', 'D'], dtype='object')
  458. ```
  459. \normalsize
  460. * `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data
  461. \footnotesize
  462. ```python
  463. df.to_numpy() # one dtype for the entire array, not per column!
  464. [[-0.62660101 -0.67330526 0.23269168 -0.67403546]
  465. [-0.53033339 0.32872063 -0.09893568 0.44814084]
  466. [-0.60289996 -0.22352548 -0.43393248 0.47531456]]
  467. ```
  468. \normalsize
  469. Does not include the index or column labels in the output
  470. * more on viewing
  471. \footnotesize
  472. ```python
  473. df.T # transpose the DataFrame df
  474. df.sort_values(by="B") # Sorting by values of a column of df
  475. df.sort_index(axis=0,ascending=False) # Sorting by index descending values
  476. df.sort_index(axis=0,ascending=False) # Display columns in inverse order
  477. ```
  478. \normalsize
  479. ##
  480. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  481. * Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions
  482. * get a named column as a Series
  483. \footnotesize
  484. ```python
  485. df["A"] # selects a column A from df, simular to df.A
  486. df.iloc[:, 1:2] # slices column A explicitly from df, df.loc[:, ["A"]]
  487. ```
  488. \normalsize
  489. * select rows of a DataFrame
  490. \footnotesize
  491. ```python
  492. df[0:2] # selects row 0 and 1 from df,
  493. df["20130102":"20130103"] # use indices endpoint are included!
  494. df.iloc[3] # Select with the position of the passed integers
  495. df.iloc[1:3, :] # selects row 1 and 2 from df
  496. ```
  497. \normalsize
  498. * select by label
  499. \footnotesize
  500. ```python
  501. df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D
  502. df.loc[dates[0], "A"] # selects a single value (scalar)
  503. ```
  504. \normalsize
  505. * select by lists of integer position (as in `NumPy`)
  506. \footnotesize
  507. ```python
  508. df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D
  509. df.iloc[1, 1] # get a value explicitly
  510. ```
  511. \normalsize
  512. * select according to expressions
  513. \footnotesize
  514. ```python
  515. df.query('B<C') # select rows where B < C
  516. df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows
  517. ```
  518. \normalsize
  519. ##
  520. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  521. * Selecting data of pandas objects continued
  522. * Boolean indexing
  523. \footnotesize
  524. ```python
  525. df[df["A"] > 0] # select df where all values of column A are >0
  526. df[df > 0] # select values from the entire DataFrame
  527. ```
  528. \normalsize
  529. more complex example
  530. \footnotesize
  531. ```python
  532. df2 = df.copy() # copy df
  533. df2["E"] = ["eight","one","four"] # add column E
  534. df2[df2["E"].isin(["two", "four"])] # test if elements "two" and "four" are
  535. # contained in Series column E
  536. ```
  537. \normalsize
  538. * Operations (in general exclude missing data)
  539. \footnotesize
  540. ```python
  541. df2[df2 > 0] = -df2 # All elements > 0 change sign
  542. df.mean(0) # get column wise mean (numbers=axis)
  543. df.mean(1) # get row wise mean
  544. df.std(0) # standard deviation according to axis
  545. df.cumsum() # cumulative sum of each column
  546. df.apply(np.sin) # apply function to each element of df
  547. df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise
  548. df + 10 # add scalar 10
  549. df - [1, 2, 10 , 100] # subtract values of each column
  550. df.corr() # Compute pairwise correlation of columns
  551. ```
  552. \normalsize
  553. ## Pandas - plotting data
  554. [\textcolor{violet}{Visualization}](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) is integrated in pandas using mathplotlib. Here are only 2 examples
  555. * Plot random data in histogramm and scatter plot
  556. \footnotesize
  557. ```python
  558. # create DataFrame with random normal distributed data
  559. df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"])
  560. df = df + [1, 3, 8 , 10] # shift mean to 1, 3, 8 , 10
  561. plt.figure()
  562. df.plot.hist(bins=20) # histogram all 4 columns
  563. g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1")
  564. df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1)
  565. ```
  566. \normalsize
  567. ::: columns
  568. :::: {.column width=35%}
  569. ![](figures/pandas_histogramm.png)
  570. ::::
  571. :::: {.column width=35%}
  572. ![](figures/pandas_scatterplot.png)
  573. ::::
  574. :::
  575. ## Pandas - plotting data
  576. The function crosstab() takes one or more array-like objects as indexes or
  577. columns and constructs a new DataFrame of variable counts on the inputs
  578. \footnotesize
  579. ```python
  580. df = pd.DataFrame( # create DataFrame of 2 categories
  581. {"sex": np.array([0,0,0,0,1,1,1,1,0,0,0]),
  582. "heart": np.array([1,1,1,0,1,1,1,0,0,0,1])
  583. } ) # closing bracket goes on next line
  584. pd.crosstab(df2.sex,df2.heart) # create cross table of possibilities
  585. pd.crosstab(df2.sex,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts
  586. ```
  587. \normalsize
  588. ::: columns
  589. :::: {.column width=42%}
  590. ![](figures/pandas_crosstabplot.png)
  591. ::::
  592. :::
  593. ## Exercise 2
  594. Read the file [\textcolor{violet}{heart.csv}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/exercises/heart.csv) into a DataFrame.
  595. [\textcolor{violet}{Information on the dataset}](https://archive.ics.uci.edu/ml/datasets/heart+Disease)
  596. \setbeamertemplate{itemize item}{\color{red}$\square$}
  597. * Which columns do we have
  598. * Print the first 3 rows
  599. * Print the statistics summary and the correlations
  600. * Print mean values for each column with and without disease
  601. * Select the data according to `sex` and `target` (heart disease 0=no 1=yes).
  602. * Plot the `age` distribution of male and female in one histogram
  603. * Plot the heart disease distribution according to chest pain type `cp`
  604. * Plot `thalach` according to `target` in one histogramm
  605. * Plot `sex` and `target` in a histogramm figure
  606. * Correlate `age` and `max heart rate` according to `target`
  607. * Correlate `age` and `colesterol` according to `target`
  608. \small
  609. [Solution: 01_intro_ex_2_sol.py](https://www.physi.uni-heidelberg.de/~reygers/lectures/2021/ml/solutions/01_intro_ex_2_sol.py) \normalsize