Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 1. Introduction to python
  2. % Day 1: 11. April 2023
  3. % \underline{Jörg Marks}, Klaus Reygers
  4. ## Outline of the $1^{st}$ day
  5. * Technical instructions for your interactions with the CIP pool, like
  6. * using the jupyter hub
  7. * using python locally in your own linux environment (anaconda)
  8. * access the CIP pool from your own windows or linux system
  9. * transfer data from and to the CIP pool
  10. Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](\normalsize
  11. * Summary of NumPy
  12. * Plotting with matplotlib
  13. * Input / output of data
  14. * Summary of pandas
  15. * Fitting with iminuit and PyROOT
  16. ## A glimpse into python classes
  17. The following python classes are important to \textcolor{red}{data analysis and machine
  18. learning} and will be useful during the course
  19. * [\textcolor{violet}{NumPy}]( - python library adding support for large,
  20. multi-dimensional arrays and matrices, along with high-level
  21. mathematical functions to operate on these arrays
  22. * [\textcolor{violet}{matplotlib}]( - a python plotting library
  23. * [\textcolor{violet}{SciPy}]( - extension of NumPy by a collection of
  24. mathematical algorithms for minimization, regression,
  25. fourier transformation, linear algebra and image processing
  26. * [\textcolor{violet}{iminuit}]( -
  27. python wrapper to the data fitting toolkit
  28. [\textcolor{violet}{Minuit2}](
  29. developed at CERN by F. James in the 1970ies
  30. * [\textcolor{violet}{PyROOT}]( - python wrapper to the C++ data analysis toolkit
  31. ROOT [\textcolor{violet}{(lecture WS 2021 / 22)}]( used at the LHC
  32. * [\textcolor{violet}{scikit-learn}]( - machine learning library written in
  33. python, which makes use extensively of NumPy for high-performance
  34. linear algebra algorithms
  35. ## NumPy
  36. \textcolor{blue}{NumPy} (Numerical Python) is an open source python library,
  37. which contains multidimensional array and matrix data structures and methods
  38. to efficiently operate on these. The core object is
  39. a homogeneous n-dimensional array object, \textcolor{blue}{ndarray}, which
  40. allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations
  41. with arrays and matrices} due to the extensive usage of compiled code.
  42. * It is heavily used in numerous scientific python packages
  43. * `ndarray` 's have a fixed size at creation $\rightarrow$ changing size
  44. leads to recreation
  45. * Array elements are all required to be of the same data type
  46. * Facilitates advanced mathematical operations on large datasets
  47. * See for a summary, e.g.   
  48. \small
  49. [\textcolor{violet}{\#numpy}]( \normalsize
  50. \vfill
  51. ::: columns
  52. :::: {.column width=30%}
  53. ::::
  54. :::
  55. ::: columns
  56. :::: {.column width=35%}
  57. `c = []`
  58. `for i in range(len(a)):`
  59.     `c.append(a[i]*b[i])`
  60. ::::
  61. :::: {.column width=35%}
  62. with NumPy
  63. `c = a * b`
  64. ::::
  65. :::
  74. ## NumPy - array basics (1)
  75. * numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed.
  76. The *rank* is the dimension of the array.
  77. There are methods to create and preset arrays.
  78. \footnotesize
  79. ```python
  80. myA = np.array([12, 5 , 11]) # create rank 1 array (vector like)
  81. type(myA) # <class ‘numpy.ndarray’>
  82. myA.shape # (3,)
  83. print(myA[2]) # 11 access 3. element
  84. myA[0] = 12 # set 1. element to 12
  85. myB = np.array([[1,5],[7,9]]) # create rank 2 array
  86. myB.shape # (2,2)
  87. print(myB[0,0],myB[0,1],myB[1,1]) # 1 5 9
  88. myC = np.arange(6) # create rank 1 set to 0 - 5
  89. myC.reshape(2,3) # change rank to (2,3)
  90. zero = np.zeros((2,5)) # 2 rows, 5 columns, set to 0
  91. one = np.ones((2,2)) # 2 rows, 2 columns, set to 1
  92. five = np.full((2,2), 5) # 2 rows, 2 columns, set to 5
  93. e = np.eye(2) # create 2x2 identity matrix
  94. ```
  95. \normalsize
  96. ## NumPy - array basics (2)
  97. * Similar to a coordinate system numpy arrays also have \textcolor{blue}{axes}. numpy operations
  98. can be performed along these axes.
  99. \footnotesize
  100. ::: columns
  101. :::: {.column width=35%}
  102. ```python
  103. # 2D arrays
  104. five = np.full((2,3), 5) # 2 rows, 3 columns, set to 5
  105. seven = np.full((2,3), 7) # 2 rows, 3 columns, set to 7
  106. np.concatenate((five,seven), axis = 0) # results in a 3 x 4 array
  107. np.concatenate((five,seven), axis = 1]) # results in a 6 x 2 array
  108. # 1D array
  109. one = np.array([1, 1 , 1]) # results in a 1 x 3 array, set to 1
  110. four = np.array([4, 4 , 4]) # results in a 1 x 3 array, set to 4
  111. np.concatenate((one,four), axis = 0) # concat. arrays horizontally!
  112. ```
  113. ::::
  114. :::: {.column width=50%}
  115. \vspace{3cm}
  116. ![](figures/numpy_axes.png)
  117. ::::
  118. :::
  119. \normalsize
  120. ## NumPy - array indexing (1)
  121. * select slices of a numpy array
  122. \footnotesize
  123. ```python
  124. a = np.array([[1,2,3,4],
  125. [5,6,7,8], # 3 rows 4 columns array
  126. [9,10,11,12]])
  127. b = a[:2, 1:3] # subarray of 2 rows and
  128. array([[2, 3], # column 1 and 2
  129. [6, 7]])
  130. ```
  131. \normalsize
  132. * a slice of an array points into the same data, *modifying* changes the original array!
  133. \footnotesize
  134. ```python
  135. b[0, 0] = 77 # b[0,0] and a[0,1] are 77
  136. r1_row = a[1, :] # get 2nd row -> rank 1
  137. r1_row.shape # (4,)
  138. r2_row = a[1:2, :] # get 2nd row -> rank 2
  139. r2_row.shape # (1,4)
  140. a=np.array([[1,2],[3,4],[5,6]]) # set a , 3 rows 2 cols
  141. d=a[[0, 1, 2], [0, 1, 1]] # d contains [1 4 6]
  142. e=a[[1, 2], [1, 1]] # e contains [4 6]
  143. np.array([a[0,0],a[1,1],a[2,0]]) # address elements explicitly
  144. ```
  145. \normalsize
  146. ## NumPy - array indexing (2)
  147. * integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements
  148. \footnotesize
  149. ```python
  150. a = np.array([[1,2,3,4],
  151. [5,6,7,8], # 3 rows 4 columns array
  152. [9,10,11,12]])
  153. p_a = np.array([0,2,0]) # Create an array of indices
  154. s = a[np.arange(3), p_a] # number the rows, p_a points to cols
  155. print (s) # s contains [1 7 9]
  156. a[np.arange(3),p_a] += 10 # add 10 to corresponding elements
  157. x=np.array([[8,2],[7,4]]) # create 2x2 array
  158. bool = (x > 5) # bool : array of boolians
  159. # [[True False]
  160. # [True False]]
  161. print(x[x>5]) # select elements, prints [8 7]
  162. ```
  163. \normalsize
  164. * data type in numpy - create according to input numbers or set explicitly
  165. \footnotesize
  166. ```python
  167. x = np.array([1.1, 2.1]) # create float array
  168. print(x.dtype) # print float64
  169. y=np.array([1.1,2.9],dtype=np.int64) # create float array [1 2]
  170. ```
  171. \normalsize
  172. ## NumPy - functions
  173. * math functions operate elementwise either as operator overload or as methods
  174. \footnotesize
  175. ```python
  176. x=np.array([[1,2],[3,4]],dtype=np.float64) # define 2x2 float array
  177. y=np.array([[3,1],[5,1]],dtype=np.float64) # define 2x2 float array
  178. s = x + y # elementwise sum
  179. s = np.add(x,y)
  180. s = np.subtract(x,y)
  181. s = np.multiply(x,y) # no matrix multiplication!
  182. s = np.divide(x,y)
  183. s = np.sqrt(x), np.exp(x), ...
  184. x @ y , or, y) # matrix product
  185. np.sum(x, axis=0) # sum of each column
  186. np.sum(x, axis=1) # sum of each row
  187. xT = x.T # transpose of x
  188. x = np.linspace(0,2*pi,100) # get equal spaced points in x
  189. r = np.random.default_rng(seed=42) # constructor random number class
  190. b = r.random((2,3)) # random 2x3 matrix
  191. ```
  192. \normalsize
  193. ##
  194. * broadcasting in numpy
  195. \vspace{0.4cm}
  196. The term \textcolor{blue}{broadcasting} describes how numpy treats arrays
  197. with different shapes during arithmetic operations
  198. * add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to
  199. $[b,b,b]$
  200. \vspace{0.2cm}
  201. * add a scalar $b$ to a 2D [2,3] array $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$
  202. $\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise
  203. \vspace{0.2cm}
  204. * add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ $\rightarrow$ 1D array is broadcast
  205. across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added element wise
  206. \vspace{0.2cm}
  207. Arithmetic operations can only be performed when the shape of each
  208. dimension in the arrays are equal or one has the dimension size of 1. Look
  209. [\textcolor{violet}{here}]( for more details
  210. \footnotesize
  211. ```python
  212. # Add a vector to each row of a matrix
  213. x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3)
  214. v = np.array([1,2,3]) # v has shape (3,)
  215. x + v # [[2 4 6]
  216. # [5 7 9]]
  217. ```
  218. \normalsize
  219. ## Plot data
  220. A popular library to present data is the `pyplot` module of `matplotlib`.
  221. * Drawing a function in one plot
  222. \footnotesize
  223. ::: columns
  224. :::: {.column width=35%}
  225. ```python
  226. import numpy as np
  227. import matplotlib.pyplot as plt
  228. # generate 100 points from 0 to 2 pi
  229. x = np.linspace( 0, 10*np.pi, 100 )
  230. f = np.sin(x)**2
  231. # plot function
  232. plt.plot(x,f,'blueviolet',label='sine')
  233. plt.xlabel('x [radian]')
  234. plt.ylabel('f(x)')
  235. plt.title('Plot sin^2')
  236. plt.legend(loc='upper right')
  237. plt.axis([0,30,-0.1,1.2]) # limit the plot range
  238. # show the plot
  240. ```
  241. ::::
  242. :::: {.column width=40%}
  243. ![](figures/matplotlib_Figure_1.png)
  244. ::::
  245. :::
  246. \normalsize
  247. ##
  248. * Drawing a scatter plot of data
  249. \footnotesize
  250. ::: columns
  251. :::: {.column width=35%}
  252. ```python
  253. ...
  254. # create x,y data points
  255. num = 75
  256. x = range(num)
  257. y = range(num) + np.random.randint(0,num/1.5,num)
  258. z = - (range(num) + np.random.randint(0,num/3,num)) + num
  259. # create colored scatter plot, sample 1
  260. plt.scatter(x, y, color = 'green',
  261. label='Sample 1')
  262. # create colored scatter plot, sample 2
  263. plt.scatter(x, z, color = 'orange',
  264. label='Sample 2')
  265. plt.title('scatter plot')
  266. plt.xlabel('x')
  267. plt.ylabel('y')
  268. # description and plot
  269. plt.legend()
  271. ```
  272. ::::
  273. :::: {.column width=35%}
  274. \vspace{3cm}
  275. ![](figures/matplotlib_Figure_6.png)
  276. ::::
  277. :::
  278. \normalsize
  279. ##
  280. * Drawing a histogram of data
  281. \footnotesize
  282. ::: columns
  283. :::: {.column width=35%}
  284. ```python
  285. ...
  286. # create normalized gaussian Distribution
  287. g = np.random.normal(size=10000)
  288. # histogram the data
  289. plt.hist(g,bins=40)
  290. # plot rotated histogram
  291. plt.hist(g,bins=40,orientation='horizontal')
  292. # normalize area to 1
  293. plt.hist(g,bins=40,density=True)
  294. # change color
  295. plt.hist(g,bins=40,density=True,
  296. edgecolor='lightgreen',color='orange')
  297. plt.title('Gaussian Histogram')
  298. plt.xlabel('bin')
  299. plt.ylabel('entries')
  300. # description and plot
  301. plt.legend(['Normalized distribution'])
  303. ```
  304. ::::
  305. :::: {.column width=35%}
  306. \vspace{3.5cm}
  307. ![](figures/matplotlib_Figure_5.png)
  308. ::::
  309. :::
  310. \normalsize
  311. ##
  312. * Drawing subplots in one canvas
  313. \footnotesize
  314. ::: columns
  315. :::: {.column width=35%}
  316. ```python
  317. ...
  318. g = np.exp(-0.2*x)
  319. # create figure
  320. plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey')
  321. plt.suptitle('1 x 2 Plot')
  322. # create subplot and plot first one
  323. plt.subplot(1,2,1)
  324. # plot first one
  325. plt.title('exp(x)')
  326. plt.xlabel('x')
  327. plt.ylabel('g(x)')
  328. plt.plot(x,g,'blueviolet')
  329. # create subplot and plot second one
  330. plt.subplot(1,2,2)
  331. plt.plot(x,f,'orange')
  332. plt.plot(x,f*g,'red')
  333. plt.legend(['sine^2','exp*sine'])
  334. # show the plot
  336. ```
  337. ::::
  338. :::: {.column width=40%}
  339. \vspace{3cm}
  340. ![](figures/matplotlib_Figure_2.png)
  341. ::::
  342. :::
  343. \normalsize
  344. ## Image data
  345. The `image` class of the `matplotlib` library can be used to load the image
  346. to numpy arrays and to render the image.
  347. * There are 3 common formats for the numpy array
  348. * (M, N) scalar data used for greyscale images
  349. * (M, N, 3) for RGB images (each pixel has an array with RGB color attached)
  350. * (M, N, 4) for RGBA images (each pixel has an array with RGB color
  351. and transparency attached)
  352. The method `imread` loads the image into an `ndarray`, which can be
  353. manipulated.
  354. The method `imshow` renders the image data
  355. \vspace {2cm}
  356. ##
  357. * Drawing pixel data and images
  358. \footnotesize
  359. ::: columns
  360. :::: {.column width=50%}
  361. ```python
  362. ....
  363. # create data array with pixel postion and RGB color code
  364. width, height = 200, 200
  365. data = np.zeros((height, width, 3), dtype=np.uint8)
  366. # red patch in the center
  367. data[75:125, 75:125] = [255, 0, 0]
  368. x = np.random.randint(0,width-1,100)
  369. y = np.random.randint(0,height-1,100)
  370. data[x,y]= [0,255,0] # 100 random green pixel
  371. plt.imshow(data)
  373. ....
  374. import matplotlib.image as mpimg
  375. #read image into numpy array
  376. pic = mpimg.imread('picture.jpg')
  377. mod_pic = pic[:,:,0] # grab slice 0 of the colors
  378. plt.imshow(mod_pic) # use default color code also
  379. plt.colorbar() # try cmap='hot'
  381. ```
  382. ::::
  383. :::: {.column width=25%}
  384. ![](figures/matplotlib_Figure_3.png)
  385. \vspace{1cm}
  386. ![](figures/matplotlib_Figure_4.png)
  387. ::::
  388. :::
  389. \normalsize
  390. ## Input / output
  391. For the analysis of measured data efficient input \/ output plays an
  392. important role. In numpy, `ndarrays` can be saved and read in from files.
  393. `load()` and `save()` functions handle numpy binary files (.npy extension)
  394. which contain data, shape, dtype and other information required to
  395. reconstruct the `ndarray` of the disk file.
  396. \footnotesize
  397. ```python
  398. r = np.random.default_rng() # instanciate random number generator
  399. a = r.random((4,3)) # random 4x3 array
  400.'myBinary.npy', a) # write array a to binary file myBinary.npy
  401. b = np.arange(12)
  402. np.savez('myComp.npz', a=a, b=b) # write a and b in compressed binary file
  403. ......
  404. b = np.load('myBinary.npy') # read content of myBinary.npy into b
  405. ```
  406. \normalsize
  407. The storage and retrieval of array data in text file format is done
  408. with `savetxt()` and `loadtxt()` methods. Parameter controlling delimiter,
  409. line separators, file header and footer can be specified.
  410. \footnotesize
  411. ```python
  412. x = np.array([1,2,3,4,5,6,7]) # create ndarray
  413. np.savetxt('myText.txt',x,fmt='%d', delimiter=',') # write array x to file myText.txt
  414. # with comma separation
  415. ```
  416. \normalsize
  417. ## Input / output
  418. Import tabular data from table processing programs in office packages.
  419. \vspace{0.4cm}
  420. \footnotesize
  421. ::: columns
  422. :::: {.column width=35%}
  423. `Excel data` can be exported as text file (myData_01.csv) with a comma as
  424. delimiter.
  425. ::::
  426. :::: {.column width=35%}
  427. ![](figures/numpy_excel.png)
  428. ::::
  429. :::
  430. \footnotesize
  431. ```python
  432. .....
  433. # read content of all files myData_*.csv into data
  434. data = np.loadtxt('myData_01.csv',dtype=int,delimiter=',')
  435. print (data.shape) # (12, 9)
  436. print (data) # [[1 1 1 1 0 0 0 0 0]
  437. # [0 0 1 1 0 0 1 1 0]
  438. # .....
  439. # [0 0 0 0 1 1 1 1 1]]
  440. ```
  441. \normalsize
  442. ## Input / output
  443. Import tabular data from table processing programs in office packages.
  444. \vspace{0.4cm}
  445. \footnotesize
  446. ::: columns
  447. :::: {.column width=35%}
  448. `Excel data` can be exported as text file (myData_01.csv) with a comma as
  449. delimiter. \newline
  450. $\color{blue}{Often~many~files~are~available~(myData\_*.csv)}$
  451. ::::
  452. :::: {.column width=35%}
  453. ![](figures/numpy_multi_excel.png)
  454. ::::
  455. :::
  456. \footnotesize
  457. ```python
  458. .....
  459. # find files and directories with names matching a pattern
  460. import glob
  461. # read content of all files myData_*.csv into data
  462. file_list = sorted(glob.glob('myData_*.csv')) # generate a sorted file list
  463. for filename in file_list:
  464. data = np.loadtxt(fname=filename, dtype=int, delimiter=',')
  465. print(data[:,3]) # print column 3 of each file
  466. # [1 1 1 1 1 1 1 1 1 1 1 0]
  467. # ......
  468. # [0 1 0 1 0 1 0 1 0 1 0 1]
  469. ```
  470. \normalsize
  471. ## Exercise 1
  472. i) Display a numpy array as figure of a blue cross. The size should be 200
  473. by 200 pixel. Use as array format (M, N, 3), where the first 2 specify
  474. the pixel positions and the last 3 the rbg color from 0:255.
  475. - Draw in addition a red square of arbitrary position into the figure.
  476. - Draw a circle in the center of the figure. Try to create a mask which
  477. selects the inner part of the circle using the indexing.
  478. \small
  479. [Solution: 01_intro_ex_1a_sol.ipynb]( \normalsize
  480. ii) Read data which contains pixels from the binary file into a
  481. numpy array. Display the data and the following transformations in 4
  482. subplots: scaling and translation, compression in x and y, rotation
  483. and mirroring.
  484. \small
  485. [Solution: 01_intro_ex_1b_sol.ipynb]( \normalsize
  486. ## Pandas
  487. [\textcolor{violet}{pandas}]( is a software library written in python for
  488. \textcolor{blue}{data manipulation and analysis}.
  489. \vspace{0.4cm}
  490. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  491. * Offers data structures and operations for manipulating numerical tables with
  492. integrated indexing
  493. * Imports data from various file formats, e.g. comma-separated values, JSON,
  494. SQL or Excel
  495. * Tools for reading and writing data structures, allows analyzing, filtering,
  496. spliting, grouping and aggregating, merging and joining and plotting
  497. * Built on top of `NumPy`
  498. * Visualize the data with `matplotlib`
  499. * Most machine learning tools support `pandas` $\rightarrow$
  500. it is widely used to preprocess data sets for analysis and machine learning
  501. in various scientific fields
  502. ## Pandas micro introduction
  503. Goal: Exploring, cleaning, transforming, and visualization of data.
  504. The basic indexable objects are
  505. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  506. * `Series` -> vector (list) of data elements of arbitrary type
  507. * `DataFrame` -> tabular arangement of data elements of column wise
  508. arbitrary type
  509. Both allow cleaning data by removing of `empty` or `nan` data entries
  510. \footnotesize
  511. ```python
  512. import numpy as np
  513. import pandas as pd # use together with numpy
  514. s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of int64
  515. r = pd.Series(np.random.randn(4)) # Series of random numbers float64
  516. dates = pd.date_range("20130101", periods=3) # index according to dates
  517. df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD"))
  518. print (df) # print the DataFrame
  519. A B C D
  520. 2013-01-01 1.618395 1.210263 -1.276586 -0.775545
  521. 2013-01-02 0.676783 -0.754161 -1.148029 -0.244821
  522. 2013-01-03 -0.359081 0.296019 1.541571 0.235337
  523. new_s = s.dropna() # return a new Data Frame without the column that has NaN cells
  524. ```
  525. \normalsize
  526. ##
  527. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  528. * pandas data can be saved in different file formats (CSV, JASON, html, XML,
  529. Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept
  530. in the output file, except if they are removed with `dataframe.dropna()`
  531. * csv file
  532. \footnotesize
  533. ```python
  534. df.to_csv("myFile.csv") # Write the DataFrame df to a csv file
  535. ```
  536. \normalsize
  537. * HDF5 output
  538. \footnotesize
  539. ```python
  540. df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5
  541. s.to_hdf("myFile.h5", key='s',mode='a')
  542. ```
  543. \normalsize
  544. * Writing to an excel file
  545. \footnotesize
  546. ```python
  547. df.to_excel("myFile.xlsx", sheet_name="Sheet1")
  548. ```
  549. \normalsize
  550. * Deleting file with data in python
  551. \footnotesize
  552. ```python
  553. import os
  554. os.remove('myFile.h5')
  555. ```
  556. \normalsize
  557. ##
  558. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  559. * read in data from various formats
  560. * csv file
  561. \footnotesize
  562. ```python
  563. .......
  564. df = pd.read_csv('heart.csv') # read csv data table
  565. print(
  566. <class 'pandas.core.frame.DataFrame'>
  567. RangeIndex: 303 entries, 0 to 302
  568. Data columns (total 14 columns):
  569. # Column Non-Null Count Dtype
  570. --- ------ -------------- -----
  571. 0 age 303 non-null int64
  572. 1 sex 303 non-null int64
  573. 2 cp 303 non-null int64
  574. print(df.head(5)) # prints the first 5 rows of the data table
  575. print(df.describe()) # shows a quick statistic summary of your data
  576. ```
  577. \normalsize
  578. * Reading an excel file
  579. \footnotesize
  580. ```python
  581. df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"])
  582. ```
  583. \normalsize
  584. \textcolor{olive}{There are many options specifying details for IO.}
  585. ##
  586. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  587. * Various functions exist to select and view data from pandas objects
  588. * Display column and index
  589. \footnotesize
  590. ```python
  591. df.index # show datetime index of df
  592. DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'],
  593. dtype='datetime64[ns]',freq='D')
  594. df.column # show columns info
  595. Index(['A', 'B', 'C', 'D'], dtype='object')
  596. ```
  597. \normalsize
  598. * `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data
  599. \footnotesize
  600. ```python
  601. df.to_numpy() # one dtype for the entire array, not per column!
  602. [[-0.62660101 -0.67330526 0.23269168 -0.67403546]
  603. [-0.53033339 0.32872063 -0.09893568 0.44814084]
  604. [-0.60289996 -0.22352548 -0.43393248 0.47531456]]
  605. ```
  606. \normalsize
  607. Does not include the index or column labels in the output
  608. * more on viewing
  609. \footnotesize
  610. ```python
  611. df.T # transpose the DataFrame df
  612. df.sort_values(by="B") # Sorting by values of column B of df
  613. df.sort_index(axis=0) # Sorting by index ascending values
  614. df.sort_index(axis=0,ascending=False) # Display columns in inverse order
  615. ```
  616. \normalsize
  617. ##
  618. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  619. * Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions
  620. * get a named column as a Series
  621. \footnotesize
  622. ```python
  623. df["A"] # selects a column A from df, simular to df.A
  624. df.iloc[:, 1:2] # slices column A explicitly from df, df.loc[:, ["A"]]
  625. ```
  626. \normalsize
  627. * select rows of a DataFrame
  628. \footnotesize
  629. ```python
  630. df[0:2] # selects row 0 and 1 from df,
  631. df["20130102":"20130103"] # use indices, endpoints are included!
  632. df.iloc[3] # select with the position of the passed integers
  633. df.iloc[1:3, :] # selects row 1 and 2 from df
  634. ```
  635. \normalsize
  636. * select by label
  637. \footnotesize
  638. ```python
  639. df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D
  640. df.loc[dates[0], "A"] # selects a single value (scalar)
  641. ```
  642. \normalsize
  643. * select by lists of integer position (as in `NumPy`)
  644. \footnotesize
  645. ```python
  646. df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D (data only)
  647. df.iloc[1, 1] # get a value explicitly (data only, no index lines)
  648. ```
  649. \normalsize
  650. * select according to expressions
  651. \footnotesize
  652. ```python
  653. df.query('B<C') # select rows where B < C
  654. df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows
  655. ```
  656. \normalsize
  657. ##
  658. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  659. * Selecting data of pandas objects continued
  660. * Boolean indexing
  661. \footnotesize
  662. ```python
  663. df[df["A"] > 0] # select df where all values of column A are >0
  664. df[df > 0] # select values >0 from the entire DataFrame
  665. ```
  666. \normalsize
  667. more complex example
  668. \footnotesize
  669. ```python
  670. df2 = df.copy() # copy df
  671. df2["E"] = ["eight","one","four"] # add column E
  672. df2[df2["E"].isin(["two", "four"])] # test if elements "two" and "four" are
  673. # contained in Series column E
  674. ```
  675. \normalsize
  676. * Operations (in general exclude missing data)
  677. \footnotesize
  678. ```python
  679. df2[df2 > 0] = -df2 # All elements > 0 change sign
  680. df.mean(0) # get column wise mean (numbers=axis)
  681. df.mean(1) # get row wise mean
  682. df.std(0) # standard deviation according to axis
  683. df.cumsum() # cumulative sum of each column
  684. df.apply(np.sin) # apply function to each element of df
  685. df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise
  686. df + 10 # add scalar 10
  687. df - [1, 2, 10 , 100] # subtract values of each column
  688. df.corr() # Compute pairwise correlation of columns
  689. ```
  690. \normalsize
  691. ## Pandas - plotting data
  692. [\textcolor{violet}{Visualization}]( is integrated in pandas using matplotlib. Here are only 2 examples
  693. * Plot random data in histogramm and scatter plot
  694. \footnotesize
  695. ```python
  696. # create DataFrame with random normal distributed data
  697. df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"])
  698. df = df + [1, 3, 8 , 10] # shift column wise mean by 1, 3, 8 , 10
  699. df.plot.hist(bins=20) # histogram all 4 columns
  700. g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1")
  701. df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1)
  702. ```
  703. \normalsize
  704. ::: columns
  705. :::: {.column width=35%}
  706. ![](figures/pandas_histogramm.png)
  707. ::::
  708. :::: {.column width=35%}
  709. ![](figures/pandas_scatterplot.png)
  710. ::::
  711. :::
  712. ## Pandas - plotting data
  713. The function crosstab() takes one or more array-like objects as indexes or
  714. columns and constructs a new DataFrame of variable counts on the inputs
  715. \footnotesize
  716. ```python
  717. df = pd.DataFrame( # create DataFrame of 2 categories
  718. {"sex": np.array([0,0,0,0,1,1,1,1,0,0,0]),
  719. "heart": np.array([1,1,1,0,1,1,1,0,0,0,1])
  720. } ) # closing bracket goes on next line
  721. pd.crosstab(,df2.heart) # create cross table of possibilities
  722. pd.crosstab(,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts
  723. ```
  724. \normalsize
  725. ::: columns
  726. :::: {.column width=38%}
  727. ![](figures/pandas_crosstabplot.png)
  728. ::::
  729. :::
  730. ## Exercise 2
  731. Read the file [\textcolor{violet}{heart.csv}]( into a DataFrame.
  732. [\textcolor{violet}{Information on the dataset}](
  733. \setbeamertemplate{itemize item}{\color{red}$\square$}
  734. * Which columns do we have
  735. * Print the first 3 rows
  736. * Print the statistics summary and the correlations
  737. * Print mean values for each column with and without disease (target)
  738. * Select the data according to `sex` and `target` (heart disease 0=no 1=yes).
  739. * Plot the `age` distribution of male and female in one histogram
  740. * Plot the heart disease distribution according to chest pain type `cp`
  741. * Plot `thalach` according to `target` in one histogramm
  742. * Plot `sex` and `target` in a histogramm figure
  743. * Correlate `age` and `max heart rate` according to `target`
  744. * Correlate `age` and `colesterol` according to `target`
  745. \small
  746. [Solution: 01_intro_ex_2_sol.ipynb]( \normalsize