Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

988 lines
29 KiB

2 years ago
  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 1. Introduction to python
  2. % Day 1: 11. April 2023
  3. % \underline{Jörg Marks}, Klaus Reygers
  4. ## Outline of the $1^{st}$ day
  5. * Technical instructions for your interactions with the CIP pool, like
  6. * using the jupyter hub
  7. * using python locally in your own linux environment (anaconda)
  8. * access the CIP pool from your own windows or linux system
  9. * transfer data from and to the CIP pool
  10. Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/Folien/CIPpoolAccess.PDF)\normalsize
  11. * Summary of NumPy
  12. * Plotting with matplotlib
  13. * Input / output of data
  14. * Summary of pandas
  15. * Fitting with iminuit and PyROOT
  16. ## A glimpse into python classes
  17. The following python classes are important to \textcolor{red}{data analysis and machine
  18. learning} and will be useful during the course
  19. * [\textcolor{violet}{NumPy}](https://numpy.org/doc/stable/user/basics.html) - python library adding support for large,
  20. multi-dimensional arrays and matrices, along with high-level
  21. mathematical functions to operate on these arrays
  22. * [\textcolor{violet}{matplotlib}](https://matplotlib.org/stable/tutorials/index.html) - a python plotting library
  23. * [\textcolor{violet}{SciPy}](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) - extension of NumPy by a collection of
  24. mathematical algorithms for minimization, regression,
  25. fourier transformation, linear algebra and image processing
  26. * [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) -
  27. python wrapper to the data fitting toolkit
  28. [\textcolor{violet}{Minuit2}](https://root.cern.ch/doc/master/Minuit2Page.html)
  29. developed at CERN by F. James in the 1970ies
  30. * [\textcolor{violet}{PyROOT}](https://root.cern/manual/python/) - python wrapper to the C++ data analysis toolkit
  31. ROOT [\textcolor{violet}{(lecture WS 2021 / 22)}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/) used at the LHC
  32. * [\textcolor{violet}{scikit-learn}](https://scikit-learn.org/stable/) - machine learning library written in
  33. python, which makes use extensively of NumPy for high-performance
  34. linear algebra algorithms
  35. ## NumPy
  36. \textcolor{blue}{NumPy} (Numerical Python) is an open source python library,
  37. which contains multidimensional array and matrix data structures and methods
  38. to efficiently operate on these. The core object is
  39. a homogeneous n-dimensional array object, \textcolor{blue}{ndarray}, which
  40. allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations
  41. with arrays and matrices} due to the extensive usage of compiled code.
  42. * It is heavily used in numerous scientific python packages
  43. * `ndarray` 's have a fixed size at creation $\rightarrow$ changing size
  44. leads to recreation
  45. * Array elements are all required to be of the same data type
  46. * Facilitates advanced mathematical operations on large datasets
  47. * See for a summary, e.g.   
  48. \small
  49. [\textcolor{violet}{https://cs231n.github.io/python-numpy-tutorial/\#numpy}](https://cs231n.github.io/python-numpy-tutorial/#numpy) \normalsize
  50. \vfill
  51. ::: columns
  52. :::: {.column width=30%}
  53. ::::
  54. :::
  55. ::: columns
  56. :::: {.column width=35%}
  57. `c = []`
  58. `for i in range(len(a)):`
  59.     `c.append(a[i]*b[i])`
  60. ::::
  61. :::: {.column width=35%}
  62. with NumPy
  63. `c = a * b`
  64. ::::
  65. :::
  66. <!---
  67. It seem we need to indent by hand.
  68. I don't manage to align under the bullet text
  69. If we do it with column the vertical space is with code sections not good
  70. If we do it without code section the vertical space is ok, but there is no
  71. code high lightning.
  72. See the different versions of the same page in the following
  73. -->
  74. ## NumPy - array basics (1)
  75. * numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed.
  76. The *rank* is the dimension of the array.
  77. There are methods to create and preset arrays.
  78. \footnotesize
  79. ```python
  80. myA = np.array([12, 5 , 11]) # create rank 1 array (vector like)
  81. type(myA) # <class ‘numpy.ndarray’>
  82. myA.shape # (3,)
  83. print(myA[2]) # 11 access 3. element
  84. myA[0] = 12 # set 1. element to 12
  85. myB = np.array([[1,5],[7,9]]) # create rank 2 array
  86. myB.shape # (2,2)
  87. print(myB[0,0],myB[0,1],myB[1,1]) # 1 5 9
  88. myC = np.arange(6) # create rank 1 set to 0 - 5
  89. myC.reshape(2,3) # change rank to (2,3)
  90. zero = np.zeros((2,5)) # 2 rows, 5 columns, set to 0
  91. one = np.ones((2,2)) # 2 rows, 2 columns, set to 1
  92. five = np.full((2,2), 5) # 2 rows, 2 columns, set to 5
  93. e = np.eye(2) # create 2x2 identity matrix
  94. ```
  95. \normalsize
  96. ## NumPy - array basics (2)
  97. * Similar to a coordinate system numpy arrays also have \textcolor{blue}{axes}. numpy operations
  98. can be performed along these axes.
  99. \footnotesize
  100. ::: columns
  101. :::: {.column width=35%}
  102. ```python
  103. # 2D arrays
  104. five = np.full((2,3), 5) # 2 rows, 3 columns, set to 5
  105. seven = np.full((2,3), 7) # 2 rows, 3 columns, set to 7
  106. np.concatenate((five,seven), axis = 0) # results in a 3 x 4 array
  107. np.concatenate((five,seven), axis = 1]) # results in a 6 x 2 array
  108. # 1D array
  109. one = np.array([1, 1 , 1]) # results in a 1 x 3 array, set to 1
  110. four = np.array([4, 4 , 4]) # results in a 1 x 3 array, set to 4
  111. np.concatenate((one,four), axis = 0) # concat. arrays horizontally!
  112. ```
  113. ::::
  114. :::: {.column width=50%}
  115. \vspace{3cm}
  116. ![](figures/numpy_axes.png)
  117. ::::
  118. :::
  119. \normalsize
  120. ## NumPy - array indexing (1)
  121. * select slices of a numpy array
  122. \footnotesize
  123. ```python
  124. a = np.array([[1,2,3,4],
  125. [5,6,7,8], # 3 rows 4 columns array
  126. [9,10,11,12]])
  127. b = a[:2, 1:3] # subarray of 2 rows and
  128. array([[2, 3], # column 1 and 2
  129. [6, 7]])
  130. ```
  131. \normalsize
  132. * a slice of an array points into the same data, *modifying* changes the original array!
  133. \footnotesize
  134. ```python
  135. b[0, 0] = 77 # b[0,0] and a[0,1] are 77
  136. r1_row = a[1, :] # get 2nd row -> rank 1
  137. r1_row.shape # (4,)
  138. r2_row = a[1:2, :] # get 2nd row -> rank 2
  139. r2_row.shape # (1,4)
  140. a=np.array([[1,2],[3,4],[5,6]]) # set a , 3 rows 2 cols
  141. d=a[[0, 1, 2], [0, 1, 1]] # d contains [1 4 6]
  142. e=a[[1, 2], [1, 1]] # e contains [4 6]
  143. np.array([a[0,0],a[1,1],a[2,0]]) # address elements explicitly
  144. ```
  145. \normalsize
  146. ## NumPy - array indexing (2)
  147. * integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements
  148. \footnotesize
  149. ```python
  150. a = np.array([[1,2,3,4],
  151. [5,6,7,8], # 3 rows 4 columns array
  152. [9,10,11,12]])
  153. p_a = np.array([0,2,0]) # Create an array of indices
  154. s = a[np.arange(3), p_a] # number the rows, p_a points to cols
  155. print (s) # s contains [1 7 9]
  156. a[np.arange(3),p_a] += 10 # add 10 to corresponding elements
  157. x=np.array([[8,2],[7,4]]) # create 2x2 array
  158. bool = (x > 5) # bool : array of boolians
  159. # [[True False]
  160. # [True False]]
  161. print(x[x>5]) # select elements, prints [8 7]
  162. ```
  163. \normalsize
  164. * data type in numpy - create according to input numbers or set explicitly
  165. \footnotesize
  166. ```python
  167. x = np.array([1.1, 2.1]) # create float array
  168. print(x.dtype) # print float64
  169. y=np.array([1.1,2.9],dtype=np.int64) # create float array [1 2]
  170. ```
  171. \normalsize
  172. ## NumPy - functions
  173. * math functions operate elementwise either as operator overload or as methods
  174. \footnotesize
  175. ```python
  176. x=np.array([[1,2],[3,4]],dtype=np.float64) # define 2x2 float array
  177. y=np.array([[3,1],[5,1]],dtype=np.float64) # define 2x2 float array
  178. s = x + y # elementwise sum
  179. s = np.add(x,y)
  180. s = np.subtract(x,y)
  181. s = np.multiply(x,y) # no matrix multiplication!
  182. s = np.divide(x,y)
  183. s = np.sqrt(x), np.exp(x), ...
  184. x @ y , or np.dot(x, y) # matrix product
  185. np.sum(x, axis=0) # sum of each column
  186. np.sum(x, axis=1) # sum of each row
  187. xT = x.T # transpose of x
  188. x = np.linspace(0,2*pi,100) # get equal spaced points in x
  189. r = np.random.default_rng(seed=42) # constructor random number class
  190. b = r.random((2,3)) # random 2x3 matrix
  191. ```
  192. \normalsize
  193. ##
  194. * broadcasting in numpy
  195. \vspace{0.4cm}
  196. The term \textcolor{blue}{broadcasting} describes how numpy treats arrays
  197. with different shapes during arithmetic operations
  198. * add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to
  199. $[b,b,b]$
  200. \vspace{0.2cm}
  201. * add a scalar $b$ to a 2D [2,3] array $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$
  202. $\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise
  203. \vspace{0.2cm}
  204. * add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ $\rightarrow$ 1D array is broadcast
  205. across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added element wise
  206. \vspace{0.2cm}
  207. Arithmetic operations can only be performed when the shape of each
  208. dimension in the arrays are equal or one has the dimension size of 1. Look
  209. [\textcolor{violet}{here}](https://numpy.org/doc/stable/user/basics.broadcasting.html) for more details
  210. \footnotesize
  211. ```python
  212. # Add a vector to each row of a matrix
  213. x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3)
  214. v = np.array([1,2,3]) # v has shape (3,)
  215. x + v # [[2 4 6]
  216. # [5 7 9]]
  217. ```
  218. \normalsize
  219. ## Plot data
  220. A popular library to present data is the `pyplot` module of `matplotlib`.
  221. * Drawing a function in one plot
  222. \footnotesize
  223. ::: columns
  224. :::: {.column width=35%}
  225. ```python
  226. import numpy as np
  227. import matplotlib.pyplot as plt
  228. # generate 100 points from 0 to 2 pi
  229. x = np.linspace( 0, 10*np.pi, 100 )
  230. f = np.sin(x)**2
  231. # plot function
  232. plt.plot(x,f,'blueviolet',label='sine')
  233. plt.xlabel('x [radian]')
  234. plt.ylabel('f(x)')
  235. plt.title('Plot sin^2')
  236. plt.legend(loc='upper right')
  237. plt.axis([0,30,-0.1,1.2]) # limit the plot range
  238. # show the plot
  239. plt.show()
  240. ```
  241. ::::
  242. :::: {.column width=40%}
  243. ![](figures/matplotlib_Figure_1.png)
  244. ::::
  245. :::
  246. \normalsize
  247. ##
  248. * Drawing a scatter plot of data
  249. \footnotesize
  250. ::: columns
  251. :::: {.column width=35%}
  252. ```python
  253. ...
  254. # create x,y data points
  255. num = 75
  256. x = range(num)
  257. y = range(num) + np.random.randint(0,num/1.5,num)
  258. z = - (range(num) + np.random.randint(0,num/3,num)) + num
  259. # create colored scatter plot, sample 1
  260. plt.scatter(x, y, color = 'green',
  261. label='Sample 1')
  262. # create colored scatter plot, sample 2
  263. plt.scatter(x, z, color = 'orange',
  264. label='Sample 2')
  265. plt.title('scatter plot')
  266. plt.xlabel('x')
  267. plt.ylabel('y')
  268. # description and plot
  269. plt.legend()
  270. plt.show()
  271. ```
  272. ::::
  273. :::: {.column width=35%}
  274. \vspace{3cm}
  275. ![](figures/matplotlib_Figure_6.png)
  276. ::::
  277. :::
  278. \normalsize
  279. ##
  280. * Drawing a histogram of data
  281. \footnotesize
  282. ::: columns
  283. :::: {.column width=35%}
  284. ```python
  285. ...
  286. # create normalized gaussian Distribution
  287. g = np.random.normal(size=10000)
  288. # histogram the data
  289. plt.hist(g,bins=40)
  290. # plot rotated histogram
  291. plt.hist(g,bins=40,orientation='horizontal')
  292. # normalize area to 1
  293. plt.hist(g,bins=40,density=True)
  294. # change color
  295. plt.hist(g,bins=40,density=True,
  296. edgecolor='lightgreen',color='orange')
  297. plt.title('Gaussian Histogram')
  298. plt.xlabel('bin')
  299. plt.ylabel('entries')
  300. # description and plot
  301. plt.legend(['Normalized distribution'])
  302. plt.show()
  303. ```
  304. ::::
  305. :::: {.column width=35%}
  306. \vspace{3.5cm}
  307. ![](figures/matplotlib_Figure_5.png)
  308. ::::
  309. :::
  310. \normalsize
  311. ##
  312. * Drawing subplots in one canvas
  313. \footnotesize
  314. ::: columns
  315. :::: {.column width=35%}
  316. ```python
  317. ...
  318. g = np.exp(-0.2*x)
  319. # create figure
  320. plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey')
  321. plt.suptitle('1 x 2 Plot')
  322. # create subplot and plot first one
  323. plt.subplot(1,2,1)
  324. # plot first one
  325. plt.title('exp(x)')
  326. plt.xlabel('x')
  327. plt.ylabel('g(x)')
  328. plt.plot(x,g,'blueviolet')
  329. # create subplot and plot second one
  330. plt.subplot(1,2,2)
  331. plt.plot(x,f,'orange')
  332. plt.plot(x,f*g,'red')
  333. plt.legend(['sine^2','exp*sine'])
  334. # show the plot
  335. plt.show()
  336. ```
  337. ::::
  338. :::: {.column width=40%}
  339. \vspace{3cm}
  340. ![](figures/matplotlib_Figure_2.png)
  341. ::::
  342. :::
  343. \normalsize
  344. ## Image data
  345. The `image` class of the `matplotlib` library can be used to load the image
  346. to numpy arrays and to render the image.
  347. * There are 3 common formats for the numpy array
  348. * (M, N) scalar data used for greyscale images
  349. * (M, N, 3) for RGB images (each pixel has an array with RGB color attached)
  350. * (M, N, 4) for RGBA images (each pixel has an array with RGB color
  351. and transparency attached)
  352. The method `imread` loads the image into an `ndarray`, which can be
  353. manipulated.
  354. The method `imshow` renders the image data
  355. \vspace {2cm}
  356. ##
  357. * Drawing pixel data and images
  358. \footnotesize
  359. ::: columns
  360. :::: {.column width=50%}
  361. ```python
  362. ....
  363. # create data array with pixel postion and RGB color code
  364. width, height = 200, 200
  365. data = np.zeros((height, width, 3), dtype=np.uint8)
  366. # red patch in the center
  367. data[75:125, 75:125] = [255, 0, 0]
  368. x = np.random.randint(0,width-1,100)
  369. y = np.random.randint(0,height-1,100)
  370. data[x,y]= [0,255,0] # 100 random green pixel
  371. plt.imshow(data)
  372. plt.show()
  373. ....
  374. import matplotlib.image as mpimg
  375. #read image into numpy array
  376. pic = mpimg.imread('picture.jpg')
  377. mod_pic = pic[:,:,0] # grab slice 0 of the colors
  378. plt.imshow(mod_pic) # use default color code also
  379. plt.colorbar() # try cmap='hot'
  380. plt.show()
  381. ```
  382. ::::
  383. :::: {.column width=25%}
  384. ![](figures/matplotlib_Figure_3.png)
  385. \vspace{1cm}
  386. ![](figures/matplotlib_Figure_4.png)
  387. ::::
  388. :::
  389. \normalsize
  390. ## Input / output
  391. For the analysis of measured data efficient input \/ output plays an
  392. important role. In numpy, `ndarrays` can be saved and read in from files.
  393. `load()` and `save()` functions handle numpy binary files (.npy extension)
  394. which contain data, shape, dtype and other information required to
  395. reconstruct the `ndarray` of the disk file.
  396. \footnotesize
  397. ```python
  398. r = np.random.default_rng() # instanciate random number generator
  399. a = r.random((4,3)) # random 4x3 array
  400. np.save('myBinary.npy', a) # write array a to binary file myBinary.npy
  401. b = np.arange(12)
  402. np.savez('myComp.npz', a=a, b=b) # write a and b in compressed binary file
  403. ......
  404. b = np.load('myBinary.npy') # read content of myBinary.npy into b
  405. ```
  406. \normalsize
  407. The storage and retrieval of array data in text file format is done
  408. with `savetxt()` and `loadtxt()` methods. Parameter controlling delimiter,
  409. line separators, file header and footer can be specified.
  410. \footnotesize
  411. ```python
  412. x = np.array([1,2,3,4,5,6,7]) # create ndarray
  413. np.savetxt('myText.txt',x,fmt='%d', delimiter=',') # write array x to file myText.txt
  414. # with comma separation
  415. ```
  416. \normalsize
  417. ## Input / output
  418. Import tabular data from table processing programs in office packages.
  419. \vspace{0.4cm}
  420. \footnotesize
  421. ::: columns
  422. :::: {.column width=35%}
  423. `Excel data` can be exported as text file (myData_01.csv) with a comma as
  424. delimiter.
  425. ::::
  426. :::: {.column width=35%}
  427. ![](figures/numpy_excel.png)
  428. ::::
  429. :::
  430. \footnotesize
  431. ```python
  432. .....
  433. # read content of all files myData_*.csv into data
  434. data = np.loadtxt('myData_01.csv',dtype=int,delimiter=',')
  435. print (data.shape) # (12, 9)
  436. print (data) # [[1 1 1 1 0 0 0 0 0]
  437. # [0 0 1 1 0 0 1 1 0]
  438. # .....
  439. # [0 0 0 0 1 1 1 1 1]]
  440. ```
  441. \normalsize
  442. ## Input / output
  443. Import tabular data from table processing programs in office packages.
  444. \vspace{0.4cm}
  445. \footnotesize
  446. ::: columns
  447. :::: {.column width=35%}
  448. `Excel data` can be exported as text file (myData_01.csv) with a comma as
  449. delimiter. \newline
  450. $\color{blue}{Often~many~files~are~available~(myData\_*.csv)}$
  451. ::::
  452. :::: {.column width=35%}
  453. ![](figures/numpy_multi_excel.png)
  454. ::::
  455. :::
  456. \footnotesize
  457. ```python
  458. .....
  459. # find files and directories with names matching a pattern
  460. import glob
  461. # read content of all files myData_*.csv into data
  462. file_list = sorted(glob.glob('myData_*.csv')) # generate a sorted file list
  463. for filename in file_list:
  464. data = np.loadtxt(fname=filename, dtype=int, delimiter=',')
  465. print(data[:,3]) # print column 3 of each file
  466. # [1 1 1 1 1 1 1 1 1 1 1 0]
  467. # ......
  468. # [0 1 0 1 0 1 0 1 0 1 0 1]
  469. ```
  470. \normalsize
  471. ## Exercise 1
  472. i) Display a numpy array as figure of a blue cross. The size should be 200
  473. by 200 pixel. Use as array format (M, N, 3), where the first 2 specify
  474. the pixel positions and the last 3 the rbg color from 0:255.
  475. - Draw in addition a red square of arbitrary position into the figure.
  476. - Draw a circle in the center of the figure. Try to create a mask which
  477. selects the inner part of the circle using the indexing.
  478. \small
  479. [Solution: 01_intro_ex_1a_sol.ipynb](https://www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/solutions/01_intro_ex_1a_sol.ipynb) \normalsize
  480. ii) Read data which contains pixels from the binary file horse.py into a
  481. numpy array. Display the data and the following transformations in 4
  482. subplots: scaling and translation, compression in x and y, rotation
  483. and mirroring.
  484. \small
  485. [Solution: 01_intro_ex_1b_sol.ipynb](https://www.physi.uni-heidelberg.de/~reygers/lectures/2022/ml/solutions/01_intro_ex_1b_sol.ipynb) \normalsize
  486. ## Pandas
  487. [\textcolor{violet}{pandas}](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) is a software library written in python for
  488. \textcolor{blue}{data manipulation and analysis}.
  489. \vspace{0.4cm}
  490. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  491. * Offers data structures and operations for manipulating numerical tables with
  492. integrated indexing
  493. * Imports data from various file formats, e.g. comma-separated values, JSON,
  494. SQL or Excel
  495. * Tools for reading and writing data structures, allows analyzing, filtering,
  496. spliting, grouping and aggregating, merging and joining and plotting
  497. * Built on top of `NumPy`
  498. * Visualize the data with `matplotlib`
  499. * Most machine learning tools support `pandas` $\rightarrow$
  500. it is widely used to preprocess data sets for analysis and machine learning
  501. in various scientific fields
  502. ## Pandas micro introduction
  503. Goal: Exploring, cleaning, transforming, and visualization of data.
  504. The basic indexable objects are
  505. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  506. * `Series` -> vector (list) of data elements of arbitrary type
  507. * `DataFrame` -> tabular arangement of data elements of column wise
  508. arbitrary type
  509. Both allow cleaning data by removing of `empty` or `nan` data entries
  510. \footnotesize
  511. ```python
  512. import numpy as np
  513. import pandas as pd # use together with numpy
  514. s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of int64
  515. r = pd.Series(np.random.randn(4)) # Series of random numbers float64
  516. dates = pd.date_range("20130101", periods=3) # index according to dates
  517. df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD"))
  518. print (df) # print the DataFrame
  519. A B C D
  520. 2013-01-01 1.618395 1.210263 -1.276586 -0.775545
  521. 2013-01-02 0.676783 -0.754161 -1.148029 -0.244821
  522. 2013-01-03 -0.359081 0.296019 1.541571 0.235337
  523. new_s = s.dropna() # return a new Data Frame without the column that has NaN cells
  524. ```
  525. \normalsize
  526. ##
  527. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  528. * pandas data can be saved in different file formats (CSV, JASON, html, XML,
  529. Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept
  530. in the output file, except if they are removed with `dataframe.dropna()`
  531. * csv file
  532. \footnotesize
  533. ```python
  534. df.to_csv("myFile.csv") # Write the DataFrame df to a csv file
  535. ```
  536. \normalsize
  537. * HDF5 output
  538. \footnotesize
  539. ```python
  540. df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5
  541. s.to_hdf("myFile.h5", key='s',mode='a')
  542. ```
  543. \normalsize
  544. * Writing to an excel file
  545. \footnotesize
  546. ```python
  547. df.to_excel("myFile.xlsx", sheet_name="Sheet1")
  548. ```
  549. \normalsize
  550. * Deleting file with data in python
  551. \footnotesize
  552. ```python
  553. import os
  554. os.remove('myFile.h5')
  555. ```
  556. \normalsize
  557. ##
  558. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  559. * read in data from various formats
  560. * csv file
  561. \footnotesize
  562. ```python
  563. .......
  564. df = pd.read_csv('heart.csv') # read csv data table
  565. print(df.info())
  566. <class 'pandas.core.frame.DataFrame'>
  567. RangeIndex: 303 entries, 0 to 302
  568. Data columns (total 14 columns):
  569. # Column Non-Null Count Dtype
  570. --- ------ -------------- -----
  571. 0 age 303 non-null int64
  572. 1 sex 303 non-null int64
  573. 2 cp 303 non-null int64
  574. print(df.head(5)) # prints the first 5 rows of the data table
  575. print(df.describe()) # shows a quick statistic summary of your data
  576. ```
  577. \normalsize
  578. * Reading an excel file
  579. \footnotesize
  580. ```python
  581. df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"])
  582. ```
  583. \normalsize
  584. \textcolor{olive}{There are many options specifying details for IO.}
  585. ##
  586. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  587. * Various functions exist to select and view data from pandas objects
  588. * Display column and index
  589. \footnotesize
  590. ```python
  591. df.index # show datetime index of df
  592. DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'],
  593. dtype='datetime64[ns]',freq='D')
  594. df.column # show columns info
  595. Index(['A', 'B', 'C', 'D'], dtype='object')
  596. ```
  597. \normalsize
  598. * `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data
  599. \footnotesize
  600. ```python
  601. df.to_numpy() # one dtype for the entire array, not per column!
  602. [[-0.62660101 -0.67330526 0.23269168 -0.67403546]
  603. [-0.53033339 0.32872063 -0.09893568 0.44814084]
  604. [-0.60289996 -0.22352548 -0.43393248 0.47531456]]
  605. ```
  606. \normalsize
  607. Does not include the index or column labels in the output
  608. * more on viewing
  609. \footnotesize
  610. ```python
  611. df.T # transpose the DataFrame df
  612. df.sort_values(by="B") # Sorting by values of column B of df
  613. df.sort_index(axis=0) # Sorting by index ascending values
  614. df.sort_index(axis=0,ascending=False) # Display columns in inverse order
  615. ```
  616. \normalsize
  617. ##
  618. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  619. * Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions
  620. * get a named column as a Series
  621. \footnotesize
  622. ```python
  623. df["A"] # selects a column A from df, simular to df.A
  624. df.iloc[:, 1:2] # slices column A explicitly from df, df.loc[:, ["A"]]
  625. ```
  626. \normalsize
  627. * select rows of a DataFrame
  628. \footnotesize
  629. ```python
  630. df[0:2] # selects row 0 and 1 from df,
  631. df["20130102":"20130103"] # use indices, endpoints are included!
  632. df.iloc[3] # select with the position of the passed integers
  633. df.iloc[1:3, :] # selects row 1 and 2 from df
  634. ```
  635. \normalsize
  636. * select by label
  637. \footnotesize
  638. ```python
  639. df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D
  640. df.loc[dates[0], "A"] # selects a single value (scalar)
  641. ```
  642. \normalsize
  643. * select by lists of integer position (as in `NumPy`)
  644. \footnotesize
  645. ```python
  646. df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D (data only)
  647. df.iloc[1, 1] # get a value explicitly (data only, no index lines)
  648. ```
  649. \normalsize
  650. * select according to expressions
  651. \footnotesize
  652. ```python
  653. df.query('B<C') # select rows where B < C
  654. df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows
  655. ```
  656. \normalsize
  657. ##
  658. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  659. * Selecting data of pandas objects continued
  660. * Boolean indexing
  661. \footnotesize
  662. ```python
  663. df[df["A"] > 0] # select df where all values of column A are >0
  664. df[df > 0] # select values >0 from the entire DataFrame
  665. ```
  666. \normalsize
  667. more complex example
  668. \footnotesize
  669. ```python
  670. df2 = df.copy() # copy df
  671. df2["E"] = ["eight","one","four"] # add column E
  672. df2[df2["E"].isin(["two", "four"])] # test if elements "two" and "four" are
  673. # contained in Series column E
  674. ```
  675. \normalsize
  676. * Operations (in general exclude missing data)
  677. \footnotesize
  678. ```python
  679. df2[df2 > 0] = -df2 # All elements > 0 change sign
  680. df.mean(0) # get column wise mean (numbers=axis)
  681. df.mean(1) # get row wise mean
  682. df.std(0) # standard deviation according to axis
  683. df.cumsum() # cumulative sum of each column
  684. df.apply(np.sin) # apply function to each element of df
  685. df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise
  686. df + 10 # add scalar 10
  687. df - [1, 2, 10 , 100] # subtract values of each column
  688. df.corr() # Compute pairwise correlation of columns
  689. ```
  690. \normalsize
  691. ## Pandas - plotting data
  692. [\textcolor{violet}{Visualization}](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) is integrated in pandas using matplotlib. Here are only 2 examples
  693. * Plot random data in histogramm and scatter plot
  694. \footnotesize
  695. ```python
  696. # create DataFrame with random normal distributed data
  697. df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"])
  698. df = df + [1, 3, 8 , 10] # shift column wise mean by 1, 3, 8 , 10
  699. df.plot.hist(bins=20) # histogram all 4 columns
  700. g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1")
  701. df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1)
  702. ```
  703. \normalsize
  704. ::: columns
  705. :::: {.column width=35%}
  706. ![](figures/pandas_histogramm.png)
  707. ::::
  708. :::: {.column width=35%}
  709. ![](figures/pandas_scatterplot.png)
  710. ::::
  711. :::
  712. ## Pandas - plotting data
  713. The function crosstab() takes one or more array-like objects as indexes or
  714. columns and constructs a new DataFrame of variable counts on the inputs
  715. \footnotesize
  716. ```python
  717. df = pd.DataFrame( # create DataFrame of 2 categories
  718. {"sex": np.array([0,0,0,0,1,1,1,1,0,0,0]),
  719. "heart": np.array([1,1,1,0,1,1,1,0,0,0,1])
  720. } ) # closing bracket goes on next line
  721. pd.crosstab(df2.sex,df2.heart) # create cross table of possibilities
  722. pd.crosstab(df2.sex,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts
  723. ```
  724. \normalsize
  725. ::: columns
  726. :::: {.column width=38%}
  727. ![](figures/pandas_crosstabplot.png)
  728. ::::
  729. :::
  730. ## Exercise 2
  731. Read the file [\textcolor{violet}{heart.csv}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/heart.csv) into a DataFrame.
  732. [\textcolor{violet}{Information on the dataset}](https://archive.ics.uci.edu/ml/datasets/heart+Disease)
  733. \setbeamertemplate{itemize item}{\color{red}$\square$}
  734. * Which columns do we have
  735. * Print the first 3 rows
  736. * Print the statistics summary and the correlations
  737. * Print mean values for each column with and without disease (target)
  738. * Select the data according to `sex` and `target` (heart disease 0=no 1=yes).
  739. * Plot the `age` distribution of male and female in one histogram
  740. * Plot the heart disease distribution according to chest pain type `cp`
  741. * Plot `thalach` according to `target` in one histogramm
  742. * Plot `sex` and `target` in a histogramm figure
  743. * Correlate `age` and `max heart rate` according to `target`
  744. * Correlate `age` and `colesterol` according to `target`
  745. \small
  746. [Solution: 01_intro_ex_2_sol.ipynb](https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/solutions/01_intro_ex_2_sol.ipynb) \normalsize