Machine Learning Kurs im Rahmen der Studierendentage im SS 2023
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1017 lines
30 KiB

2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
  1. % Introduction to Data Analysis and Machine Learning in Physics: \ 1. Introduction to python
  2. % Day 1: 11. April 2023
  3. % \underline{Jörg Marks}, Klaus Reygers
  4. ## Outline of the $1^{st}$ day
  5. * Technical instructions for your interactions with the CIP pool, like
  6. * using the jupyter hub
  7. * using python locally in your own linux environment (anaconda)
  8. * access the CIP pool from your own windows or linux system
  9. * transfer data from and to the CIP pool
  10. Can be found in [\textcolor{violet}{CIPpoolAccess.PDF}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/Folien/CIPpoolAccess.PDF)\normalsize
  11. * Summary of NumPy
  12. * Plotting with matplotlib
  13. * Input / output of data
  14. * Summary of pandas
  15. * Fitting with iminuit and PyROOT
  16. ## A glimpse into python classes
  17. The following python classes are important to \textcolor{red}{data analysis and machine
  18. learning} and will be useful during the course
  19. * [\textcolor{violet}{NumPy}](https://numpy.org/doc/stable/user/basics.html) - python library adding support for large,
  20. multi-dimensional arrays and matrices, along with high-level
  21. mathematical functions to operate on these arrays
  22. * [\textcolor{violet}{matplotlib}](https://matplotlib.org/stable/tutorials/index.html) - a python plotting library
  23. * [\textcolor{violet}{SciPy}](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) - extension of NumPy by a collection of
  24. mathematical algorithms for minimization, regression,
  25. fourier transformation, linear algebra and image processing
  26. * [\textcolor{violet}{iminuit}](https://iminuit.readthedocs.io/en/stable/) -
  27. python wrapper to the data fitting toolkit
  28. [\textcolor{violet}{Minuit2}](https://root.cern.ch/doc/master/Minuit2Page.html)
  29. developed at CERN by F. James in the 1970ies
  30. * [\textcolor{violet}{PyROOT}](https://root.cern/manual/python/) - python wrapper to the C++ data analysis toolkit
  31. ROOT [\textcolor{violet}{(lecture WS 2021 / 22)}](https://www.physi.uni-heidelberg.de/~marks/root_einfuehrung/) used at the LHC
  32. * [\textcolor{violet}{scikit-learn}](https://scikit-learn.org/stable/) - machine learning library written in
  33. python, which makes use extensively of NumPy for high-performance
  34. linear algebra algorithms
  35. * [\textcolor{violet}{Tensorflow}](https://https://www.tensorflow.org/) - machine learning library with Keras as python interface
  36. ## NumPy
  37. \textcolor{blue}{NumPy} (Numerical Python) is an open source python library,
  38. which contains multidimensional array and matrix data structures and methods
  39. to efficiently operate on these. The core object is
  40. a homogeneous n-dimensional array object, \textcolor{blue}{ndarray}, which
  41. allows for a wide variety of \textcolor{blue}{fast operations and mathematical calculations
  42. with arrays and matrices} due to the extensive usage of compiled code.
  43. * It is heavily used in numerous scientific python packages
  44. * `ndarray` 's have a fixed size at creation $\rightarrow$ changing size
  45. leads to recreation
  46. * Array elements are all required to be of the same data type
  47. * Facilitates advanced mathematical operations on large datasets
  48. * See for a summary, e.g.   
  49. \small
  50. [\textcolor{violet}{https://cs231n.github.io/python-numpy-tutorial/\#numpy}](https://cs231n.github.io/python-numpy-tutorial/#numpy) \normalsize
  51. \vfill
  52. ::: columns
  53. :::: {.column width=30%}
  54. ::::
  55. :::
  56. ::: columns
  57. :::: {.column width=35%}
  58. `c = []`
  59. `for i in range(len(a)):`
  60.     `c.append(a[i]*b[i])`
  61. ::::
  62. :::: {.column width=35%}
  63. with NumPy
  64. `c = a * b`
  65. ::::
  66. :::
  67. <!---
  68. It seem we need to indent by hand.
  69. I don't manage to align under the bullet text
  70. If we do it with column the vertical space is with code sections not good
  71. If we do it without code section the vertical space is ok, but there is no
  72. code high lightning.
  73. See the different versions of the same page in the following
  74. -->
  75. ## NumPy - array basics (1)
  76. * numpy arrays build a grid of \textcolor{blue}{same type} values, which are indexed.
  77. The *rank* is the dimension of the array.
  78. There are methods to create and preset arrays.
  79. \footnotesize
  80. ```python
  81. myA = np.array([12, 5 , 11]) # create rank 1 array (vector like)
  82. type(myA) # <class ‘numpy.ndarray’>
  83. myA.shape # (3,)
  84. print(myA[2]) # 11 access 3. element
  85. myA[0] = 12 # set 1. element to 12
  86. myB = np.array([[1,5],[7,9]]) # create rank 2 array
  87. myB.shape # (2,2) (rows,columns)
  88. print(myB[0,0],myB[0,1],myB[1,1]) # 1 5 9
  89. myC = np.arange(6) # create rank 1 set to 0 - 5
  90. myC.reshape(2,3) # change rank to (2,3)
  91. zero = np.zeros((2,5)) # 2 rows, 5 columns, set to 0
  92. one = np.ones((2,2)) # 2 rows, 2 columns, set to 1
  93. five = np.full((2,2), 5) # 2 rows, 2 columns, set to 5
  94. e = np.eye(2) # create 2x2 identity matrix
  95. ```
  96. \normalsize
  97. ## NumPy - array basics (2)
  98. * Similar to a coordinate system numpy arrays also have \textcolor{blue}{axes}. numpy operations
  99. can be performed along these axes.
  100. \footnotesize
  101. ::: columns
  102. :::: {.column width=35%}
  103. ```python
  104. # 2D arrays
  105. five = np.full((2,3), 5) # 2 rows, 3 columns, set to 5
  106. seven = np.full((2,3), 7) # 2 rows, 3 columns, set to 7
  107. np.concatenate((five,seven), axis = 0) # results in a 3 x 4 array
  108. np.concatenate((five,seven), axis = 1]) # results in a 6 x 2 array
  109. # 1D array
  110. one = np.array([1, 1 , 1]) # results in a 1 x 3 array, set to 1
  111. four = np.array([4, 4 , 4]) # results in a 1 x 3 array, set to 4
  112. np.concatenate((one,four), axis = 0) # concat. arrays horizontally!
  113. ```
  114. ::::
  115. :::: {.column width=50%}
  116. \vspace{3cm}
  117. ![](figures/numpy_axes.png)
  118. ::::
  119. :::
  120. \normalsize
  121. ## NumPy - array indexing (1)
  122. * select slices of a numpy array
  123. \footnotesize
  124. ```python
  125. a = np.array([[1,2,3,4],
  126. [5,6,7,8], # 3 rows 4 columns array
  127. [9,10,11,12]])
  128. b = a[:2, 1:3] # subarray of 2 rows and
  129. array([[2, 3], # column 1 and 2
  130. [6, 7]])
  131. ```
  132. \normalsize
  133. * a slice of an array points into the same data, *modifying* changes the original array!
  134. \footnotesize
  135. ```python
  136. b[0, 0] = 77 # b[0,0] and a[0,1] are 77
  137. r1_row = a[1, :] # get 2nd row -> rank 1
  138. r1_row.shape # (4,)
  139. r2_row = a[1:2, :] # get 2nd row -> rank 2
  140. r2_row.shape # (1,4)
  141. a=np.array([[1,2],[3,4],[5,6]]) # set a , 3 rows 2 cols
  142. d=a[[0, 1, 2], [0, 1, 1]] # d contains [1 4 6]
  143. e=a[[1, 2], [1, 1]] # e contains [4 6]
  144. np.array([a[0,0],a[1,1],a[2,0]]) # address elements explicitly
  145. ```
  146. \normalsize
  147. ## NumPy - array indexing (2)
  148. * integer array indexing by setting an array of indices $\rightarrow$ selecting/changing elements
  149. \footnotesize
  150. ```python
  151. a = np.array([[1,2,3,4],
  152. [5,6,7,8], # 3 rows 4 columns array
  153. [9,10,11,12]])
  154. p_a = np.array([0,2,0]) # Create an array of indices
  155. s = a[np.arange(3), p_a] # number the rows, p_a points to cols
  156. print (s) # s contains [1 7 9]
  157. a[np.arange(3),p_a] += 10 # add 10 to corresponding elements
  158. x=np.array([[8,2],[7,4]]) # create 2x2 array
  159. bool = (x > 5) # bool : array of boolians
  160. # [[True False]
  161. # [True False]]
  162. print(x[x>5]) # select elements, prints [8 7]
  163. ```
  164. \normalsize
  165. * data type in numpy - create according to input numbers or set explicitly
  166. \footnotesize
  167. ```python
  168. x = np.array([1.1, 2.1]) # create float array
  169. print(x.dtype) # print float64
  170. y=np.array([1.1,2.9],dtype=np.int64) # create float array [1 2]
  171. ```
  172. \normalsize
  173. ## NumPy - functions
  174. * math functions operate elementwise either as operator overload or as methods
  175. \footnotesize
  176. ```python
  177. x=np.array([[1,2],[3,4]],dtype=np.float64) # define 2x2 float array
  178. y=np.array([[3,1],[5,1]],dtype=np.float64) # define 2x2 float array
  179. s = x + y # elementwise sum
  180. s = np.add(x,y)
  181. s = np.subtract(x,y)
  182. s = np.multiply(x,y) # no matrix multiplication!
  183. s = np.divide(x,y)
  184. s = np.sqrt(x), np.exp(x), ...
  185. x @ y , or np.dot(x, y) # matrix product
  186. np.sum(x, axis=0) # sum of each column
  187. np.sum(x, axis=1) # sum of each row
  188. xT = x.T # transpose of x
  189. x = np.linspace(0,2*pi,100) # get equal spaced points in x
  190. r = np.random.default_rng(seed=42) # constructor random number class
  191. b = r.random((2,3)) # random 2x3 matrix
  192. ```
  193. \normalsize
  194. ##
  195. * broadcasting in numpy
  196. \vspace{0.4cm}
  197. The term \textcolor{blue}{broadcasting} describes how numpy treats arrays
  198. with different shapes during arithmetic operations
  199. * add a scalar $b$ to a 1D array $a = [a_1,a_2,a_3]$ $\rightarrow$ expand $b$ to
  200. $[b,b,b]$
  201. \vspace{0.2cm}
  202. * add a scalar $b$ to a 2D [2,3] array $a =[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$
  203. $\rightarrow$ expand $b$ to $b =[[b,b,b],[b,b,b]]$ and add element wise
  204. \vspace{0.2cm}
  205. * add 1D array $b = [b_1,b_2,b_3]$ to a 2D [2,3] array $a=[[a_{11},a_{12},a_{13}],[a_{21},a_{22},a_{23}]]$ $\rightarrow$ 1D array is broadcast
  206. across each row of the 2D array $b =[[b_1,b_2,b_3],[b_1,b_2,b_3]]$ and added element wise
  207. \vspace{0.2cm}
  208. Arithmetic operations can only be performed when the shape of each
  209. dimension in the arrays are equal or one has the dimension size of 1. Look
  210. [\textcolor{violet}{here}](https://numpy.org/doc/stable/user/basics.broadcasting.html) for more details
  211. \footnotesize
  212. ```python
  213. # Add a vector to each row of a matrix
  214. x = np.array([[1,2,3], [4,5,6]]) # x has shape (2, 3)
  215. v = np.array([1,2,3]) # v has shape (3,)
  216. x + v # [[2 4 6]
  217. # [5 7 9]]
  218. ```
  219. \normalsize
  220. ## Plot data
  221. A popular library to present data is the `pyplot` module of `matplotlib`.
  222. * Drawing a function in one plot
  223. \footnotesize
  224. ::: columns
  225. :::: {.column width=35%}
  226. ```python
  227. import numpy as np
  228. import matplotlib.pyplot as plt
  229. # generate 100 points from 0 to 2 pi
  230. x = np.linspace( 0, 10*np.pi, 100 )
  231. f = np.sin(x)**2
  232. # plot function
  233. plt.plot(x,f,'blueviolet',label='sine')
  234. plt.xlabel('x [radian]')
  235. plt.ylabel('f(x)')
  236. plt.title('Plot sin^2')
  237. plt.legend(loc='upper right')
  238. plt.axis([0,30,-0.1,1.2]) # limit the plot range
  239. # show the plot
  240. plt.show()
  241. ```
  242. ::::
  243. :::: {.column width=40%}
  244. ![](figures/matplotlib_Figure_1.png)
  245. ::::
  246. :::
  247. \normalsize
  248. ##
  249. * Drawing a scatter plot of data
  250. \footnotesize
  251. ::: columns
  252. :::: {.column width=35%}
  253. ```python
  254. ...
  255. # create x,y data points
  256. num = 75
  257. x = range(num)
  258. y = range(num) + np.random.randint(0,num/1.5,num)
  259. z = - (range(num) + np.random.randint(0,num/3,num)) + num
  260. # create colored scatter plot, sample 1
  261. plt.scatter(x, y, color = 'green',
  262. label='Sample 1')
  263. # create colored scatter plot, sample 2
  264. plt.scatter(x, z, color = 'orange',
  265. label='Sample 2')
  266. plt.title('scatter plot')
  267. plt.xlabel('x')
  268. plt.ylabel('y')
  269. # description and plot
  270. plt.legend()
  271. plt.show()
  272. ```
  273. ::::
  274. :::: {.column width=35%}
  275. \vspace{3cm}
  276. ![](figures/matplotlib_Figure_6.png)
  277. ::::
  278. :::
  279. \normalsize
  280. ##
  281. * Drawing a histogram of data
  282. \footnotesize
  283. ::: columns
  284. :::: {.column width=35%}
  285. ```python
  286. ...
  287. # create normalized gaussian Distribution
  288. g = np.random.normal(size=10000)
  289. # histogram the data
  290. plt.hist(g,bins=40)
  291. # plot rotated histogram
  292. plt.hist(g,bins=40,orientation='horizontal')
  293. # normalize area to 1
  294. plt.hist(g,bins=40,density=True)
  295. # change color
  296. plt.hist(g,bins=40,density=True,
  297. edgecolor='lightgreen',color='orange')
  298. plt.title('Gaussian Histogram')
  299. plt.xlabel('bin')
  300. plt.ylabel('entries')
  301. # description and plot
  302. plt.legend(['Normalized distribution'])
  303. plt.show()
  304. ```
  305. ::::
  306. :::: {.column width=35%}
  307. \vspace{3.5cm}
  308. ![](figures/matplotlib_Figure_5.png)
  309. ::::
  310. :::
  311. \normalsize
  312. ##
  313. * Drawing subplots in one canvas
  314. \footnotesize
  315. ::: columns
  316. :::: {.column width=35%}
  317. ```python
  318. ...
  319. g = np.exp(-0.2*x)
  320. # create figure
  321. plt.figure(num=2,figsize=(10.0,7.5),dpi=150,facecolor='lightgrey')
  322. plt.suptitle('1 x 2 Plot')
  323. # create subplot and plot first one
  324. plt.subplot(1,2,1)
  325. # plot first one
  326. plt.title('exp(x)')
  327. plt.xlabel('x')
  328. plt.ylabel('g(x)')
  329. plt.plot(x,g,'blueviolet')
  330. # create subplot and plot second one
  331. plt.subplot(1,2,2)
  332. plt.plot(x,f,'orange')
  333. plt.plot(x,f*g,'red')
  334. plt.legend(['sine^2','exp*sine'])
  335. # show the plot
  336. plt.show()
  337. ```
  338. ::::
  339. :::: {.column width=40%}
  340. \vspace{3cm}
  341. ![](figures/matplotlib_Figure_2.png)
  342. ::::
  343. :::
  344. \normalsize
  345. ## Image data
  346. The `image` class of the `matplotlib` library can be used to load the image
  347. to numpy arrays and to render the image.
  348. * There are 3 common formats for the numpy array
  349. * (M, N) scalar data used for greyscale images
  350. * (M, N, 3) for RGB images (each pixel has an array with RGB color attached)
  351. * (M, N, 4) for RGBA images (each pixel has an array with RGB color
  352. and transparency attached)
  353. The method `imread` loads the image into an `ndarray`, which can be
  354. manipulated.
  355. The method `imshow` renders the image data
  356. \vspace {2cm}
  357. ##
  358. * Drawing pixel data and images
  359. \footnotesize
  360. ::: columns
  361. :::: {.column width=50%}
  362. ```python
  363. ....
  364. # create data array with pixel postion and RGB color code
  365. width, height = 200, 200
  366. data = np.zeros((height, width, 3), dtype=np.uint8)
  367. # red patch in the center
  368. data[75:125, 75:125] = [255, 0, 0]
  369. x = np.random.randint(0,width-1,100)
  370. y = np.random.randint(0,height-1,100)
  371. data[x,y]= [0,255,0] # 100 random green pixel
  372. plt.imshow(data)
  373. plt.show()
  374. ....
  375. import matplotlib.image as mpimg
  376. #read image into numpy array
  377. pic = mpimg.imread('picture.jpg')
  378. mod_pic = pic[:,:,0] # grab slice 0 of the colors
  379. plt.imshow(mod_pic) # use default color code also
  380. plt.colorbar() # try cmap='hot'
  381. plt.show()
  382. ```
  383. ::::
  384. :::: {.column width=25%}
  385. ![](figures/matplotlib_Figure_3.png)
  386. \vspace{1cm}
  387. ![](figures/matplotlib_Figure_4.png)
  388. ::::
  389. :::
  390. \normalsize
  391. ## Input / output
  392. For the analysis of measured data efficient input \/ output plays an
  393. important role. In numpy, `ndarrays` can be saved and read in from files.
  394. `load()` and `save()` functions handle numpy binary files (.npy extension)
  395. which contain data, shape, dtype and other information required to
  396. reconstruct the `ndarray` of the disk file.
  397. \footnotesize
  398. ```python
  399. r = np.random.default_rng() # instanciate random number generator
  400. a = r.random((4,3)) # random 4x3 array
  401. np.save('myBinary.npy', a) # write array a to binary file myBinary.npy
  402. b = np.arange(12)
  403. np.savez('myComp.npz', a=a, b=b) # write a and b in compressed binary file
  404. ......
  405. b = np.load('myBinary.npy') # read content of myBinary.npy into b
  406. ```
  407. \normalsize
  408. The storage and retrieval of array data in text file format is done
  409. with `savetxt()` and `loadtxt()` methods. Parameter controlling delimiter,
  410. line separators, file header and footer can be specified.
  411. \footnotesize
  412. ```python
  413. x = np.array([1,2,3,4,5,6,7]) # create ndarray
  414. np.savetxt('myText.txt',x,fmt='%d', delimiter=',') # write array x to file myText.txt
  415. # with comma separation
  416. ```
  417. \normalsize
  418. ## Input / output
  419. Import tabular data from table processing programs in office packages.
  420. \vspace{0.4cm}
  421. \footnotesize
  422. ::: columns
  423. :::: {.column width=35%}
  424. `Excel data` can be exported as text file (myData_01.csv) with a comma as
  425. delimiter.
  426. ::::
  427. :::: {.column width=35%}
  428. ![](figures/numpy_excel.png)
  429. ::::
  430. :::
  431. \footnotesize
  432. ```python
  433. .....
  434. # read content of all files myData_*.csv into data
  435. data = np.loadtxt('myData_01.csv',dtype=int,delimiter=',')
  436. print (data.shape) # (12, 9)
  437. print (data) # [[1 1 1 1 0 0 0 0 0]
  438. # [0 0 1 1 0 0 1 1 0]
  439. # .....
  440. # [0 0 0 0 1 1 1 1 1]]
  441. ```
  442. \normalsize
  443. ## Input / output
  444. Import tabular data from table processing programs in office packages.
  445. \vspace{0.4cm}
  446. \footnotesize
  447. ::: columns
  448. :::: {.column width=35%}
  449. `Excel data` can be exported as text file (myData_01.csv) with a comma as
  450. delimiter. \newline
  451. $\color{blue}{Often~many~files~are~available~(myData\_*.csv)}$
  452. ::::
  453. :::: {.column width=35%}
  454. ![](figures/numpy_multi_excel.png)
  455. ::::
  456. :::
  457. \footnotesize
  458. ```python
  459. .....
  460. # find files and directories with names matching a pattern
  461. import glob
  462. # read content of all files myData_*.csv into data
  463. file_list = sorted(glob.glob('myData_*.csv')) # generate a sorted file list
  464. for filename in file_list:
  465. data = np.loadtxt(fname=filename, dtype=int, delimiter=',')
  466. print(data[:,3]) # print column 3 of each file
  467. # [1 1 1 1 1 1 1 1 1 1 1 0]
  468. # ......
  469. # [0 1 0 1 0 1 0 1 0 1 0 1]
  470. ```
  471. \normalsize
  472. ## Exercise 1
  473. i) Display a numpy array as figure of a blue cross. The size should be 200
  474. by 200 pixel. Use as array format (M, N, 3), where the first 2 specify
  475. the pixel positions and the last 3 the rbg color from 0:255.
  476. - Draw in addition a red square of arbitrary position into the figure.
  477. - Draw a circle in the center of the figure. Try to create a mask which
  478. selects the inner part of the circle using the indexing.
  479. \small
  480. [Solution: 01_intro_ex_1a_sol.ipynb](https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/solutions/01_intro_ex_1a_sol.ipynb) \normalsize
  481. ii) Read data which contains pixels from the binary file horse.py into a
  482. numpy array. Display the data and the following transformations in 4
  483. subplots: scaling and translation, compression in x and y, rotation
  484. and mirroring.
  485. \small
  486. [Solution: 01_intro_ex_1b_sol.ipynb](https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/solutions/01_intro_ex_1b_sol.ipynb) \normalsize
  487. ## Pandas
  488. [\textcolor{violet}{pandas}](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) is a software library written in python for
  489. \textcolor{blue}{data manipulation and analysis}.
  490. \vspace{0.4cm}
  491. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  492. * Offers data structures and operations for manipulating numerical tables with
  493. integrated indexing
  494. * Imports data from various file formats, e.g. comma-separated values, JSON,
  495. SQL or Excel
  496. * Tools for reading and writing data structures, allows analyzing, filtering,
  497. spliting, grouping and aggregating, merging and joining and plotting
  498. * Built on top of `NumPy`
  499. * Visualize the data with `matplotlib`
  500. * Most machine learning tools support `pandas` $\rightarrow$
  501. it is widely used to preprocess data sets for analysis and machine learning
  502. in various scientific fields
  503. ## Pandas micro introduction
  504. Goal: Exploring, cleaning, transforming, and visualization of data.
  505. The basic indexable objects are
  506. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  507. * `Series` -> vector (list) of data elements of arbitrary type
  508. * `DataFrame` -> tabular arangement of data elements of column wise
  509. arbitrary type
  510. Both allow cleaning data by removing of `empty` or `nan` data entries
  511. \footnotesize
  512. ```python
  513. import numpy as np
  514. import pandas as pd # use together with numpy
  515. s = pd.Series([1, 3, 5, np.nan, 6, 8]) # create a Series of int64
  516. r = pd.Series(np.random.randn(4)) # Series of random numbers float64
  517. dates = pd.date_range("20130101", periods=3) # index according to dates
  518. df = pd.DataFrame(np.random.randn(3,4),index=dates,columns=list("ABCD"))
  519. print (df) # print the DataFrame
  520. A B C D
  521. 2013-01-01 1.618395 1.210263 -1.276586 -0.775545
  522. 2013-01-02 0.676783 -0.754161 -1.148029 -0.244821
  523. 2013-01-03 -0.359081 0.296019 1.541571 0.235337
  524. new_s = s.dropna() # return a new Data Frame without the column that has NaN cells
  525. ```
  526. \normalsize
  527. ##
  528. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  529. * pandas data can be saved in different file formats (CSV, JASON, html, XML,
  530. Excel, OpenDocument, HDF5 format, .....). `NaN` entries are kept
  531. in the output file, except if they are removed with `dataframe.dropna()`
  532. * csv file
  533. \footnotesize
  534. ```python
  535. df.to_csv("myFile.csv") # Write the DataFrame df to a csv file
  536. ```
  537. \normalsize
  538. * HDF5 output
  539. \footnotesize
  540. ```python
  541. df.to_hdf("myFile.h5",key='df',mode='w') # Write the DataFrame df to HDF5
  542. s.to_hdf("myFile.h5", key='s',mode='a')
  543. ```
  544. \normalsize
  545. * Writing to an excel file
  546. \footnotesize
  547. ```python
  548. df.to_excel("myFile.xlsx", sheet_name="Sheet1")
  549. ```
  550. \normalsize
  551. * Deleting file with data in python
  552. \footnotesize
  553. ```python
  554. import os
  555. os.remove('myFile.h5')
  556. ```
  557. \normalsize
  558. ##
  559. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  560. * read in data from various formats
  561. * csv file
  562. \footnotesize
  563. ```python
  564. .......
  565. df = pd.read_csv('heart.csv') # read csv data table
  566. print(df.info())
  567. <class 'pandas.core.frame.DataFrame'>
  568. RangeIndex: 303 entries, 0 to 302
  569. Data columns (total 14 columns):
  570. # Column Non-Null Count Dtype
  571. --- ------ -------------- -----
  572. 0 age 303 non-null int64
  573. 1 sex 303 non-null int64
  574. 2 cp 303 non-null int64
  575. print(df.head(5)) # prints the first 5 rows of the data table
  576. print(df.describe()) # shows a quick statistic summary of your data
  577. ```
  578. \normalsize
  579. * Reading an excel file
  580. \footnotesize
  581. ```python
  582. df = pd.read_excel("myFile.xlsx","Sheet1", na_values=["NA"])
  583. ```
  584. \normalsize
  585. \textcolor{olive}{There are many options specifying details for IO.}
  586. ##
  587. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  588. * Various functions exist to select and view data from pandas objects
  589. * Display column and index
  590. \footnotesize
  591. ```python
  592. df.index # show datetime index of df
  593. DatetimeIndex(['2013-01-01','2013-01-02','2013-01-03'],
  594. dtype='datetime64[ns]',freq='D')
  595. df.column # show columns info
  596. Index(['A', 'B', 'C', 'D'], dtype='object')
  597. ```
  598. \normalsize
  599. * `DataFrame.to_numpy()` gives a `NumPy` representation of the underlying data
  600. \footnotesize
  601. ```python
  602. df.to_numpy() # one dtype for the entire array, not per column!
  603. [[-0.62660101 -0.67330526 0.23269168 -0.67403546]
  604. [-0.53033339 0.32872063 -0.09893568 0.44814084]
  605. [-0.60289996 -0.22352548 -0.43393248 0.47531456]]
  606. ```
  607. \normalsize
  608. Does not include the index or column labels in the output
  609. * more on viewing
  610. \footnotesize
  611. ```python
  612. df.T # transpose the DataFrame df
  613. df.sort_values(by="B") # Sorting by values of column B of df
  614. df.sort_index(axis=0) # Sorting by index ascending values
  615. df.sort_index(axis=0,ascending=False) # Display columns in inverse order
  616. ```
  617. \normalsize
  618. ##
  619. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  620. * Selecting data of pandas objects $\rightarrow$ keep or reduce dimensions
  621. * get a named column as a Series
  622. \footnotesize
  623. ```python
  624. df["A"] # selects a column A from df, simular to df.A
  625. df.iloc[:, 1:2] # slices column A explicitly from df, df.loc[:, ["A"]]
  626. ```
  627. \normalsize
  628. * select rows of a DataFrame
  629. \footnotesize
  630. ```python
  631. df[0:2] # selects row 0 and 1 from df,
  632. df["20130102":"20130103"] # use indices, endpoints are included!
  633. df.iloc[3] # select with the position of the passed integers
  634. df.iloc[1:3, :] # selects row 1 and 2 from df
  635. ```
  636. \normalsize
  637. * select by label
  638. \footnotesize
  639. ```python
  640. df.loc["20130102":"20130103",["C","D"]] # selects row 1 and 2 and only C and D
  641. df.loc[dates[0], "A"] # selects a single value (scalar)
  642. ```
  643. \normalsize
  644. * select by lists of integer position (as in `NumPy`)
  645. \footnotesize
  646. ```python
  647. df.iloc[[0, 2], [1, 3]] # select row 1 and 3 and col B and D (data only)
  648. df.iloc[1, 1] # get a value explicitly (data only, no index lines)
  649. ```
  650. \normalsize
  651. * select according to expressions
  652. \footnotesize
  653. ```python
  654. df.query('B<C') # select rows where B < C
  655. df1=df[(df["B"]==0)&(df["D"]==0)] # conditions on rows
  656. ```
  657. \normalsize
  658. ##
  659. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  660. * Selecting data of pandas objects continued
  661. * Boolean indexing
  662. \footnotesize
  663. ```python
  664. df[df["A"] > 0] # select df where all values of column A are >0
  665. df[df > 0] # select values >0 from the entire DataFrame
  666. ```
  667. \normalsize
  668. more complex example
  669. \footnotesize
  670. ```python
  671. df2 = df.copy() # copy df
  672. df2["E"] = ["eight","one","four"] # add column E
  673. df2[df2["E"].isin(["two", "four"])] # test if elements "two" and "four" are
  674. # contained in Series column E
  675. ```
  676. \normalsize
  677. * Operations (in general exclude missing data)
  678. \footnotesize
  679. ```python
  680. df2[df2 > 0] = -df2 # All elements > 0 change sign
  681. df.mean(0) # get column wise mean (numbers=axis)
  682. df.mean(1) # get row wise mean
  683. df.std(0) # standard deviation according to axis
  684. df.cumsum() # cumulative sum of each column
  685. df.apply(np.sin) # apply function to each element of df
  686. df.apply(lambda x: x.max() - x.min()) # apply lambda function column wise
  687. df + 10 # add scalar 10
  688. df - [1, 2, 10 , 100] # subtract values of each column
  689. df.corr() # Compute pairwise correlation of columns
  690. ```
  691. \normalsize
  692. ##
  693. \setbeamertemplate{itemize item}{\color{red}\tiny$\blacksquare$}
  694. * Selecting data of pandas objects continued
  695. \vspace{0.5cm}
  696. * More operations
  697. \footnotesize
  698. ```python
  699. df.drop(['col1', 'col2'], axis=1) # removes columns 'col1' and 'col2'
  700. df.fillna(0) # fills missing values with 0
  701. df.fillna(method='ffill') # fills missing values with previous
  702. # non-missing value in the column
  703. df.replace('old_val', 'new_val') # replaces 'old_val' with 'new_val'
  704. df.groupby('col1').mean() # groups by 'col1' and computes
  705. # the mean of each group
  706. pd.merge(df1, df2, on='column1') # merges df1 and df2 on 'column1'
  707. df['column1'].value_counts() # counts the number of occurrences
  708. # of each unique value in 'column'
  709. ```
  710. \normalsize
  711. \vspace{3cm}
  712. ## Pandas - plotting data
  713. [\textcolor{violet}{Visualization}](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) is integrated in pandas using matplotlib. Here are only 2 examples
  714. * Plot random data in histogramm and scatter plot
  715. \footnotesize
  716. ```python
  717. # create DataFrame with random normal distributed data
  718. df = pd.DataFrame(np.random.randn(1000,4),columns=["a","b","c","d"])
  719. df = df + [1, 3, 8 , 10] # shift column wise mean by 1, 3, 8 , 10
  720. df.plot.hist(bins=20) # histogram all 4 columns
  721. g1 = df.plot.scatter(x="a",y="c",color="DarkBlue",label="Group 1")
  722. df.plot.scatter(x="b",y="d",color="DarkGreen",label="Group 2",ax=g1)
  723. ```
  724. \normalsize
  725. ::: columns
  726. :::: {.column width=35%}
  727. ![](figures/pandas_histogramm.png)
  728. ::::
  729. :::: {.column width=35%}
  730. ![](figures/pandas_scatterplot.png)
  731. ::::
  732. :::
  733. ## Pandas - plotting data
  734. The function crosstab() takes one or more array-like objects as indexes or
  735. columns and constructs a new DataFrame of variable counts on the inputs
  736. \footnotesize
  737. ```python
  738. df = pd.DataFrame( # create DataFrame of 2 categories
  739. {"sex": np.array([0,0,0,0,1,1,1,1,0,0,0]),
  740. "heart": np.array([1,1,1,0,1,1,1,0,0,0,1])
  741. } ) # closing bracket goes on next line
  742. pd.crosstab(df2.sex,df2.heart) # create cross table of possibilities
  743. pd.crosstab(df2.sex,df2.heart).plot(kind="bar",color=['red','blue']) # plot counts
  744. ```
  745. \normalsize
  746. ::: columns
  747. :::: {.column width=38%}
  748. ![](figures/pandas_crosstabplot.png)
  749. ::::
  750. :::
  751. ## Exercise 2
  752. Read the file [\textcolor{violet}{heart.csv}](https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/exercises/heart.csv) into a DataFrame.
  753. [\textcolor{violet}{Information on the dataset}](https://archive.ics.uci.edu/ml/datasets/heart+Disease)
  754. \setbeamertemplate{itemize item}{\color{red}$\square$}
  755. * Which columns do we have
  756. * Print the first 3 rows
  757. * Print the statistics summary and the correlations
  758. * Print mean values for each column with and without disease (target)
  759. * Select the data according to `sex` and `target` (heart disease 0=no 1=yes).
  760. * Plot the `age` distribution of male and female in one histogram
  761. * Plot the heart disease distribution according to chest pain type `cp`
  762. * Plot `thalach` according to `target` in one histogramm
  763. * Plot `sex` and `target` in a histogramm figure
  764. * Correlate `age` and `max heart rate` according to `target`
  765. * Correlate `age` and `colesterol` according to `target`
  766. \small
  767. [Solution: 01_intro_ex_2_sol.ipynb](https://www.physi.uni-heidelberg.de/~reygers/lectures/2023/ml/solutions/01_intro_ex_2_sol.ipynb) \normalsize