profile-img
The merit of an action lies in finishing it to the end.
slide-image

Exploratory Data Analysis

- ๋ฐ์ดํ„ฐ๋ฅผ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ 

* ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘, ์ „์ฒ˜๋ฆฌ์—์„œ์˜ ์‹ค์ˆ˜ ๊ตฌ๋ณ„

* ํ†ต๊ณ„์  ๊ฐ€์ •์„ ์–ด๊ธฐ๋Š” ๊ฒฝ์šฐ๋ฅผ ํŒŒ์•…

* ๋ฐ์ดํ„ฐ ํŒจํ„ด ํƒ์ƒ‰

* ๊ฐ€์„ค ์„ค์ •

 

Anscombe's Quartet

- ๊ฐ™์€ ํ‰๊ท , ํŽธ์ฐจ, ์ƒ๊ด€๊ด€๊ณ„, ํšŒ๊ท€์ง์„ ์„ ๊ฐ€์ง€์ง€๋งŒ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ๋ชจ์–‘ ์ž์ฒด๊ฐ€ ๋งค์šฐ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ.

 

Mapping Data to Image

- ํšจ์œจ์„ฑ ์ˆœ์œ„: ์œ„์น˜ > ๊ธธ์ด > ๊ธฐ์šธ๊ธฐ, ๊ฐ๋„ > ๋ฉด์  > ์ƒ‰ ์ง„ํ•˜๊ธฐ > ์ƒ‰, ๋ชจ์–‘

- ๋ฉด์ , ์ƒ‰ ์ง„ํ•˜๊ธฐ ๋ฐ์ดํ„ฐ๋Š” ordinal data์— ์‚ฌ์šฉ ๊ฐ€๋Šฅ

- ์›๊ทธ๋ž˜ํ”„๋Š” ๋ฉด์ ๊ณผ ๊ฐ๋„๋ฅผ ๊ฐ™์ด ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋„๋„› ๊ทธ๋ž˜ํ”„๋Š” ๊ฐ€์šด๋ฐ๊ฐ€ ๋น„์–ด์žˆ์œผ๋ฏ€๋กœ ๊ฐ์ด ์ƒ๋žต๋œ ํ˜•ํƒœ๋‹ค.

- ๊ฐ€์žฅ ๋น„ํšจ์œจ์ ์ธ ์‹œ๊ฐํ™” ์‚ฌ๋ก€

์ƒ‰์˜ ์šฐ์„ ์ˆœ์œ„๊ฐ€ ๋ช…ํ™•ํ•˜์ง€ ์•Š์Œ

- ์ƒ‰์˜ ์šฐ์„ ์ˆœ์œ„๋ฅผ ๊ฒฐ์ •

์™ผ์ชฝ์˜ ๊ฒฝ์šฐ ordering ๋ถˆ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์˜ค๋ฅธ์ชฝ์€ intensity๋กœ ๋งค๊ธธ ์ˆ˜ ์žˆ๋‹ค.

- ์ง€๋‚˜์น˜๊ฒŒ ๋ณต์žกํ•œ ์‹œ๊ฐํ™”๋Š” ์ข‹์ง€ ์•Š๋‹ค. (ํ•˜๋‚˜์˜ ๊ทธ๋ฆผ์— ๋„ˆ๋ฌด ๋งŽ์€ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ, ์ƒ‰ ์ˆœ์œ„๋ฅผ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ ๋“ฑ๋“ฑ)

 

Tufte's Design Principle

1. Maximize data ink-ratio

- data-ink ratio = data ink / total ink used in graphic

- ์ด ๊ฐ’์ด ์ตœ๋Œ€ํ•œ 1์— ๊ฐ€๊นŒ์›Œ์ ธ์•ผ ํ•œ๋‹ค.

- ์˜ˆ์‹œ: 3D ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„๋ฅผ 2D๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ๊ทธ๋ฆผ์ž๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค.

2. Minimize lie factor

- Lie factor = size of effect shown in graphic / size of effect in data -> 1์— ๊ฐ€๊นŒ์›Œ์ ธ์•ผ ํ•œ๋‹ค.

- ์‹ค์ œ๋ณด๋‹ค ๊ณผ์žฅํ•˜์—ฌ ๊ทธ๋ฆผ์„ ๊ทธ๋ฆฌ๋Š” ๊ฒฝ์šฐ

- 3D ๊ทธ๋ž˜ํ”„๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ฐ’์— ๋น„ํ•ด ๊ทธ๋ฆผ์ด ๋” ์ปค๋ณด์ด๋Š” ๊ฒฝ์šฐ

 

3. Minimize chartjunk

- ๋ถˆํ•„์š”ํ•œ ์ฐจ์›, ์ •๋ณด ์ „๋‹ฌ์šฉ์ด ์•„๋‹Œ ์ƒ‰ ์ž…ํžˆ๊ธฐ, ๊ณผ๋„ํ•œ ๊ฒฉ์ž๋ฌด๋Šฌ ๋ฐ ๋ฐ์ฝ”๋ ˆ์ด์…˜ ๋“ฑ์€ ์ •๋ณด์ „๋‹ฌ์— ๋ฐฉํ•ด๊ฐ€ ๋˜๋Š” ์š”์†Œ.

- ๊ฒฉ์ž ์ œ๊ฑฐ, ๋ฐฐ๊ฒฝ์ƒ‰ ์ œ๊ฑฐ, ํ…Œ๋‘๋ฆฌ ์ œ๊ฑฐ ๋“ฑ์œผ๋กœ ์‹œ๊ฐํ™”ํ•œ ๋‚ด์šฉ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ค์ž.

- ๋ˆˆ๊ธˆ์˜ ๊ฒฝ์šฐ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ์•ˆ์— ํ‘œ์‹œํ•˜๋Š” ๋“ฑ ํ•ด์„œ ๋” ์ค„์ผ ์ˆ˜ ์žˆ์Œ.

- Matplotlib ๋“ฑ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

 

4. Use proper scales and clear labeling

- Scale Distortion: ๋ˆˆ๊ธˆ์„ 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์ง€ ์•Š๊ณ  ์›ํ•˜๋Š” ๊ฐ’์— ํฌ์ปค์Šค๋ฅผ ๋งž์ถฐ์„œ ์‹ค์ œ๋ณด๋‹ค ์ฐจ์ด๋ฅผ ๋” ํฌ๊ฒŒ ๋งŒ๋“ค์–ด๋ฒ„๋ฆผ.

ex) Bush Tax cuts expire~ -> ๋ˆˆ๊ธˆ์€ ํ•ญ์ƒ 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋„๋ก ํ•˜์ž.

 

์‹œ๊ฐํ™” ํ‘œ์˜ ์ข…๋ฅ˜

1. Bar Chart

2. Line Chart: ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ํŠธ๋ Œ๋“œ์˜ ๋ณ€ํ™”

* Bar vs. Line: Line์€ ์—ฐ๊ฒฐ์˜ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  ์žˆ์œผ๋ฏ€๋กœ, Categorical data์— ์‚ฌ์šฉํ•˜๋ฉด ์•ˆ ๋œ๋‹ค.

1๋ฒˆ์งธ ์ค„์—์„œ๋Š” Bar, 2๋ฒˆ์งธ ์ค„์—์„œ๋Š” Line์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ํ•ฉ๋ฆฌ์ 

* book rating์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด์„œ bar chart๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ๋ผ.

* Banking to 45º: 2๊ฐœ์˜ line segment๋Š” 45๋„์˜ ๊ฐ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ๊ตฌ๋ณ„์ด ์ž˜ ๋œ๋‹ค.

2๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„๋Š” ๋„ˆ๋ฌด ํ”Œ๋žซํ•˜๊ณ , 3๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„๋Š” ๋„ˆ๋ฌด stiffํ•˜๋‹ค.

* ๋„ˆ๋ฌด ํ”Œ๋žซํ•œ ๊ทธ๋ž˜ํ”„๋‚˜ stiffํ•œ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆด ๊ฒฝ์šฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์ž˜๋ชป ํ•ด์„ํ•˜๊ฒŒ ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, 45๋„๋ฅผ ์œ ์ง€ํ•˜์ž.

 

3. Scatter Plots / Bubble Charts

- Scatter plot์€ ๊ฐ๊ฐ์˜ ์ ์˜ ๊ฐ’์„ 2D์—์„œ ๋ณด์—ฌ์ฃผ๋Š” ๋ฐ ํšจ๊ณผ์ 

- 3, 4๊ฐœ์˜ ๋ณ€์ˆ˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์€ bubble chart๋ฅผ ์ด์šฉ

- principle component analysis๋ฅผ ํ†ตํ•ด ๊ณ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ๋ฅผ 2D๋กœ ํˆฌ์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

- ์ ์˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋ฉด overplotting์ด ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ์ ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด์ž.

- ์ ์ด ํ•œ ๊ตฐ๋ฐ์— ๋„ˆ๋ฌด ๋ชฐ๋ฆฌ๋ฉด Overplotting์ด ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ๋ถˆํˆฌ๋ช…๋„๋ฅผ ๋‚ฎ์ถ”์ž.

- heatmap: ๋นˆ๋„์— ๋”ฐ๋ผ์„œ ์ƒ‰์„ ๋‹ฌ๋ฆฌํ•˜๋ฉด ์ž๋ฃŒ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ๋” ์‰ฝ๋‹ค.

- ์ ์˜ ํฌ๊ธฐ, ๋ชจ์–‘, ์ƒ‰์„ ๋‹ฌ๋ฆฌํ•˜์—ฌ 3์ฐจ์›์˜ ์ž๋ฃŒ๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. -> Bubble Chart

* 3์ฐจ์›์˜ ๋ฐ์ดํ„ฐ๋ผ๊ณ  3D ๊ณต๊ฐ„์— ๋‚˜ํƒ€๋‚ด๋ฉด ์•ˆ ๋œ๋‹ค. ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ•˜์ž.

 

4. Pie Chart

- Pie Chart vs. Bar Chart

* Pie Chart๋Š” ๋น„์œจ์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋Œ€๋‹ตํ•˜๊ธฐ ํŽธ๋ฆฌํ•˜๋‹ค.

* Bar Chart๋Š” ์‹ค์ œ ๊ฐ’์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋Œ€๋‹ตํ•˜๊ธฐ ํŽธ๋ฆฌํ•˜๋‹ค.

 

5. Donut Chart

- ๋ฉด์ ๋งŒ ์žˆ์œผ๋ฏ€๋กœ pie chart์— ๋น„ํ•ด ๊ฐ€๋…์„ฑ์ด ๋–จ์–ด์ง„๋‹ค. 

 

6. Stacked Bar Chart

7. Stacked Area Chart

Stacked Area Chart / 100% stacked Area Chart

- Stacked Area Chart vs Line Graphs

* Stacked Area Chart์—์„œ๋Š” ๊ฐ๊ฐ์˜ ์„ฑ๋ถ„์ด ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋Š” ๋น„์œจ ๋ณ€ํ™”๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์‰ฝ๋‹ค.

* Line Chart์—์„œ๋Š” ๊ฐ๊ฐ์˜ ์„ฑ๋ถ„์˜ ์ ˆ๋Œ“๊ฐ’์„ ํŒŒ์•…ํ•˜๊ธฐ ์‰ฝ๋‹ค.

 

8. Histograms

- Bin Size๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

- Count๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

- ๋„ˆ๋ฌด ์ž˜๊ฒŒ ์ชผ๊ฐœ๋ฉด ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ์ž˜ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋‹ค.

- Frequency vs. Density Histograms

* ๊ฐ ํ•ญ๋ชฉ์˜ ์ˆ˜๋ฅผ ์ „์ฒด ์ˆ˜๋กœ ๋‚˜๋ˆ„๋ฉด ํ™•๋ฅ  ๋ฐ€๋„ plot์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. -> ๋” ํ•ด์„ํ•˜๊ธฐ ์‰ฝ๋‹ค.

 

9. Heat Maps

 

10. Box & Whisker Plots

- max, min, 4๋ถ„์œ„์ˆ˜, ์ค‘์•™๊ฐ’ ๋“ฑ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ์ž๋ฃŒ 

 

๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ํˆด

- Excel: ๊ฐ€์žฅ ์œ ๋ช…ํ•˜์ง€๋งŒ ์ข‹์€ ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ์•„๋‹˜.

- R: ํ†ต๊ณ„์–ธ์–ด

- Matplotlib: ํŒŒ์ด์ฌ

 

Multivariate Data๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ

- ์ž‘์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ plot์„ ๋งŒ๋“ค์–ด์„œ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋‹ค.

- ์˜ˆ: ๊ตญ๊ฐ€๋ณ„ ์ธ๊ตฌ ๋ถ„ํฌ -> ๊ตญ๊ฐ€๋ณ„๋กœ ์ธ๊ตฌ ๋ถ„ํฌ๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ•˜๋‚˜์˜ ๊ทธ๋ฆผ์— ํ•ฉ์นœ๋‹ค.

 

Overinterpreting Variance

- ๋ ๋ถ€๋ถ„ ๋ฐ์ดํ„ฐ๊ฐ€ ์š”๋™์น˜๋Š” ์ด์œ : gene๋งˆ๋‹ค size๊ฐ€ ๋ชจ๋‘ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ!

-> size๊ฐ€ ํฐ gene์ด ๋งŽ์ง€ ์•Š์•„์„œ ๊ฒฝํ–ฅ์„ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์›€.

 

๋น„ํŒ์  ์‹œ๊ฐ์œผ๋กœ ๋ฐ”๋ผ๋ณด๊ธฐ: ์˜ˆ์œ ๋ฐ์ดํ„ฐ๋Š” ์˜ˆ์œ ์‹œ๊ฐํ™”๋ฅผ ๋„์ถœํ•ด๋‚ธ๋‹ค.

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +