profile-img
The merit of an action lies in finishing it to the end.
slide-image

Big Data Intro

- 2003๋…„ ์ „๊นŒ์ง€ 5 ์—‘์‚ฌ๋ฐ”์ดํŠธ์˜ ์ •๋ณด๋ฅผ ์ƒ์‚ฐํ–ˆ์Œ

- ์ด์ œ ์ดํ‹€๋งˆ๋‹ค 5 ์—‘์‚ฌ๋ฐ”์ดํŠธ์˜ ์ •๋ณด๋ฅผ ์ƒ์‚ฐ -> ๋น…๋ฐ์ดํ„ฐ

 

Big Data์˜ 4V

- Volume

- Variety

- Velocity

- Veracity

 

Science Paradigms

- ๋ช‡์ฒœ ๋…„ ์ „: ๊ณผํ•™์€ empirical ํ–ˆ๋‹ค. -> ์ž์—ฐ ํ˜„์ƒ์„ ์„œ์ˆ 

- ๋ช‡๋ฐฑ ๋…„ ์ „: theoretical branch -> ์ด๋ก , ๋ชจ๋ธ๋ง

- ๋ช‡์‹ญ ๋…„ ์ „: computational -> ์‹œ๋ฎฌ๋ ˆ์ด์…˜

- ์˜ค๋Š˜๋‚ : data exploration

- ๋ฏธ๋ž˜: Data-driven Science -> Data-driven Hypothesis Generation

 

Big Data Challenges

- ๋น…๋ฐ์ดํ„ฐ

* ํฌ๊ณ  ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ

* ์˜ˆ: social, web, financial transaction data, academic articles, genomic data...

- 2๊ฐ€์ง€์˜ ๊ณ ๋‚œ

* ํšจ์œจ์  ์ €์žฅ ๋ฐ ์ ‘๊ทผ

* Data Analytics -> ๊ฐ€์น˜ ์žˆ๋Š” ์ •๋ณด ์บ๋‚ด๊ธฐ

 

Data Science and Big data

- ๋น…๋ฐ์ดํ„ฐ๋ผ๋Š” ์šฉ์–ด๋Š” ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„์„์„ ๋œปํ•จ

- ์˜ˆ: ํŠธ์œ„ํ„ฐ/ํŽ˜์ด์Šค๋ถ ์ „์ฒด, ์ฃผ์š” ์‚ฌ์ดํŠธ์˜ ์›น ๋กœ๊ทธ, ๋ช‡์ฒœ ๋ช…์˜ genome sequence, Flickr์˜ ์ „์ฒด ์ด๋ฏธ์ง€...

- ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ๋” ์–ด๋ ค์›Œ์ง„๋‹ค

 

Big data as Bad data

- Unrepresentative participation (bias): ์ธ์Šคํƒ€๊ทธ๋žจ-์ Š์€ ์ธต๋งŒ ์ด์šฉ, ๋‰ด์š•ํƒ€์ž„์ฆˆ-์ž์œ ์ฃผ์˜์ , Fox News-๋ณด์ˆ˜์ , ์›”์ŠคํŠธ๋ฆฌํŠธ์ €๋„-๋ถ€์ž๋งŒ ๋ด„

- Spam and machine-generated content: ๋ด‡, ๊ฐ€์งœ๋‰ด์Šค ๋“ฑ 

- Power-laws: redundancy๋ฅผ ์˜๋ฏธ -> ๊ฑด๋ฌผ ํƒ์ง€ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๋žœ๋“œ๋งˆํฌ ๋ฐ์ดํ„ฐ๋Š” ๋งŽ์œผ๋ฏ€๋กœ ๋žœ๋“œ๋งˆํฌ๋งŒ ์ž˜ ์ธ์‹ํ•˜๊ณ  ์ผ๋ฐ˜ ๊ฑด๋ฌผ์€ ์ž˜ ์ธ์‹ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ

- Susceptibility to temporal bias (ex. Google Flu Trends): ๊ตฌ๊ธ€์˜ ์ž๋™์™„์„ฑ ๊ธฐ๋Šฅ์ด ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ์˜ ๋ถ„ํฌ๋ฅผ ๋ณ€ํ™”์‹œํ‚ด

 

Large-Scale Machine Learning

- ์šฐ๋ฆฌ๊ฐ€ ์ง€๊ธˆ๊นŒ์ง€ ๊ณต๋ถ€ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ์•„์ฃผ ํฐ ๋ฐ์ดํ„ฐ์…‹์— ์ž˜ ๋งž์ง€ ์•Š๋Š”๋‹ค.

- ์•„์ฃผ ํฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ ์€ ๋ชจ๋ธ์ด ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค.

- ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์‹œ๊ฐ„๋ณต์žก๋„๊ฐ€ linear -> O(n)์ด์–ด์•ผ ํ•œ๋‹ค.

- Big matrices๋Š” ๋น…๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํฌ์†Œํ•  ์ˆ˜ ์žˆ๋‹ค.

* ์˜ˆ) ๋„ทํ”Œ๋ฆญ์Šค ๊ณ ๊ฐ 1์–ต๋ช…์„ ๋Œ€์ƒ์œผ๋กœ ์ „์ฒด ์˜ํ™”์— ๋Œ€ํ•œ ์œ ์ € ํ‰์  matrix๋ฅผ ๋ถ„์„ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ ์ฐจ์›์ด ์•„์ฃผ ํฌ๊ณ  ํฌ์†Œํ•˜๋‹ค.

-> ๋ชจ๋“  ์‚ฌ๋žŒ์ด ๋ชจ๋“  ์˜ํ™”๋ฅผ ๋ณด์ง€๋Š” ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์—

 

Filtering Data -> domain specific criteria

- ๋น…๋ฐ์ดํ„ฐ์˜ ์žฅ์ ์€ ๋ถ„์„์„ ๋” ๊น”๋”ํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋น…๋ฐ์ดํ„ฐ์˜ ์ƒ๋‹น ๋ถ€๋ถ„์„ ๋ฒ„๋ ค๋„ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

- ์˜ˆ: ํŠธ์œ„ํ„ฐ์˜ ์ „์ฒด ํŠธ์œ— ์ค‘ 34%๋งŒ์ด ์˜์–ด๊ถŒ ๊ณ„์ • -> ๋‚˜๋จธ์ง€๋Š” ๋ฒ„๋ ค๋„ ๋จ

- irrelevant or hard-to-interpret data๋ฅผ ํ•„ํ„ฐ๋ง ํ•˜๋Š” ๊ฒƒ์€ application-specific knowledge๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค.

 

Subsampling Data

- subsampling...?

* training/validation/testing data๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

* ๋‹จ์ˆœํ•˜๊ณ  ํŠผํŠผํ•œ ๋ชจ๋ธ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ ์Œ -> big data์˜ overkill ์œ ๋„

cf) hyperparameter fitting, ๋ชจ๋ธ ์„ ์ •->์—ฌ๋Ÿฌ๋ฒˆ ํ•™์Šต

* ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ ์‚ฌ์ด์ฆˆ์˜ ๋ฐ์ดํ„ฐ์…‹์€ ๋ถ„์„์ด ์‰ฝ๊ณ  ๋น ๋ฅด๊ฒŒ ๋œ๋‹ค.

 

์ ˆ๋‹จ(truncation)์„ ์ด์šฉํ•˜๊ธฐ

- ์ฒซ n๊ฐœ์˜ ๊ธฐ๋ก๋งŒ์„ ์ทจํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€๋งŒ ์—ฌ๋Ÿฌ ์˜ค๋ฅ˜๋ฅผ ๋ฒ”ํ•  ์ˆ˜ ์žˆ๋‹ค.

* temporal biases (์˜ค๋ž˜๋œ ๋ฐ์ดํ„ฐ๋งŒ์„ ๋ถ„์„ํ•˜๊ฒŒ ๋  ์ˆ˜๋„)

* lexicographic biases (A์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋งŒ ๋ถ„์„ํ•˜๊ฒŒ ๋  ์ˆ˜๋„ -> Arabic์€ ๋ถ„์„๋ฒ”์œ„์— ์žˆ์ง€๋งŒ Korean์€ ํฌํ•จ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค)

* numerical biases (ID ๋ฒˆํ˜ธ ๊ฐ™์€ ๊ฑด ๋ฌด์ž‘์œ„ ๋ถ€์—ฌ๊ฐ€ ์•„๋‹ˆ๋ผ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ์ˆซ์ž์ผ ์ˆ˜ ์žˆ์Œ)

 

Stream Sampling

- n์„ ๋ชจ๋ฅผ ๋•Œ size k ์งœ๋ฆฌ ๊ท ์ผ ์ƒ˜ํ”Œ์„ ์ถ”๊ตฌํ•  ๋•Œ๊ฐ€ ์žˆ๋‹ค.

- ํ•ด๊ฒฐ์ฑ…: k๊ฐœ์˜ active element์— ๋Œ€ํ•œ ๋ฐฐ์—ด์„ ์œ ์ง€ํ•˜๊ณ , n๋ฒˆ์จฐ ์š”์†Œ๋ฅผ k/n์˜ ํ™•๋ฅ ๋กœ ๋Œ€์ฒด -> ๋“ค์–ด์™”๋˜ ๊ฒƒ๋„ ์ซ“๊ฒจ๋‚  ์ˆ˜๋„ ์žˆ์Œ

- cut์„ ๋งŒ๋“ค ๋•Œ ์ƒˆ๋กœ์šด ์š”์†Œ์˜ ์œ„์น˜๋ฅผ ๋žœ๋ค์œผ๋กœ ์„ ์ •ํ•  ์ˆ˜ ์žˆ์Œ.

 

Distributed vs. Parallel Processing

- Parallel processing: 1๊ฐœ์˜ ๊ธฐ๊ณ„์—์„œ ์ง„ํ–‰ (์Šค๋ ˆ๋“œ, ์šด์˜์ฒด์ œ ํ”„๋กœ์„ธ์Šค ๋“ฑ์„ ํ†ตํ•ด)

- Distributed processing: ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„์—์„œ ์ง„ํ–‰ - ๋„คํŠธ์›Œํฌ ํ†ต์‹  ๋“ฑ์„ ์ด์šฉ

 

Data Paralllelism

- parallelism์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋น…๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์ˆ˜์˜ ๊ธฐ๊ณ„๋กœ ๋‚˜๋ˆ  ๋…๋ฆฝ์ ์œผ๋กœ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

- k-means์ฒ˜๋Ÿผ ๊ฐ๊ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์น˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค.

* Parallelized K-means

์˜ˆ๋ฅผ ๋“ค์–ด 100๊ฐœ์˜ seed๊ฐ€ ์žˆ๋Š” ์ƒํ™ฉ์ด๋ผ๊ณ  ํ•˜๋ฉด

1. 1๊ฐœ์˜ master machine (job assign...)์ด k๊ฐœ์˜ random seed๋ฅผ ์„ ์ •ํ•ด์„œ ๋‹ค๋ฅธ machine์—๊ฒŒ broadcast

2. local์—์„œ, ๊ฐ machine์€ ์ž์‹ ์ด ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋Š ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•˜๋Š”์ง€ ๊ณ„์‚ฐํ•œ๋‹ค.

3. ๊ทธ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด centroid, data #๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ „๋‹ฌํ•œ๋‹ค. (sum vector ๋“ฑ์˜ ํ˜•ํƒœ๋กœ ์ „๋‹ฌํ•˜๋ฉด ์ผ์ผ์ด ์ „๋‹ฌํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค)

4. master machine์ด ๊ทธ ์ •๋ณด๋ฅผ ์ „๋‹ฌ๋ฐ›์•„ centroid ๊ฐ’์„ ์ตœ์ข…์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ณ , ์ด๋ฅผ ๋‹ค์‹œ broadcastํ•œ๋‹ค.

5. ๊ณ„์† ๋ฐ˜๋ณตํ•œ๋‹ค.

 

Grid Search

- ์ •์˜: right hyper-parameter๋ฅผ ์ฐพ๋Š” ๊ฒƒ

- ์˜ˆ: k-means clustering์—์„œ ์˜ฌ๋ฐ”๋ฅธ k๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ

- parallelํ•˜๊ฒŒ ์‹คํ–‰๋  ๊ฒฝ์šฐ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

Distributed processing์˜ ๋ณต์žก๋„

- machine ์ˆ˜์— ๋”ฐ๋ผ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค

* 1๊ฐœ: ๋‚˜์˜ box์˜ ์ค‘์‹ฌ๋ถ€๋ฅผ ๋ฐ”์˜๊ฒŒ ํ•จ

* 2๊ฐœ: ๋ช‡ ๊ฐœ์˜ box์— ๋Œ€ํ•ด ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰์‹œํ‚จ๋‹ค.

* ์—ฌ๋Ÿฌ๊ฐœ: MapReduce ๋“ฑ์˜ ์‹œ์Šคํ…œ์„ ์ด์šฉํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

MapReduce / Hadoop

- distributed computing์„ ์œ„ํ•œ ๊ตฌ๊ธ€์˜ MapReduce ํŒจ๋Ÿฌ๋‹ค์ž„์€ Hadoop, Spark ๋“ฑ์˜ ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ตฌํ˜„๋จ

- ๊ฐ„๋‹จํ•œ parallel programming model

- ์ˆ˜๋ฐฑ/์ˆ˜์ฒœ ๊ฐœ์˜ ๊ธฐ๊ณ„์— ๋Œ€ํ•ด Straightforward scaling

- redundancy(์—ฌ๋ถ„)์„ ํ†ตํ•œ fault tolerance

 

Divide and Conquer

1. ๋ถ„ํ• (partition): ํ•˜๋‚˜์˜ ์ผ์„ ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ์ชผ๊ฐฌ

2. ๊ฐ๊ฐ ์ˆ˜ํ–‰

3. ๋ณ‘ํ•ฉ(combine): ๊ฐ๊ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์นจ

 

Typical Big Data Problem

- ์•„์ฃผ ํฐ ์ˆซ์ž์˜ ๊ธฐ๋ก์„ ๋ฐ˜๋ณต

- ๊ฐ๊ฐ์—์„œ ๊ด€์‹ฌ์žˆ๋Š” ๊ฒƒ์„ ๋ฝ‘์•„๋‚ด๊ธฐ

- ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ shuffle, sort

- ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘ํ•ฉ

- ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœ

- word counting, k-means clustering์„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ

 

MapReduce์˜ ์•„์ด๋””์–ด

- Scale Out: ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„๋ฅผ ๋ถ™์ด๊ธฐ (up-> ๊ธฐ๊ณ„ ์„ฑ๋Šฅ ์žฅ์ฒด๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ->X)

* ์Šค์ผ€์ผ ์—…์€ ๋ฉ”๋ชจ๋ฆฌ bottleneck ํ˜„์ƒ ๋“ฑ์ด ๋ฐœ์ƒํ•  ์œ„ํ—˜์ด ์žˆ๋‹ค.

- Move processing to the data: ํด๋Ÿฌ์Šคํ„ฐ๋“ค์ด ์ œํ•œ๋œ bandwidth๋ฅผ ๊ฐ€์ง„๋‹ค

- ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ๋žœ๋ค ์ ‘๊ทผ์„ ํ”ผํ•˜๋ผ: seek์€ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค์ง€๋งŒ, disk throughput์€ ํ•ฉ๋ฆฌ์ 

* HDD ๋“ฑ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์„ ๋•Œ head๋ฅผ ์ด๋™ํ•ด๊ฐ€๋ฉฐ ์ฝ๋Š”๋‹ค. ์ด๊ฒƒ์„ ์–ด๋–ป๊ฒŒ ์ตœ์ ํ™”ํ•˜๋Š๋ƒ?

 

Components of Hadoop

1. Hadoop/MapReduce

- ๋ถ„์‚ฐ๋œ ๋น…๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์ธํ”„๋ผ๊ตฌ์กฐ

- abstract/paradigm

- fault-tolerant: ๋ฐ์ดํ„ฐ๋ฅผ 100๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด์ฃผ๋Š”๋ฐ, ๊ทธ์ค‘ 1๊ฐœ์˜ ์ฒ˜๋ฆฌ๊ธฐ๊ฐ€ ๊ณ ์žฅ๋‚˜๋ฉด ์ผ๋‹จ ์žฌ์‹œ์ž‘๋ถ€ํ„ฐ ํ•จ -> ๊ทธ๋ž˜๋„ ๋™์ž‘ํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ๊ธฐ์—๊ฒŒ ํ• ๋‹น

- schedule: job assignment ๋“ฑ์„ ์กฐ์œจ

- execution

 

2. HDFS (Hadoop Distributed File System)

- fault-tolerant: ํ•˜๋“œ๋””์Šคํฌ๊ฐ€ ๋ง๊ฐ€์ ธ๋„ ๊ดœ์ฐฎ์Œ

- high-bandwidth

- high availability distributed storage

- ์ตœ์†Œ 3๊ฐœ์˜ ๋ณต์‚ฌ๋ณธ์€ ์œ ์ง€

 

MapReduce Word Count

- Map and Reduce ํ•จ์ˆ˜

map(k,v) -> [(k', v')]
reduce(k', [v']) -> [(k',v')]

- Word Count

Map(String docid, String text):
    for each word w in text:
    	Emit(w, 1);
Reduce(String term, Iterator<int> values):
    int sum = 0;
    for each v in values:
    	sum += v;
    Emit(term, sum);

 

MapReduce ์‹คํ–‰ ์‹œ๊ฐ„

- scheduling ์กฐ์ ˆ: map, reduce task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก worker์— ๋ถ€์—ฌ

- ๋ฐ์ดํ„ฐ ๋ถ„์‚ฐ ์กฐ์ ˆ: ํ”„๋กœ์„ธ์Šค์—์„œ ๋ฐ์ดํ„ฐ๋กœ ์ด๋™

- synchronization ์กฐ์ ˆ: intermediate data์˜ ์ˆ˜์ง‘, ์ •๋ ฌ, ํ˜ผํ•ฉ

- error, fault ์กฐ์ ˆ: worker failure ๊ฐ์ง€ ๋ฐ ์žฌ์‹œ์ž‘

 

Hadoop Distributed File System (HDFS)

- ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด๋ถ€์˜ ๋…ธ๋“œ์˜ local disk์— ๋ฐ์ดํ„ฐ ์ €์žฅ -> ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์„ RAM์ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Œ

- Disk access๋Š” ๋Š๋ฆฌ์ง€๋งŒ disk throughput์€ ํ•ฉ๋ฆฌ์  -> file์„ ํ†ตํ•œ linear scan์€ ๊ดœ์ฐฎ์Œ

- ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ 3๋ฒˆ ๋ณต์ œ -> reliability on commodity hardware

 

Cloud Computing Services

- Amazon ๋“ฑ์˜ ํ”Œ๋žซํผ์€ ๋‹จ๊ธฐ ์ž‘์—…์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ machine ์ œ๊ณต ๊ฐ€๋Šฅ

 

Feel Free to Experiment: micro instance์˜ ๊ฒฝ์šฐ ํ•ด๋ณผ ๋งŒํ•จ

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +