profile-img
The merit of an action lies in finishing it to the end.
slide-image

Ensemble Learning์ด๋ž€?

์—ฌ๋Ÿฌ ๊ฐœ์˜ base-learner(base-model)๋ฅผ ์กฐํ•ฉํ•˜๋Š” ๋ชจ๋ธ

 

Bagging

- ๋‹ค๋ฅธ ์ด๋ฆ„: Bootstrap Aggregating

- Bootstrap: Random Sampling ๋ฐฉ๋ฒ•๋ก  ์ค‘ ํ•˜๋‚˜. ๋ณต์› ์ถ”์ถœ์„ ์‹œํ–‰ํ•œ๋‹ค.

- ๋ณต์› ์ถ”์ถœ์„ ํ†ตํ•ด ๋™์ผํ•œ ํฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ์„ฑํ•œ ํ›„, ์•™์ƒ๋ธ”์„ ๊ตฌ์„ฑํ•˜๋Š” ํ•™์Šต ๋ฐฉ์‹

- ๋‹จ์ : ๋žœ๋ค ์ƒ˜ํ”Œ๋ง์€ ์šด์— ์˜์กดํ•œ๋‹ค.

๋งŒ์•ฝ, decision tree๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž.

boostrap์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•  ๊ฒฝ์šฐ, ๋ณต์› ์ถ”์ถœ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฝ‘ํž ์ˆ˜ ์žˆ๊ณ , ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์…‹๋ผ๋ฆฌ ์œ ์‚ฌํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์ด๋Ÿฐ ๊ฒฝ์šฐ decision tree์˜ root node๋Š” ๋ชจ๋ธ ๊ฐ„์— ํ•ญ์ƒ ๋น„์Šทํ•ด์งˆ ์ˆ˜๋ฐ–์— ์—†๋‹ค. 

 

Boosting

- Bagging๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์ด randomness์— ์˜์กดํ•œ๋‹ค๋Š” ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋œ ๋ฐฉ์‹

- ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ 3๊ฐœ๋กœ ๋ถ„ํ• ํ•œ๋‹ค. (D1, D2, D3)

- ํ•™์Šต ๋ฐฉ์‹

base-learner๋ฅผ 3๊ฐœ ํ•™์Šต์‹œํ‚จ๋‹ค.

1. h1์„ D1์œผ๋กœ ํ•™์Šต์‹œํ‚จ๋‹ค.

2. h2๋ฅผ D2 ์ค‘ h1์ด ํ‹€๋ ธ๋˜ ๋ฌธ์ œ + ๋งž์•˜๋˜ ๋ฌธ์ œ ๊ฐ™์€ ๋น„์œจ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต์‹œํ‚จ๋‹ค.

3. h3์„ h2๊ฐ€ ํ‹€๋ ธ๋˜ ๋ฌธ์ œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต์‹œํ‚จ๋‹ค.

- ์˜ˆ์ธก ๋ฐฉ์‹

h1(x)=h2(x)๋ผ๋ฉด, h1(x)๋ฅผ ๋ฐ˜ํ™˜.

์•„๋‹ˆ๋ผ๋ฉด h3(x)์„ ๋ฐ˜ํ™˜. => ํ‹€๋ ธ๋˜ ๋ฌธ์ œ๋งŒ์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ด๋ฏ€๋กœ ๊ฐ€์žฅ ๋งž์„ ํ™•๋ฅ ์ด ๋†’๋‹ค.

Statement = Proposition : ๋ช…์ œ

- ์ •์˜: ์ฐธ์ด๋‚˜ ๊ฑฐ์ง“์œผ๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์žฅ

- ์ข…๋ฅ˜

1) Universal Statement: ํ•œ ์ง‘ํ•ฉ์˜ ๋ชจ๋“  ์š”์†Œ์— ๋Œ€ํ•ด์„œ ์ฐธ

2) Conditional Statement: ํŠน์ • ์กฐ๊ฑด๋งŒ ํฌํ•จ

3) Existential Statement: ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์š”์†Œ๊ฐ€ 1๊ฐœ ์ด์ƒ ์žˆ๋‹ค.

4) Universal Conditional Statement: universal & conditional

5) Universal Existential Statement: ์ฒซ ๋ถ€๋ถ„์€ universal, ๋‘ ๋ฒˆ์งธ ๋ถ€๋ถ„์€ existential

6) Existential Universal Statement: ์ฒซ ๋ถ€๋ถ„์€ existential, ๋‘ ๋ฒˆ์งธ ๋ถ€๋ถ„์€ universal

 

Set : ์ง‘ํ•ฉ

- ์ •์˜: ํŠน์ • ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์š”์†Œ๋“ค์˜ ๋ชจ์ž„

- ex) { x | 0 < x < 5 }

 

Russell's Paradox

R = {x | x is a set and x is not an element of itself}

์ •์˜์— ๋”ฐ๋ฅด๋ฉด R์ด R์˜ ์š”์†Œ๋ผ๋ฉด, R์€ R์˜ ์š”์†Œ์ผ ์ˆ˜ ์—†๋‹ค.

๋˜ํ•œ, R์ด R์˜ ์š”์†Œ๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด, R์€ R์˜ ์š”์†Œ์ด๋‹ค.

-> ๋ชจ์ˆœ์ 

 

Cartesian product

A X B : Cartesian product of A and B

B X A : Cartesian product of B and A

์ •์˜: A X B = {(x, y) | x is a member of A, y is a member of B}

ex) A = {1, 2} / B = {3, 4} -> A X B = {(1,3), (1,4), (2,3), (2,4)}

 

Relation

- ๊ณต์ง‘ํ•ฉ์ด ์•„๋‹Œ ์ง‘ํ•ฉ A์— ๋Œ€ํ•ด A์— ๋Œ€ํ•œ relation = A X A์˜ subset

- ์ข…๋ฅ˜

1) Reflexive

ex) A = {1, 2, 3, 4}์ผ ๋•Œ

R1 = { (1, 1), (2, 2) } ๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Reflexiveํ•˜์ง€ ์•Š๋‹ค.

์ด์œ : (3, 3), (4, 4)๊ฐ€ ์—†๋‹ค.

R2 = { (1, 1), (2, 2), (3, 3), (4, 4), (2, 3) } ๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Reflexiveํ•˜๋‹ค.

์ด์œ : (1, 1)~(4, 4)๊ฐ€ ๋ชจ๋‘ ์žˆ๋‹ค. (2, 3)์€ ์ƒ๊ด€์ด ์—†๋‹ค.

 

2) Symmetric

ex) R3 = ๊ณต์ง‘ํ•ฉ์ด๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Symmetricํ•˜๋‹ค.

์ด์œ : ์ง‘ํ•ฉ์—์„œ ์š”์†Œ๋“ค์˜ ์Œ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—

R4 = { (1, 2), (2, 1), (2, 2) } ๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Symmetricํ•˜๋‹ค.

R5 =  { (1, 2), (2, 1), (2, 3) } ๋ผ๋ฉด, (3, 2)๊ฐ€ ์—†์–ด Symmetricํ•˜์ง€ ์•Š๋‹ค.

 

3) Transitive

R6 = { (1, 5), (5, 1), (1, 1) } ์ด๋ผ๋ฉด ๋งŒ์กฑํ•  ์ˆ˜ ์—†๋‹ค.

(1, 5), (5, 1) -> (1, 1) ์ด์ง€๋งŒ

(5, 1), (1, 5) -> (5, 5)๋„ ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ ์—†๋‹ค.

Greedy Best-First Search

Evaluation Func & Heuristic Func

  1. Evaluation func f(n)
    1. node n์˜ estimated cost
  2. Heuristic func h(n)
    1. node n์—์„œ goal state๋กœ ๊ฐ€๋Š” ๊ฐ€์žฅ ์ €๋ ดํ•œ ๋น„์šฉ์˜ ์ถ”์ •์น˜
    2. root-finding problem์—์„œ ๊ฐ ๋„์‹œ๋ณ„ ์ง์„ ๊ฑฐ๋ฆฌ ํ•จ์ˆ˜

Greedy Best-First Search

  1. ์ •์˜: best-first search์˜ ์ผ์ข… → h(n) value๊ฐ€ ๊ฐ€์žฅ ์ ์€ ๋…ธ๋“œ๋ฅผ ํ™•์žฅ
    1. table์„ ์œ ์ง€ํ•˜๋Š” ๊ฒฝ์šฐ ๊ทธ๋ž˜ํ”„ (์ค‘๋ณต์„ ๊ณ ๋ ค)
    2. ์—†์• ๋Š” ๊ฒฝ์šฐ tree search (์ค‘๋ณต์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ)
  2. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Completeness → ์‚ฌ์ดํด ๋ฐœ์ƒ
      1. Not complete: tree-like search
      2. complete: graph search
    2. Not cost-optimal
    3. Time Complexity: O(b^m)
    4. Space Complexity: O(b^m)

A* Search

  1. ์ •์˜: f(n) = g(n) + h(n)์„ evaluation function์œผ๋กœ ์ด์šฉ
    1. g(n): n๊นŒ์ง€ ์˜ค๋Š” ๋ฐ ๋น„์šฉ
    2. h(n): n๋ถ€ํ„ฐ goal๊นŒ์ง€ ๊ฐ€๋Š” ๋ฐ ์ถ”์ •๋˜๋Š” ๋น„์šฉ
  2. Admissible Heuristic Function
    1. goal์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•œ ๋น„์šฉ์„ ์ ˆ๋Œ€ overestimateํ•˜์ง€ ์•Š์Œ
    2. h(n) <= h*(n) : h(n)์€ ์‹ค์ œ ๊ฐ’
    3. admissible heuristic function์„ ์ด์šฉํ•œ๋‹ค๋ฉด A* search๋Š” cost-optimalํ•˜๋‹ค.
  3. Consistent Heuristic Function
    1. consistent: ๋ชจ๋“  ๋…ธ๋“œ n๊ณผ a๋ผ๋Š” ํ–‰๋™์œผ๋กœ ์ƒ์„ฑ๋œ ์ž์‹ n’์ด ์žˆ์„ ๋•Œ
    2. h(n) <= c(n,a,n') + h(n') ์„ฑ๋ฆฝ
    3. consistent heuristic function์„ ์ด์šฉํ•œ๋‹ค๋ฉด A* search๋Š” cost-optimalํ•˜๋‹ค → ๋ชจ๋“  consistent heuristic์€ admissibleํ•˜๋‹ค.
  4. Search contours
    1. A*๋Š” ๊ฐ€์žฅ ์ ์€ f-cost๋ฅผ ๊ฐ€์ง„ frontier ๋…ธ๋“œ๋ฅผ ํ™•์žฅํ•˜๋ฏ€๋กœ ๋“ฑ๊ณ ์„ ์ด ์ปค์ง€๋ฉด์„œ cost๊ฐ€ ์ ์  ์ปค์ง€๋Š” ํ˜•ํƒœ์ด๋‹ค.
    2. ์ข‹์€ ํœด๋ฆฌ์Šคํ‹ฑ์„ ๊ฐ€์กŒ๋‹ค๋ฉด, ๋“ฑ๊ณ ์„ ์€ ์ ์  goal state์— ๊ฐ€๊นŒ์›Œ์งˆ ๊ฒƒ์ด๋ฉฐ ์ตœ์  ๊ฒฝ๋กœ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ๋” ์ข์•„์งˆ ๊ฒƒ์ด๋‹ค.
  5. Optimally Efficient → C* : optimal cost
    1. surely expanded: f(n)<C*
    2. depends: f(n)=C*
    3. never expended: f(n) > C*
  6. Weighted A* Search
    1. ์ •์˜: C์™€ WC*์‚ฌ์ด์˜ cost๋ฅผ ๊ฐ€์ง€๋Š” ํ•ด๋‹ต์„ ์ฐพ๋Š” ๊ฒƒ
    2. f(n)=g(n)+W*h(n)
    3. Suboptimal search techniques
      1. bounded suboptimal search: ์‹ค์ œ optimal cost์˜ 1/2๋ฐฐ ..
      2. bounded-cost search: ์‹ค์ œ cost๋Š” ๋ชจ๋ฅด๊ณ  ์–ผ๋งˆ ์•ˆ์— ๋„๋‹ฌํ•˜๋Š” ๊ฒฝ๋กœ ์ฐพ๊ธฐ
      3. unbounded-cost search
      4. beam search: ๋ฉ”๋ชจ๋ฆฌ๋Š” ์ ๊ฒŒ ์œ ์ง€ → ๊ฐ€์žฅ ์ข‹์€ score์ธ ๊ฒƒ๋งŒ ์œ ์ง€ํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๊ฐ€์ง€์น˜๊ธฐ
  7. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Cost-optimal
    3. Time-complexity <= O(b^d)
    4. Space complexity <= O(b^d)

Iterative-Deepening A* Search

  1. ์ •์˜ (IDA*)
    1. iterative deepening search + A* Search
    2. ์ด์ „ ๋ฐ˜๋ณต์—์„œ cutoff๋ฅผ ์ดˆ๊ณผํ•˜์˜€๋˜, f-cost๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ ์ž„์˜์˜ ๋…ธ๋“œ๋ฅผ cutoff → ๊ธฐ์ค€๋ณด๋‹ค ํฌ์ง€ ์•Š์•˜๋‹ค๋ฉด ๊ณ„์† ๋‚ด๋ ค๊ฐ
  2. ํŠน์ง•
    1. f-cost๊ฐ€ ๋ชจ๋‘ ์ •์ˆ˜์ผ ๋•Œ ์ž˜ ์ž‘๋™
    2. f-cost๊ฐ€ ๋ชจ๋‘ ๋‹ค๋ฅธ ๋…ธ๋“œ์ผ ๋•Œ, ๋ฐ˜๋ณต ์ˆซ์ž๊ฐ€ ๊ฒฐ๊ตญ state์ˆ˜์™€ ๊ฐ™์•„์งˆ ์ˆ˜ ์žˆ๋‹ค.

Recursive Best-First Search

  1. ์ •์˜ (RBFS)
    1. recursive depth-first search + best-first search
    2. f-limit์„ ํ˜„์žฌ ๋…ธ๋“œ๊ฐ€ ์ดˆ๊ณผํ•œ๋‹ค๋ฉด ๋‹ค์‹œ ๊ฐ์•„ ๋‚ด๋ ค๊ฐ€์„œ alternative path๋กœ ์ด๋™
  2. ์ง„ํ–‰ ๊ณผ์ •
    1. ์ผ๋‹จ ์ฒ˜์Œ limit = ๋ฌดํ•œ๋Œ€
    2. ๋‹ค์Œ ๊ฒฝ๋กœ ์ค‘ ๊ฐ€์žฅ ์ž‘์€ cost์ธ ์ชฝ์„ ์„ ํƒ, limit์€ ๋‘ ๋ฒˆ์งธ๋กœ ๊ฐ€์žฅ ์ž‘์€ ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝ
    3. ๋‚ด๋ ค๊ฐ€์„œ limit๊ณผ cost์˜ ๊ฐ’์„ ๋น„๊ต → cost๊ฐ€ limit๋ณด๋‹ค ํฌ๋ฉด ํ•ด๋‹น cost๋ฅผ limit์œผ๋กœ ๋ณ€๊ฒฝํ•˜๊ณ , limit๊ณผ ๊ฐ™์€ cost๋ฅผ ๊ฐ€์กŒ๋˜ node๋ฅผ ํ™•์ธ
  3. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Cost-optimal
    3. time complexity <= O(b^d)
    4. space complexity <= O(bd)
      • ์ง€๋‚˜์น˜๊ฒŒ ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ → ๊ณ„์† ๊ฐ™์€ state๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ํƒ์ƒ‰

Simplified Memory-Bounded A* Search

  1. ์ •์˜
    1. A* search๋ฅผ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ฝ‰ ์ฐฐ ๋•Œ๊นŒ์ง€ ์ง„ํ–‰
    2. ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ฝ‰ ์ฐผ๋‹ค๋ฉด f-cost๊ฐ€ ๊ฐ€์žฅ ํฐ worst leaf node๋ฅผ ๋ฒ„๋ฆผ
    3. ์žŠ์–ด๋ฒ„๋ฆฐ ๋…ธ๋“œ์˜ ๊ฐ’์„ ๋‹ค์‹œ ๋ถ€๋ชจ์—๊ฒŒ ์ „๋‹ฌ
  2. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Cost-optimal
    3. time complexity <= O(b^d)
    4. space complexity <= O(b^d)
'School/COSE361 ์ธ๊ณต์ง€๋Šฅ' Related Articles +

Example Problems

  1. Robotic Vacuum Cleaner
  2. 8-Puzzle
  3. 8-queens
  4. Route-finding problem

Solving Problems by Searching

์šฉ์–ด ์ •์˜

  1. Problem formulation: ์–ด๋–ค ๊ณ ๋ คํ•  state, action์„ ๊ฒฐ์ •ํ• ์ง€ ์ •ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค → ๋ชฉํ‘œ๋Š” ์ฃผ์–ด์ ธ ์žˆ์Œ
  2. ๋ฌธ์ œ๋Š” state space, initial state, goal state, action, transition model, action cost function์œผ๋กœ ๊ตฌ์„ฑ
  3. action์˜ ์—ฐ์†์€ path๋ฅผ ๊ตฌ์„ฑ
  4. solution: initial state→goal state๋กœ ๊ฐ€๋Š” path
  5. optimal solution: path cost๊ฐ€ ๊ฐ€์žฅ ์ ์€ ํ•ด๋‹ต
  6. search: ๋ชฉํ‘œ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋Š” ํ–‰๋™์˜ ์—ฐ์†์„ ์ฐพ๋Š” ๊ฒƒ
  7. execution phase์—์„œ ํ–‰๋™์„ ํ•œ ๋ฒˆ์— ํ•œ ๊ฐœ์”ฉ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

Abstraction

  1. ์ •์˜: ํ‘œํ˜„์— ํ•„์š” ์—†๋Š” ๋””ํ…Œ์ผ์€ ๊ฐ์ถ”๋Š” ๊ฒƒ
    1. Useful abstraction
    2. Valid abstraction
    3. level of abstraction

State space

  1. ์ •์˜: ํ™˜๊ฒฝ์ด ์ฒ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅํ•œ state์˜ ์ง‘ํ•ฉ
  2. ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„: vertices → state, edges → actions
  3. solution

Search Tree

  1. Node expansion: Node generation, parent / child / successor node
  2. reached state: ํ•ด๋‹น state์— ๋งž๋Š” ๋…ธ๋“œ๊ฐ€ ์ƒ์„ฑ๋œ ์ƒํƒœ
  3. frontier: unexpanded node์˜ ์ง‘ํ•ฉ
  4. redundent path and cycle ์‹ ๊ฒฝ์“ฐ๋ฉด → graph search, ์•„๋‹ˆ๋ฉด → tree-like search

ํƒ์ƒ‰ ์ „๋žต ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€

  1. ํƒ์ƒ‰ ์ „๋žต
    1. FIFO
    2. LIFO
    3. Highest priority first
  2. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. completeness
    2. cost optimality
    3. time complexity
    4. space complexity

Task Environment์˜ ํŠน์ง•

  • Task environment: ์„ฑ๋Šฅํ‰๊ฐ€, ํ™˜๊ฒฝ, actuator, sensor
  • fully observable, partially observable, unobservable ⇒ sensor
  • deterministic, nondeterministic, stochastic ⇒ transition model
  • discrete, continuous ⇒ state, action, sensor, time
  • known, unknown ⇒ laws of physics
  • single, multiagent ⇒ agent
  • episodic, sequential ⇒ dependency
  • static, dynamic, semi-dynamic ⇒ state and action
  • fully observable, deterministic, known environment → solution = fixed sequence of actions

Uninformed Searches

Breadth-First Search

  1. ์ •์˜: ๋ฃจํŠธ ๋…ธ๋“œ๊ฐ€ ๊ฐ€์žฅ ๋จผ์ € ํ™•์žฅ๋˜๊ณ , ๋ฃจํŠธ ๋…ธ๋“œ์˜ ์ž์‹์ด ํ™•์žฅ๋˜๊ณ , ๊ทธ ๋‹ค์Œ ๊ทธ๋“ค์˜ ์ž์‹์ด ํ™•์žฅ๋˜๊ณ … ํ•˜๋Š” ์‹
  2. ์ˆ˜๋„์ฝ”๋“œ ํŠน์ง•
    • IsGoal ํ•จ์ˆ˜: frontier์— FIFO queue๋ฅผ ์‚ฌ์šฉ
    • reached: ๋…ธ๋“œ๊ฐ€ ์ƒ์„ฑ๋˜๋ฉด reached list์— ์ €์žฅ
  3. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete: ์œ ํ•œํ•œ branching factor, ์œ ํ•œ/๋ฌดํ•œ state space
    2. Cost optimalํ•˜์ง€ ์•Š์Œ
    3. time complexity: O(b^d) → b: branching factor, d: ๊ฐ€์žฅ ์–•์€ ํ•ด
    4. space complexity: O(b^d)
    5. ๋‹ต์ด ์žˆ์œผ๋ฉด ๋ฌด์กฐ๊ฑด ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

Uniform-Cost Search

  1. ์ •์˜: ๊ฐ€์žฅ ์ ์€ path cost g(n)์„ ๊ฐ€์ง„ ๋…ธ๋“œ n์œผ๋กœ ํ™•์žฅ
  2. ์ˆ˜๋„์ฝ”๋“œ ํŠน์ง•
    • reached: lookup table → key problem.Initial & value node
    • child.PathCost < reached[s].PathCost : ๋ฐฉ๋ฌธ์€ ์ด๋ฏธ ํ•œ ์ƒํƒœ์—์„œ ๋น„๊ต๋งŒ
    • late goal test
  3. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete: ๋ชจ๋“  cost๊ฐ€ ์–‘์ˆ˜๋ผ๋Š” ์กฐ๊ฑด์ด ์žˆ์–ด์•ผ ํ•จ
    2. Cost-Optimal
    3. Time complexity: O(b^{1+\lfloor{C/\epsilon}\rfloor})
    4. Space complexity: O(b^{1+\lfloor{C/\epsilon}\rfloor}) → C*: optimal solution์˜ cost
  4. ํ•˜๋‚˜๋งŒ ์„ ํƒํ•˜๋ ค๋ฉด time๋ณด๋‹ค๋Š” space๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด ํ•„์š”

Depth-First Search

  1. ์ •์˜: Frontier์— ์žˆ๋Š” ๊ฐ€์žฅ ๊นŠ์€ ๋…ธ๋“œ๋ฅผ ํ™•์žฅ
    • backing search: ๋‹จ 1๊ฐœ์˜ successor๋งŒ์ด ์ƒ์„ฑ๋จ → ๋ถ€๋ถ„์ ์œผ๋กœ ํ™•์žฅ๋œ ๋…ธ๋“œ๋Š” ๋‹ค์Œ ๋ฒˆ์— ํ™•์žฅ๋  ๋…ธ๋“œ์— ๋Œ€ํ•ด์„œ ๊ธฐ์–ต
    • ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ˜„์žฌ ์ƒํƒœ ์„ค๋ช…์„ ์ง์ ‘ ๋ณ€๊ฒฝํ•จ์œผ๋กœ์จ successor ์ƒ์„ฑ → backtrack์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋˜๋Œ๋ฆฌ๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ด์•ผ ํ•จ
  2. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Completeness
      • not complete: infinite state space, tree-like search
      • complete: finite, graph
    2. Not cost-optimal
    3. Time complexity: $O(b^m)$ → m: ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด
    4. Space complexity
      1. O(b^m): graph search → backtracking ์ •๋ณด๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•˜๋ฏ€๋กœ
      2. O(bm): tree-like search → ํŠธ๋ฆฌ ์„œ์น˜๋ฅผ ํ•˜๋ฉด completeํ•ด์ง€์ง€ ์•Š์œผ๋ฏ€๋กœ
      3. O(m): backtracking search

Depth-Limited Search

  1. ์ •์˜: depth limit l์ด ์ •ํ•ด์ ธ ์žˆ์Œ → ๊นŠ์ด l์ธ ๋…ธ๋“œ์˜ ๊ฒฝ์šฐ๋Š” successor๊ฐ€ ์—†๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์ทจ๊ธ‰
  2. ์ˆ˜๋„์ฝ”๋“œ ํŠน์ง•
    1. result ← cutoff
  3. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Not complete
    2. Not cost-optimal
    3. time complexity: $O(b^l)$ → l: depth limit
    4. space complexity: $O(bl)$ → ๋ฐฉ๋ฌธํ•œ ๋…ธ๋“œ ๋ฆฌ์ŠคํŠธ๋ฅผ ์œ ์ง€ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ

Iterative Deepening Search

  1. ์ •์˜: depth-limited search + increasing limits
  2. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Not cost-optimal
    3. Time complexity: O(b^d)
    4. Space complexity: O(bd) → ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ
  3. ํŠน์ง•
    1. depth first search์™€ breadth first search์˜ ์žฅ์ ์„ ๋ชจ๋‘ ํฌํ•จ
    2. hybrid approach๋กœ ์ด์šฉ ๊ฐ€๋Šฅ
    3. state space๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋ณด๋‹ค ํฌ๊ณ  depth๊ฐ€ ์•Œ๋ ค์ง€์ง€ ์•Š์•˜์„ ๋•Œ ์ด์šฉ
    4. Iterative lengthening search: cost๊ฐ€ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ๋ฉด cost๊ฐ€ ์ปค์ง€๋Š” ๋ฐฉ์‹์œผ๋กœ deepening

Bidirectional Search

  1. ์ •์˜: 2๊ฐœ์˜ frontier๋กœ ํ™•์žฅ → 1๊ฐœ๋Š” initial state, 1๊ฐœ๋Š” goal state → ์ค‘๊ฐ„์—์„œ ๋งŒ๋‚˜๊ธฐ๋ฅผ ๊ธฐ์›
  2. ์ˆ˜๋„์ฝ”๋“œ ํŠน์ง•
    1. solution ์„ ํƒ ์‹œ ์–ด๋–ค ๋ฐฉํ–ฅ์ด ๊ฐ€์žฅ ๋น„์šฉ์ด ์ ์„์ง€ ํŒ๋‹จํ•˜์—ฌ ๊ทธ๊ฒƒ์„ ์„ ํƒํ•จ
    2. proceed ํ•จ์ˆ˜: ์ƒ๋Œ€๋ฐฉ์˜ reached state ์ •๋ณด๋ฅผ ๋ฐ›์•„์„œ ์‹คํ–‰
  3. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Not cost-optimal
    3. time complexity: O(b^{d/2})
    4. space complexity: O(b^{d/2})

'School/COSE361 ์ธ๊ณต์ง€๋Šฅ' Related Articles +

Big Data Intro

- 2003๋…„ ์ „๊นŒ์ง€ 5 ์—‘์‚ฌ๋ฐ”์ดํŠธ์˜ ์ •๋ณด๋ฅผ ์ƒ์‚ฐํ–ˆ์Œ

- ์ด์ œ ์ดํ‹€๋งˆ๋‹ค 5 ์—‘์‚ฌ๋ฐ”์ดํŠธ์˜ ์ •๋ณด๋ฅผ ์ƒ์‚ฐ -> ๋น…๋ฐ์ดํ„ฐ

 

Big Data์˜ 4V

- Volume

- Variety

- Velocity

- Veracity

 

Science Paradigms

- ๋ช‡์ฒœ ๋…„ ์ „: ๊ณผํ•™์€ empirical ํ–ˆ๋‹ค. -> ์ž์—ฐ ํ˜„์ƒ์„ ์„œ์ˆ 

- ๋ช‡๋ฐฑ ๋…„ ์ „: theoretical branch -> ์ด๋ก , ๋ชจ๋ธ๋ง

- ๋ช‡์‹ญ ๋…„ ์ „: computational -> ์‹œ๋ฎฌ๋ ˆ์ด์…˜

- ์˜ค๋Š˜๋‚ : data exploration

- ๋ฏธ๋ž˜: Data-driven Science -> Data-driven Hypothesis Generation

 

Big Data Challenges

- ๋น…๋ฐ์ดํ„ฐ

* ํฌ๊ณ  ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ

* ์˜ˆ: social, web, financial transaction data, academic articles, genomic data...

- 2๊ฐ€์ง€์˜ ๊ณ ๋‚œ

* ํšจ์œจ์  ์ €์žฅ ๋ฐ ์ ‘๊ทผ

* Data Analytics -> ๊ฐ€์น˜ ์žˆ๋Š” ์ •๋ณด ์บ๋‚ด๊ธฐ

 

Data Science and Big data

- ๋น…๋ฐ์ดํ„ฐ๋ผ๋Š” ์šฉ์–ด๋Š” ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„์„์„ ๋œปํ•จ

- ์˜ˆ: ํŠธ์œ„ํ„ฐ/ํŽ˜์ด์Šค๋ถ ์ „์ฒด, ์ฃผ์š” ์‚ฌ์ดํŠธ์˜ ์›น ๋กœ๊ทธ, ๋ช‡์ฒœ ๋ช…์˜ genome sequence, Flickr์˜ ์ „์ฒด ์ด๋ฏธ์ง€...

- ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ๋” ์–ด๋ ค์›Œ์ง„๋‹ค

 

Big data as Bad data

- Unrepresentative participation (bias): ์ธ์Šคํƒ€๊ทธ๋žจ-์ Š์€ ์ธต๋งŒ ์ด์šฉ, ๋‰ด์š•ํƒ€์ž„์ฆˆ-์ž์œ ์ฃผ์˜์ , Fox News-๋ณด์ˆ˜์ , ์›”์ŠคํŠธ๋ฆฌํŠธ์ €๋„-๋ถ€์ž๋งŒ ๋ด„

- Spam and machine-generated content: ๋ด‡, ๊ฐ€์งœ๋‰ด์Šค ๋“ฑ 

- Power-laws: redundancy๋ฅผ ์˜๋ฏธ -> ๊ฑด๋ฌผ ํƒ์ง€ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๋žœ๋“œ๋งˆํฌ ๋ฐ์ดํ„ฐ๋Š” ๋งŽ์œผ๋ฏ€๋กœ ๋žœ๋“œ๋งˆํฌ๋งŒ ์ž˜ ์ธ์‹ํ•˜๊ณ  ์ผ๋ฐ˜ ๊ฑด๋ฌผ์€ ์ž˜ ์ธ์‹ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ

- Susceptibility to temporal bias (ex. Google Flu Trends): ๊ตฌ๊ธ€์˜ ์ž๋™์™„์„ฑ ๊ธฐ๋Šฅ์ด ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ์˜ ๋ถ„ํฌ๋ฅผ ๋ณ€ํ™”์‹œํ‚ด

 

Large-Scale Machine Learning

- ์šฐ๋ฆฌ๊ฐ€ ์ง€๊ธˆ๊นŒ์ง€ ๊ณต๋ถ€ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ์•„์ฃผ ํฐ ๋ฐ์ดํ„ฐ์…‹์— ์ž˜ ๋งž์ง€ ์•Š๋Š”๋‹ค.

- ์•„์ฃผ ํฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ ์€ ๋ชจ๋ธ์ด ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค.

- ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์‹œ๊ฐ„๋ณต์žก๋„๊ฐ€ linear -> O(n)์ด์–ด์•ผ ํ•œ๋‹ค.

- Big matrices๋Š” ๋น…๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํฌ์†Œํ•  ์ˆ˜ ์žˆ๋‹ค.

* ์˜ˆ) ๋„ทํ”Œ๋ฆญ์Šค ๊ณ ๊ฐ 1์–ต๋ช…์„ ๋Œ€์ƒ์œผ๋กœ ์ „์ฒด ์˜ํ™”์— ๋Œ€ํ•œ ์œ ์ € ํ‰์  matrix๋ฅผ ๋ถ„์„ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ ์ฐจ์›์ด ์•„์ฃผ ํฌ๊ณ  ํฌ์†Œํ•˜๋‹ค.

-> ๋ชจ๋“  ์‚ฌ๋žŒ์ด ๋ชจ๋“  ์˜ํ™”๋ฅผ ๋ณด์ง€๋Š” ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์—

 

Filtering Data -> domain specific criteria

- ๋น…๋ฐ์ดํ„ฐ์˜ ์žฅ์ ์€ ๋ถ„์„์„ ๋” ๊น”๋”ํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋น…๋ฐ์ดํ„ฐ์˜ ์ƒ๋‹น ๋ถ€๋ถ„์„ ๋ฒ„๋ ค๋„ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

- ์˜ˆ: ํŠธ์œ„ํ„ฐ์˜ ์ „์ฒด ํŠธ์œ— ์ค‘ 34%๋งŒ์ด ์˜์–ด๊ถŒ ๊ณ„์ • -> ๋‚˜๋จธ์ง€๋Š” ๋ฒ„๋ ค๋„ ๋จ

- irrelevant or hard-to-interpret data๋ฅผ ํ•„ํ„ฐ๋ง ํ•˜๋Š” ๊ฒƒ์€ application-specific knowledge๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค.

 

Subsampling Data

- subsampling...?

* training/validation/testing data๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

* ๋‹จ์ˆœํ•˜๊ณ  ํŠผํŠผํ•œ ๋ชจ๋ธ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ ์Œ -> big data์˜ overkill ์œ ๋„

cf) hyperparameter fitting, ๋ชจ๋ธ ์„ ์ •->์—ฌ๋Ÿฌ๋ฒˆ ํ•™์Šต

* ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ ์‚ฌ์ด์ฆˆ์˜ ๋ฐ์ดํ„ฐ์…‹์€ ๋ถ„์„์ด ์‰ฝ๊ณ  ๋น ๋ฅด๊ฒŒ ๋œ๋‹ค.

 

์ ˆ๋‹จ(truncation)์„ ์ด์šฉํ•˜๊ธฐ

- ์ฒซ n๊ฐœ์˜ ๊ธฐ๋ก๋งŒ์„ ์ทจํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€๋งŒ ์—ฌ๋Ÿฌ ์˜ค๋ฅ˜๋ฅผ ๋ฒ”ํ•  ์ˆ˜ ์žˆ๋‹ค.

* temporal biases (์˜ค๋ž˜๋œ ๋ฐ์ดํ„ฐ๋งŒ์„ ๋ถ„์„ํ•˜๊ฒŒ ๋  ์ˆ˜๋„)

* lexicographic biases (A์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋งŒ ๋ถ„์„ํ•˜๊ฒŒ ๋  ์ˆ˜๋„ -> Arabic์€ ๋ถ„์„๋ฒ”์œ„์— ์žˆ์ง€๋งŒ Korean์€ ํฌํ•จ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค)

* numerical biases (ID ๋ฒˆํ˜ธ ๊ฐ™์€ ๊ฑด ๋ฌด์ž‘์œ„ ๋ถ€์—ฌ๊ฐ€ ์•„๋‹ˆ๋ผ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ์ˆซ์ž์ผ ์ˆ˜ ์žˆ์Œ)

 

Stream Sampling

- n์„ ๋ชจ๋ฅผ ๋•Œ size k ์งœ๋ฆฌ ๊ท ์ผ ์ƒ˜ํ”Œ์„ ์ถ”๊ตฌํ•  ๋•Œ๊ฐ€ ์žˆ๋‹ค.

- ํ•ด๊ฒฐ์ฑ…: k๊ฐœ์˜ active element์— ๋Œ€ํ•œ ๋ฐฐ์—ด์„ ์œ ์ง€ํ•˜๊ณ , n๋ฒˆ์จฐ ์š”์†Œ๋ฅผ k/n์˜ ํ™•๋ฅ ๋กœ ๋Œ€์ฒด -> ๋“ค์–ด์™”๋˜ ๊ฒƒ๋„ ์ซ“๊ฒจ๋‚  ์ˆ˜๋„ ์žˆ์Œ

- cut์„ ๋งŒ๋“ค ๋•Œ ์ƒˆ๋กœ์šด ์š”์†Œ์˜ ์œ„์น˜๋ฅผ ๋žœ๋ค์œผ๋กœ ์„ ์ •ํ•  ์ˆ˜ ์žˆ์Œ.

 

Distributed vs. Parallel Processing

- Parallel processing: 1๊ฐœ์˜ ๊ธฐ๊ณ„์—์„œ ์ง„ํ–‰ (์Šค๋ ˆ๋“œ, ์šด์˜์ฒด์ œ ํ”„๋กœ์„ธ์Šค ๋“ฑ์„ ํ†ตํ•ด)

- Distributed processing: ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„์—์„œ ์ง„ํ–‰ - ๋„คํŠธ์›Œํฌ ํ†ต์‹  ๋“ฑ์„ ์ด์šฉ

 

Data Paralllelism

- parallelism์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋น…๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์ˆ˜์˜ ๊ธฐ๊ณ„๋กœ ๋‚˜๋ˆ  ๋…๋ฆฝ์ ์œผ๋กœ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

- k-means์ฒ˜๋Ÿผ ๊ฐ๊ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์น˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค.

* Parallelized K-means

์˜ˆ๋ฅผ ๋“ค์–ด 100๊ฐœ์˜ seed๊ฐ€ ์žˆ๋Š” ์ƒํ™ฉ์ด๋ผ๊ณ  ํ•˜๋ฉด

1. 1๊ฐœ์˜ master machine (job assign...)์ด k๊ฐœ์˜ random seed๋ฅผ ์„ ์ •ํ•ด์„œ ๋‹ค๋ฅธ machine์—๊ฒŒ broadcast

2. local์—์„œ, ๊ฐ machine์€ ์ž์‹ ์ด ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋Š ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•˜๋Š”์ง€ ๊ณ„์‚ฐํ•œ๋‹ค.

3. ๊ทธ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด centroid, data #๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ „๋‹ฌํ•œ๋‹ค. (sum vector ๋“ฑ์˜ ํ˜•ํƒœ๋กœ ์ „๋‹ฌํ•˜๋ฉด ์ผ์ผ์ด ์ „๋‹ฌํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค)

4. master machine์ด ๊ทธ ์ •๋ณด๋ฅผ ์ „๋‹ฌ๋ฐ›์•„ centroid ๊ฐ’์„ ์ตœ์ข…์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ณ , ์ด๋ฅผ ๋‹ค์‹œ broadcastํ•œ๋‹ค.

5. ๊ณ„์† ๋ฐ˜๋ณตํ•œ๋‹ค.

 

Grid Search

- ์ •์˜: right hyper-parameter๋ฅผ ์ฐพ๋Š” ๊ฒƒ

- ์˜ˆ: k-means clustering์—์„œ ์˜ฌ๋ฐ”๋ฅธ k๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ

- parallelํ•˜๊ฒŒ ์‹คํ–‰๋  ๊ฒฝ์šฐ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

Distributed processing์˜ ๋ณต์žก๋„

- machine ์ˆ˜์— ๋”ฐ๋ผ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค

* 1๊ฐœ: ๋‚˜์˜ box์˜ ์ค‘์‹ฌ๋ถ€๋ฅผ ๋ฐ”์˜๊ฒŒ ํ•จ

* 2๊ฐœ: ๋ช‡ ๊ฐœ์˜ box์— ๋Œ€ํ•ด ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰์‹œํ‚จ๋‹ค.

* ์—ฌ๋Ÿฌ๊ฐœ: MapReduce ๋“ฑ์˜ ์‹œ์Šคํ…œ์„ ์ด์šฉํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

MapReduce / Hadoop

- distributed computing์„ ์œ„ํ•œ ๊ตฌ๊ธ€์˜ MapReduce ํŒจ๋Ÿฌ๋‹ค์ž„์€ Hadoop, Spark ๋“ฑ์˜ ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ตฌํ˜„๋จ

- ๊ฐ„๋‹จํ•œ parallel programming model

- ์ˆ˜๋ฐฑ/์ˆ˜์ฒœ ๊ฐœ์˜ ๊ธฐ๊ณ„์— ๋Œ€ํ•ด Straightforward scaling

- redundancy(์—ฌ๋ถ„)์„ ํ†ตํ•œ fault tolerance

 

Divide and Conquer

1. ๋ถ„ํ• (partition): ํ•˜๋‚˜์˜ ์ผ์„ ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ์ชผ๊ฐฌ

2. ๊ฐ๊ฐ ์ˆ˜ํ–‰

3. ๋ณ‘ํ•ฉ(combine): ๊ฐ๊ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์นจ

 

Typical Big Data Problem

- ์•„์ฃผ ํฐ ์ˆซ์ž์˜ ๊ธฐ๋ก์„ ๋ฐ˜๋ณต

- ๊ฐ๊ฐ์—์„œ ๊ด€์‹ฌ์žˆ๋Š” ๊ฒƒ์„ ๋ฝ‘์•„๋‚ด๊ธฐ

- ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ shuffle, sort

- ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘ํ•ฉ

- ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœ

- word counting, k-means clustering์„ ์ƒ๊ฐํ•ด๋ณด๊ธฐ

 

MapReduce์˜ ์•„์ด๋””์–ด

- Scale Out: ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„๋ฅผ ๋ถ™์ด๊ธฐ (up-> ๊ธฐ๊ณ„ ์„ฑ๋Šฅ ์žฅ์ฒด๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ->X)

* ์Šค์ผ€์ผ ์—…์€ ๋ฉ”๋ชจ๋ฆฌ bottleneck ํ˜„์ƒ ๋“ฑ์ด ๋ฐœ์ƒํ•  ์œ„ํ—˜์ด ์žˆ๋‹ค.

- Move processing to the data: ํด๋Ÿฌ์Šคํ„ฐ๋“ค์ด ์ œํ•œ๋œ bandwidth๋ฅผ ๊ฐ€์ง„๋‹ค

- ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ๋žœ๋ค ์ ‘๊ทผ์„ ํ”ผํ•˜๋ผ: seek์€ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค์ง€๋งŒ, disk throughput์€ ํ•ฉ๋ฆฌ์ 

* HDD ๋“ฑ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์„ ๋•Œ head๋ฅผ ์ด๋™ํ•ด๊ฐ€๋ฉฐ ์ฝ๋Š”๋‹ค. ์ด๊ฒƒ์„ ์–ด๋–ป๊ฒŒ ์ตœ์ ํ™”ํ•˜๋Š๋ƒ?

 

Components of Hadoop

1. Hadoop/MapReduce

- ๋ถ„์‚ฐ๋œ ๋น…๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์ธํ”„๋ผ๊ตฌ์กฐ

- abstract/paradigm

- fault-tolerant: ๋ฐ์ดํ„ฐ๋ฅผ 100๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด์ฃผ๋Š”๋ฐ, ๊ทธ์ค‘ 1๊ฐœ์˜ ์ฒ˜๋ฆฌ๊ธฐ๊ฐ€ ๊ณ ์žฅ๋‚˜๋ฉด ์ผ๋‹จ ์žฌ์‹œ์ž‘๋ถ€ํ„ฐ ํ•จ -> ๊ทธ๋ž˜๋„ ๋™์ž‘ํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ๊ธฐ์—๊ฒŒ ํ• ๋‹น

- schedule: job assignment ๋“ฑ์„ ์กฐ์œจ

- execution

 

2. HDFS (Hadoop Distributed File System)

- fault-tolerant: ํ•˜๋“œ๋””์Šคํฌ๊ฐ€ ๋ง๊ฐ€์ ธ๋„ ๊ดœ์ฐฎ์Œ

- high-bandwidth

- high availability distributed storage

- ์ตœ์†Œ 3๊ฐœ์˜ ๋ณต์‚ฌ๋ณธ์€ ์œ ์ง€

 

MapReduce Word Count

- Map and Reduce ํ•จ์ˆ˜

map(k,v) -> [(k', v')]
reduce(k', [v']) -> [(k',v')]

- Word Count

Map(String docid, String text):
    for each word w in text:
    	Emit(w, 1);
Reduce(String term, Iterator<int> values):
    int sum = 0;
    for each v in values:
    	sum += v;
    Emit(term, sum);

 

MapReduce ์‹คํ–‰ ์‹œ๊ฐ„

- scheduling ์กฐ์ ˆ: map, reduce task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก worker์— ๋ถ€์—ฌ

- ๋ฐ์ดํ„ฐ ๋ถ„์‚ฐ ์กฐ์ ˆ: ํ”„๋กœ์„ธ์Šค์—์„œ ๋ฐ์ดํ„ฐ๋กœ ์ด๋™

- synchronization ์กฐ์ ˆ: intermediate data์˜ ์ˆ˜์ง‘, ์ •๋ ฌ, ํ˜ผํ•ฉ

- error, fault ์กฐ์ ˆ: worker failure ๊ฐ์ง€ ๋ฐ ์žฌ์‹œ์ž‘

 

Hadoop Distributed File System (HDFS)

- ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด๋ถ€์˜ ๋…ธ๋“œ์˜ local disk์— ๋ฐ์ดํ„ฐ ์ €์žฅ -> ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์„ RAM์ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Œ

- Disk access๋Š” ๋Š๋ฆฌ์ง€๋งŒ disk throughput์€ ํ•ฉ๋ฆฌ์  -> file์„ ํ†ตํ•œ linear scan์€ ๊ดœ์ฐฎ์Œ

- ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ 3๋ฒˆ ๋ณต์ œ -> reliability on commodity hardware

 

Cloud Computing Services

- Amazon ๋“ฑ์˜ ํ”Œ๋žซํผ์€ ๋‹จ๊ธฐ ์ž‘์—…์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ machine ์ œ๊ณต ๊ฐ€๋Šฅ

 

Feel Free to Experiment: micro instance์˜ ๊ฒฝ์šฐ ํ•ด๋ณผ ๋งŒํ•จ

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +
1 2 3 4