profile-img
The merit of an action lies in finishing it to the end.
slide-image

์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์„ ๊ณ„์‚ฐํ•  ๋•Œ ์šฐ๋ฆฌ๋Š” confusion matrix (ํ˜ผ๋™ํ–‰๋ ฌ) ์„ ๊ทธ๋ ค True positive, True negative, False positive, False negative๋ฅผ ํ™•์ธํ•˜๊ณ  ์ด๋ฅผ ํ†ตํ•ด precision, recall, accuracy๋“ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ๋งŒ์•ฝ ๋ถ„๋ฅ˜ํ•ด์•ผ ํ•  ํด๋ž˜์Šค๊ฐ€ 3๊ฐœ ์ด์ƒ์ด๋ผ๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ?

๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์ด ์žˆ๋Š”๋ฐ ๋ฐ”๋กœ Microaveraging ๊ณผ Macroaveraging์ด๋‹ค. ๋‘˜์˜ ์ฐจ์ด๋ฅผ ํ™•์ธํ•ด๋ณด์ž. 

 

Macroaveraging

๊ตฌํ•˜๊ณ  ์‹ถ์€ ๊ฐ๊ฐ์˜ ํด๋ž˜์Šค์— ๋Œ€ํ•ด์„œ ์„ฑ๋Šฅ์„ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ทธ ์ดํ›„ ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ‰๊ท ๊ฐ’์ด ์ตœ์ข… ์„ฑ๋Šฅ์ด ๋œ๋‹ค.

๋งŒ์•ฝ ํด๋ž˜์Šค๊ฐ€ 4๊ฐœ๋ผ๋ฉด, ์ž‘์€ confusion matrix 4๊ฐœ๋ฅผ ๊ทธ๋ฆฌ๊ณ , ์ด 4๊ฐœ์— ๋Œ€ํ•œ ํ‰๊ท ๊ฐ’์œผ๋กœ ์„ฑ๋Šฅ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

 

Microaveraging

๋ชจ๋“  ํด๋ž˜์Šค์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•˜๋‚˜์˜ confusion matrix๋กœ ๋‚˜ํƒ€๋‚ด์–ด ๊ฐ ์„ฑ๋Šฅ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์‹์ด๋‹ค.

 

Microaveraging ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐ๋œ score๋Š” ํ”ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์ ์ˆ˜๋กœ ๊ฒฐ์ •๋œ๋‹ค.

'CS study/๊ธฐํƒ€' Related Articles +

Statement = Proposition : ๋ช…์ œ

- ์ •์˜: ์ฐธ์ด๋‚˜ ๊ฑฐ์ง“์œผ๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์žฅ

- ์ข…๋ฅ˜

1) Universal Statement: ํ•œ ์ง‘ํ•ฉ์˜ ๋ชจ๋“  ์š”์†Œ์— ๋Œ€ํ•ด์„œ ์ฐธ

2) Conditional Statement: ํŠน์ • ์กฐ๊ฑด๋งŒ ํฌํ•จ

3) Existential Statement: ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์š”์†Œ๊ฐ€ 1๊ฐœ ์ด์ƒ ์žˆ๋‹ค.

4) Universal Conditional Statement: universal & conditional

5) Universal Existential Statement: ์ฒซ ๋ถ€๋ถ„์€ universal, ๋‘ ๋ฒˆ์งธ ๋ถ€๋ถ„์€ existential

6) Existential Universal Statement: ์ฒซ ๋ถ€๋ถ„์€ existential, ๋‘ ๋ฒˆ์งธ ๋ถ€๋ถ„์€ universal

 

Set : ์ง‘ํ•ฉ

- ์ •์˜: ํŠน์ • ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์š”์†Œ๋“ค์˜ ๋ชจ์ž„

- ex) { x | 0 < x < 5 }

 

Russell's Paradox

R = {x | x is a set and x is not an element of itself}

์ •์˜์— ๋”ฐ๋ฅด๋ฉด R์ด R์˜ ์š”์†Œ๋ผ๋ฉด, R์€ R์˜ ์š”์†Œ์ผ ์ˆ˜ ์—†๋‹ค.

๋˜ํ•œ, R์ด R์˜ ์š”์†Œ๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด, R์€ R์˜ ์š”์†Œ์ด๋‹ค.

-> ๋ชจ์ˆœ์ 

 

Cartesian product

A X B : Cartesian product of A and B

B X A : Cartesian product of B and A

์ •์˜: A X B = {(x, y) | x is a member of A, y is a member of B}

ex) A = {1, 2} / B = {3, 4} -> A X B = {(1,3), (1,4), (2,3), (2,4)}

 

Relation

- ๊ณต์ง‘ํ•ฉ์ด ์•„๋‹Œ ์ง‘ํ•ฉ A์— ๋Œ€ํ•ด A์— ๋Œ€ํ•œ relation = A X A์˜ subset

- ์ข…๋ฅ˜

1) Reflexive

ex) A = {1, 2, 3, 4}์ผ ๋•Œ

R1 = { (1, 1), (2, 2) } ๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Reflexiveํ•˜์ง€ ์•Š๋‹ค.

์ด์œ : (3, 3), (4, 4)๊ฐ€ ์—†๋‹ค.

R2 = { (1, 1), (2, 2), (3, 3), (4, 4), (2, 3) } ๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Reflexiveํ•˜๋‹ค.

์ด์œ : (1, 1)~(4, 4)๊ฐ€ ๋ชจ๋‘ ์žˆ๋‹ค. (2, 3)์€ ์ƒ๊ด€์ด ์—†๋‹ค.

 

2) Symmetric

ex) R3 = ๊ณต์ง‘ํ•ฉ์ด๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Symmetricํ•˜๋‹ค.

์ด์œ : ์ง‘ํ•ฉ์—์„œ ์š”์†Œ๋“ค์˜ ์Œ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—

R4 = { (1, 2), (2, 1), (2, 2) } ๋ผ๋ฉด, ์ด ์ง‘ํ•ฉ์€ Symmetricํ•˜๋‹ค.

R5 =  { (1, 2), (2, 1), (2, 3) } ๋ผ๋ฉด, (3, 2)๊ฐ€ ์—†์–ด Symmetricํ•˜์ง€ ์•Š๋‹ค.

 

3) Transitive

R6 = { (1, 5), (5, 1), (1, 1) } ์ด๋ผ๋ฉด ๋งŒ์กฑํ•  ์ˆ˜ ์—†๋‹ค.

(1, 5), (5, 1) -> (1, 1) ์ด์ง€๋งŒ

(5, 1), (1, 5) -> (5, 5)๋„ ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ ์—†๋‹ค.

์ตœ๊ทผ์— ๋ฐ”์ด์˜ค์ธํฌ ์ชฝ ๋žฉ์—์„œ ํ•™๋ถ€ ์—ฐ๊ตฌ์ƒ์„ ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”๋ฐ, ์ƒ๊ฐ๋ณด๋‹ค ํ•œ๊ตญ์–ด ์ž๋ฃŒ๊ฐ€ ๋งŽ์ง€ ์•Š์•„์„œ ๋‚จ๊ฒจ๋ณด๋Š” ๊ธฐ๋ก!

ํ‹€๋ฆฐ ์„ค๋ช…์ด ์žˆ์„ ์ˆ˜๋„ ์žˆ์Œ ์ฃผ์˜

 

DEG ๋ถ„์„์ด๋ž€?

DEG๋Š” Differentially Expressed Gene์˜ ์•ฝ์ž!

์œ ์ „์ž์˜ ๋ฐœํ˜„๊ฐ’์„ ์ธก์ •ํ•˜๊ณ  ํ†ต๊ณ„์  ๊ณผ์ •์„ ๊ฑฐ์ณ ๋Œ€์กฐ๊ตฐ๊ณผ ๋น„๊ต๊ตฐ ๊ฐ„์— ๋ฐœํ˜„์ด ์œ ์˜ํ•œ ์œ ์ „์ž ํ›„๋ณด๊ตฐ์„ ์„ ์ •ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. (์ถœ์ฒ˜)

DEG ๋ถ„์„์„ ์œ„ํ•ด์„œ๋Š” RNA-seq ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. DEG ๋ถ„์„์„ ์œ„ํ•œ RNA-seq ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๊ฐ€ ์žˆ๊ฒ ์ง€๋งŒ ๋‚ด๊ฐ€ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

1) RNA-seq๋ฅผ ํ†ตํ•ด ์–ป์€ fq ํŒŒ์ผ์—์„œ ํ•„์š” ์—†๋Š” adapter ๋“ฑ์˜ ์„œ์—ด์„ ์ž˜๋ผ๋ƒ„.

2) TopHat2๋ฅผ ์ด์šฉํ•œ mapping ์ง„ํ–‰.

3) Cufflinks, Cuffdiff ๋“ฑ์˜ ํ”„๋กœ๊ทธ๋žจ์„ ์ด์šฉํ•˜์—ฌ ๋ฐœํ˜„๋Ÿ‰์„ ๊ณ„์‚ฐ.

4) ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์‹ฑ ๋ฐ ๊ฐ€๊ณต (-> ์ด ๊ณผ์ •์—์„œ ์ด๋ฏธ ์ž‘์„ฑ๋œ ์ฝ”๋“œ๋ฅผ ์ด์šฉํ–ˆ์Œ)

 

4๋ฒˆ ๊ณผ์ •๊นŒ์ง€ ๋๋‚ด๋ฉด ์šฐ๋ฆฌ๋Š” replicate ๋ณ„ ์œ ์ „์ž์˜ ๋ฐœํ˜„๋Ÿ‰ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ ์˜ํ™•๋ฅ ์ด๋‚˜ FPKM, Fold Change ๋“ฑ์˜ ๊ฐ’์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

(FPKM = Fragments Per Kilobase of transcript per Million - transcript๋‹น fragment์˜ ์ˆ˜)

(Fold Change = Transgenic / Wild-type์œผ๋กœ ๊ณ„์‚ฐ)

Volcano Plot

Volcano Plot์€ scatter plot์˜ ์ผ์ข…์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ์‹œํ‚ค๋Š” ๋Œ€๋กœ ๊ทธ๋ฆฌ๋ฉด ํ™”์‚ฐ์ด ๋ถ„์ถœํ•˜๋Š” ๋“ฏํ•œ(!) ๋ชจ์–‘์ด์–ด์„œ Volcano Plot์ด๋‹ค.

x์ถ•์€ log2(Fold Change) ๊ฐ’์„, y์ถ•์€ -log(p-value)๊ฐ’์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

๋Œ€์ถฉ ๊ทธ๋ฆฌ๋ฉด ์ด๋Ÿฐ ๋Š๋‚Œ์ด๋‹ค. x=0์„ ๊ธฐ์ค€์œผ๋กœ ์™ผ์ชฝ์€ log2(Fold Change) ๊ฐ’์ด 0๋ณด๋‹ค ์ž‘์œผ๋ฏ€๋กœ down regulated gene, ์˜ค๋ฅธ์ชฝ์€ up regulated gene์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

๋ณดํ†ต p<0.05๋ฅผ ์œ ์˜ํ•˜๋‹ค๊ณ  ๋ณด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ๊ฐ’์„ ๋งŒ์กฑ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ(ํšŒ์ƒ‰ ์ ๋“ค) ๋ฐœํ˜„์ด ์œ ์˜ํ•˜๋‹ค๊ณ  ๋ณด์ง€ ์•Š๋Š”๋‹ค.

 

MA plot

RNA-seq์˜ Intensity-dependent ratio๋ฅผ ๋ฐํžˆ๊ณ  ์‹œ๊ฐํ™” ํ•˜๋Š”๋ฐ ๋งค์šฐ ๊ฐ•๋ ฅํ•œ ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•์ด๋‹ค. (์ถœ์ฒ˜)

๋ณต์žกํ•˜๊ฒŒ ์ ํ˜€์žˆ์ง€๋งŒ, ์ด ์นœ๊ตฌ๋„ ์—ญ์‹œ scatter plot์˜ ์ผ์ข…์ด๋ผ๊ณ  ๋ณด๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

x์ถ•์—๋Š” ํ•ด๋‹น ์œ ์ „์ž์˜ log2(FPKM)๊ฐ’์˜ ํ‰๊ท ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, y์ถ•์—๋Š” log2(Fold Change)๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์˜ˆ์‹œ ๊ทธ๋ฆผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๋žฉ ์ปด์— ์ง์ ‘ ๊ทธ๋ฆฐ ๊ฒŒ ์žˆ๋Š”๋ฐ ๊ท€์ฐฎ์•„์„œ ์ž๋ฃŒ์— ์žˆ๋Š” ๊ฑฐ ๋„ฃ์Œ..

MA plot์—์„œ๋„ ์œ ์˜๋ฏธ(?)ํ•˜์ง€ ์•Š์€ ๊ฐ’๋“ค์„ ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •์„ ๊ฑฐ์น˜๋Š”๋ฐ ์ด ๊ณผ์ •์„ Volume Cut์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค๊ณ  ํ•œ๋‹ค.

๋ฌด์กฐ๊ฑด ๋“ค์–ด๋งž๋Š” ๊ธฐ์ค€์€ ์—†๊ณ  ์œก์•ˆ์œผ๋กœ ์‚ดํŽด๋ณด๋ฉด์„œ ์ ๋‹นํžˆ ์ง์„  x=n์„ ๊ธฐ์ค€์œผ๋กœ ์˜ค๋ฅธ์ชฝ์— ์žˆ๋Š” ๊ฐ’๋“ค๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผํ•œ๋‹ค.

๋ณดํ†ต์€ MA plot์„ ๋จผ์ € ๊ทธ๋ ค์„œ volume cut์„ ์ •ํ•˜๊ณ , ๊ทธ ์ดํ›„์— volcano plot์„ ๊ทธ๋ ค์„œ Up/Down Regulated Gene์„ ๊ตฌํ•˜๋Š” ๋“ฏ...

 

๊ทธ ๋‹ค์Œ์—” ๋ฌด์—‡์„ ํ•ด์•ผ ํ•˜์ง€?

DEG ๋ถ„์„์˜ ์ฒซ ๊ฑธ์Œ์ธ Up/Down Regulated Gene ์ฐพ๊ธฐ๊ฐ€ ๋๋‚ฌ๋‹ค!

์ด์ œ ์ด ์ •๋ณด๋ฅผ DAVID, GOrilla, Revigo ๋“ฑ์˜ ์‚ฌ์ดํŠธ์— ์ž…๋ ฅํ•ด๋ณด๋ฉด์„œ ์œ ์ „์ž ๋ฐœํ˜„ ์ฐจ์ด๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด ๋œ๋‹ค.

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์นดํ…Œ๊ณ ๋ฆฌ (๋ฒ”์ฃผํ˜•) ์ž๋ฃŒ ์ธ์ฝ”๋”ฉ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ Mean Target Encoding์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•˜๊ณ ์ž ํ•œ๋‹ค.

์‚ฌ์‹ค ๋‚˜๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ๋ฒ•์— ๋ผ๋ฒจ์ธ์ฝ”๋”ฉ๊ณผ ์› ํ•ซ ์ธ์ฝ”๋”ฉ๋งŒ ์žˆ๋Š” ์ค„ ์•Œ์•˜๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์ด๋ฒˆ ์ฑ„๋ฌด๋ถˆ์ดํ–‰ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•  ๋•Œ ๋‚ด๊ฐ€ ํ–ˆ๋˜ ์ ˆ์ฐจ๋“ค์ด ์‚ฌ์‹ค์€ Mean Target Encoding์ด์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋˜์–ด ๋†€๋ž„ ์ˆ˜๋ฐ–์— ์—†์—ˆ๋‹ค.

 

์•„๋ฌดํŠผ ๋ณธ๋ก ์œผ๋กœ ๋“ค์–ด๊ฐ€์ž.

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๊ฐ€ ๋ฌด์—‡์ธ์ง€๋Š”, ์ด์ „์— ๋‚ด๊ฐ€ ์ž‘์„ฑํ•˜์˜€๋˜ ๊ธ€์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

https://iamnotwhale.tistory.com/5

 

[EDA] ๋ฐ์ดํ„ฐ ์ข…๋ฅ˜๋ณ„ ์‹œ๊ฐํ™” ๋ฐฉ๋ฒ•

๋ฐ์ดํ„ฐ์˜ ์ข…๋ฅ˜ ๋ฐ์ดํ„ฐ๋Š” ํฌ๊ฒŒ ๋ฒ”์ฃผํ˜•, ์ˆ˜์น˜ํ˜• ๋‘ ๊ฐ€์ง€ ๋ถ„๋ฅ˜๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋Š” ๋ฒ”์ฃผ/์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฐ๊ฐ์˜ ์ด๋ฆ„์„ ๊ฐ–๋Š” ๋ฐ์ดํ„ฐ ์ข…๋ฅ˜๋‹ค. ex) ์„ฑ๋ณ„ - ์—ฌ์„ฑ, ๋‚จ์„ฑ / ํ•™๋ ฅ - ์ดˆ์กธ,

iamnotwhale.tistory.com

๊ฐ„๋‹จํžˆ ๋งํ•˜์ž๋ฉด, ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋Š” ๋ฒ”์ฃผ/์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋“ค์ด๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์„ฑ๋ณ„์ด๋ผ๋Š” ๋ฐ์ดํ„ฐ๋Š” ์—ฌ์„ฑ ๋˜๋Š” ๋‚จ์„ฑ์œผ๋กœ ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋‹ค.

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ํ•˜์œ„ ๋ถ„๋ฅ˜๋กœ ๋ช…๋ชฉํ˜• (์—ฌ์„ฑ, ๋‚จ์„ฑ), ์ˆœ์œ„ํ˜• (1๋“ฑ, 2๋“ฑ, 3๋“ฑ...) ์„ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ด ๊ธ€์—์„œ๋Š” ๋ชฐ๋ผ๋„ ๋œ๋‹ค.

 

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋Š” ๋Œ€๊ฐœ ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ object ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง„๋‹ค. ์ด๋ฅผ ๊ทธ๋Œ€๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํˆฌ์ž…ํ•  ์ˆ˜ ์—†๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์ปดํ“จํ„ฐ๋Š” ์—ฌ์„ฑ/๋‚จ์„ฑ์ด ๋ฌด์—‡์ธ์ง€ ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์˜ค๋กœ์ง€ ์ˆซ์ž ํ˜•ํƒœ๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ๋งŒ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ˆซ์ž (int, float) ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•˜๋ฉฐ, ์ด์— ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด ์ธ์ฝ”๋”ฉ์ด๋‹ค.

 

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์ œ์‹œ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๊ฒƒ์€ Label Encoding, One-Hot Encoding์ผ ๊ฒƒ์ด๋‹ค.

Label Encoding์€ ๋‹จ์ˆœํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ์นดํ…Œ๊ณ ๋ฆฌ์— ๋ฒˆํ˜ธ๋ฅผ ๋งค๊ธด๋‹ค. ์ด ๋ฒˆํ˜ธ๋Š” ๋ฌด์—‡์ด ๋” ๋‚ซ๊ณ  ๋ณ„๋กœ์ด๊ณ  ์ด๋Ÿฐ ์ •๋ณด๋ฅผ ๊ฐ–์ง€๋Š” ์•Š๋Š”๋‹ค. ๊ทธ๋ƒฅ 1๋ฒˆ ๊ทธ๋ฃน, 2๋ฒˆ ๊ทธ๋ฃน ์ด๋Ÿฐ ๋Š๋‚Œ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

ํ•˜์ง€๋งŒ ์ด๋Š” ํŠธ๋ฆฌ๋ฅผ ์ œ์™ธํ•œ ์ผ๋ฐ˜์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ์ž์ฃผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.

 

One-Hot Encoding์€ ํ•˜๋‚˜์˜ ์นผ๋Ÿผ๋งŒ Hot, ๋‚˜๋จธ์ง€๋Š” Coldํ•˜๋‹ค๋Š” ์˜๋ฏธ๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค.

๋ฌด์Šจ ๋œป์ธ์ง€ ๋ฐ”๋กœ ๊ฐ์ด ์•ˆ ์˜ฌ ์ˆ˜๋„ ์žˆ์–ด ๋‹ค์‹œ ์ •๋ฆฌํ•˜์ž๋ฉด, ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๊ฐ’๋“ค์„ ๊ฐ๊ฐ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ , ์ด ์ปฌ๋Ÿผ์— ํ•ด๋‹นํ•˜๋Š” ์†์„ฑ์„ ๊ฐ€์งˆ ๋•Œ 1, ์•„๋‹Œ ๊ฑด ์ „๋ถ€ 0์œผ๋กœ ํ‘œ์‹œํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

ํ‘œ๋กœ ์„ค๋ช…ํ•˜๊ฒ ๋‹ค.

ID Color
0 Pink
1 Yellow
2 Pink
3 Blue

์ด๋Ÿฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„ ๋•Œ, Color ์นผ๋Ÿผ์— ์› ํ•ซ ์ธ์ฝ”๋”ฉ์„ ์ ์šฉํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ๋ณ€ํ™˜๋œ๋‹ค.

ID Color_Pink Color_Yellow Color_Blue
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1

Color์˜ ๋ฐ์ดํ„ฐ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๋Š” ์—ด์„ ๊ฐ๊ฐ ์ƒ์„ฑํ•˜์—ฌ ์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

์› ํ•ซ ์ธ์ฝ”๋”ฉ์€ ๋ผ๋ฒจ์ธ์ฝ”๋”ฉ์— ๋น„ํ•ด ์„ฑ๋Šฅ์€ ๋†’์ง€๋งŒ, ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๊ฐ’์ด ๋„ˆ๋ฌด ๋งŽ์„ ๊ฒฝ์šฐ ์ฐจ์›์ด ์ง€๋‚˜์น˜๊ฒŒ ์ปค์ ธ ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์งˆ ์ˆ˜ ์žˆ๋‹ค.

 

๊ทธ๋Ÿผ Mean Target Encoding์€ ์–ด๋–จ ๋•Œ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?

์œ„์—์„œ ์†Œ๊ฐœํ•œ ๋‘ ์ธ์ฝ”๋”ฉ ๋ฐฉ๋ฒ•์€ ๋‹จ์ˆœํžˆ ๋ถ„๋ฅ˜์—๋งŒ ์‚ฌ์šฉํ•˜๊ณ , Target๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ๊ฐ•์กฐํ•  ์ˆ˜ ์—†๋‹ค.

์› ํ•ซ ์ธ์ฝ”๋”ฉ์˜ ๊ฒฝ์šฐ ๊ฐ’์ด ๋งŽ์•„์ง€๋ฉด ์ฐจ์›์ด ์ง€๋‚˜์น˜๊ฒŒ ์ปค์ง€๋Š” ๋ฌธ์ œ๋„ ์žˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Mean Target Encoding์„ ์ด์šฉํ•œ๋‹ค.

 

๊ฐ„๋‹จํ•˜๊ฒŒ ์†Œ๊ฐœํ•˜์ž๋ฉด ํ•ด๋‹น ์นผ๋Ÿผ์˜ ๊ฐ™์€ ๋ผ๋ฒจ์„ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋“ค์˜ Target ๊ฐ’์˜ ํ‰๊ท ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

 

๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์˜ฌ๋ฆด ์ˆ˜๋Š” ์—†๊ฒ ์ง€๋งŒ, ์ฑ„๋ฌด๋ถˆ์ดํ–‰ ๋ฐ์ดํ„ฐ ์ค‘์—๋Š” Loan Title์ด๋ผ๋Š” ์นผ๋Ÿผ์ด ์žˆ๊ณ ,

ํ•ด๋‹น ์นผ๋Ÿผ์˜ ๋ผ๋ฒจ์€ ๊ต‰์žฅํžˆ ๋‹ค์–‘ํ•˜๋‹ค.

'Credit card refinancing', 'Debt consolidation',
'Credit card refinance', 'Credit Card Consolidation',
'Debt Consolidation', 'Home improvement', 'Consolidation', 'Other',
'relief', 'debt consolidation loan', 'Major purchase', 'Loan',
'credit card refinance', 'Medical', 'Pool', 'Vacation',
'CC consolidation', 'Medical expenses', 'Moving and relocation',
'payoff', 'Personal Loan', 'debt consolidation', 'Debt Loan',
'House', 'consolidation loan', 'consolidate', 'Credit payoff',
'Bathroom', 'Green loan', 'Debt Payoff', 'Consolidate', 'Business',
'Lending Club', 'Refinance', 'Home Improvement',

๋„ˆ๋ฌด ๋งŽ์•„์„œ ๊ทธ๋ƒฅ ์ด์ •๋„๋งŒ ๊ฐ–๊ณ  ์™”๋‹ค.

์ด๋Ÿฌํ•œ ๋ผ๋ฒจ๋“ค์— ๋”ฐ๋ผ์„œ target (์ด ์ฑ„๋ฌด๋ถˆ์ดํ–‰ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ loan status) ๊ฐ’์˜ ๋น„์œจ, ์ฆ‰ ํ‰๊ท ์„ ๊ตฌํ•˜๊ณ ์ž ํ–ˆ๋‹ค.

(ํ‰๊ท ์€ ๊ฐ’์˜ ํ•ฉ/๊ฐ’์˜ ๊ฐœ์ˆ˜์ธ๋ฐ, target์ด 0, 1๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ‰๊ท ์ด ๋ฐ์ดํ„ฐ์—์„œ 1์˜ ๋น„์œจ๊ณผ ๊ฐ™์•„์ง„๋‹ค)

 

์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž‘์„ฑํ–ˆ๋‹ค.

# ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
df = pd.read_csv("train_loan.csv")

# Target๊ณผ ๋ณ€์ˆ˜๋งŒ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
data = df[['Loan Title', 'Loan Status']]

# Mean Target Encoding
ratio = pd.DataFrame(data.groupby('Loan Title').mean()).reset_index()

# ์ธ์ฝ”๋”ฉ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
ratio.sort_values(by = 'Loan Status').plot.bar()
plt.show()

ratio๋ผ๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์นผ๋Ÿผ์—์„œ Loan Title ์นผ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ groupby ๋ฉ”์„œ๋“œ๋กœ ๋ฌถ๊ณ , mean() method๋ฅผ ์ด์šฉํ•˜์—ฌ Mean Target Encoding์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•˜๋‹ค.

๊ธ€์ž๊ฐ€ ๋งŽ์•„์„œ ์ข€... ์•ˆ ์˜ˆ์˜๊ฒŒ ๋˜๊ธด ํ–ˆ๋Š”๋ฐ ๊ฒฝํ–ฅ๋งŒ ๋ณด๋ฉด ๋œ๋‹ค.

์ธ์ฝ”๋”ฉ ๊ฒฐ๊ณผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ์ง„ํ–‰ํ•˜๊ฑฐ๋‚˜ ๊ทธ๋Œ€๋กœ ์ด์šฉํ•˜๋ฉด ๋œ๋‹ค.

๋‚˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ง๊นŒ์ง€ ์ง„ํ–‰ํ–ˆ๋‹ค.

from sklearn.cluster import KMeans

inertia = []

for i in range(1, 11):
    kmeans_plus = KMeans(n_clusters = i, init = 'k-means++')
    kmeans_plus.fit(ratio)
    inertia.append(kmeans_plus.inertia_)

plt.plot(range(1,11), inertia, marker = 'o')

plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

elbow method๋กœ, ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋Š” 3๊ฐœ๋กœ ์ •ํ•˜์˜€๋‹ค.

 

๊ธฐ์กด์— ์•Œ๊ณ  ์žˆ๋˜ ์ธ์ฝ”๋”ฉ๊ณผ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์€ Target ๋ณ€์ˆ˜์™€์˜ ๊ด€๊ณ„๊นŒ์ง€ ์ธ์ฝ”๋”ฉ ๊ณผ์ •์—์„œ ๋‚˜ํƒ€๋‚ด๊ณ ์ž ํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค.

'CS study/๋จธ์‹ ๋Ÿฌ๋‹' Related Articles +

Greedy Best-First Search

Evaluation Func & Heuristic Func

  1. Evaluation func f(n)
    1. node n์˜ estimated cost
  2. Heuristic func h(n)
    1. node n์—์„œ goal state๋กœ ๊ฐ€๋Š” ๊ฐ€์žฅ ์ €๋ ดํ•œ ๋น„์šฉ์˜ ์ถ”์ •์น˜
    2. root-finding problem์—์„œ ๊ฐ ๋„์‹œ๋ณ„ ์ง์„ ๊ฑฐ๋ฆฌ ํ•จ์ˆ˜

Greedy Best-First Search

  1. ์ •์˜: best-first search์˜ ์ผ์ข… → h(n) value๊ฐ€ ๊ฐ€์žฅ ์ ์€ ๋…ธ๋“œ๋ฅผ ํ™•์žฅ
    1. table์„ ์œ ์ง€ํ•˜๋Š” ๊ฒฝ์šฐ ๊ทธ๋ž˜ํ”„ (์ค‘๋ณต์„ ๊ณ ๋ ค)
    2. ์—†์• ๋Š” ๊ฒฝ์šฐ tree search (์ค‘๋ณต์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ)
  2. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Completeness → ์‚ฌ์ดํด ๋ฐœ์ƒ
      1. Not complete: tree-like search
      2. complete: graph search
    2. Not cost-optimal
    3. Time Complexity: O(b^m)
    4. Space Complexity: O(b^m)

A* Search

  1. ์ •์˜: f(n) = g(n) + h(n)์„ evaluation function์œผ๋กœ ์ด์šฉ
    1. g(n): n๊นŒ์ง€ ์˜ค๋Š” ๋ฐ ๋น„์šฉ
    2. h(n): n๋ถ€ํ„ฐ goal๊นŒ์ง€ ๊ฐ€๋Š” ๋ฐ ์ถ”์ •๋˜๋Š” ๋น„์šฉ
  2. Admissible Heuristic Function
    1. goal์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•œ ๋น„์šฉ์„ ์ ˆ๋Œ€ overestimateํ•˜์ง€ ์•Š์Œ
    2. h(n) <= h*(n) : h(n)์€ ์‹ค์ œ ๊ฐ’
    3. admissible heuristic function์„ ์ด์šฉํ•œ๋‹ค๋ฉด A* search๋Š” cost-optimalํ•˜๋‹ค.
  3. Consistent Heuristic Function
    1. consistent: ๋ชจ๋“  ๋…ธ๋“œ n๊ณผ a๋ผ๋Š” ํ–‰๋™์œผ๋กœ ์ƒ์„ฑ๋œ ์ž์‹ n’์ด ์žˆ์„ ๋•Œ
    2. h(n) <= c(n,a,n') + h(n') ์„ฑ๋ฆฝ
    3. consistent heuristic function์„ ์ด์šฉํ•œ๋‹ค๋ฉด A* search๋Š” cost-optimalํ•˜๋‹ค → ๋ชจ๋“  consistent heuristic์€ admissibleํ•˜๋‹ค.
  4. Search contours
    1. A*๋Š” ๊ฐ€์žฅ ์ ์€ f-cost๋ฅผ ๊ฐ€์ง„ frontier ๋…ธ๋“œ๋ฅผ ํ™•์žฅํ•˜๋ฏ€๋กœ ๋“ฑ๊ณ ์„ ์ด ์ปค์ง€๋ฉด์„œ cost๊ฐ€ ์ ์  ์ปค์ง€๋Š” ํ˜•ํƒœ์ด๋‹ค.
    2. ์ข‹์€ ํœด๋ฆฌ์Šคํ‹ฑ์„ ๊ฐ€์กŒ๋‹ค๋ฉด, ๋“ฑ๊ณ ์„ ์€ ์ ์  goal state์— ๊ฐ€๊นŒ์›Œ์งˆ ๊ฒƒ์ด๋ฉฐ ์ตœ์  ๊ฒฝ๋กœ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ๋” ์ข์•„์งˆ ๊ฒƒ์ด๋‹ค.
  5. Optimally Efficient → C* : optimal cost
    1. surely expanded: f(n)<C*
    2. depends: f(n)=C*
    3. never expended: f(n) > C*
  6. Weighted A* Search
    1. ์ •์˜: C์™€ WC*์‚ฌ์ด์˜ cost๋ฅผ ๊ฐ€์ง€๋Š” ํ•ด๋‹ต์„ ์ฐพ๋Š” ๊ฒƒ
    2. f(n)=g(n)+W*h(n)
    3. Suboptimal search techniques
      1. bounded suboptimal search: ์‹ค์ œ optimal cost์˜ 1/2๋ฐฐ ..
      2. bounded-cost search: ์‹ค์ œ cost๋Š” ๋ชจ๋ฅด๊ณ  ์–ผ๋งˆ ์•ˆ์— ๋„๋‹ฌํ•˜๋Š” ๊ฒฝ๋กœ ์ฐพ๊ธฐ
      3. unbounded-cost search
      4. beam search: ๋ฉ”๋ชจ๋ฆฌ๋Š” ์ ๊ฒŒ ์œ ์ง€ → ๊ฐ€์žฅ ์ข‹์€ score์ธ ๊ฒƒ๋งŒ ์œ ์ง€ํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๊ฐ€์ง€์น˜๊ธฐ
  7. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Cost-optimal
    3. Time-complexity <= O(b^d)
    4. Space complexity <= O(b^d)

Iterative-Deepening A* Search

  1. ์ •์˜ (IDA*)
    1. iterative deepening search + A* Search
    2. ์ด์ „ ๋ฐ˜๋ณต์—์„œ cutoff๋ฅผ ์ดˆ๊ณผํ•˜์˜€๋˜, f-cost๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ ์ž„์˜์˜ ๋…ธ๋“œ๋ฅผ cutoff → ๊ธฐ์ค€๋ณด๋‹ค ํฌ์ง€ ์•Š์•˜๋‹ค๋ฉด ๊ณ„์† ๋‚ด๋ ค๊ฐ
  2. ํŠน์ง•
    1. f-cost๊ฐ€ ๋ชจ๋‘ ์ •์ˆ˜์ผ ๋•Œ ์ž˜ ์ž‘๋™
    2. f-cost๊ฐ€ ๋ชจ๋‘ ๋‹ค๋ฅธ ๋…ธ๋“œ์ผ ๋•Œ, ๋ฐ˜๋ณต ์ˆซ์ž๊ฐ€ ๊ฒฐ๊ตญ state์ˆ˜์™€ ๊ฐ™์•„์งˆ ์ˆ˜ ์žˆ๋‹ค.

Recursive Best-First Search

  1. ์ •์˜ (RBFS)
    1. recursive depth-first search + best-first search
    2. f-limit์„ ํ˜„์žฌ ๋…ธ๋“œ๊ฐ€ ์ดˆ๊ณผํ•œ๋‹ค๋ฉด ๋‹ค์‹œ ๊ฐ์•„ ๋‚ด๋ ค๊ฐ€์„œ alternative path๋กœ ์ด๋™
  2. ์ง„ํ–‰ ๊ณผ์ •
    1. ์ผ๋‹จ ์ฒ˜์Œ limit = ๋ฌดํ•œ๋Œ€
    2. ๋‹ค์Œ ๊ฒฝ๋กœ ์ค‘ ๊ฐ€์žฅ ์ž‘์€ cost์ธ ์ชฝ์„ ์„ ํƒ, limit์€ ๋‘ ๋ฒˆ์งธ๋กœ ๊ฐ€์žฅ ์ž‘์€ ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝ
    3. ๋‚ด๋ ค๊ฐ€์„œ limit๊ณผ cost์˜ ๊ฐ’์„ ๋น„๊ต → cost๊ฐ€ limit๋ณด๋‹ค ํฌ๋ฉด ํ•ด๋‹น cost๋ฅผ limit์œผ๋กœ ๋ณ€๊ฒฝํ•˜๊ณ , limit๊ณผ ๊ฐ™์€ cost๋ฅผ ๊ฐ€์กŒ๋˜ node๋ฅผ ํ™•์ธ
  3. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Cost-optimal
    3. time complexity <= O(b^d)
    4. space complexity <= O(bd)
      • ์ง€๋‚˜์น˜๊ฒŒ ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ → ๊ณ„์† ๊ฐ™์€ state๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ํƒ์ƒ‰

Simplified Memory-Bounded A* Search

  1. ์ •์˜
    1. A* search๋ฅผ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ฝ‰ ์ฐฐ ๋•Œ๊นŒ์ง€ ์ง„ํ–‰
    2. ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ฝ‰ ์ฐผ๋‹ค๋ฉด f-cost๊ฐ€ ๊ฐ€์žฅ ํฐ worst leaf node๋ฅผ ๋ฒ„๋ฆผ
    3. ์žŠ์–ด๋ฒ„๋ฆฐ ๋…ธ๋“œ์˜ ๊ฐ’์„ ๋‹ค์‹œ ๋ถ€๋ชจ์—๊ฒŒ ์ „๋‹ฌ
  2. ์„ฑ๋Šฅ ํ‰๊ฐ€
    1. Complete
    2. Cost-optimal
    3. time complexity <= O(b^d)
    4. space complexity <= O(b^d)
'School/COSE361 ์ธ๊ณต์ง€๋Šฅ' Related Articles +
1 2 3 4 5 6