profile-img
The merit of an action lies in finishing it to the end.
slide-image

Data Science Analysis Pipeline

- Modeling: ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋„๊ตฌ๋กœ ์ •๋ณด๋ฅผ ๊ฐ์‹ธ๋Š” ๊ณผ์ •

- ํ•ต์‹ฌ ๊ณผ์ •: building, fitting, validating the model

 

Philosophies of Modeling

1. Occam's Razor

- 14์„ธ๊ธฐ ์˜๊ตญ ์ˆ˜๋„์Šน

- ๋œป: ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ์„ค๋ช…์ด ๊ฐ€์žฅ ์ข‹๋‹ค.

- ๊ฐ€์žฅ ์ ์€ ๊ฐ€์ •์„ ๋งŒ๋“œ๋Š” ๋‹ต์„ ์„ ํƒํ•ด์•ผ ํ•œ๋‹ค. -> ๋ชจ๋ธ์—์„œ parameter์˜ ์ˆ˜๋ฅผ ์ค„์—ฌ์•ผ ํ•จ์„ ์˜๋ฏธ

- LASSO/ridge regression ๋“ฑ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์€ ํ”ผ์ณ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด penalty function์„ ์‚ฌ์šฉ -> ๋ถˆํ•„์š”ํ•œ coefficient๋ฅผ ์ตœ์†Œํ™”

 

2. Bias-Variance Tradeoffs

- "๋ชจ๋“  ๋ชจ๋ธ์€ ํ‹€๋ฆฌ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์–ด๋–ค ๋ชจ๋ธ์€ ์œ ์šฉํ•˜๋‹ค."

- Bias: ๋ชจ๋ธ์—์„œ ์—๋Ÿฌ๊ฐ€ ์žˆ๋Š” ๊ฐ€์ •์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์˜ค๋ฅ˜ -> underfitting; ๊ฐ€์ • ์ž์ฒด๊ฐ€ ์ •๋‹ต์—์„œ ๋ฉ€์–ด์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ -> linear๋กœ ๋ณ€ํ™˜๋จ

- Variance: training set ์•ˆ์˜ ์ž‘์€ ๋ณ€ํ™”์— ๋ฏผ๊ฐํ•จ์œผ๋กœ์จ ์ƒ๊ธฐ๋Š” ์˜ค๋ฅ˜ -> overfitting; ๊ฐ€์ • ์ž์ฒด๋Š” ๋งž์œผ๋‚˜ ๋ณ€์ˆ˜์˜ ํผ์ง„ ์ •๋„๊ฐ€ ๋„ˆ๋ฌด ํฐ ๊ฒฝ์šฐ

 

3. Principles of Nate Silver

- ํ™•๋ฅ ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ƒ๊ฐํ•˜๊ธฐ

- ์ƒˆ๋กœ์šด ์ •๋ณด์— ๋ฐ˜์‘ํ•˜์—ฌ ์˜ˆ์ธก์„ ๋ฐ”๊พธ๊ธฐ

- consensus๋ฅผ ์ถ”๊ตฌํ•˜๊ธฐ

 

๋‚˜์˜ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ

- ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ 1๊ฐœ์˜ ๊ฒฐ์ •๋ก ์ ์ธ ์˜ˆ์ธก์„ ์š”๊ตฌํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ผ์ด๋‹ค.

- ์ข‹์€ ์˜ˆ์ธก ๋ชจ๋ธ์€ ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ œ๊ณต

- logistic regression, k-nearest neighbor ๋“ฑ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ

 

Live Models

- ์ƒˆ๋กœ์šด ์ •๋ณด์— ๋ฐ˜์‘ํ•˜์—ฌ ์—์ธก์„ ๊พธ์ค€ํžˆ ๋ณ€๊ฒฝํ•˜๋ฉด Liveํ•œ ๋ชจ๋ธ์ด๋ผ๊ณ  ํ•œ๋‹ค.

- ์—์ธก์ด ์ •ํ™•ํ•œ ๋‹ต์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š”๊ฐ€? / ์ด์ „ ์˜ˆ์ธก์„ ๋ณด์—ฌ์ค˜์„œ ๋ชจ๋ธ์˜ ์ผ๊ด€์„ฑ์„ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š”๊ฐ€? / ๋ชจ๋ธ์ด ๋” ์ตœ๊ทผ ๋ฐ์ดํ„ฐ๋กœ ๋‹ค์‹œ ํ•™์Šต๋˜๋Š”๊ฐ€?

 

Consensus ์ถ”๊ตฌ

- Google Flue Trends: illness term์— ๋Œ€ํ•œ ๊ฒ€์ƒ‰ ๋นˆ๋„๋ฅผ ์ด์šฉํ•˜์—ฌ flu ๋ฐœ์ƒ์„ ์˜ˆ์ธก -> ๊ตฌ๊ธ€์ด ๊ฒ€์ƒ‰ ์ œ์•ˆ ๊ธฐ๋Šฅ์„ ๋„์ž…ํ•˜๊ณ  ์‹คํŒจํ•จ

- ๋น„๊ต ๊ฐ€๋Šฅํ•œ ๊ฒฝ์Ÿ ์˜ˆ์ธก์ด ์žˆ๋Š”๊ฐ€? / baseline model์ด ํ•˜๋Š” ๋ง์ด ๋ฌด์—‡์ธ๊ฐ€? / ์˜ˆ์ธก์„ ํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ๋ฒ•์ด ์„œ๋กœ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?

- Boosting, Bagging: ๋ถ„๋ฅ˜๊ธฐ์˜ ensemble์„ ๋ช…์‹œ์ ์œผ๋กœ ํ•ฉํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ 

 

Taxonomy of Models

1. Linear vs. Non-linear Model

- Linear: ๊ฐ๊ฐ์˜ ํ”ผ์ณ ๋ณ€์ˆ˜๋ฅผ coefficient๋กœ ์ค‘์š”๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์‹์œผ๋กœ ๋‚˜ํƒ€๋ƒ„ -> ์ด๋Ÿฌํ•œ ๊ฐ’๋“ค์„ ๋”ํ•ด ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•จ

- Non-linear: ๊ณ ์ฐจ์›์˜ polynomial, logarithm, exponential ๋“ฑ์„ ํฌํ•จํ•˜๋Š” ์‹

 

2. Blackbox vs. Descriptive Model

- ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„ ์„ค๋ช…์ ์ž„ = ์™œ ๊ทธ๋Ÿฌํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ ธ๋Š”์ง€ ์„ค๋ช…ํ•จ

- Linear regression model์ด ๊ทธ ์˜ˆ์‹œ, ํŠธ๋ฆฌ๋„ ์˜ˆ์‹œ์ผ ์ˆ˜ ์žˆ์Œ.

- Neural Network Model์€ ์™œ ๊ทธ๋Ÿฌํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ ธ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์Œ.

-> opaque

 

3. First principle vs. Data-driven Model

- First principle: ์‹œ์Šคํ…œ์ด ์ž‘๋™ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ด๋ก ์ ์ธ ์„ค๋ช…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ -> ์‹œ๋ฎฌ๋ ˆ์ด์…˜, ๊ณผํ•™ ๊ณต์‹

- Data-driven: input parameter์™€ outcome variables ์‚ฌ์ด์˜ ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ ์ƒ๊ด€์„ฑ -> ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ

- ์ข‹์€ ๋ชจ๋ธ์€ ๋‘˜์„ ์ ์ ˆํžˆ ์„ž์—ˆ์„ ๋•Œ ๋‚˜์˜ด

 

4. General vs. Ad Hoc Model

- ๋ถ„๋ฅ˜๋‚˜ ํšŒ๊ท€ ๋“ฑ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ general = ๋ฌธ์ œ์— ํŠน์ •๋˜๋Š” ์•„์ด๋””์–ด๊ฐ€ ์•„๋‹Œ ํŠน์ • ๋ฐ์ดํ„ฐ๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ

- Ad Hoc Model์€ ํŠน์ • ๋„๋ฉ”์ธ์˜ ์ง€์‹์„ ์ด์šฉ - ํŠน๋ณ„ํ•œ ๋ชฉ์ 

 

5. Stochastic vs. deterministic

6. Flat vs. Hierarchical

 

๋ชจ๋ธํ‰๊ฐ€ : Baseline

- simplest reasonable models to compare against

- ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ (๋ณต์žก๋„ ์˜ค๋ฆ„์ฐจ์ˆœ)

* uniform/random selection among labels

* training data์˜ ๊ฐ€์žฅ ํ”ํ•œ label

* ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ single-variable model

* ์ด์ „ ์‹œ๊ฐ„ ํฌ์ธํŠธ์™€ ๊ฐ™์€ label

* mean, median

* linear regression

* existing model: SOTA model...

- baseline ๋ชจ๋ธ์€ ๊ณต์ •ํ•ด์•ผํ•œ๋‹ค.

 

๋ถ„๋ฅ˜๊ธฐ ํ‰๊ฐ€

- ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์— ์˜ํ•œ ๊ฒฐ๊ณผ๋Š” ์ด 4๊ฐ€์ง€๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Œ

 

Threshold Classifiers

- ์ ์ ˆํ•œ threshold๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์ค‘์š”

Accuracy

- ์ „์ฒด ์˜ˆ์ธก์— ๋Œ€ํ•œ ์˜ณ์€ ์˜ˆ์ธก์˜ ๋น„์œจ

- Accuracy๊ฐ€ ๋†’๋‹ค๊ณ  ํ•˜๋”๋ผ๋„ ์‹ค์ œ์˜ ๊ฒฐ๊ณผ์™€๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์ •ํ™•๋„๋งŒ์œผ๋กœ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์€ ์˜ณ์ง€ ์•Š๋‹ค

- |P| << |N| ์ผ ๋•Œ ํŠนํžˆ ๋ถ€์ •ํ™•

 

Precision

Recall

F-score

- ์กฐํ™”ํ‰๊ท ์€ ์‚ฐ์ˆ ํ‰๊ท ๋ณด๋‹ค ํ•ญ์ƒ ์ž‘๊ฑฐ๋‚˜ ๊ฐ™์œผ๋ฏ€๋กœ F-score๊ฐ€ ๋†’์•„์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

- ๋‘ ๊ฐœ์˜ ์ ์ˆ˜๊ฐ€ ๋†’์•„์•ผ ์ตœ์ข…์ ์œผ๋กœ F ๊ฐ’๋„ ๋†’์•„์ง

 

Sensitivity vs. Specificity

- Sensitivity: True Positive Rate - ๋ฏผ๊ฐ๋„

- Specificity: True Negative Rate - ํŠน์ด๋„

- ์•”ํ™˜์ž, ์ŠคํŒธ ๋“ฑ์—์„œ true label์ด ๋ฌด์—‡์ธ์ง€๋Š” ๋‚ด์šฉ์˜ ๊ธ๋ถ€์ •์ด ์•„๋‹ˆ๋ผ ์šฐ๋ฆฌ๊ฐ€ ํƒ€๊ฒŸ์œผ๋กœ ํ•˜๋Š” ๋Œ€์ƒ์˜ ์—ฌ๋ถ€์— ๋‹ฌ๋ฆฐ ๊ฒƒ

 

์ข…ํ•ฉ์ ์œผ๋กœ๋Š”!

- Accuracy: class size๊ฐ€ ๋Œ€์ฒด๋กœ ๋‹ค๋ฅผ ๋•Œ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ์Œ

- Recall: ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ๊ท ํ˜•์ด ์žกํ˜€์žˆ์„ ๋•Œ ์ •ํ™•๋„์™€ ๊ฐ™์€ ๊ฐ’์ด ๋จ

- High precision: ๊ท ํ˜• ์žกํžˆ์ง€ ์•Š์€ class size์— ๋Œ€ํ•ด์„œ๋Š” ๋งค์šฐ ๋‹ฌ์„ฑํ•˜๊ธฐ ์–ด๋ ค์›€

- F-score: ์–ด๋–ค ๋‹จ์ผ ํ†ต๊ณ„๋Ÿ‰์— ๋Œ€ํ•ด์„œ๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ„, but 4๊ฐœ ๊ฐ’์ด ๋ชจ๋‘ ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Œ

 

Receiver-Operator Characteristic (ROC) curves

- threshold ๊ฐ’์ด ๋ฐ”๋€Œ๋ฉด recall/precision, TPR/FPR ๊ฐ’์ด ๋ณ€ํ•œ๋‹ค.

- ROC curve ์•„๋ž˜์˜ ๋ฉด์  = AUC๋Š” ์ •ํ™•๋„ ํŒ๋‹จ ๊ธฐ์ค€์ด ๋จ

 

๋ณต์ˆ˜ ๋ถ„๋ฅ˜ ์‹œ์Šคํ…œ ํ‰๊ฐ€

- class๊ฐ€ ๋งŽ์•„์ง€๋ฉด ๋ถ„๋ฅ˜๊ฐ€ ๋” ์–ด๋ ค์›Œ์ง

- top-k success rate: first k๊ฐœ์˜ ์˜ˆ์ธก์— ๋Œ€ํ•ด ์„ฑ๊ณตํ•  ๊ฒฝ์šฐ ๋ณด์ƒ์„ ์ œ๊ณต

(๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ: ์˜ˆ์ธกํ•œ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ N๊ฐœ์˜ ํด๋ž˜์Šค์™€ ๋™์ผํ•œ ์‹ค์ œ ํด๋ž˜์Šค์˜ ํ‘œ์ค€ ์ •ํ™•๋„)

 

Summary statistics: Numerical Error

- f: forecast, o: observation

- absolute error: (f-o)

- relative error: (f-o)/o -> ๋” ๋‚˜์€ ๊ฒฐ๊ณผ

- ์˜ค๋ฅ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•

* Mean or median squared error (MSE)

* Root mean squared error 

Error histograms

 

Evaluation Data

- out-of-sample prediction: ๋ชจ๋ธ์„ ๋งŒ๋“ค ๋•Œ ๋ณธ ์  ์—†์—ˆ๋˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœ

- training 60%, validation 20%, testing 20% ์ด๋Ÿฐ ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• 

- testing data๋Š” ๋‹ค๋ฅธ ๊ฒฝ์šฐ์— ์ ˆ๋Œ€ ๋ณด๋ฉด ์•ˆ ๋˜๊ณ  ๋”ฑ ํ…Œ์ŠคํŠธ ํ•  ๋•Œ๋งŒ ๋ณผ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผํ•จ.

 

Sins in evaluation

- training / test data๋ฅผ ์„œ๋กœ ์„ž์—ˆ๋Š”๊ฐ€?

- ๋‚˜์˜ implementation์— ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋Š”๊ฐ€?

- ์ด๋Ÿฌํ•œ ์งˆ๋ฌธ์„ ๋˜์ ธ์„œ ์˜ค๋ฅ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

 

Cross-validation

- ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ, ๋ฐ์ดํ„ฐ๋ฅผ k๊ฐœ๋กœ ์ชผ๊ฐœ์„œ k-1๋ฒˆ์งธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ•™์Šต์„ ์ง„ํ–‰, ๋‚˜๋จธ์ง€๋กœ ํ‰๊ฐ€, ๊ทธ๋ฆฌ๊ณ  ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ์œผ๋กœ ๊ณ„์† ๋ฐ˜๋ณต 

- limiting case: leave one out validation 

- k-fold

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +