profile-img
The merit of an action lies in finishing it to the end.
slide-image

Data Science Analysis Pipeline

- Modeling: ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋„๊ตฌ๋กœ ์ •๋ณด๋ฅผ ๊ฐ์‹ธ๋Š” ๊ณผ์ •

- ํ•ต์‹ฌ ๊ณผ์ •: building, fitting, validating the model

 

Philosophies of Modeling

1. Occam's Razor

- 14์„ธ๊ธฐ ์˜๊ตญ ์ˆ˜๋„์Šน

- ๋œป: ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ์„ค๋ช…์ด ๊ฐ€์žฅ ์ข‹๋‹ค.

- ๊ฐ€์žฅ ์ ์€ ๊ฐ€์ •์„ ๋งŒ๋“œ๋Š” ๋‹ต์„ ์„ ํƒํ•ด์•ผ ํ•œ๋‹ค. -> ๋ชจ๋ธ์—์„œ parameter์˜ ์ˆ˜๋ฅผ ์ค„์—ฌ์•ผ ํ•จ์„ ์˜๋ฏธ

- LASSO/ridge regression ๋“ฑ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์€ ํ”ผ์ณ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด penalty function์„ ์‚ฌ์šฉ -> ๋ถˆํ•„์š”ํ•œ coefficient๋ฅผ ์ตœ์†Œํ™”

 

2. Bias-Variance Tradeoffs

- "๋ชจ๋“  ๋ชจ๋ธ์€ ํ‹€๋ฆฌ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ์–ด๋–ค ๋ชจ๋ธ์€ ์œ ์šฉํ•˜๋‹ค."

- Bias: ๋ชจ๋ธ์—์„œ ์—๋Ÿฌ๊ฐ€ ์žˆ๋Š” ๊ฐ€์ •์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์˜ค๋ฅ˜ -> underfitting; ๊ฐ€์ • ์ž์ฒด๊ฐ€ ์ •๋‹ต์—์„œ ๋ฉ€์–ด์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ -> linear๋กœ ๋ณ€ํ™˜๋จ

- Variance: training set ์•ˆ์˜ ์ž‘์€ ๋ณ€ํ™”์— ๋ฏผ๊ฐํ•จ์œผ๋กœ์จ ์ƒ๊ธฐ๋Š” ์˜ค๋ฅ˜ -> overfitting; ๊ฐ€์ • ์ž์ฒด๋Š” ๋งž์œผ๋‚˜ ๋ณ€์ˆ˜์˜ ํผ์ง„ ์ •๋„๊ฐ€ ๋„ˆ๋ฌด ํฐ ๊ฒฝ์šฐ

 

3. Principles of Nate Silver

- ํ™•๋ฅ ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ƒ๊ฐํ•˜๊ธฐ

- ์ƒˆ๋กœ์šด ์ •๋ณด์— ๋ฐ˜์‘ํ•˜์—ฌ ์˜ˆ์ธก์„ ๋ฐ”๊พธ๊ธฐ

- consensus๋ฅผ ์ถ”๊ตฌํ•˜๊ธฐ

 

๋‚˜์˜ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ

- ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ 1๊ฐœ์˜ ๊ฒฐ์ •๋ก ์ ์ธ ์˜ˆ์ธก์„ ์š”๊ตฌํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ผ์ด๋‹ค.

- ์ข‹์€ ์˜ˆ์ธก ๋ชจ๋ธ์€ ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ œ๊ณต

- logistic regression, k-nearest neighbor ๋“ฑ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ

 

Live Models

- ์ƒˆ๋กœ์šด ์ •๋ณด์— ๋ฐ˜์‘ํ•˜์—ฌ ์—์ธก์„ ๊พธ์ค€ํžˆ ๋ณ€๊ฒฝํ•˜๋ฉด Liveํ•œ ๋ชจ๋ธ์ด๋ผ๊ณ  ํ•œ๋‹ค.

- ์—์ธก์ด ์ •ํ™•ํ•œ ๋‹ต์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š”๊ฐ€? / ์ด์ „ ์˜ˆ์ธก์„ ๋ณด์—ฌ์ค˜์„œ ๋ชจ๋ธ์˜ ์ผ๊ด€์„ฑ์„ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š”๊ฐ€? / ๋ชจ๋ธ์ด ๋” ์ตœ๊ทผ ๋ฐ์ดํ„ฐ๋กœ ๋‹ค์‹œ ํ•™์Šต๋˜๋Š”๊ฐ€?

 

Consensus ์ถ”๊ตฌ

- Google Flue Trends: illness term์— ๋Œ€ํ•œ ๊ฒ€์ƒ‰ ๋นˆ๋„๋ฅผ ์ด์šฉํ•˜์—ฌ flu ๋ฐœ์ƒ์„ ์˜ˆ์ธก -> ๊ตฌ๊ธ€์ด ๊ฒ€์ƒ‰ ์ œ์•ˆ ๊ธฐ๋Šฅ์„ ๋„์ž…ํ•˜๊ณ  ์‹คํŒจํ•จ

- ๋น„๊ต ๊ฐ€๋Šฅํ•œ ๊ฒฝ์Ÿ ์˜ˆ์ธก์ด ์žˆ๋Š”๊ฐ€? / baseline model์ด ํ•˜๋Š” ๋ง์ด ๋ฌด์—‡์ธ๊ฐ€? / ์˜ˆ์ธก์„ ํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ๋ฒ•์ด ์„œ๋กœ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?

- Boosting, Bagging: ๋ถ„๋ฅ˜๊ธฐ์˜ ensemble์„ ๋ช…์‹œ์ ์œผ๋กœ ํ•ฉํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ 

 

Taxonomy of Models

1. Linear vs. Non-linear Model

- Linear: ๊ฐ๊ฐ์˜ ํ”ผ์ณ ๋ณ€์ˆ˜๋ฅผ coefficient๋กœ ์ค‘์š”๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์‹์œผ๋กœ ๋‚˜ํƒ€๋ƒ„ -> ์ด๋Ÿฌํ•œ ๊ฐ’๋“ค์„ ๋”ํ•ด ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•จ

- Non-linear: ๊ณ ์ฐจ์›์˜ polynomial, logarithm, exponential ๋“ฑ์„ ํฌํ•จํ•˜๋Š” ์‹

 

2. Blackbox vs. Descriptive Model

- ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„ ์„ค๋ช…์ ์ž„ = ์™œ ๊ทธ๋Ÿฌํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ ธ๋Š”์ง€ ์„ค๋ช…ํ•จ

- Linear regression model์ด ๊ทธ ์˜ˆ์‹œ, ํŠธ๋ฆฌ๋„ ์˜ˆ์‹œ์ผ ์ˆ˜ ์žˆ์Œ.

- Neural Network Model์€ ์™œ ๊ทธ๋Ÿฌํ•œ ๊ฒฐ์ •์„ ๋‚ด๋ ธ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์Œ.

-> opaque

 

3. First principle vs. Data-driven Model

- First principle: ์‹œ์Šคํ…œ์ด ์ž‘๋™ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ด๋ก ์ ์ธ ์„ค๋ช…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ -> ์‹œ๋ฎฌ๋ ˆ์ด์…˜, ๊ณผํ•™ ๊ณต์‹

- Data-driven: input parameter์™€ outcome variables ์‚ฌ์ด์˜ ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ ์ƒ๊ด€์„ฑ -> ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ

- ์ข‹์€ ๋ชจ๋ธ์€ ๋‘˜์„ ์ ์ ˆํžˆ ์„ž์—ˆ์„ ๋•Œ ๋‚˜์˜ด

 

4. General vs. Ad Hoc Model

- ๋ถ„๋ฅ˜๋‚˜ ํšŒ๊ท€ ๋“ฑ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ general = ๋ฌธ์ œ์— ํŠน์ •๋˜๋Š” ์•„์ด๋””์–ด๊ฐ€ ์•„๋‹Œ ํŠน์ • ๋ฐ์ดํ„ฐ๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ

- Ad Hoc Model์€ ํŠน์ • ๋„๋ฉ”์ธ์˜ ์ง€์‹์„ ์ด์šฉ - ํŠน๋ณ„ํ•œ ๋ชฉ์ 

 

5. Stochastic vs. deterministic

6. Flat vs. Hierarchical

 

๋ชจ๋ธํ‰๊ฐ€ : Baseline

- simplest reasonable models to compare against

- ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ (๋ณต์žก๋„ ์˜ค๋ฆ„์ฐจ์ˆœ)

* uniform/random selection among labels

* training data์˜ ๊ฐ€์žฅ ํ”ํ•œ label

* ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ single-variable model

* ์ด์ „ ์‹œ๊ฐ„ ํฌ์ธํŠธ์™€ ๊ฐ™์€ label

* mean, median

* linear regression

* existing model: SOTA model...

- baseline ๋ชจ๋ธ์€ ๊ณต์ •ํ•ด์•ผํ•œ๋‹ค.

 

๋ถ„๋ฅ˜๊ธฐ ํ‰๊ฐ€

- ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์— ์˜ํ•œ ๊ฒฐ๊ณผ๋Š” ์ด 4๊ฐ€์ง€๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Œ

 

Threshold Classifiers

- ์ ์ ˆํ•œ threshold๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์ค‘์š”

Accuracy

- ์ „์ฒด ์˜ˆ์ธก์— ๋Œ€ํ•œ ์˜ณ์€ ์˜ˆ์ธก์˜ ๋น„์œจ

- Accuracy๊ฐ€ ๋†’๋‹ค๊ณ  ํ•˜๋”๋ผ๋„ ์‹ค์ œ์˜ ๊ฒฐ๊ณผ์™€๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์ •ํ™•๋„๋งŒ์œผ๋กœ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์€ ์˜ณ์ง€ ์•Š๋‹ค

- |P| << |N| ์ผ ๋•Œ ํŠนํžˆ ๋ถ€์ •ํ™•

 

Precision

Recall

F-score

- ์กฐํ™”ํ‰๊ท ์€ ์‚ฐ์ˆ ํ‰๊ท ๋ณด๋‹ค ํ•ญ์ƒ ์ž‘๊ฑฐ๋‚˜ ๊ฐ™์œผ๋ฏ€๋กœ F-score๊ฐ€ ๋†’์•„์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

- ๋‘ ๊ฐœ์˜ ์ ์ˆ˜๊ฐ€ ๋†’์•„์•ผ ์ตœ์ข…์ ์œผ๋กœ F ๊ฐ’๋„ ๋†’์•„์ง

 

Sensitivity vs. Specificity

- Sensitivity: True Positive Rate - ๋ฏผ๊ฐ๋„

- Specificity: True Negative Rate - ํŠน์ด๋„

- ์•”ํ™˜์ž, ์ŠคํŒธ ๋“ฑ์—์„œ true label์ด ๋ฌด์—‡์ธ์ง€๋Š” ๋‚ด์šฉ์˜ ๊ธ๋ถ€์ •์ด ์•„๋‹ˆ๋ผ ์šฐ๋ฆฌ๊ฐ€ ํƒ€๊ฒŸ์œผ๋กœ ํ•˜๋Š” ๋Œ€์ƒ์˜ ์—ฌ๋ถ€์— ๋‹ฌ๋ฆฐ ๊ฒƒ

 

์ข…ํ•ฉ์ ์œผ๋กœ๋Š”!

- Accuracy: class size๊ฐ€ ๋Œ€์ฒด๋กœ ๋‹ค๋ฅผ ๋•Œ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ์Œ

- Recall: ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ๊ท ํ˜•์ด ์žกํ˜€์žˆ์„ ๋•Œ ์ •ํ™•๋„์™€ ๊ฐ™์€ ๊ฐ’์ด ๋จ

- High precision: ๊ท ํ˜• ์žกํžˆ์ง€ ์•Š์€ class size์— ๋Œ€ํ•ด์„œ๋Š” ๋งค์šฐ ๋‹ฌ์„ฑํ•˜๊ธฐ ์–ด๋ ค์›€

- F-score: ์–ด๋–ค ๋‹จ์ผ ํ†ต๊ณ„๋Ÿ‰์— ๋Œ€ํ•ด์„œ๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ„, but 4๊ฐœ ๊ฐ’์ด ๋ชจ๋‘ ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Œ

 

Receiver-Operator Characteristic (ROC) curves

- threshold ๊ฐ’์ด ๋ฐ”๋€Œ๋ฉด recall/precision, TPR/FPR ๊ฐ’์ด ๋ณ€ํ•œ๋‹ค.

- ROC curve ์•„๋ž˜์˜ ๋ฉด์  = AUC๋Š” ์ •ํ™•๋„ ํŒ๋‹จ ๊ธฐ์ค€์ด ๋จ

 

๋ณต์ˆ˜ ๋ถ„๋ฅ˜ ์‹œ์Šคํ…œ ํ‰๊ฐ€

- class๊ฐ€ ๋งŽ์•„์ง€๋ฉด ๋ถ„๋ฅ˜๊ฐ€ ๋” ์–ด๋ ค์›Œ์ง

- top-k success rate: first k๊ฐœ์˜ ์˜ˆ์ธก์— ๋Œ€ํ•ด ์„ฑ๊ณตํ•  ๊ฒฝ์šฐ ๋ณด์ƒ์„ ์ œ๊ณต

(๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ: ์˜ˆ์ธกํ•œ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ N๊ฐœ์˜ ํด๋ž˜์Šค์™€ ๋™์ผํ•œ ์‹ค์ œ ํด๋ž˜์Šค์˜ ํ‘œ์ค€ ์ •ํ™•๋„)

 

Summary statistics: Numerical Error

- f: forecast, o: observation

- absolute error: (f-o)

- relative error: (f-o)/o -> ๋” ๋‚˜์€ ๊ฒฐ๊ณผ

- ์˜ค๋ฅ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•

* Mean or median squared error (MSE)

* Root mean squared error 

Error histograms

 

Evaluation Data

- out-of-sample prediction: ๋ชจ๋ธ์„ ๋งŒ๋“ค ๋•Œ ๋ณธ ์  ์—†์—ˆ๋˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœ

- training 60%, validation 20%, testing 20% ์ด๋Ÿฐ ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• 

- testing data๋Š” ๋‹ค๋ฅธ ๊ฒฝ์šฐ์— ์ ˆ๋Œ€ ๋ณด๋ฉด ์•ˆ ๋˜๊ณ  ๋”ฑ ํ…Œ์ŠคํŠธ ํ•  ๋•Œ๋งŒ ๋ณผ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผํ•จ.

 

Sins in evaluation

- training / test data๋ฅผ ์„œ๋กœ ์„ž์—ˆ๋Š”๊ฐ€?

- ๋‚˜์˜ implementation์— ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋Š”๊ฐ€?

- ์ด๋Ÿฌํ•œ ์งˆ๋ฌธ์„ ๋˜์ ธ์„œ ์˜ค๋ฅ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

 

Cross-validation

- ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ, ๋ฐ์ดํ„ฐ๋ฅผ k๊ฐœ๋กœ ์ชผ๊ฐœ์„œ k-1๋ฒˆ์งธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ•™์Šต์„ ์ง„ํ–‰, ๋‚˜๋จธ์ง€๋กœ ํ‰๊ฐ€, ๊ทธ๋ฆฌ๊ณ  ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ์œผ๋กœ ๊ณ„์† ๋ฐ˜๋ณต 

- limiting case: leave one out validation 

- k-fold

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +

Exploratory Data Analysis

- ๋ฐ์ดํ„ฐ๋ฅผ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ 

* ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘, ์ „์ฒ˜๋ฆฌ์—์„œ์˜ ์‹ค์ˆ˜ ๊ตฌ๋ณ„

* ํ†ต๊ณ„์  ๊ฐ€์ •์„ ์–ด๊ธฐ๋Š” ๊ฒฝ์šฐ๋ฅผ ํŒŒ์•…

* ๋ฐ์ดํ„ฐ ํŒจํ„ด ํƒ์ƒ‰

* ๊ฐ€์„ค ์„ค์ •

 

Anscombe's Quartet

- ๊ฐ™์€ ํ‰๊ท , ํŽธ์ฐจ, ์ƒ๊ด€๊ด€๊ณ„, ํšŒ๊ท€์ง์„ ์„ ๊ฐ€์ง€์ง€๋งŒ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ๋ชจ์–‘ ์ž์ฒด๊ฐ€ ๋งค์šฐ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ.

 

Mapping Data to Image

- ํšจ์œจ์„ฑ ์ˆœ์œ„: ์œ„์น˜ > ๊ธธ์ด > ๊ธฐ์šธ๊ธฐ, ๊ฐ๋„ > ๋ฉด์  > ์ƒ‰ ์ง„ํ•˜๊ธฐ > ์ƒ‰, ๋ชจ์–‘

- ๋ฉด์ , ์ƒ‰ ์ง„ํ•˜๊ธฐ ๋ฐ์ดํ„ฐ๋Š” ordinal data์— ์‚ฌ์šฉ ๊ฐ€๋Šฅ

- ์›๊ทธ๋ž˜ํ”„๋Š” ๋ฉด์ ๊ณผ ๊ฐ๋„๋ฅผ ๊ฐ™์ด ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋„๋„› ๊ทธ๋ž˜ํ”„๋Š” ๊ฐ€์šด๋ฐ๊ฐ€ ๋น„์–ด์žˆ์œผ๋ฏ€๋กœ ๊ฐ์ด ์ƒ๋žต๋œ ํ˜•ํƒœ๋‹ค.

- ๊ฐ€์žฅ ๋น„ํšจ์œจ์ ์ธ ์‹œ๊ฐํ™” ์‚ฌ๋ก€

์ƒ‰์˜ ์šฐ์„ ์ˆœ์œ„๊ฐ€ ๋ช…ํ™•ํ•˜์ง€ ์•Š์Œ

- ์ƒ‰์˜ ์šฐ์„ ์ˆœ์œ„๋ฅผ ๊ฒฐ์ •

์™ผ์ชฝ์˜ ๊ฒฝ์šฐ ordering ๋ถˆ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์˜ค๋ฅธ์ชฝ์€ intensity๋กœ ๋งค๊ธธ ์ˆ˜ ์žˆ๋‹ค.

- ์ง€๋‚˜์น˜๊ฒŒ ๋ณต์žกํ•œ ์‹œ๊ฐํ™”๋Š” ์ข‹์ง€ ์•Š๋‹ค. (ํ•˜๋‚˜์˜ ๊ทธ๋ฆผ์— ๋„ˆ๋ฌด ๋งŽ์€ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ, ์ƒ‰ ์ˆœ์œ„๋ฅผ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ ๋“ฑ๋“ฑ)

 

Tufte's Design Principle

1. Maximize data ink-ratio

- data-ink ratio = data ink / total ink used in graphic

- ์ด ๊ฐ’์ด ์ตœ๋Œ€ํ•œ 1์— ๊ฐ€๊นŒ์›Œ์ ธ์•ผ ํ•œ๋‹ค.

- ์˜ˆ์‹œ: 3D ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„๋ฅผ 2D๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ๊ทธ๋ฆผ์ž๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค.

2. Minimize lie factor

- Lie factor = size of effect shown in graphic / size of effect in data -> 1์— ๊ฐ€๊นŒ์›Œ์ ธ์•ผ ํ•œ๋‹ค.

- ์‹ค์ œ๋ณด๋‹ค ๊ณผ์žฅํ•˜์—ฌ ๊ทธ๋ฆผ์„ ๊ทธ๋ฆฌ๋Š” ๊ฒฝ์šฐ

- 3D ๊ทธ๋ž˜ํ”„๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ฐ’์— ๋น„ํ•ด ๊ทธ๋ฆผ์ด ๋” ์ปค๋ณด์ด๋Š” ๊ฒฝ์šฐ

 

3. Minimize chartjunk

- ๋ถˆํ•„์š”ํ•œ ์ฐจ์›, ์ •๋ณด ์ „๋‹ฌ์šฉ์ด ์•„๋‹Œ ์ƒ‰ ์ž…ํžˆ๊ธฐ, ๊ณผ๋„ํ•œ ๊ฒฉ์ž๋ฌด๋Šฌ ๋ฐ ๋ฐ์ฝ”๋ ˆ์ด์…˜ ๋“ฑ์€ ์ •๋ณด์ „๋‹ฌ์— ๋ฐฉํ•ด๊ฐ€ ๋˜๋Š” ์š”์†Œ.

- ๊ฒฉ์ž ์ œ๊ฑฐ, ๋ฐฐ๊ฒฝ์ƒ‰ ์ œ๊ฑฐ, ํ…Œ๋‘๋ฆฌ ์ œ๊ฑฐ ๋“ฑ์œผ๋กœ ์‹œ๊ฐํ™”ํ•œ ๋‚ด์šฉ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ค์ž.

- ๋ˆˆ๊ธˆ์˜ ๊ฒฝ์šฐ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ์•ˆ์— ํ‘œ์‹œํ•˜๋Š” ๋“ฑ ํ•ด์„œ ๋” ์ค„์ผ ์ˆ˜ ์žˆ์Œ.

- Matplotlib ๋“ฑ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

 

4. Use proper scales and clear labeling

- Scale Distortion: ๋ˆˆ๊ธˆ์„ 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์ง€ ์•Š๊ณ  ์›ํ•˜๋Š” ๊ฐ’์— ํฌ์ปค์Šค๋ฅผ ๋งž์ถฐ์„œ ์‹ค์ œ๋ณด๋‹ค ์ฐจ์ด๋ฅผ ๋” ํฌ๊ฒŒ ๋งŒ๋“ค์–ด๋ฒ„๋ฆผ.

ex) Bush Tax cuts expire~ -> ๋ˆˆ๊ธˆ์€ ํ•ญ์ƒ 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋„๋ก ํ•˜์ž.

 

์‹œ๊ฐํ™” ํ‘œ์˜ ์ข…๋ฅ˜

1. Bar Chart

2. Line Chart: ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ํŠธ๋ Œ๋“œ์˜ ๋ณ€ํ™”

* Bar vs. Line: Line์€ ์—ฐ๊ฒฐ์˜ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  ์žˆ์œผ๋ฏ€๋กœ, Categorical data์— ์‚ฌ์šฉํ•˜๋ฉด ์•ˆ ๋œ๋‹ค.

1๋ฒˆ์งธ ์ค„์—์„œ๋Š” Bar, 2๋ฒˆ์งธ ์ค„์—์„œ๋Š” Line์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ํ•ฉ๋ฆฌ์ 

* book rating์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด์„œ bar chart๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ๋ผ.

* Banking to 45º: 2๊ฐœ์˜ line segment๋Š” 45๋„์˜ ๊ฐ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ๊ตฌ๋ณ„์ด ์ž˜ ๋œ๋‹ค.

2๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„๋Š” ๋„ˆ๋ฌด ํ”Œ๋žซํ•˜๊ณ , 3๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„๋Š” ๋„ˆ๋ฌด stiffํ•˜๋‹ค.

* ๋„ˆ๋ฌด ํ”Œ๋žซํ•œ ๊ทธ๋ž˜ํ”„๋‚˜ stiffํ•œ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆด ๊ฒฝ์šฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์ž˜๋ชป ํ•ด์„ํ•˜๊ฒŒ ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, 45๋„๋ฅผ ์œ ์ง€ํ•˜์ž.

 

3. Scatter Plots / Bubble Charts

- Scatter plot์€ ๊ฐ๊ฐ์˜ ์ ์˜ ๊ฐ’์„ 2D์—์„œ ๋ณด์—ฌ์ฃผ๋Š” ๋ฐ ํšจ๊ณผ์ 

- 3, 4๊ฐœ์˜ ๋ณ€์ˆ˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์€ bubble chart๋ฅผ ์ด์šฉ

- principle component analysis๋ฅผ ํ†ตํ•ด ๊ณ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ๋ฅผ 2D๋กœ ํˆฌ์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

- ์ ์˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋ฉด overplotting์ด ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ์ ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด์ž.

- ์ ์ด ํ•œ ๊ตฐ๋ฐ์— ๋„ˆ๋ฌด ๋ชฐ๋ฆฌ๋ฉด Overplotting์ด ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ๋ถˆํˆฌ๋ช…๋„๋ฅผ ๋‚ฎ์ถ”์ž.

- heatmap: ๋นˆ๋„์— ๋”ฐ๋ผ์„œ ์ƒ‰์„ ๋‹ฌ๋ฆฌํ•˜๋ฉด ์ž๋ฃŒ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ๋” ์‰ฝ๋‹ค.

- ์ ์˜ ํฌ๊ธฐ, ๋ชจ์–‘, ์ƒ‰์„ ๋‹ฌ๋ฆฌํ•˜์—ฌ 3์ฐจ์›์˜ ์ž๋ฃŒ๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. -> Bubble Chart

* 3์ฐจ์›์˜ ๋ฐ์ดํ„ฐ๋ผ๊ณ  3D ๊ณต๊ฐ„์— ๋‚˜ํƒ€๋‚ด๋ฉด ์•ˆ ๋œ๋‹ค. ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ•˜์ž.

 

4. Pie Chart

- Pie Chart vs. Bar Chart

* Pie Chart๋Š” ๋น„์œจ์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋Œ€๋‹ตํ•˜๊ธฐ ํŽธ๋ฆฌํ•˜๋‹ค.

* Bar Chart๋Š” ์‹ค์ œ ๊ฐ’์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋Œ€๋‹ตํ•˜๊ธฐ ํŽธ๋ฆฌํ•˜๋‹ค.

 

5. Donut Chart

- ๋ฉด์ ๋งŒ ์žˆ์œผ๋ฏ€๋กœ pie chart์— ๋น„ํ•ด ๊ฐ€๋…์„ฑ์ด ๋–จ์–ด์ง„๋‹ค. 

 

6. Stacked Bar Chart

7. Stacked Area Chart

Stacked Area Chart / 100% stacked Area Chart

- Stacked Area Chart vs Line Graphs

* Stacked Area Chart์—์„œ๋Š” ๊ฐ๊ฐ์˜ ์„ฑ๋ถ„์ด ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋Š” ๋น„์œจ ๋ณ€ํ™”๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์‰ฝ๋‹ค.

* Line Chart์—์„œ๋Š” ๊ฐ๊ฐ์˜ ์„ฑ๋ถ„์˜ ์ ˆ๋Œ“๊ฐ’์„ ํŒŒ์•…ํ•˜๊ธฐ ์‰ฝ๋‹ค.

 

8. Histograms

- Bin Size๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

- Count๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

- ๋„ˆ๋ฌด ์ž˜๊ฒŒ ์ชผ๊ฐœ๋ฉด ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ์ž˜ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋‹ค.

- Frequency vs. Density Histograms

* ๊ฐ ํ•ญ๋ชฉ์˜ ์ˆ˜๋ฅผ ์ „์ฒด ์ˆ˜๋กœ ๋‚˜๋ˆ„๋ฉด ํ™•๋ฅ  ๋ฐ€๋„ plot์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. -> ๋” ํ•ด์„ํ•˜๊ธฐ ์‰ฝ๋‹ค.

 

9. Heat Maps

 

10. Box & Whisker Plots

- max, min, 4๋ถ„์œ„์ˆ˜, ์ค‘์•™๊ฐ’ ๋“ฑ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ์ž๋ฃŒ 

 

๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ํˆด

- Excel: ๊ฐ€์žฅ ์œ ๋ช…ํ•˜์ง€๋งŒ ์ข‹์€ ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ์•„๋‹˜.

- R: ํ†ต๊ณ„์–ธ์–ด

- Matplotlib: ํŒŒ์ด์ฌ

 

Multivariate Data๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ

- ์ž‘์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ plot์„ ๋งŒ๋“ค์–ด์„œ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋‹ค.

- ์˜ˆ: ๊ตญ๊ฐ€๋ณ„ ์ธ๊ตฌ ๋ถ„ํฌ -> ๊ตญ๊ฐ€๋ณ„๋กœ ์ธ๊ตฌ ๋ถ„ํฌ๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ•˜๋‚˜์˜ ๊ทธ๋ฆผ์— ํ•ฉ์นœ๋‹ค.

 

Overinterpreting Variance

- ๋ ๋ถ€๋ถ„ ๋ฐ์ดํ„ฐ๊ฐ€ ์š”๋™์น˜๋Š” ์ด์œ : gene๋งˆ๋‹ค size๊ฐ€ ๋ชจ๋‘ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ!

-> size๊ฐ€ ํฐ gene์ด ๋งŽ์ง€ ์•Š์•„์„œ ๊ฒฝํ–ฅ์„ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์›€.

 

๋น„ํŒ์  ์‹œ๊ฐ์œผ๋กœ ๋ฐ”๋ผ๋ณด๊ธฐ: ์˜ˆ์œ ๋ฐ์ดํ„ฐ๋Š” ์˜ˆ์œ ์‹œ๊ฐํ™”๋ฅผ ๋„์ถœํ•ด๋‚ธ๋‹ค.

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +

Central Dogma of Statistics

 

 

Statistical Data Distributions

- ๋ชจ๋“  random variable์€ ํŠน์ • ๋นˆ๋„/ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ฐ–๋Š”๋‹ค.

- ์ข…๋ฅ˜: binomial distribution, normal distribution, poisson distribution, power law distribution

 

Classical Distribution์˜ ์ค‘์š”์„ฑ

- ์‹ค์ œ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์Œ

- Closed-form formula(cdf, pdf), test(t-test) ๋“ฑ์„ ์ด์šฉ ๊ฐ€๋Šฅ

- ๋ชจ์–‘์ด ๋น„์Šทํ•˜๋‹ค๊ณ  ์ด๋Ÿฌํ•œ ๋ถ„ํฌ์™€ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ์•ˆ ๋œ๋‹ค.

 

Binomial Distribution

- n๊ฐœ์˜ independent trial๋กœ ์ด๋ฃจ์–ด์ง„ ์‹คํ—˜ -> 2๊ฐ€์ง€์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์•ผ ํ•จ

- ์˜ˆ: ๋™์ „ ๋˜์ง€๊ธฐ

- ๋ถ„ํฌ: ์ด์‚ฐ์ ์ด๋‚˜ ์ข… ๋ชจ์–‘์ž„ (๋˜๋Š” half-bell shape)

 

Normal Distribution

- ์ข… ๋ชจ์–‘์„ ๊ฐ€์ง

- ํ‚ค, IQ ๋“ฑ.

- ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ.

- ๋ชจ๋“  ์ข… ๋ชจ์–‘ ๋ถ„ํฌ๊ฐ€ normal distribution์€ ์•„๋‹ˆ๋‹ค.

- n์ด ๋ฌดํ•œ๋Œ€๋กœ ๋ฐœ์‚ฐํ•˜๋Š” binomial distribution

- normal distribution์„ ๊ฐ–๋Š” ๋ณ€์ˆ˜๋“ค์˜ ํ•ฉ์€ normalํ•˜๋‹ค.

- normal distribution์˜ mixture๋Š” normalํ•˜์ง€ ์•Š๋‹ค.

-> ์—ฌ์„ฑ์˜ ํ‚ค, ๋‚จ์„ฑ์˜ ํ‚ค ๊ฐ๊ฐ์€ ์ •๊ทœ๋ถ„ํฌ์ง€๋งŒ, ์ „์ฒด ์ธ๊ตฌ์˜ ํ‚ค๋Š” ์ •๊ทœ๋ถ„ํฌ๊ฐ€ ์•„๋‹ˆ๋‹ค.

 

Lifespan Distribution

- ๋งค์ผ๋งค์ผ์˜ ์ƒ์กด ํ™•๋ฅ ์ด p๋ผ๋ฉด, n์ผ๋™์•ˆ ์ƒ์กดํ•  ํ™•๋ฅ ์€ p^(n-1)*(1-p)

 

Poisson Distribution

- rare event์—์„œ์˜ interval์˜ ๋นˆ๋„

 

Power law Distribution

- ์ •์˜: P(X=x) = cx^(-a)

* c: normalization ์ƒ์ˆ˜, a: exponent

* c: a๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ํ™•๋ฅ ๋“ค์˜ ํ•ฉ์€ 1์ด์–ด์•ผ ํ•˜๋ฏ€๋กœ ํ•˜๋‚˜์˜ ๊ฐ’์œผ๋กœ ์ •ํ•ด์ง„๋‹ค.

- ์ •๊ทœ๋ถ„ํฌ์ฒ˜๋Ÿผ ์ค‘์•™์— ๋ฐ€์ง‘ํ•˜์ง€ ์•Š๊ณ  ์ง€์†์ ์œผ๋กœ ์•„์ฃผ ํฐ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.

- 80-20 rules: 20%์˜ X๊ฐ€ 80%์˜ Y๋ฅผ ๊ฐ–๋Š”๋‹ค. -> ๋นˆ๋ถ€ ๊ฒฉ์ฐจ ๋“ฑ์—์„œ ์†Œ์ˆ˜์˜ ๋ถ€์ž๋“ค์ด ๋ถ€๋ฅผ ๋…์‹ํ•œ๋‹ค ๋“ฑ๋“ฑ

ex) City Population - Power Law

: ๋ถ€๋ฅผ ๊ฐ€์ง„ ์‚ฌ๋žŒ์ด ์ ์  ๋” ๋ถ€์ž๊ฐ€ ๋œ๋‹ค.

- x๊ฐ€ 2๋ฐฐ๊ฐ€ ๋  ๋•Œ, ํ™•๋ฅ ์€ 2^a๋งŒํผ ์ค„์–ด๋“ ๋‹ค.

- Power Law์˜ ์˜ˆ์‹œ

1) x๊ฐœ์˜ ๋งํฌ๋ฅผ ๊ฐ€์ง„ ์ธํ„ฐ๋„ท ์‚ฌ์ดํŠธ

2) ๋ฆฌํžˆํ„ฐ ๊ทœ๋ชจ์™€ ์ง€์ง„์˜ ๋นˆ๋„

3) ๋‹จ์–ด ์‚ฌ์šฉ ๋นˆ๋„

4) x๋ช…์„ ์ฃฝ์ธ ์ „์Ÿ์˜ ์ˆ˜

 

Word Frequencies and Zipf's Law

- Zipf's law: k๋ฒˆ์งธ๋กœ ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ๋‹จ์–ด๋Š” ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ๋‹จ์–ด์˜ 1/k๋ฒˆ์งธ ๋งŒํผ ์‚ฌ์šฉ๋œ๋‹ค.

-> a = 1์ธ power law

-> 2x๋ฒˆ์งธ์ธ ๋‹จ์–ด๋Š” x๋ฒˆ์งธ ๋‹จ์–ด์˜ ์‚ฌ์šฉ ๋นˆ๋„ * 1/2

 

Power Law์˜ ํŠน์„ฑ

- ํ‰๊ท ์€ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค.

- ํ‘œ์ค€ํŽธ์ฐจ๋„ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. ํ‰๊ท ๋ณด๋‹ค ํ›จ์”ฌ ํฐ ๊ฐ’์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—

- ์ค‘์•™๊ฐ’์€ ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค.

- ๋ถ„ํฌ๋Š” scale invariantํ•˜๋‹ค = ํ™•๋Œ€ํ•œ ๋ถ„ํฌ์˜ ์ผ๋ถ€๊ฐ€ ์ „์ฒด ๋ถ„ํฌ์˜ ๋ชจ์–‘๊ณผ ๋น„์Šทํ•ด ๋ณด์ธ๋‹ค. 

 

ํ†ต๊ณ„ํ•™์ž vs ๋ฐ์ดํ„ฐ๋งˆ์ด๋„ˆ

- ํ†ต๊ณ„ํ•™์ž: ๋ฐ์ดํ„ฐ์—์„œ ๋ฐœ๊ฒฌํ•œ ๊ฒƒ์ด ์ค‘์š”ํ•œ์ง€์— ๊ด€์‹ฌ

- ๋ฐ์ดํ„ฐ๋งˆ์ด๋„ˆ: ๋ฐ์ดํ„ฐ์—์„œ ๋ฐœ๊ฒฌํ•œ ๊ฒƒ์ด ํฅ๋ฏธ๋กœ์šด์ง€ ๊ด€์‹ฌ

- meaningfulํ•œ ๋ฐœ๊ฒฌ: ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋Š” ๊ฒ‰์œผ๋กœ๋Š” ์ค‘์š”ํ•ด๋ณด์ผ ์ˆ˜ ์žˆ์œผ๋‚˜, ๋ฌธ์ œ๋Š” ๋” ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋ด์•ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Comparing Population Means

- T-test: ๋‘ ์ƒ˜ํ”Œ์˜ ํ‰๊ท  ์‚ฌ์ด์˜ ์ฐจ์ด์— ๋Œ€ํ•ด ํ‰๊ฐ€

- ํ‰๊ท ์ด ๋‹ค๋ฅด๊ฑฐ๋‚˜, ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ ๋‹ค๋ฅผ ๊ฒฝ์šฐ ์ฐจ์ด๋ฅผ ๊ทธ๋ƒฅ ํŒ๋‹จํ•˜๊ธฐ ์‰ฝ๋‹ค.

 

T-test

- 2๊ฐœ์˜ ํ‰๊ท ์€ ์ƒ๋‹นํžˆ ๋‹ค๋ฅด๋‹ค. ์–ด๋–จ ๋•Œ?

* ํ‰๊ท ์˜ ์ฐจ์ด๊ฐ€ ๋น„๊ต์  ํด ๋•Œ

* ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ž‘์„ ๋•Œ

* ์ƒ˜ํ”Œ์ด ์ถฉ๋ถ„ํžˆ ํด ๋•Œ

- Welch's t-statistic

* s^2: sample variance

- t distribution table

* df: ์ž์œ ๋„ (๋ชจ์ง‘๋‹จ์˜ ๊ฐœ์ˆ˜๊ฐ€ n๊ฐœ์ผ ๋•Œ ์ž์œ ๋„๋Š” n-1์ด๋‹ค)

* one tail์ผ ๋•Œ t๊ฐ’์˜ ๊ธฐ์ค€์ด ๋‚ฎ์•„์ง„๋‹ค.

* 120์ผ ๋•Œ 1.98, ๋ฌดํ•œ๋Œ€๋กœ ๋ฐœ์‚ฐํ•˜๋Š” ๊ฒฝ์šฐ 1.96์ด ๋œ๋‹ค.

 

Kolmogorov-Smirnov Test (KS-test)

- ๋‘ ๊ฐœ์˜ cdf ์‚ฌ์ด์˜ ์ตœ๋Œ€ y-distance ์ฐจ์ด๋ฅผ ์ด์šฉํ•ด ํ™•๋ฅ  ๋ถ„ํฌ์˜ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋ƒ„.

- max distance between two cdfs: D(C1, C2) = max|C1(x)-C2(x)| (-∞ <= x <= ∞)

- ์œ ์˜์ˆ˜์ค€ a์ผ ๋•Œ, D(C1, C2) > c(a)* √((n1+n2)/n1n2) ์ด๋ฉด ๋‹ค๋ฅธ ๋ถ„ํฌ์ด๋‹ค.

* c(a): table lookup์œผ๋กœ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ

 

Normality Testing

- ์ด๋ก ์ ์ธ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ๋ถ„ํฌ์— ๋Œ€ํ•ด KS-test๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Bonferroni Correction

- 0.05์˜ ํ†ต๊ณ„์  ์œ ์˜์ˆ˜์ค€์€ ์šฐ์—ฐํžˆ ์ด ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚ฌ์„ ํ™•๋ฅ ์ด 1/20์ด๋ผ๋Š” ๋œป์ด๋‹ค.

-> ๋” ๋†’์€ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ฐ€๋˜์–ด์•ผ ํ•˜๋Š” ์ด์œ 

- n๊ฐœ์˜ ๊ฐ€์„ค ๊ฒ€์ •์„ ํ•  ๋•Œ, p-value๋Š” a/n ์ˆ˜์ค€์ด ๋˜์–ด์•ผ ํ•œ๋‹ค. -> ๊ทธ๋ž˜์•ผ a ์ˆ˜์ค€์—์„œ ์œ ์˜ํ•˜๊ฒŒ ๊ณ ๋ ค๋  ์ˆ˜ ์žˆ๋‹ค.

 

Significance of Significance

- ์ถฉ๋ถ„ํžˆ ํฐ ์ƒ˜ํ”Œ ์‚ฌ์ด์ฆˆ๋ผ๋ฉด ๊ทน๋„๋กœ ์ž‘์€ ์ฐจ์ด๋Š” ๋งค์šฐ ์œ ์˜ํ•˜๊ฒŒ ์—ฌ๊ฒจ์งˆ ์ˆ˜ ์žˆ๋‹ค.

- significance(์œ ์˜์„ฑ)์€ ๋ถ„ํฌ ์‚ฌ์ด์— ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๊ณ  ํ™•์‹ ํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด์ง€, effect size๋‚˜ importance/magnitude of difference๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค.

 

Effect Size ์ธก์ •

- Pearson correlation coefficient: 0.2 = ์ž‘์€ effect, 0.5 = ์ค‘๊ฐ„, 0.8 = ํผ

- Percentage of overlap between distribution: 53% = ์ž‘์Œ, 67% = ์ค‘๊ฐ„, 85% = ํผ

- Cohen's d = (u- u')/sigma : small > 0.2, medium > 0.5, large > 0.8

 

Permutation test, p-values

- ๋ฐ์ดํ„ฐ๋กœ ๊ฐ€์„ค์„ ์ž…์ฆํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ๋žœ๋คํ•˜๊ฒŒ ์„ž์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋„ ๊ฐ€์„ค์ด ์ž…์ฆ๋˜์–ด์•ผ ํ•œ๋‹ค.

- random permutation ์‚ฌ์ด์˜ ์‹ค์ œ ๋ฐ์ดํ„ฐ ์ˆœ์œ„๊ฐ€ significance๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.

- (์ตœ์†Œ 1000๋ฒˆ ์ด์ƒ-p value๊ฐ€ ์†Œ์ˆ˜์  ์„ธ์ž๋ฆฌ๊นŒ์ง€ ๋‚˜์˜จ๋‹ค) permutation์„ ๋งŽ์ด ์ˆ˜ํ–‰ํ•  ์ˆ˜๋ก, significance๊ฐ€ ๋” ์ค‘์š”ํ•ด์ง„๋‹ค. 

์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์„ค๋ช…ํ•  ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

for i=1 to n do a[i]=i;
for i=1 to n-1 do swap(a[i], a[Random[i,n]);

*Random ํ•จ์ˆ˜์—์„œ 1~n๋ฒˆ์งธ ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณจ๋ผ์„œ ์„ž์œผ๋ฉด uniformํ•˜๊ฒŒ ์„ž์ด์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋œ๋‹ค.

 

Sampling from distributions

- ์›ํ˜• ๋ฐ์ดํ„ฐ์—์„œ ๋ฐ˜์ง€๋ฆ„ ๊ธธ์ด์™€ ๊ฐ์œผ๋กœ ๋žœ๋คํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ๋ฅด๋ฉด ์ค‘์•™์— ๋ชฐ๋ฆฌ๋Š” ๊ฐ’์ด ๋งŽ์•„์ง„๋‹ค.

- (x, y)๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐ์–ด์•ผ ํ•œ๋‹ค.

 

Sampling in One dimension

- ์–ด๋– ํ•œ ํ™•๋ฅ ๋ถ„ํฌ์—์„œ๋“  ์ƒ˜ํ”Œ๋ง์„ ํ•˜๋ ค๋ฉด cdf ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์„œ ์ถ”์ถœํ•˜๋ฉด ๋œ๋‹ค.

 

Statistical Hypothesis Testing

Central Limit Theorem

- random variable: ๋…๋ฆฝ์ ์ด๊ณ  ๋™์งˆ์ ์œผ๋กœ ๋ถ„ํฌ๋œ ํฐ ์ˆ˜์˜ ํ‰๊ท  (i.i.d.)

- random variable์€ ์ •๊ทœ๋ถ„ํฌ ๋˜์–ด์žˆ๋‹ค๊ณ  ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ๋‹ค.

- x1, ..., xn์ด μ, σ^2๋กœ ๋œ random variable์ด๊ณ  n์ด ์—„์ฒญ ํฌ๋‹ค๋ฉด

Z = 1/n * (x1+...+xn)์€ ๊ทผ์‚ฌ์ ์œผ๋กœ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๊ฐ–๋Š”๋‹ค. (ํ‰๊ท  μ, ํ‘œ์ค€ํŽธ์ฐจ σ^2/n)

- n์ด ์—„์ฒญ ์ปค์ง€๋ฉด Binomial(n, p)~Normal(np, np(1-p))์œผ๋กœ ๊ทผ์‚ฌ ๊ฐ€๋Šฅํ•˜๋‹ค.

(n ๋ฒˆ์˜ ๋…๋ฆฝ์ ์ธ ๋ฒ ๋ฅด๋ˆ„์ด ์‹œํ–‰์˜ ํ•ฉ์œผ๋กœ ๋œ ๋žœ๋ค ๋ณ€์ˆ˜)

- ์ค‘์‹ฌ ๊ทนํ•œ ์ •๋ฆฌ์˜ ์ค‘์š”์„ฑ: ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๋ถ„ํฌ๋ฅผ ํฌํ•จํ•œ ๋ฌธ์ œ๋ฅผ ์ •๊ทœ๋ถ„ํฌ ํ˜•ํƒœ๋กœ ์ ์šฉ ๊ฐ€๋Šฅ

 

Statistical hypothesis testing

๋™์ „ ๋˜์ง€๊ธฐ์—์„œ ๋™์ „์˜ ์–‘๋ฉด์ด ๊ณตํ‰ํ•˜์ง€ ์•Š์„ ๋•Œ๋ฅผ ๊ฐ€์ •ํ•ด๋ณด๋ฉด,

H0 (๊ท€๋ฌด๊ฐ€์„ค, null hypothesis): ๋™์ „์€ ๊ณตํ‰ํ•˜๋‹ค. == p=0.5์ด๋‹ค.

H1 (๋Œ€๋ฆฝ๊ฐ€์„ค, alternative hypothesis): ๋™์ „์€ ๊ณตํ‰ํ•˜์ง€ ์•Š๋‹ค. -> ์šฐ๋ฆฌ๊ฐ€ ์ž…์ฆํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ

 

๊ฒ€์ฆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

- ๋™์ „์„ n๋ฒˆ ๋˜์ ธ์„œ ์•ž๋ฉด์ด ๋‚˜์˜ค๋Š” ํšŸ์ˆ˜๋ฅผ ์„ผ๋‹ค.

- ๋งค๋ฒˆ์˜ ๋™์ „ ๋˜์ง€๊ธฐ๋Š” Bernoulli trial์ด๋ฏ€๋กœ, X๋Š” Binomial(n, p)์ด๋‹ค.

- CLT์— ์˜ํ•ด X๋Š” Normal(np, np(1-p))๋กœ ๊ทผ์‚ฌ๋  ์ˆ˜ ์žˆ๋‹ค.

- ์œ ์˜ ์ˆ˜์ค€์„ ๊ฒฐ์ •: 1์ข… ์˜ค๋ฅ˜ (False Positive) ๋ฅผ ํ—ˆ์šฉํ•  ๋ฒ”์œ„

- ์œ ์˜์ˆ˜์ค€์„ 0.05๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ๋งŒ์•ฝ ์•ž๋ฉด์ด 532๋ฒˆ ๋‚˜์™”๋‹ค๊ณ  ํ•˜๋ฉด, 4.63%์— ์†ํ•˜๋Š” ๋ฒ”์œ„์ด๋‹ค. -> ์œ ์˜ํ•œ ์ˆ˜์ค€์˜ ๊ฒฐ๊ณผ์ด๋ฏ€๋กœ H0์„ ๊ธฐ๊ฐํ•˜๊ณ  H1์„ ์ฑ„ํƒํ•œ๋‹ค.

 

Error์˜ ์ข…๋ฅ˜

  ๊ท€๋ฌด๊ฐ€์„ค์˜ ์ฐธ/๊ฑฐ์ง“ ์—ฌ๋ถ€
์ฐธ ๊ฑฐ์ง“
H0(๊ท€๋ฌด๊ฐ€์„ค)์— ๋Œ€ํ•œ ํŒ๋‹จ ๊ธฐ๊ฐ (positive call) 1์ข… ์˜ค๋ฅ˜ (False Positive) ์ •๋‹ต (True Positive)
๊ธฐ๊ฐ ์‹คํŒจ (negative call) ์ •๋‹ต (True Negative) 2์ข… ์˜ค๋ฅ˜ (False Negative)

Statistical Hypothesis Testing with p-value

- P-value: ํ™•๋ฅ (๊ท€๋ฌด๊ฐ€์„ค(H0)์ด ์˜ณ๋‹ค๊ณ  ํŒ๋‹จ)

- ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ ์œ ์˜์ˆ˜์ค€์€ 0.05 ๋˜๋Š” 0.01๋กœ ์„ค์ •

- X = 530 ์ผ ๋•Œ, p-value = 0.062, X = 532์ผ ๋•Œ p-value = 0.0463

 

Confidence interval : ์‹ ๋ขฐ๊ตฌ๊ฐ„

1000๋ฒˆ ์ค‘ 529๋ฒˆ ์•ž๋ฉด์„ ๋ณด์•˜์„ ๋•Œ, p=0.529์ด๋‹ค.

p_hat = 0.529
sigma_hat = math.sqrt(p_hat*(1-p_hat)/1000)
print normal_two_sided_bounds(0.95, p_hat, sigma_hat) # 0.95 = ์œ ์˜์ˆ˜์ค€

>>> [0.498, 0.560]
# ์‹ค์ œ p๋Š” ์ด ๊ตฌ๊ฐ„ ์•ˆ์— 95% ํ™•๋ฅ ๋กœ ์กด์žฌํ•œ๋‹ค.

 

Example: Running an A/B test

- 2๊ฐœ์˜ ๊ด‘๊ณ  ์ค‘ ํด๋ฆญ์„ ๋” ๋งŽ์ด ์œ ๋„ํ•˜๋Š” ๊ด‘๊ณ ๋ฅผ ์„ ํƒํ•ด์•ผ ํ•จ.

- Na = A ๊ด‘๊ณ ๋ฅผ ๋ณด๋Š” ์‚ฌ๋žŒ์˜ ์ˆ˜, na = A ๊ด‘๊ณ ๋ฅผ ํด๋ฆญํ•˜๋Š” ์‚ฌ๋žŒ์˜ ์ˆ˜, pa = A ๊ด‘๊ณ ๋ฅผ ํด๋ฆญํ•  ํ™•๋ฅ 

-> na/Na๋Š” Normal(pa, pa(1-pa)/Na)๋กœ ๊ทผ์‚ฌ๋  ์ˆ˜ ์žˆ๋‹ค. (์ด์œ : ๋ฒ ๋ฅด๋ˆ„์ด ์‹œํ–‰์„ n ๋ฒˆ ๋ฐ˜๋ณตํ•œ ๊ฒƒ์ด๋ฏ€๋กœ)

-> nb/Nb๋Š” Normal(pb, pb(1-pb)/Nb)๋กœ ๊ทผ์‚ฌ๋  ์ˆ˜ ์žˆ๋‹ค.

- 2๊ฐœ์˜ ๋ถ„ํฌ๋Š” ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ ๋‘ ๊ฐœ์˜ ์ฐจ์ด๋„ normal ํ•ด์•ผ ํ•œ๋‹ค.

- H0์„ pa=pb๋ผ๊ณ  ๊ฐ€์ •ํ•ด์„œ ๊ฒ€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

- ์œ ์˜์ˆ˜์ค€์„ 0.05๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ํ•ด๋ณด์ž.

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +

Scores and Rankings

- Scoring functions: ๋‹ค์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ๋‹จ์ผ ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ํŠน์ • ์„ฑ์งˆ์„ ๊ฐ•์กฐํ•˜๋Š” ๋ฐฉ๋ฒ•

- Rankings: ์ ์ˆ˜๋ฅผ ์ •๋ ฌํ•˜์—ฌ ํ•ญ๋ชฉ์˜ ์ˆœ์œ„๋ฅผ ๋งค๊น€

 

Assigning Grades 

- ํ•™์ ์€ scoring function์œผ๋กœ ๋ถ€์—ฌ๋œ๋‹ค.

- ํŠน์ง•: ์ž„์˜์„ฑ (๊ต์ˆ˜๋‹˜๋งˆ๋‹ค ๊ธฐ์ค€์ด ๋‹ค๋ฆ„), validation data ์—†์Œ ("์˜ณ์€" ๋“ฑ๊ธ‰์€ ์—†์Œ), general robustness (๋‹ค๋ฅธ ์ˆ˜์—…์ด์–ด๋„ ํ•™์ƒ๋งˆ๋‹ค ํ•™์ ์€ ๋น„์Šท๋น„์Šทํ•จ)

 

Scoring vs. Regression

- gold standard/right answer๊ฐ€ ์—†๋‹ค.

- ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ์„ ํ˜• ํšŒ๊ท€ ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” scoring function์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์ง€๋งŒ ๋ณดํ†ต ๊ทธ๋Ÿฌ์ง€ ์•Š๋Š”๋‹ค.

 

BMI ์ง€์ˆ˜

- BMI = mass / height^2

- BMI๋Š” ํ•ด์„ํ•˜๊ธฐ ์‰ฝ๋‹ค.

- ์šด๋™์„ ์ˆ˜์˜ BMI๋ฅผ ํ™•์ธํ•ด๋ดค์„ ๋–„, ๋†๊ตฌ์„ ์ˆ˜๋Š” ๋น„๋งŒ์ธ ๋น„์œจ์ด ๋‚ฎ๊ณ , ํ’‹๋ณผ์€ ์ •์ƒ์ธ ๋น„์œจ์ด ๋‚ฎ๋‹ค.

 

Gold Standards and Proxies

- Gold standards: scoring goal์„ ๋ฐ˜์˜ํ•˜๋Š”, ์šฐ๋ฆฌ๊ฐ€ ๋งž๋‹ค๊ณ  ๋งž๋Š” ์ง€ํ‘œ ๋˜๋Š” ์ •๋‹ต

- Proxies: ์ธก์ •ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ด€๋ จ๋œ ์ ‘๊ทผ๊ฐ€๋Šฅ์„ฑ ์žˆ๋Š” ์ž๋ฃŒ

ex) GPA -> ์ˆ˜์—…์— ์–ผ๋งˆ๋‚˜ ์ž˜ ์ฐธ์—ฌํ•˜๊ณ  ์žˆ๋Š”์ง€ ํŒ๋‹จํ•˜๋Š” ์ž๋ฃŒ๋กœ ์‚ฌ์šฉ

 

Scoring vs. Machine Learning

- gold standard๊ฐ€ ์žˆ๋‹ค๋ฉด ์ •ํ™•ํ•˜๊ฒŒ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด regression function์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ.

- proxy๊ฐ€ ์žˆ๋‹ค๋ฉด scoring function์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค.

 

Scores vs. Rankings

- ๊ฐ’์ด ๋…๋ฆฝ์ ์œผ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‚˜? (351ํŒ€ ์ค‘ 111๋“ฑ / RPI = 39.18)

- ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ์–ด๋–ค๊ฐ€? (1๋“ฑ๊ณผ 2๋“ฑ์˜ ์ฐจ์ด๊ฐ€ ์–ผ๋งˆ์ธ๊ฐ€?)

- ์ค‘์•™๊ฐ’์ด๋‚˜ ๊ทนํ•œ์„ ์ƒ๊ฐํ•  ๊ฒƒ์ธ๊ฐ€? (1๋“ฑ๊ณผ 2๋“ฑ ์ฐจ vs 50๋“ฑ 51๋“ฑ ์ฐจ?)

 

Recognizing Good Scoring Functions

- ์‰ฝ๊ฒŒ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๊ณ  ์ดํ•ด ๊ฐ€๋Šฅ

- Monotic interpretation ๊ฐ€๋Šฅ

- ์ด์ƒ์น˜์— ๋Œ€ํ•œ ๋งŒ์กฑํ•  ๋งŒํ•œ ๊ฒฐ๊ณผ ๋„์ถœ

- ์‹œ์Šคํ…œ์ ์œผ๋กœ ์ •๊ทœํ™”๋œ ๋ณ€์ˆ˜ ์ด์šฉ

 

Normalization and Z-scores

Z_i = (X_i  - Xbar)/sigma

- ๋‹จ์œ„๊ฐ€ ์—†์–ด์ง€๋„๋ก ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ

- ํ‰๊ท  0, ์‹œ๊ทธ๋งˆ 1

 

Advanced Ranking Techniques

- Linear combinations of normalized values

- Elo rankings

- Merging rank orderings

- Directed graph orderings

 

Binary Comparisons

- AํŒ€๊ณผ BํŒ€ ์‚ฌ์ด์˜ ๋น„๊ต

- ์ด๋Ÿฌํ•œ ๋น„๊ต๋Š” ํŒ€ ์‚ฌ์ด ๊ฒฝ์Ÿ์˜ ๋‚œ์ด๋„๊ฐ€ ์ฐจ์ด๊ฐ€ ์žˆ์„ ๋•Œ ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.

 

Elo Rankings

- ๊ฐ™์€ ์ˆœ์œ„๋กœ ์‹œ์ž‘ํ•œ ์ดํ›„, ๊ฐ ์‹œํ•ฉ์˜ ๋‚œ์ด๋„๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ์ ์ˆ˜๋ฅผ ๋งค๊ธฐ๋Š” ์‹œ์Šคํ…œ

ํ•œ ๊ฒŒ์ž„์ด ๋๋‚œ ์ดํ›„ ๊ฐ’ (k*(Sa_ua))๋ฅผ ๋”ํ•˜์—ฌ ์ˆœ์œ„๋ฅผ ๋ณ€๊ฒฝ์‹œํ‚ค๋Š” ์‹œ์Šคํ…œ์ด๋‹ค.

- S_A: ๊ฒฝ๊ธฐ์—์„œ AํŒ€์ด ์–ป์€ ์ ์ˆ˜. ๋ณดํ†ต 1์  -> ์Šน, -1์  -> ํŒจ

- u_A: B์™€์˜ ๊ฒฝ๊ธฐ์—์„œ A๊ฐ€ ์–ป์„ ๊ฒƒ์ด๋ผ๊ณ  ์—์ธก๋˜๋Š” ์ ์ˆ˜

* ์•ฝ์ฒด์ธ ํŒ€์„ ์ƒ๋Œ€ํ•  ๋•Œ, 1์— ๊ฐ€๊นŒ์šด ๊ฐ’์„ ๊ฐ€์ง€๊ณ , ๊ฐ•ํŒ€์„ ์ƒ๋Œ€ํ•  ๋•Œ -1์— ๊ฐ€๊นŒ์šด ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.

* ๊ฐ•ํŒ€์„ ์ƒ๋Œ€ํ•ด์„œ ์ด๊ฒผ์„ ๋•Œ ๋“ฑ์ˆ˜์— ํฌ๊ฒŒ ๋ฐ˜์˜๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ

- k: ๋‹จ์ผ ๊ฒฝ๊ธฐ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ์ ์ˆ˜ ์กฐ์ •

 

Expected Match Score

- P(A>B)๊ฐ€ A๊ฐ€ B๋ฅผ ์ด๊ธธ ํ™•๋ฅ ์ด๋ผ๋ฉด, u_A๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

์ฆ‰ u_A์˜ ๊ฐ’์€ ์ด๊ธธ ํ™•๋ฅ ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

- ์ˆœ์œ„ ๋งค๊ธฐ๋Š” ์‹œ์Šคํ…œ์ด ์˜๋ฏธ์žˆ๋‹ค๋ฉด, P(A>B)๋Š” r(A)-r(B)์˜ ํ•จ์ˆ˜๊ฐ€ ๋˜์–ด์•ผ ํ•œ๋‹ค.

* ๋งŒ์•ฝ ์ „์ ์ด ์—†๋Š” ๊ฒฝ์šฐ ์ด๊ธธ ๊ฐ€๋Šฅ์„ฑ์„ ์–ด๋–ป๊ฒŒ ํŒ๋‹จํ• ๊นŒ?

-> ranking ์ฐจ๋กœ mapping์„ ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. (์ž˜ ๋งŒ๋“  ์‹œ์Šคํ…œ์ผ ๊ฒฝ์šฐ ์ž˜ ์ž‘๋™ํ•  ๊ฒƒ์ด๋‹ค.)

-> mapping์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•? Logit function ์ด์šฉ

 

Logit Function

f(0) = 1/2

f() = 1

f(-) = 0 ์ธ ํ•จ์ˆ˜๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋ฉฐ, ๊ทธ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๊ทธ๋ž˜ํ”„๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋ฅผ ๋ณด์ธ๋‹ค.

x์˜ ์˜๋ฏธ๋Š” r(A)-r(B)์ด๋‹ค.

 

Borda's method

- ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๊ฒฝ์Ÿ์ด ์žˆ๋Š” ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•

- ์ˆœ์œ„(์œ„์น˜)์— ๋”ฐ๋ผ ์ ์ˆ˜๋ฅผ ๋‹ค๋ฅด๊ฒŒ ๋งค๊ฒจ์ฃผ๊ณ , ์ดํ•ฉ ์ ์ˆ˜๋ฅผ ํ†ตํ•ด rank๋ฅผ ๊ฒฐ์ •

- Linear position weight๋Š” ๋ชจ๋“  ๋“ฑ์ˆ˜ ์‚ฌ์ด์— ๊ฐ™์€ confidence๊ฐ€ ์žˆ์„ ๋•Œ์—๋Š” ๋ฌธ์ œ๊ฐ€ ์—†์œผ๋‚˜, ๋ณดํ†ต 1๋“ฑ/๊ผด๋“ฑ ๊ทผ์ฒ˜ ๊ตฌ๋ณ„์ด ์ค‘์•™๋ณด๋‹ค ์‰ฌ์›€

-> normally distributed weight๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•จ

 

Directed Graph Orderings

- A>B๋ฅผ edge(A, B)๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ

- inconsistency๊ฐ€ ์—†๋‹ค๋ฉด, directed acyclic graph๋ฅผ ์–ป๊ฒŒ ๋œ๋‹ค.

- Topologically sorting

๋ฝ‘๋Š” ๋ฐฉ๋ฒ•

1. incoming edge๊ฐ€ ์—†๋Š” A ๋…ธ๋“œ๋ฅผ ์„ ํƒ ํ›„ ์‚ญ์ œ

2. ๋‹ค์Œ ๋…ธ๋“œ ์ค‘ incoming edge๊ฐ€ ์—†๋Š” ๋…ธ๋“œ๋ฅผ ๊ณ ๋ฅด๊ณ  ์‚ญ์ œ -> ๋ฐ˜๋ณต

๊ฒฐ๊ณผ) ABCGDEF or GABCDEF

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +

Typical Data Science Pipeline

Data Munging / Data wrangling

- ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ณ  ๋ถ„์„์„ ์œ„ํ•ด ์ค€๋น„/์ •๋ฆฌํ•˜๋Š” ๊ฒƒ

 

๋ฐ์ดํ„ฐ๊ณผํ•™์— ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด

- Python: ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ํ”ผ์ณ ๋“ฑ ํฌํ•จ - regular expressions

* regular expression ์˜ˆ

    - [hc]at : hat or cat

    - .at : .์ž๋ฆฌ์— ์•„๋ฌด๊ฑฐ๋‚˜ ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ์Œ

    - ab*c : b๊ฐ€ 0๊ฐœ ์ด์ƒ ์žˆ์–ด์•ผ ํ•จ

    - ab+c : b๊ฐ€ 1๊ฐœ ์ด์ƒ ์žˆ์–ด์•ผ ํ•จ

- R: ํ†ต๊ณ„ํ•™์ž๋“ค์ด ์ฃผ๋กœ ์‚ฌ์šฉ

- Matlab: ํ–‰๋ ฌ

- Java/C++: ๋น…๋ฐ์ดํ„ฐ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ์–ธ์–ด

- Excel: bread and butter toool

 

๋…ธํŠธ๋ถ ํ™˜๊ฒฝ

- reproducible, tweakable, documented

- ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ์œ ์ง€ํ•˜๊ธฐ ์‰ฝ๋‹ค.

 

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

- Files (CSV, XML, JSON)

- Databases (SQL server)

- API: Application Programming Interface

- Web Scraping (HTML)

 

๋ฐ์ดํ„ฐ ํ˜•์‹

1. CSV (Comma-Separated Values)

2. XML (eXtensible Markup Language)

<student>

    <name> 000 </name>

    <age> 19 </age>

</student>

์ด๋Ÿฐ ์‹์œผ๋กœ <>, </> ๋กœ ๊ตฌ๋ถ„ ๊ฐ€๋Šฅ

3. JSON (JavaScript Object Notation)

{

    "a": {

             "name": 000,

             "age": 19

            }

... ์ด๋Ÿฐ ์‹์œผ๋กœ, {}์™€ ๋”ฐ์˜ดํ‘œ ๋“ฑ์œผ๋กœ ๊ตฌ๋ถ„

 

CSV ํŒŒ์ผ์—์„œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - pandas ์ด์šฉ

import pandas
student_data = pandas.read_csv("student_table.csv")

 

Relational DB์—์„œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas
student_data = pandas.read_sql_query(sql_string, db_uri)
SELECT * FROM student_table --์ „์ฒด ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
SELECT studentid, name FROM student_table WHERE gpa > 3.5
--ํ…Œ์ด๋ธ”์—์„œ gpa>3.5์ธ ํ•™์ƒ๋“ค์˜ id์™€ ์ด๋ฆ„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

SELECT S.name, A.name
FROM student_table S, advisor_table A
WHERE S.advisorid = A.id --ํ•™์ƒ์˜ ์ง€๋„๊ต์ˆ˜ ์•„์ด๋””๊ฐ€ ์ง€๋„๊ต์ˆ˜ ๋ช…๋‹จ ์•„์ด๋””์™€ ๋™์ผํ•œ ๊ฒฝ์šฐ์— ์ด๋ฆ„ ์Œ ์ถœ๋ ฅ

SELECT S.advisorid, avg(S.gpa)
FROM student_table S
GROUP BY S.advisorid --์ง€๋„๊ต์ˆ˜๊ฐ€ ๊ฐ™์€ ํ•™์ƒ๋“ค์˜ ์ง€๋„๊ต์ˆ˜ ์•„์ด๋””์™€ ํ•™์ƒ๋“ค ์„ฑ์  ํ‰๊ท  ์ถœ๋ ฅ

SELECT S.advisorid, avg(S.gpa)
FROM student_table S
WHERE S.birthdate >= 2000/1/1 AND S.birthdate < 2010/1/1
GROUP BY S.advisorid

 

API๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์–ป๊ธฐ

- REST (Representational State Transfer) API

import json
import requests

url = 'https://...format=json'

data = requests.get(url).text
parsed_data = json.loads(data)

 

Web Scraping

<html>
	<head>
    	<title> ... </title>
    </head>
    <body>
        <h1> hello world! </h1>
        <p> COSE471 data science </p>
    </body>
</html>

 

๋ฐ์ดํ„ฐ์˜ ์›์ฒœ

1. Proprietary Data Sources

- ๊ธฐ์—…์ด ์ˆ˜์ง‘ํ•˜๋Š” ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ

- API๋กœ ์ œํ•œ์ ์œผ๋กœ ๊ณต๊ฐœํ•  ์ˆ˜ ์žˆ์Œ

- ์™ธ๋ถ€๋กœ ๋ฐ์ดํ„ฐ ์ „์ฒด๋ฅผ ๊ณต๊ฐœํ•˜๊ธฐ ์–ด๋ ค์šด ์ด์œ : ๋น„์ฆˆ๋‹ˆ์Šค์  ์ด์œ (๊ฒฝ์Ÿ์‚ฌ๋ฅผ ๋„์™€์ค„ ์ˆ˜ ์žˆ์Œ), ํ”„๋ผ์ด๋ฒ„์‹œ ์ด์Šˆ

- ์˜ˆ์‹œ: 2006๋…„ AOL search log release

 

2. Government Data Sources

- ์‹œ, ๊ตญ๊ฐ€ ๋“ฑ์—์„œ ๊ณต๊ฐœํ•˜๋Š” ์˜คํ”ˆ ๋ฐ์ดํ„ฐ

- Freedom of Information Act: ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต๊ฐœํ•ด๋‹ฌ๋ผ๊ณ  ์š”์ฒญํ•  ์ˆ˜ ์žˆ์Œ

- ๋ฐ์ดํ„ฐ๊ฐ€ ๊ณต๊ฐœ๋  ๋•Œ ํ”„๋ผ์ด๋ฒ„์‹œ๋ฅผ ์ง€์ผœ์•ผ ํ•จ

 

3. Academic Data Sets

- ๋…ผ๋ฌธ ๋“ฑ์—์„œ ๋ฐœํ‘œํ•˜๋Š” ์ž๋ฃŒ

 

4. Web Search/Scraping

- Scraping: ์›นํŽ˜์ด์ง€์—์„œ ๋ฐ์ดํ„ฐ๋‚˜ ํ…์ŠคํŠธ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ

- ์ด๋ฏธ API๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€, ๋ˆ„๊ตฐ๊ฐ€ ์ด๋ฏธ scraper๋ฅผ ์ž‘์„ฑํ•ด๋‘์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ณ  ํ•ด์•ผ ํ•จ

 

5. Available Data Sources

- Bulk Downloads: ์œ„ํ‚คํ”ผ๋””์•„, IMDB, Million Song Database

- API access: ๋‰ด์š•ํƒ€์ž„์Šค, ํŠธ์œ„ํ„ฐ ๋“ฑ

 

6. Sensor Data Logging

- ์ด๋ฏธ์ง€, ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ: Flicker ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‚ ์”จ๋ฅผ ๊ด€์ธก

- ํœด๋Œ€ํฐ์˜ accelerometer๋ฅผ ์ด์šฉํ•˜์—ฌ ์ง€์ง„ ๊ด€์ธก

- ํƒ์‹œ GPS๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ตํ†ต ํ๋ฆ„ ๊ด€์ธก

- logging system ๊ตฌ์ถ•: ์ €์žฅ์†Œ๊ฐ€ ์ €๋ ดํ•จ

 

7. Crowdsourcing

- ์œ„ํ‚คํ”ผ๋””์•„, Freebase, IMDB ๋“ฑ ์—ฌ๋Ÿฌ ์ €์ž๊ฐ€ ์ฐธ์—ฌํ•œ ๋ฐ์ดํ„ฐ

- Amazon Turk, CrowdFlower ๋“ฑ์˜ ํ”Œ๋žซํผ์— ๋ˆ์„ ์ง€๋ถˆํ•˜๋ฉด ๋‹ค์ˆ˜์˜ ์‚ฌ๋žŒ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์˜ฌ ์ˆ˜ ์žˆ์Œ

 

8. Sweat Equity

- ์—ญ์‚ฌ ์ž๋ฃŒ ๋“ฑ์€ ์ข…์ด, ํ”ผ๋””์—ํ”„ ๋“ฑ์˜ ํ˜•์‹์œผ๋กœ ์ €์žฅ๋˜์–ด์žˆ์–ด ๋ณ€ํ™˜์ด ํ•„์š”ํ•จ

 

Cleaning Data: Garbage In, Garbage Out

- ๋ฐ์ดํ„ฐ ์ •๋ฆฌ ์ค‘์— ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ

* artifact์™€ ์˜ค๋ฅ˜๋ฅผ ๊ตฌ๋ณ„

* Data compatibility, unification

* ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

* ๋ฏธ๊ด€์ธก๊ฐ’ (0) ์ถ”์ •

* ์ด์ƒ์น˜ ํƒ์ง€

 

Errors vs. Artifacts

- error: ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ณผ์ •์—์„œ ๊ทผ๋ณธ์ ์œผ๋กœ ์žƒ์–ด๋ฒ„๋ฆฐ ์ •๋ณด

- artifact: ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๊ณผ์ • ์ค‘์— ๋‚˜ํƒ€๋‚˜๋Š” ์‹œ์Šคํ…œ์ ์ธ ๋ฌธ์ œ

- sniff test: product๋ฅผ ๊ฐ€๊นŒ์ด์—์„œ ์กฐ์‚ฌํ•˜์—ฌ ์ž˜๋ชป๋˜์—ˆ๋‹ค๋Š” ๋‚Œ์ƒˆ๋ฅผ ์•Œ์•„์ฐจ๋ฆฌ๋Š” ๋ฐฉ๋ฒ•

 

์—ฐ๋„๋ณ„ ๊ณผํ•™์ž์˜ ์ฒซ ๋…ผ๋ฌธ ๋ฐœํ‘œ์ž ์ˆ˜ ๊ด€๋ จ ๋ฐ์ดํ„ฐ

- PubMed๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ€์žฅ ์ž์ฃผ ์ธ์šฉ๋˜๋Š” ๊ณผํ•™์ž 100,000๋ช…์˜ ์ฒซ ๋…ผ๋ฌธ ๋ฐœํ‘œ ์—ฐ๋„๋ฅผ ์กฐ์‚ฌ

- ์˜ˆ์ƒ ๋ถ„ํฌ: 1960๋…„์— ๋ชฐ๋ ค์žˆ๊ณ , 2023๋…„์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉด 0์— ์ˆ˜๋ ดํ•˜๋ฉฐ, ๊ทธ ์ค‘๊ฐ„์€ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€๋˜๋Š” ๊ทธ๋ž˜ํ”„

(์ด์œ : 1960๋…„๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ–ˆ์œผ๋ฏ€๋กœ ๊ทธ ์ด์ „ ๋…ผ๋ฌธ๋“ค๋„ 1960๋…„์œผ๋กœ ๋ฐ˜์˜๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ตœ๊ทผ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ์ปค๋ฆฌ์–ด๊ฐ€ ์ ์€ ์‚ฌ๋žŒ์ด ๋งŽ์•„์„œ ์ˆœ์œ„์— ๋“ค์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ)

- ์‹ค์ œ ๋ถ„ํฌ (์™ผ์ชฝ)

1965๋…„ ์ด์ „ ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜์ž‘์—…์œผ๋กœ ํ•˜์—ฌ ๋ˆ„๋ฝ๋˜์—ˆ์„ ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ

2000๋…„ ์ดํ›„ ์ €์ž ์ด๋ฆ„ ๊ธฐ๋ก ๋ฐฉ์‹์ด ๋ณ€ํ™”๋กœ, ๋‹ค๋ฅธ ์ €์ž์ง€๋งŒ ๊ฐ™์€ ์ €์ž๋กœ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์–ด ์˜ค๋ฅ˜ ๋ฐœ์ƒ

 

Data Compatibility

1. Unit Conversion

- 1999๋…„ ๋‚˜์‚ฌ์˜ Mars Climate Orbiter๊ฐ€ ์‚ฌ๋ผ์ง

- ์ด์œ ๋Š” ๋‹จ์œ„ ์ฐจ์ด ๋•Œ๋ฌธ

- Z-score๋ฅผ ์ด์šฉํ•˜๋ฉด ๋‹จ์œ„๊ฐ€ ์‚ฌ๋ผ์ง€๋ฏ€๋กœ ์œ ์šฉ

 

2. Number / Character Representation

- 1996๋…„ Ariane 5 rocket ํญ๋ฐœ: 64-bit float์„ 16-bit integer๋กœ ์ „ํ™˜ํ–ˆ๊ธฐ ๋•Œ๋ฌธ

- ์‹ค์ œ ์ˆ˜์น˜์˜ integer ๊ทผ์‚ฌ๋ฅผ ํ”ผํ•  ๊ฒƒ

- ์ธก์ •๊ฐ’์€ decimal number, count๋Š” integer๊ฐ€ ๋˜๋„๋ก ํ•˜์ž

- fractional quantities๋Š” decimal ๊ฐ’์œผ๋กœ ๊ธฐ๋กํ•˜์ž

 

3. Character Representations

- ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ์ฃผ์˜ํ•˜์ž

- UTF-8: multibyte encoding for all Unicode characters

 

4. Name Unification

- ๊ธ€ ์ €์ž ๊ธฐ๋ก ๋ฐฉ์‹์ด ๋‹ค ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ. ex) Steve, Steven. S. ๋“ฑ๋“ฑ

- ์ด๋ฆ„์„ ๊ฐ™๊ฒŒ ํ†ต์ผํ•˜์ž. ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜, ๋ฏธ๋“ค๋„ค์ž„ ์ œ๊ฑฐ ๋“ฑ

- false positive์™€ negative ์‚ฌ์ด์˜ tradeoff

 

5. Time / Date Unification

- ๋‚ ์งœ ํ†ต์ผ: UTC ๋“ฑ ์ด์šฉ

 

6. Financial Unification

- Currency conversion: ํ™˜์œจ์„ ์ด์šฉ

- ์ ˆ๋Œ€๊ฐ’ ๋ณ€ํ™”๋ณด๋‹ค returns / percentage change๋ฅผ ์ด์šฉํ•˜์ž

- stock price๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ•˜์ž. (split(์•ก๋ฉด๋ถ„ํ• ), dividends(๋ฐฐ๋‹น))

- time value of money๋Š” ๊ณต์ •ํ•œ ์žฅ๊ธฐ ๋น„๊ต๋ฅผ ์œ„ํ•ด ์ธํ”Œ๋ ˆ์ด์…˜์˜ correction์„ ํ•ด์•ผ ํ•œ๋‹ค.

 

๊ฒฐ์ธก์น˜ ๋‹ค๋ฃจ๊ธฐ

- ๋ชจ๋ฅธ๋‹ค๊ณ  ํ•˜์—ฌ 0์œผ๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ ์˜ณ์ง€ ์•Š๋‹ค.

- ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ: ๋นˆ์นธ์œผ๋กœ ๋‘๋Š” ๊ฒƒ๋ณด๋‹ค estimation, imputation์ด ํ•„์š”ํ•˜๋‹ค.

 

Imputation methods

- Heuristic-based imputation: ํ‰๊ท ์ˆ˜๋ช…์„ ๋ฐ˜์˜ํ•˜์—ฌ ์‚ฌ๋ง์ผ์ž ์˜ˆ์ธก

- Mean value imputation: ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด

- Random value imputation: ๋‚œ์ˆ˜๋กœ ๋Œ€์ฒด -> ํ†ต๊ณ„์  ํ‰๊ฐ€์— imputation์˜ ์˜ํ–ฅ์„ ์ค„์—ฌ์คŒ

- Imputation by nearest neighbor: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด record๋ฅผ ์ด์šฉํ•˜์—ฌ ๋Œ€์ฒด

- Imputation by interpolation: linear regression์„ ์ด์šฉํ•˜์—ฌ ๋Œ€์ฒด -> record์— ๋นˆ ํ•„๋“œ๊ฐ€ ๋งŽ์ด ์—†์„ ๋•Œ

 

Outlier Detection

- ์ตœ๋Œ“๊ฐ’, ์ตœ์†Ÿ๊ฐ’์„ ์ž˜ ํ™•์ธํ•˜๊ธฐ

- Normally distributed data๋Š” ํฐ outlier๋ฅผ ๊ฐ–์ง€ ์•Š๋Š”๋‹ค. (k sigma from the mean)

- ๊ทธ๋ƒฅ ์‚ญ์ œํ•˜์ง€ ๋ง๊ณ  ์ด์ƒ์น˜๊ฐ€ ์ƒ๊ธด ์ด์œ ๋ฅผ ์ƒ๊ฐํ•˜์—ฌ ๊ณ ์น˜๊ธฐ

- ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ํƒ์ง€ํ•˜๊ธฐ ์‰ฌ์›€ (์ €์ฐจ์›์ธ ๊ฒฝ์šฐ๋งŒ ๊ฐ€๋Šฅ)

- ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ–ˆ์„ ๋•Œ ์ค‘์‹ฌ์œผ๋กœ๋ถ€ํ„ฐ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ์žˆ๋Š” ๊ฒฝ์šฐ

 

Delete Outliers Prior to Fitting?

- ์ ํ•ฉ ์ด์ „์— ์ด์ƒ์น˜๋ฅผ ์ง€์šฐ๋Š” ๊ฒƒ์€ ๋” ๋‚˜์€ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ๋„ ํ•œ๋‹ค.

ex) ์ด์ƒ์น˜๊ฐ€ ์ธก์ • ์˜ค๋ฅ˜๋กœ ์ƒ๊ธด ๊ฒƒ์ผ ๋•Œ

- ๋ฐ˜๋Œ€๋กœ ๋” ๋‚˜์œ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ๋„ ํ•œ๋‹ค.

ex) ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์— ์˜ํ•ด ์„ค๋ช…๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•ด์„œ ํ•ด๋‹นํ•˜๋Š” ์ ์„ ๋ชจ๋‘ ์ง€์›Œ๋ฒ„๋ฆด ๊ฒฝ์šฐ

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +
1 2 3