profile-img
The merit of an action lies in finishing it to the end.
slide-image

Probability
- experiment: ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋Š” ๊ฒฐ๊ณผ ํ•œ ์„ธํŠธ๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ณผ์ •
- sample space S: experiment์˜ ๊ฐ€๋Šฅ์„ฑ์žˆ๋Š” ๊ฒฐ๊ณผ์˜ ์„ธํŠธ
- event E: experiment์˜ ํŠน์ • ํ•œ ๊ฐ€์ง€ ๊ฒฐ๊ณผ
- p(s): probability of an outcome s
0 <= p(s) <= 1 // p(s)์˜ ํ•ฉ = 1์„ ๋งŒ์กฑํ•˜๋Š” ์ˆซ์ž -> ํ™•๋ฅ 
- probability of an event E: experiment์˜ ๊ฒฐ๊ณผ๋“ค์˜ ํ™•๋ฅ ์˜ ํ•ฉ
P(E) = p(s)์˜ ํ•ฉ = 1 - p(E^C)
- random variable V: ํ™•๋ฅ ๊ณต๊ฐ„์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ numerical function
- expected value of a random variable V
E[V] = p(s) * V(s)์˜ ํ•ฉ

Probability vs. Statistics
- ํ™•๋ฅ ์€ ๋ฏธ๋ž˜ ์‚ฌ๊ฑด์— ๋Œ€ํ•œ ๊ฐ€๋Šฅ์„ฑ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ๋‹ค๋ฃธ
- ํ†ต๊ณ„๋Š” ๊ณผ๊ฑฐ ์‚ฌ๊ฑด์˜ ๋นˆ๋„์— ๋Œ€ํ•ด ๋ถ„์„, ์‹ค์ œ ์„ธ๊ณ„์— ๋Œ€ํ•œ ๊ด€์ธก์„ ํ•ฉ๋ฆฌํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋จ

Compound Events and Independence
- independent (๋…๋ฆฝ): ๋‹ค์Œ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” event 2๊ฐœ


Conditional Probability P(A|B)

- ์ •์˜: B์‚ฌ๊ฑด์ด ์ผ์–ด๋‚ฌ์„ ๋•Œ A์‚ฌ๊ฑด๋„ ์ผ์–ด๋‚  ํ™•๋ฅ 
- A์™€ B๊ฐ€ ๋…๋ฆฝ์ผ ๋•Œ, P(A|B) = P(A)

Bayes Theorem
- ์˜์กด๊ด€๊ณ„์˜ ๋ฐฉํ–ฅ์„ ๋ฐ”๊ฟ€ ๋•Œ ์ด์šฉ
- ์‚ฌ์ „ํ™•๋ฅ ๋กœ๋ถ€ํ„ฐ ์‚ฌํ›„ํ™•๋ฅ ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ

Ex) B: ์ŠคํŒธ ์ด๋ฉ”์ผ์ธ ์‚ฌ๊ฑด / A: ์ด๋ฉ”์ผ์ธ ์‚ฌ๊ฑด
๋ฐ›์€ ์ด๋ฉ”์ผ์ด ์ŠคํŒธ ์ด๋ฉ”์ผ์ผ ํ™•๋ฅ ์„ ๊ตฌํ•  ๋•Œ ์ด์šฉ
- P(A), P(B)๋Š” ๊ฐ๊ฐ์˜ ์‚ฌ๊ฑด์˜ ์‚ฌ์ „ํ™•๋ฅ ์ด๋‹ค.
- ์‚ฌํ›„ ํ™•๋ฅ ์„ ๊ตฌํ•˜๊ธฐ. ์–ด๋ ค์šฐ๋ฏ€๋กœ approximation ~ naive Bayesian ์ด์šฉ

Distributions of Random Variables
- Random variables: ๊ฐ’๊ณผ ํ™•๋ฅ ์ด ๊ฐ™์ด ๋“ฑ์žฅํ•˜๋Š” ์ˆ˜์น˜์  ํ•จ์ˆ˜
- Probability density functions (pdfs): ํžˆ์Šคํ† ๊ทธ๋žจ ๋“ฑ์œผ๋กœ RV๋ฅผ ๋‚˜ํƒ€๋ƒ„
- Cumulative density functions (cdfs): running sum of the pdf
- pdf์™€ cdf๋Š” ๋™์ผํ•œ ์ •๋ณด๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค.

- cdf๋Š” ์„ฑ์žฅ๋ฅ ์— ๋Œ€ํ•œ ์ž˜๋ชป๋œ ์‹œ๊ฐ์„ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค. -> ์—„์ฒญ ๋น ๋ฅด๊ฒŒ ์„ฑ์žฅํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ด๋‚˜, pdf๋กœ ํ™•์ธํ•  ๊ฒฝ์šฐ ๊ฐœ๋ณ„ ์—ฐ๋„๋ณ„ ์„ฑ์žฅ๋ฅ ์ด ๊ทธ๋ ‡๊ฒŒ ๋†’์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ.

 

Descriptive Statistics

- Central tendency measures: ์ค‘์‹ฌ์  ์ฃผ๋ณ€์— ๋ถ„ํฌํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์„ค๋ช…

- Variation or variability measures: ๋ฐ์ดํ„ฐ๊ฐ€ ํผ์ ธ์žˆ๋Š” ์ •๋„๋ฅผ ์„ค๋ช…

 

Centrality Measure

(Arithmatic) Mean

- ์žฅ์ : outlier๊ฐ€ ์—†๋Š” symmetric distribution์—์„œ ์˜๋ฏธ ์žˆ๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (ex. ํ‚ค, ๋ชธ๋ฌด๊ฒŒ ๋“ฑ์˜ ์ •๊ทœ๋ถ„ํฌ)

 

median: middle value

- skewed distribution

- outlier๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ

- ๋ถ€, ์ˆ˜์ž… ๋“ฑ

 

mode: ๊ฐ€์žฅ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ์š”์†Œ

- ์ค‘์•™์— ๊ฐ€๊น์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.

 

geometric mean: nth root of the product of n values

- geometric mean์€ ํ•ญ์ƒ arithmetic mean๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ ๊ฐ™๋‹ค.

- 0์— ๊ฐ€๊นŒ์šด ๊ฐ’๋“ค์— ๋” ๋ฏผ๊ฐํ•˜๋‹ค.

- ratio์˜ mean์„ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉ

 

Aggregation as Data Reduction

- feature์˜ ๊ฐœ์ˆ˜๊ฐ€ ์•„๋‹Œ ๊ทธ๋ƒฅ ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒƒ

- train, test set ๋‚˜๋ˆŒ ๋•Œ ์œ ์˜ํ•˜์ง€ ์•Š์œผ๋ฉด ํŽธํ–ฅ๋  ์ˆ˜ ์žˆ์Œ.

 

Variance Metric: Standard Deviation

- Variance: standard deviation sigma์˜ ์ œ๊ณฑ

- population SD: n์œผ๋กœ ๋‚˜๋ˆ”

- sample SD: n-1๋กœ ๋‚˜๋ˆ”

- n์ด ์•„์ฃผ ์ปค์ง€๋ฉด n ~ (n-1) ์ด๋ฏ€๋กœ ํฐ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์Œ

- hat: sample์„ ์˜๋ฏธํ•จ

- ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ถ„ํฌ๋ฅผ ํŠน์ •์ง€์„ ์ˆ˜ ์žˆ๋‹ค.

 

Parameterizing Distributions

- ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ถ„ํฌํ•ด์žˆ๋Š”์ง€์™€๋Š” ์ƒ๊ด€์—†์ด, ์ตœ์†Œํ•œ 1-1/k^2๋ฒˆ์งธ ์ ์€ ํ‰๊ท ์˜ k sigma ์•ˆ์ชฝ์— ์žˆ์–ด์•ผ ํ•œ๋‹ค.

- ์ตœ์†Œํ•œ 75%๋Š” ํ‰๊ท ์˜ 2 sigma ์•ˆ์ชฝ์— ์žˆ๋‹ค.

- Power law์˜ ๊ฒฝ์šฐ์—๋Š” ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. (skewed data)

- signal to noise ratio๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค. -> sampling error, measurement error์— ์˜ํ•œ ์ •ํ™•ํ•˜์ง€ ์•Š์€ ๋ถ„์‚ฐ

 

Batting Average - Interpreting Variance

- 3ํ•  ํƒ€์ž์—ฌ๋„, 2ํ•  7ํ‘ผ 5๋ฆฌ ์ดํ•˜์˜ ์„ฑ์ ์„ ๋ณด์ผ ๊ฐ€๋Šฅ์„ฑ์ด 10%๋‚˜ ๋˜๊ณ , 3ํ•  2ํ‘ผ 5๋ฆฌ ์ด์ƒ์˜ ์„ฑ์ ์„ ๋ณด์ผ ๊ฐ€๋Šฅ์„ฑ๋„ 10%๋‚˜ ๋œ๋‹ค.

 

Correlation Analysis

- correlation coefficient r(X, Y): Y๊ฐ€ X์˜ ํ•จ์ˆ˜์ธ ์ •๋„๋ฅผ ์ธก์ •ํ•œ๋‹ค.

- -1~1์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค.

-1: anti-correlated

1: fully-correlated

0: uncorrelated

 

Pearson Correlation Coefficient

- ๋ถ„์ž = covariance

 

r^2

- X์— ์˜ํ•ด ์„ค๋ช…๋˜๋Š” Y์˜ ๋น„์œจ์„ ๋‚˜ํƒ€๋‚ธ ์ง€ํ‘œ

 

Variance Reduction & r^2

- good linear fit f(x)๊ฐ€ ์žˆ์„ ๋•Œ, ์ž”์ฐจ d = y - f(x)๋Š” y์— ๋น„ํ•ด์„œ ๋” ๋‚ฎ์€ ๋ถ„์‚ฐ์„ ๋ณด์ผ ๊ฒƒ์ด๋‹ค.

- 1-r^2 = V(d)/V(y)

- ex. r = 0.94์ผ ๋•Œ, 88.4%์˜ V(y)๋ฅผ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Significance

- ์–ผ๋งˆ๋‚˜ ์œ ์˜ํ•˜๋ƒ~

- sample size, r ๋ชจ๋‘ ์ค‘์š”ํ•จ

- p value < 0.05 (์šฐ์—ฐํžˆ ๋ณด์•˜์„ ํ™•๋ฅ ์ด 5% ๋ฏธ๋งŒ์ด๋‹ค)

- ์ž‘์€ ์ƒ๊ด€๊ด€๊ณ„๋„ sample size๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฌ๋ฉด ์œ ์˜ํ•ด์งˆ ์ˆ˜ ์žˆ์Œ

- permutation test: X๋ฅผ ๋‘๊ณ  Y๋ฅผ ์„ž์–ด์„œ ๊ณ„์‚ฐ -> ๋งŒ๋ฒˆ ๊ณ„์‚ฐ ํ›„ ์šฐ๋ฆฌ๊ฐ€ ๊ถ๊ธˆํ•œ r(X, Y)๊ฐ€ ์ƒ์œ„ ๋ช‡ % ์•ˆ์— ์žˆ๋‚˜ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•

 

Spearman Rank Correlation

- disordered pair์˜ ์ˆ˜๋ฅผ ์„ธ๋Š” ๋ฐฉ๋ฒ•

- ๋ฐ์ดํ„ฐ๊ฐ€ ์ง์„ ์— ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งž๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹˜

=> non-linear relationship, outlier์— ๊ฐ•์ 

- ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•

- Pearson correlation rank์ด๋ฏ€๋กœ ๋ฒ”์œ„๊ฐ€ -1~1 ์‚ฌ์ด

 

Correlation vs. Causation

- Correlation์ด causation์ธ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

- causation: ์›์ธ->๊ฒฐ๊ณผ์ธ ๋ฐฉํ–ฅ์ด ์žˆ๋Š” ์ •๋ณด

 

Autocorrelation and Periodicity

- time-series data ์ค‘ ์ข…์ข… cycle์„ ๋ณด์ด๋Š” ๋ฐ์ดํ„ฐ

- lag-k autocorrelation ๊ณ„์‚ฐ์€ O(n)์ด์ง€๋งŒ, Fast Fourier Transform(FFT)๋ฅผ ์ด์šฉํ•˜๋ฉด O(nlogn)์— ๊ณ„์‚ฐ ๊ฐ€๋Šฅ

- shifting์„ ์ด์šฉํ•˜์—ฌ correlation์„ ํŒŒ์•… = ์ฆ‰, ์ฃผ๊ธฐ์„ฑ ํŒŒ์•…

 

Logarithms

- ์ •์˜: inverse exponential function

- ์ปดํ“จํ„ฐ์˜ ์—ฐ์‚ฐ ๋ฌธ์ œ ๋–„๋ฌธ์— logarithm์„ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ์œจ์ ์ด๋‹ค.

- ๋น„์œจ์„ ๊ทธ๋ƒฅ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์€ ์—„์ฒญ๋‚œ ์ฐจ์ด๋ฅผ ๋ณด์ผ ์ˆ˜ ์žˆ์œผ๋‚˜ ๋น„์œจ์— ๋กœ๊ทธ๋ฅผ ์ทจํ•ด์„œ ๋น„๊ตํ•  ๊ฒฝ์šฐ equal displacement๋ฅผ ๋ณด์ธ๋‹ค.

- power law์—์„œ ๋กœ๊ทธ๋ฅผ ์”Œ์›Œ์„œ ๋น„๊ตํ•˜๋Š” ์ด์œ .

 

Normalizing Skewed Distributions

- logarithm์„ ์ด์šฉ: power law, ratio ๋“ฑ์— ์ด์šฉํ•˜์—ฌ ์ •๊ทœํ™” ๊ฐ€๋Šฅ

 

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +

๋ฐ์ดํ„ฐ๊ณผํ•™์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€

- ์‹คํ—˜์  ๋ฐ์ดํ„ฐ ๋ถ„์„ & ์‹œ๊ฐํ™”

- ๋จธ์‹ ๋Ÿฌ๋‹ & ํ†ต๊ณ„ํ•™

- scale์„ ๋‹ค๋ฃจ๋Š” ๊ณ ์„ฑ๋Šฅ ๊ณ„์‚ฐ (computing) ๊ธฐ์ˆ 

 

๋ฐ์ดํ„ฐ๊ณผํ•™์— ํ•„์š”ํ•œ ์Šคํ‚ฌ

- computer science + domain science + statistics

 

Typical Data Science Pipeline

1. ํฅ๋ฏธ๋กœ์šด ์งˆ๋ฌธ ๋˜์ง€๊ธฐ : ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„

2. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

3. ๋ฐ์ดํ„ฐ ํƒ์ƒ‰

4. ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋ง

5. ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜, ์‹œ๊ฐํ™”

 

์ปดํ“จํ„ฐ๊ณผํ•™๊ณผ ์‹ค์ œ ๊ณผํ•™์˜ ์ฐจ์ด

๊ณผํ•™์ž ์ปดํ“จํ„ฐ๊ณผํ•™์ž
- Data-driven
-๋ณต์žกํ•œ ์ž์—ฐ์„ธ์ƒ์„ ์ดํ•ดํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•จ.
- finding์— ์ง‘์ค‘: ๊ฒฐ๊ณผ๊ฐ€ ์ค‘์š”ํ•˜์ง€ method๋Š” ๊ทธ๋ ‡๊ฒŒ ๊ด€์‹ฌ์žˆ์ง€ ์•Š์Œ.
- ์ƒˆ๋กœ์šด ๊ฒƒ์„ ๋ฐœ๊ฒฌ

- ์ •ํ™•๋„๋ณด๋‹ค๋Š” meaning์ด ์ค‘์š”ํ•จ
- ์™„์ „ํžˆ true์ธ ๊ฒƒ์€ ์—†๋‹ค. (์ด ํ˜„์ƒ์ด ์ด๋Ÿด ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์ €๋Ÿด ์ˆ˜๋„ ์žˆ๋‹ค - ex. ์œ ์ „๋ฒ•์น™์— ํ•ญ์ƒ ๋งž๋Š” ๊ฒƒ์€ ์—†๋Š” ๊ฒƒ)
- Algorithm-driven
- ์ž์‹ ๋“ค์˜ ๊นจ๋—ํ•œ ๊ฐ€์ƒํ™˜๊ฒฝ์„ ๋งŒ๋“ฆ
- method์— ์ง‘์ค‘
- ์ƒˆ๋กœ์šด ๊ฒƒ์„ ๋ฐœ๋ช…

- ์ •ํ™•๋„๊ฐ€ ์ค‘์š”

 

๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ

๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋“ค์€ ์‹ค์ œ ๊ณผํ•™์ž์ฒ˜๋Ÿผ ์ƒ๊ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›Œ์•ผ ํ•œ๋‹ค.

์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ์ž๋Š” code๋ฅผ ์ƒ์‚ฐํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋“ค์€ insight๋ฅผ ์ฐฝ์ถœํ•œ๋‹ค.

์ข‹์€ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋“ค์€ ์—ฐ๊ตฌ๋ถ„์•ผ์— ๋Œ€ํ•œ ํ˜ธ๊ธฐ์‹ฌ์„ ๊ฐ–๊ณ  ์žˆ๋‹ค. ๋Š˜ ์„ธ์ƒ์— ๋Œ€ํ•œ ๋” ๋„“์€ ์‹œ๊ฐ์„ ๊ฐ–๊ณ ์ž ํ•œ๋‹ค.

 

์ข‹์€ ์งˆ๋ฌธ ํ•˜๊ธฐ

์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ์ž์™€ ๋‹ฌ๋ฆฌ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋“ค์€ ์งˆ๋ฌธ์„ ํ•ด์•ผํ•œ๋‹ค.

์˜ˆ์‹œ)

- ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋ถ€ํ„ฐ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์„ ๊ฒƒ์€?

- ๋‚˜ ๋˜๋Š” ๋Œ€์ค‘๋“ค์ด ์•Œ๊ณ  ์‹ถ์–ดํ•˜๋Š” ๊ฒƒ์ด ๋ฌด์—‡์ธ๊ฐ€?

- ์ด ๋ถ„์•ผ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์€ ๋ฌด์—‡์ธ๊ฐ€?

Who / What / Where / When / Why ์— ์ž…๊ฐํ•˜์—ฌ ์งˆ๋ฌธํ•ด๋ณด์ž.

 

ex1) ์•ผ๊ตฌ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์„ ๋•Œ

- ์ˆ˜๋น„ ํฌ์ง€์…˜์— ๋”ฐ๋ผ ํƒ€๊ฒฉ ์„ฑ์ ์ด ๋‹ฌ๋ผ์ง€๋Š”๊ฐ€?

(์ด ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” 2017-2018์‹œ์ฆŒ ๋ชจ ์„ ์ˆ˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด์„œ ํ•œ ๋ฒˆ ๋ถ„์„ํ•ด๋ณด๊ณ  ์‹ถ๋‹ค.)

- ๋ฃจํ‚ค์˜ ์„ฑ์ ์ด ๊ณผ์—ฐ ๊ดœ์ฐฎ์„๊นŒ?

(๋ฃจํ‚ค์˜ ๊ฒฝ์šฐ ์ž๋ฃŒ๊ฐ€ ์—†์œผ๋ฏ€๋กœ ์˜ˆ์ธก์ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ. ์ด๋•Œ ๊ธฐ์กด ์„ ์ˆ˜๋“ค์˜ ๋ฃจํ‚ค ์‹œ์ ˆ ์ž๋ฃŒ ์ค‘ ๋น„๊ต๋Œ€์ƒ๊ณผ ๋น„์Šทํ•œ ํŠน์„ฑ์„ ๊ฐ–๋Š” ์‚ฌ๋žŒ์„ ๊ตฐ์ง‘ํ™”ํ•˜์—ฌ average๋ฅผ ๊ตฌํ•˜๋ฉด ์˜ˆ์ธก๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.)

 

ex2) ์ธ๊ตฌํ†ต๊ณ„ํ•™

- ์™ผ์†์žก์ด๊ฐ€ ์˜ค๋ฅธ์†์žก์ด๋ณด๋‹ค ๋” ์งง์€ ์ˆ˜๋ช…์„ ๊ฐ–๊ณ  ์žˆ๋Š”๊ฐ€?

- ํ‚ค, ๋ชธ๋ฌด๊ฒŒ๊ฐ€ ์ธ๊ตฌ ์ „๋ฐ˜์ ์œผ๋กœ ์ปค์ง€๊ณ  ์žˆ๋Š”๊ฐ€?

 

ex3) ์˜ํ™” ๋ฐ์ดํ„ฐ (IMDb)

- ์˜ํ™”๋ฐฐ์šฐ๋“ค์˜ social network์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ธฐ

- ์˜ํ™” ํฅํ–‰ ์˜ˆ์ธกํ•˜๊ธฐ

 

ex4) NYC Taxi Cab Data - Freedom of Information Act Request (FOA) ์— ์˜ํ•ด ๊ณต๊ฐœ๋œ ์ž๋ฃŒ

- ํšจ์œจ์ ์œผ๋กœ ํƒ์‹œ๋ฅผ ๋ฐฐ์น˜ํ•˜๋Š” ๋ฐฉ์•ˆ

- ์‹ฌ์•ผ๋ฒ„์Šค/์ผ๋ฐ˜ ๋ฒ„์Šค ๋…ธ์„  ์ •ํ•  ๋•Œ ์ˆ˜์š”๊ฐ€ ๋งŽ์€ ์ง€์—ญ ํƒ์ƒ‰

 

Google Ngrams

๊ตฌ๊ธ€์ด ์ฑ… ์ž๋ฃŒ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋งŒ๋“  ๋ฐ์ดํ„ฐ์…‹. ๋‹จ์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๊ทธ ๋‹จ์–ด๊ฐ€ ๋ฐœํ–‰๋œ ์ฑ…๋“ค ๊ฐ€์šด๋ฐ ๋ช‡ ๋ฒˆ ๋นˆ๋„๋กœ ๋‚˜ํƒ€๋‚ฌ๋Š”์ง€ ์ˆ˜์น˜๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์—ฌ๋Ÿฌ ๊ฐœ ์ž…๋ ฅ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.

๊ทธ ์‹œ๋Œ€์˜ ๊ฐ€์น˜๋ฅผ ํŒŒ์•…ํ•˜๊ฑฐ๋‚˜, ํŠน์ • ๋‹จ์–ด (์˜ˆ: ์š•์„ค) ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜์˜€๋Š”์ง€? ์ถ”์ ํ•˜๋Š” ์ˆ˜๋‹จ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

 

Google Trends

ํŠน์ • search-term์ด ์ „์ฒด ๊ฒ€์ƒ‰๋Ÿ‰ ์ค‘ ์–ผ๋งŒํผ์˜ ๋นˆ๋„๋กœ ์ž…๋ ฅ๋˜์—ˆ๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ํˆด. ์—ฌ๋Ÿฌ ๋‹จ์–ด ์‚ฌ์ด์˜ ์ƒ๋Œ€์ ์ธ ๊ฒ€์ƒ‰๋นˆ๋„๋ฅผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Œ.

๋…ผ๋ฌธ Quantifying Trading Behavior in Financial Markets Using Google Trends ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” debt๋ผ๋Š” ๋‹จ์–ด์˜ ๊ฒ€์ƒ‰ ๋นˆ๋„์ˆ˜์™€ Dow Jones Industrial Average ์‚ฌ์ด ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•ด์„œ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ, debt์˜ ๊ฒ€์ƒ‰ ๋นˆ๋„์ˆ˜ ์ถ”์ด์— ๋”ฐ๋ผ ์ฃผ์‹ ํˆฌ์ž๋ฅผ ํ–ˆ์„ ๋•Œ ์ˆ˜์ต๋ฅ ์ด ์ผ๋ฐ˜ ์ „๋žต (buy and hold, random...) ๋ณด๋‹ค ๋” ๋†’์€ ์ˆ˜์ต๋ฅ ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค๊ณ  ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค. financial times์—์„œ ์—ฌ๋Ÿฌ ์ฐจ๋ก€ ์–ธ๊ธ‰๋œ ํ‚ค์›Œ๋“œ ์—ฌ๋Ÿฌ ๊ฐœ ์ค‘ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋˜ ๊ฒƒ์ด debt์ด์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•˜๊ณ  ์žˆ๋‹ค. ๋‹ค๋งŒ unseen data์— ๋Œ€ํ•ด์„œ ๋™์ผํ•œ ์„ฑ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ๋ณด์žฅ๋œ ๋ฐ”๊ฐ€ ์—†๋‹ค (๋‹จ์ˆœํžˆ ์šด์ด์—ˆ์„ ์ˆ˜๋„ ์žˆ๊ธฐ๋Š” ํ•˜๋‹ค.)

 

๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ

1. Structured / Unstructured Data 

2. Quantitative / Categorical Data

์•„๋ž˜ ๊ธ€ ์ฐธ๊ณ 

 

[EDA] ๋ฐ์ดํ„ฐ ์ข…๋ฅ˜๋ณ„ ์‹œ๊ฐํ™” ๋ฐฉ๋ฒ•

๋ฐ์ดํ„ฐ์˜ ์ข…๋ฅ˜ ๋ฐ์ดํ„ฐ๋Š” ํฌ๊ฒŒ ๋ฒ”์ฃผํ˜•, ์ˆ˜์น˜ํ˜• ๋‘ ๊ฐ€์ง€ ๋ถ„๋ฅ˜๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋Š” ๋ฒ”์ฃผ/์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฐ๊ฐ์˜ ์ด๋ฆ„์„ ๊ฐ–๋Š” ๋ฐ์ดํ„ฐ ์ข…๋ฅ˜๋‹ค. ex) ์„ฑ๋ณ„ - ์—ฌ์„ฑ, ๋‚จ์„ฑ / ํ•™๋ ฅ - ์ดˆ์กธ,

iamnotwhale.tistory.com

 

๋ฐ์ดํ„ฐ์˜ ์ข…๋ฅ˜

1. Nominal (Categorical) (N) : ๊ตฌ๋ถ„์€ ๋˜์ง€๋งŒ ์ˆœ์œ„๋Š” ๋‚˜๋ˆŒ ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ =, != 

2. Ordinal (O) : < ์ฒ˜๋Ÿผ ์ˆœ์œ„๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ =, !=, >, <

3. Quantitative (Q) : arithmetic์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ 

   3-1. Q - Interval (location of zero arbitrary) : ์›์ ์ด ๊ณ ์ •๋˜์–ด์žˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ex) ๋‚ ์งœ =, !=, >, <, +, -

   3-2. Q - Ratio (zero fixed) : ์‚ฌ์น™์—ฐ์‚ฐ์ด ๋ชจ๋‘ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ex) ๊ธธ์ด =, !=, >, <, +, -, *, /

 

Classification vs Regression

Classification(๋ถ„๋ฅ˜)๋Š” input์— label์„ ๋ถ™์ด๋Š” ๊ฒƒ. ์„œ๋กœ ๋‹ค๋ฅธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„๋Š” ๋Š๋‚Œ์ด๋‹ค. ๋ฐ˜๋ฉด Regression(ํšŒ๊ท€)๋Š” continuous target์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ, ๊ฐ€๊ฒฉ์ด ๋†’์•„์งˆ ๊ฒƒ์ธ๊ฐ€ ๋‚ฎ์•„์งˆ ๊ฒƒ์ธ๊ฐ€? ๋Š” ๋ถ„๋ฅ˜, ๊ฐ€๊ฒฉ์ด ์–ผ๋งˆ๊ฐ€ ๋  ๊ฒƒ์ธ๊ฐ€? ๋Š” ํšŒ๊ท€๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

+

์ด๋ฒˆ ํ•™๊ธฐ ๋ฐ์ดํ„ฐ๊ณผํ•™ ์ˆ˜์—…ํ•˜์‹œ๋Š” ๊ต์ˆ˜๋‹˜ ๋žฉ์‹ค์—์„œ BioBERT๋ฅผ ๊ฐœ๋ฐœํ•˜์…จ๋‹ค๊ณ  ํ•œ๋‹ค. ๋˜ ์ด ๋žฉ์‹ค์€ DreamChallenges๋ผ๋Š” bio-medicine๋ถ„์•ผ์˜ ์บ๊ธ€๊ณผ ๊ฐ™์€ ๋Œ€ํšŒ๋“ค์—์„œ ์ƒ๋„ ๋งŽ์ด ํƒ€์…จ๋‹ค. ์ง„์งœ ์ง„์งœ ๋‚ด๊ฐ€ ๊ด€์‹ฌ ์žˆ๋Š” ๋ถ„์•ผ๊ฐ€ ์ด๋Ÿฐ ๋ถ„์•ผ์˜€๋Š”๋ฐ! ์‹ถ๊ณ ... ์‹ฌ์žฅ์ด ๋›ฐ๊ณ  (?) ... ์•„๋ฌดํŠผ ์ˆ˜์—… ๋ง๊ณ ๋„ ๊ต์ˆ˜๋‹˜ ๋žฉ์‹ค์—์„œ ๋‚˜์˜จ ๋…ผ๋ฌธ ๊ฐ™์€ ๊ฑฐ ๋„์ „ํ•ด๋ณด๊ธฐ! ๊ฐ€ 2023๋…„ ๋ชฉํ‘œ

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +
1 2 3