profile-img
The merit of an action lies in finishing it to the end.
slide-image

Typical Data Science Pipeline

Data Munging / Data wrangling

- ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ณ  ๋ถ„์„์„ ์œ„ํ•ด ์ค€๋น„/์ •๋ฆฌํ•˜๋Š” ๊ฒƒ

 

๋ฐ์ดํ„ฐ๊ณผํ•™์— ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด

- Python: ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ํ”ผ์ณ ๋“ฑ ํฌํ•จ - regular expressions

* regular expression ์˜ˆ

    - [hc]at : hat or cat

    - .at : .์ž๋ฆฌ์— ์•„๋ฌด๊ฑฐ๋‚˜ ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ์Œ

    - ab*c : b๊ฐ€ 0๊ฐœ ์ด์ƒ ์žˆ์–ด์•ผ ํ•จ

    - ab+c : b๊ฐ€ 1๊ฐœ ์ด์ƒ ์žˆ์–ด์•ผ ํ•จ

- R: ํ†ต๊ณ„ํ•™์ž๋“ค์ด ์ฃผ๋กœ ์‚ฌ์šฉ

- Matlab: ํ–‰๋ ฌ

- Java/C++: ๋น…๋ฐ์ดํ„ฐ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ์–ธ์–ด

- Excel: bread and butter toool

 

๋…ธํŠธ๋ถ ํ™˜๊ฒฝ

- reproducible, tweakable, documented

- ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ์œ ์ง€ํ•˜๊ธฐ ์‰ฝ๋‹ค.

 

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

- Files (CSV, XML, JSON)

- Databases (SQL server)

- API: Application Programming Interface

- Web Scraping (HTML)

 

๋ฐ์ดํ„ฐ ํ˜•์‹

1. CSV (Comma-Separated Values)

2. XML (eXtensible Markup Language)

<student>

    <name> 000 </name>

    <age> 19 </age>

</student>

์ด๋Ÿฐ ์‹์œผ๋กœ <>, </> ๋กœ ๊ตฌ๋ถ„ ๊ฐ€๋Šฅ

3. JSON (JavaScript Object Notation)

{

    "a": {

             "name": 000,

             "age": 19

            }

... ์ด๋Ÿฐ ์‹์œผ๋กœ, {}์™€ ๋”ฐ์˜ดํ‘œ ๋“ฑ์œผ๋กœ ๊ตฌ๋ถ„

 

CSV ํŒŒ์ผ์—์„œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - pandas ์ด์šฉ

import pandas
student_data = pandas.read_csv("student_table.csv")

 

Relational DB์—์„œ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas
student_data = pandas.read_sql_query(sql_string, db_uri)
SELECT * FROM student_table --์ „์ฒด ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
SELECT studentid, name FROM student_table WHERE gpa > 3.5
--ํ…Œ์ด๋ธ”์—์„œ gpa>3.5์ธ ํ•™์ƒ๋“ค์˜ id์™€ ์ด๋ฆ„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

SELECT S.name, A.name
FROM student_table S, advisor_table A
WHERE S.advisorid = A.id --ํ•™์ƒ์˜ ์ง€๋„๊ต์ˆ˜ ์•„์ด๋””๊ฐ€ ์ง€๋„๊ต์ˆ˜ ๋ช…๋‹จ ์•„์ด๋””์™€ ๋™์ผํ•œ ๊ฒฝ์šฐ์— ์ด๋ฆ„ ์Œ ์ถœ๋ ฅ

SELECT S.advisorid, avg(S.gpa)
FROM student_table S
GROUP BY S.advisorid --์ง€๋„๊ต์ˆ˜๊ฐ€ ๊ฐ™์€ ํ•™์ƒ๋“ค์˜ ์ง€๋„๊ต์ˆ˜ ์•„์ด๋””์™€ ํ•™์ƒ๋“ค ์„ฑ์  ํ‰๊ท  ์ถœ๋ ฅ

SELECT S.advisorid, avg(S.gpa)
FROM student_table S
WHERE S.birthdate >= 2000/1/1 AND S.birthdate < 2010/1/1
GROUP BY S.advisorid

 

API๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์–ป๊ธฐ

- REST (Representational State Transfer) API

import json
import requests

url = 'https://...format=json'

data = requests.get(url).text
parsed_data = json.loads(data)

 

Web Scraping

<html>
	<head>
    	<title> ... </title>
    </head>
    <body>
        <h1> hello world! </h1>
        <p> COSE471 data science </p>
    </body>
</html>

 

๋ฐ์ดํ„ฐ์˜ ์›์ฒœ

1. Proprietary Data Sources

- ๊ธฐ์—…์ด ์ˆ˜์ง‘ํ•˜๋Š” ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ

- API๋กœ ์ œํ•œ์ ์œผ๋กœ ๊ณต๊ฐœํ•  ์ˆ˜ ์žˆ์Œ

- ์™ธ๋ถ€๋กœ ๋ฐ์ดํ„ฐ ์ „์ฒด๋ฅผ ๊ณต๊ฐœํ•˜๊ธฐ ์–ด๋ ค์šด ์ด์œ : ๋น„์ฆˆ๋‹ˆ์Šค์  ์ด์œ (๊ฒฝ์Ÿ์‚ฌ๋ฅผ ๋„์™€์ค„ ์ˆ˜ ์žˆ์Œ), ํ”„๋ผ์ด๋ฒ„์‹œ ์ด์Šˆ

- ์˜ˆ์‹œ: 2006๋…„ AOL search log release

 

2. Government Data Sources

- ์‹œ, ๊ตญ๊ฐ€ ๋“ฑ์—์„œ ๊ณต๊ฐœํ•˜๋Š” ์˜คํ”ˆ ๋ฐ์ดํ„ฐ

- Freedom of Information Act: ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต๊ฐœํ•ด๋‹ฌ๋ผ๊ณ  ์š”์ฒญํ•  ์ˆ˜ ์žˆ์Œ

- ๋ฐ์ดํ„ฐ๊ฐ€ ๊ณต๊ฐœ๋  ๋•Œ ํ”„๋ผ์ด๋ฒ„์‹œ๋ฅผ ์ง€์ผœ์•ผ ํ•จ

 

3. Academic Data Sets

- ๋…ผ๋ฌธ ๋“ฑ์—์„œ ๋ฐœํ‘œํ•˜๋Š” ์ž๋ฃŒ

 

4. Web Search/Scraping

- Scraping: ์›นํŽ˜์ด์ง€์—์„œ ๋ฐ์ดํ„ฐ๋‚˜ ํ…์ŠคํŠธ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ

- ์ด๋ฏธ API๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€, ๋ˆ„๊ตฐ๊ฐ€ ์ด๋ฏธ scraper๋ฅผ ์ž‘์„ฑํ•ด๋‘์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ณ  ํ•ด์•ผ ํ•จ

 

5. Available Data Sources

- Bulk Downloads: ์œ„ํ‚คํ”ผ๋””์•„, IMDB, Million Song Database

- API access: ๋‰ด์š•ํƒ€์ž„์Šค, ํŠธ์œ„ํ„ฐ ๋“ฑ

 

6. Sensor Data Logging

- ์ด๋ฏธ์ง€, ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ: Flicker ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‚ ์”จ๋ฅผ ๊ด€์ธก

- ํœด๋Œ€ํฐ์˜ accelerometer๋ฅผ ์ด์šฉํ•˜์—ฌ ์ง€์ง„ ๊ด€์ธก

- ํƒ์‹œ GPS๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ตํ†ต ํ๋ฆ„ ๊ด€์ธก

- logging system ๊ตฌ์ถ•: ์ €์žฅ์†Œ๊ฐ€ ์ €๋ ดํ•จ

 

7. Crowdsourcing

- ์œ„ํ‚คํ”ผ๋””์•„, Freebase, IMDB ๋“ฑ ์—ฌ๋Ÿฌ ์ €์ž๊ฐ€ ์ฐธ์—ฌํ•œ ๋ฐ์ดํ„ฐ

- Amazon Turk, CrowdFlower ๋“ฑ์˜ ํ”Œ๋žซํผ์— ๋ˆ์„ ์ง€๋ถˆํ•˜๋ฉด ๋‹ค์ˆ˜์˜ ์‚ฌ๋žŒ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์˜ฌ ์ˆ˜ ์žˆ์Œ

 

8. Sweat Equity

- ์—ญ์‚ฌ ์ž๋ฃŒ ๋“ฑ์€ ์ข…์ด, ํ”ผ๋””์—ํ”„ ๋“ฑ์˜ ํ˜•์‹์œผ๋กœ ์ €์žฅ๋˜์–ด์žˆ์–ด ๋ณ€ํ™˜์ด ํ•„์š”ํ•จ

 

Cleaning Data: Garbage In, Garbage Out

- ๋ฐ์ดํ„ฐ ์ •๋ฆฌ ์ค‘์— ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ

* artifact์™€ ์˜ค๋ฅ˜๋ฅผ ๊ตฌ๋ณ„

* Data compatibility, unification

* ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

* ๋ฏธ๊ด€์ธก๊ฐ’ (0) ์ถ”์ •

* ์ด์ƒ์น˜ ํƒ์ง€

 

Errors vs. Artifacts

- error: ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ณผ์ •์—์„œ ๊ทผ๋ณธ์ ์œผ๋กœ ์žƒ์–ด๋ฒ„๋ฆฐ ์ •๋ณด

- artifact: ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๊ณผ์ • ์ค‘์— ๋‚˜ํƒ€๋‚˜๋Š” ์‹œ์Šคํ…œ์ ์ธ ๋ฌธ์ œ

- sniff test: product๋ฅผ ๊ฐ€๊นŒ์ด์—์„œ ์กฐ์‚ฌํ•˜์—ฌ ์ž˜๋ชป๋˜์—ˆ๋‹ค๋Š” ๋‚Œ์ƒˆ๋ฅผ ์•Œ์•„์ฐจ๋ฆฌ๋Š” ๋ฐฉ๋ฒ•

 

์—ฐ๋„๋ณ„ ๊ณผํ•™์ž์˜ ์ฒซ ๋…ผ๋ฌธ ๋ฐœํ‘œ์ž ์ˆ˜ ๊ด€๋ จ ๋ฐ์ดํ„ฐ

- PubMed๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ€์žฅ ์ž์ฃผ ์ธ์šฉ๋˜๋Š” ๊ณผํ•™์ž 100,000๋ช…์˜ ์ฒซ ๋…ผ๋ฌธ ๋ฐœํ‘œ ์—ฐ๋„๋ฅผ ์กฐ์‚ฌ

- ์˜ˆ์ƒ ๋ถ„ํฌ: 1960๋…„์— ๋ชฐ๋ ค์žˆ๊ณ , 2023๋…„์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉด 0์— ์ˆ˜๋ ดํ•˜๋ฉฐ, ๊ทธ ์ค‘๊ฐ„์€ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€๋˜๋Š” ๊ทธ๋ž˜ํ”„

(์ด์œ : 1960๋…„๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ–ˆ์œผ๋ฏ€๋กœ ๊ทธ ์ด์ „ ๋…ผ๋ฌธ๋“ค๋„ 1960๋…„์œผ๋กœ ๋ฐ˜์˜๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ตœ๊ทผ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ์ปค๋ฆฌ์–ด๊ฐ€ ์ ์€ ์‚ฌ๋žŒ์ด ๋งŽ์•„์„œ ์ˆœ์œ„์— ๋“ค์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ)

- ์‹ค์ œ ๋ถ„ํฌ (์™ผ์ชฝ)

1965๋…„ ์ด์ „ ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜์ž‘์—…์œผ๋กœ ํ•˜์—ฌ ๋ˆ„๋ฝ๋˜์—ˆ์„ ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ

2000๋…„ ์ดํ›„ ์ €์ž ์ด๋ฆ„ ๊ธฐ๋ก ๋ฐฉ์‹์ด ๋ณ€ํ™”๋กœ, ๋‹ค๋ฅธ ์ €์ž์ง€๋งŒ ๊ฐ™์€ ์ €์ž๋กœ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์–ด ์˜ค๋ฅ˜ ๋ฐœ์ƒ

 

Data Compatibility

1. Unit Conversion

- 1999๋…„ ๋‚˜์‚ฌ์˜ Mars Climate Orbiter๊ฐ€ ์‚ฌ๋ผ์ง

- ์ด์œ ๋Š” ๋‹จ์œ„ ์ฐจ์ด ๋•Œ๋ฌธ

- Z-score๋ฅผ ์ด์šฉํ•˜๋ฉด ๋‹จ์œ„๊ฐ€ ์‚ฌ๋ผ์ง€๋ฏ€๋กœ ์œ ์šฉ

 

2. Number / Character Representation

- 1996๋…„ Ariane 5 rocket ํญ๋ฐœ: 64-bit float์„ 16-bit integer๋กœ ์ „ํ™˜ํ–ˆ๊ธฐ ๋•Œ๋ฌธ

- ์‹ค์ œ ์ˆ˜์น˜์˜ integer ๊ทผ์‚ฌ๋ฅผ ํ”ผํ•  ๊ฒƒ

- ์ธก์ •๊ฐ’์€ decimal number, count๋Š” integer๊ฐ€ ๋˜๋„๋ก ํ•˜์ž

- fractional quantities๋Š” decimal ๊ฐ’์œผ๋กœ ๊ธฐ๋กํ•˜์ž

 

3. Character Representations

- ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ์ฃผ์˜ํ•˜์ž

- UTF-8: multibyte encoding for all Unicode characters

 

4. Name Unification

- ๊ธ€ ์ €์ž ๊ธฐ๋ก ๋ฐฉ์‹์ด ๋‹ค ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ. ex) Steve, Steven. S. ๋“ฑ๋“ฑ

- ์ด๋ฆ„์„ ๊ฐ™๊ฒŒ ํ†ต์ผํ•˜์ž. ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜, ๋ฏธ๋“ค๋„ค์ž„ ์ œ๊ฑฐ ๋“ฑ

- false positive์™€ negative ์‚ฌ์ด์˜ tradeoff

 

5. Time / Date Unification

- ๋‚ ์งœ ํ†ต์ผ: UTC ๋“ฑ ์ด์šฉ

 

6. Financial Unification

- Currency conversion: ํ™˜์œจ์„ ์ด์šฉ

- ์ ˆ๋Œ€๊ฐ’ ๋ณ€ํ™”๋ณด๋‹ค returns / percentage change๋ฅผ ์ด์šฉํ•˜์ž

- stock price๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ•˜์ž. (split(์•ก๋ฉด๋ถ„ํ• ), dividends(๋ฐฐ๋‹น))

- time value of money๋Š” ๊ณต์ •ํ•œ ์žฅ๊ธฐ ๋น„๊ต๋ฅผ ์œ„ํ•ด ์ธํ”Œ๋ ˆ์ด์…˜์˜ correction์„ ํ•ด์•ผ ํ•œ๋‹ค.

 

๊ฒฐ์ธก์น˜ ๋‹ค๋ฃจ๊ธฐ

- ๋ชจ๋ฅธ๋‹ค๊ณ  ํ•˜์—ฌ 0์œผ๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ ์˜ณ์ง€ ์•Š๋‹ค.

- ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ: ๋นˆ์นธ์œผ๋กœ ๋‘๋Š” ๊ฒƒ๋ณด๋‹ค estimation, imputation์ด ํ•„์š”ํ•˜๋‹ค.

 

Imputation methods

- Heuristic-based imputation: ํ‰๊ท ์ˆ˜๋ช…์„ ๋ฐ˜์˜ํ•˜์—ฌ ์‚ฌ๋ง์ผ์ž ์˜ˆ์ธก

- Mean value imputation: ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด

- Random value imputation: ๋‚œ์ˆ˜๋กœ ๋Œ€์ฒด -> ํ†ต๊ณ„์  ํ‰๊ฐ€์— imputation์˜ ์˜ํ–ฅ์„ ์ค„์—ฌ์คŒ

- Imputation by nearest neighbor: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด record๋ฅผ ์ด์šฉํ•˜์—ฌ ๋Œ€์ฒด

- Imputation by interpolation: linear regression์„ ์ด์šฉํ•˜์—ฌ ๋Œ€์ฒด -> record์— ๋นˆ ํ•„๋“œ๊ฐ€ ๋งŽ์ด ์—†์„ ๋•Œ

 

Outlier Detection

- ์ตœ๋Œ“๊ฐ’, ์ตœ์†Ÿ๊ฐ’์„ ์ž˜ ํ™•์ธํ•˜๊ธฐ

- Normally distributed data๋Š” ํฐ outlier๋ฅผ ๊ฐ–์ง€ ์•Š๋Š”๋‹ค. (k sigma from the mean)

- ๊ทธ๋ƒฅ ์‚ญ์ œํ•˜์ง€ ๋ง๊ณ  ์ด์ƒ์น˜๊ฐ€ ์ƒ๊ธด ์ด์œ ๋ฅผ ์ƒ๊ฐํ•˜์—ฌ ๊ณ ์น˜๊ธฐ

- ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ํƒ์ง€ํ•˜๊ธฐ ์‰ฌ์›€ (์ €์ฐจ์›์ธ ๊ฒฝ์šฐ๋งŒ ๊ฐ€๋Šฅ)

- ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ–ˆ์„ ๋•Œ ์ค‘์‹ฌ์œผ๋กœ๋ถ€ํ„ฐ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ์žˆ๋Š” ๊ฒฝ์šฐ

 

Delete Outliers Prior to Fitting?

- ์ ํ•ฉ ์ด์ „์— ์ด์ƒ์น˜๋ฅผ ์ง€์šฐ๋Š” ๊ฒƒ์€ ๋” ๋‚˜์€ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ๋„ ํ•œ๋‹ค.

ex) ์ด์ƒ์น˜๊ฐ€ ์ธก์ • ์˜ค๋ฅ˜๋กœ ์ƒ๊ธด ๊ฒƒ์ผ ๋•Œ

- ๋ฐ˜๋Œ€๋กœ ๋” ๋‚˜์œ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ๋„ ํ•œ๋‹ค.

ex) ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์— ์˜ํ•ด ์„ค๋ช…๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•ด์„œ ํ•ด๋‹นํ•˜๋Š” ์ ์„ ๋ชจ๋‘ ์ง€์›Œ๋ฒ„๋ฆด ๊ฒฝ์šฐ

'School/COSE471 ๋ฐ์ดํ„ฐ๊ณผํ•™' Related Articles +