'어쩌다 회사/Data Science' 카테고리의 글 목록

어쩌다 회사/Data Science

7.2. Tidying data for analysis (DataCamp Data Scientist Career Track) 2018.12.13
7.1. Exploring your data (DataCamp Data Scientist Career Track) 2018.12.13

7.2. Tidying data for analysis (DataCamp Data Scientist Career Track)

schneider 2018. 12. 13. 02:49

2018. 12. 13. 02:49

There are several ways for one dataset to presentation

Tidy data, the "Standard way to organize data values within a dataset"

3 principles:

1) Columns represent seperate variables

2) Rows represent individual observations

3) Observational units form tables

*note that 'Better for reporting' does not always mean 'Better for analysis.'
and Tidy data makes fixing common data problems (changing value, adding observation, ...) easier

- Converting to tidy data

Problem A) Columns containing values, instead of variables (violating principle #1)

Solution A) Melt (turning columns into rows)

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
 
pd.melt(frame = df, id_vars = 'name',
        value_vars = ['treatment a', 'treatment b'],
        var_name = 'treatment', value_name = 'result')
#frame: dataframe to melt
#id_vars: columns to fix
#value_vars: columns to melt, if not specified, else of id_vars
#var_name: rename the variable of melted columns 
#value_name: rename the value of melted columns  
 
Colored by Color Scripter
cs

Problem B) one observation is divided to number of rows & different variables are merged into one column
(or to reshape df into more report-friendly)

Solution B) Pivot (opposite of melting), turns unique values into seperate columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
 
weather_tidy = weather.pivot(index = 'date,
                            columns = 'element'.
                            values = 'value')
#index: columns to be fixed
#columns: columns to pivot into seperate columns
#values: value to fill in the new columns' values
 
"""
But, if there are duplicate values for same index-column pair,
Value error would occur.
-> Pivot table is the solution
"""
 
weather2_tidy = weather.pivot_table(values='value',
                                    index = 'date',
                                    columns = 'element',
                                    aggfunc = np.mean)
#np.mean orders how to handle the duplicate values
#But this will give a hierarchical index (or Multiindex)
# by using '.reset_index()', it goes back to the original dataframe format 
 
weather2_tidy.reset_index()
 
Colored by Color Scripter
cs

Beyond melt and pivot

Problem C) Columns contain multiple bits of info (e.g. variables as 'm014', 'm1524' as merged form of sex and age)

Solution C) Parsing the string

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#df name is 'tb', containing variables that sex and age are merged
 
#First, melt the variable in order to parse it
pd.melt(frame = tb, id_vaes = ['country', 'year'])
 
#Second, make new columns for parsed variables
tb_melt['sex'] = tb_melt.variable.str[0]
tb_melt['age_group'] = tb_melt.variable.str[1:]
 
#if the column name cannot be parsed by index or position,
#use .split() method (for string, built-in)
 
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
#values of this column are the lists containing splitted strs
#so, distribute each splitted str into seperate column
ebola_melt['type'] = ebola_melt.str_split.str[0]
ebola_melt['country'] = ebola_melt.str_split.str[1]
cs

저작자표시 비영리 (새창열림)

'어쩌다 회사 > Data Science' 카테고리의 다른 글

7.1. Exploring your data (DataCamp Data Scientist Career Track) (0)	2018.12.13

7.1. Exploring your data (DataCamp Data Scientist Career Track)

schneider 2018. 12. 13. 01:00

2018. 12. 13. 01:00

Data preparation in python consists of 4 parts:

1) Exploration

2) Tidying (transforming)

3) Combining

4) Cleaning

Here, I reorganize and practice what I learned from DataCamp course #7.1: data exploration.

Exploration is needed for diagnosing several problems in my dataset before getting into analysis.

Common data problems include*:

1) Inconsistent column name

2) Missing data

3) Outlier

4) Duplicate rows

5) Untidy

6) Need to process columns and its types

*well, these are not obviously the PROBLEMs itself, but in general, one would want to control these before analysis.

First step toward a exploration is to inspect visually, using methods and attributes below:

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
 
df = pd.read_csv('literary_birth_rate.csv')
 
df.head() #first five rows
df.tail() #last five rows
 
df.columns #column name index list
 
df.shape #column shape by (m row(s), n column(s))
 
df.info() #column별 characteristics (number,type)
cs

Second, if there is any column/variable that consists of categorical value, one would like to count frequency of each value,

while summary statistics should be reported for numerical value columns:

1
2
3
4
5
6
7
8
9
df.population.value_counts(dropna = False)
df['continent'].value_counts(dropna = False)
# if too many categorical values, add '.head()' or '.tail()'
# can access to the column named 'population' with df.population
# 'dropna = False' to see NA
# this gives frequency series, with data type
 
df.describe() #only returns results for numeric columns
# count, mean, st.d, min, quartiles, max
cs

Also, one could visualize data to explore more simply especially for the datasets consist of many variables. For example, in a lecture, recognition of outlier (both real and error) is done by this step.

Histograms, Bar plots, and Scatterplot(for relationship btw two numeric variables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import matplotlib.pyplot as plt
 
#Histogram
df.population.plot('hist')
# df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx = True, logy=True)
# rot: x 축 index 회전 각도, logx, y: rescale, x = '~' 등은 labeling
plt.show()
 
df[df.population > 1000000000] #to see the outliers
 
#Boxplot
df.boxplot(column = 'population', by = 'continent')
#df.boxplot(column='initial_cost', rot = 90, by='Borough')
# by = 은 말그대로 initial cost의 boxplot이 Borough 값 (categorical) 에 따라 어떻게 달라지는지 보임
plt.show()
 
#Scatterplot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
Colored by Color Scripter
cs
 
Colored by Color Scripter
cs

저작자표시 비영리 (새창열림)

'어쩌다 회사 > Data Science' 카테고리의 다른 글

7.2. Tidying data for analysis (DataCamp Data Scientist Career Track) (0)	2018.12.13

PREV 이전 1 NEXT 다음

어쩌다 블로그: 대학과 학문, 연구와 정책