There are several ways for one dataset to presentation
Tidy data, the "Standard way to organize data values within a dataset"
3 principles:
1) Columns represent seperate variables
2) Rows represent individual observations
3) Observational units form tables
*note that 'Better for reporting' does not always mean 'Better for analysis.'
and Tidy data makes fixing common data problems (changing value, adding observation, ...) easier
- Converting to tidy data
Problem A) Columns containing values, instead of variables (violating principle #1)
Solution A) Melt (turning columns into rows)
1 2 3 4 5 6 7 8 9 10 11 | import pandas as pd pd.melt(frame = df, id_vars = 'name', value_vars = ['treatment a', 'treatment b'], var_name = 'treatment', value_name = 'result') #frame: dataframe to melt #id_vars: columns to fix #value_vars: columns to melt, if not specified, else of id_vars #var_name: rename the variable of melted columns #value_name: rename the value of melted columns | cs |
Problem B) one observation is divided to number of rows & different variables are merged into one column
(or to reshape df into more report-friendly)
Solution B) Pivot (opposite of melting), turns unique values into seperate columns
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import pandas as pd weather_tidy = weather.pivot(index = 'date, columns = 'element'. values = 'value') #index: columns to be fixed #columns: columns to pivot into seperate columns #values: value to fill in the new columns' values """ But, if there are duplicate values for same index-column pair, Value error would occur. -> Pivot table is the solution """ weather2_tidy = weather.pivot_table(values='value', index = 'date', columns = 'element', aggfunc = np.mean) #np.mean orders how to handle the duplicate values #But this will give a hierarchical index (or Multiindex) # by using '.reset_index()', it goes back to the original dataframe format weather2_tidy.reset_index() | cs |
Beyond melt and pivot
Problem C) Columns contain multiple bits of info (e.g. variables as 'm014', 'm1524' as merged form of sex and age)
Solution C) Parsing the string
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #df name is 'tb', containing variables that sex and age are merged #First, melt the variable in order to parse it pd.melt(frame = tb, id_vaes = ['country', 'year']) #Second, make new columns for parsed variables tb_melt['sex'] = tb_melt.variable.str[0] tb_melt['age_group'] = tb_melt.variable.str[1:] #if the column name cannot be parsed by index or position, #use .split() method (for string, built-in) ebola_melt['str_split'] = ebola_melt.type_country.str.split('_') #values of this column are the lists containing splitted strs #so, distribute each splitted str into seperate column ebola_melt['type'] = ebola_melt.str_split.str[0] ebola_melt['country'] = ebola_melt.str_split.str[1] | cs |
'어쩌다 회사 > Data Science' 카테고리의 다른 글
7.1. Exploring your data (DataCamp Data Scientist Career Track) (0) | 2018.12.13 |
---|