There are several ways for one dataset to presentation

Tidy data, the "Standard way to organize data values within a dataset"

3 principles:

1) Columns represent seperate variables

2) Rows represent individual observations

3) Observational units form tables

*note that 'Better for reporting' does not always mean 'Better for analysis.'
and Tidy data makes fixing common data problems (changing value, adding observation, ...) easier

- Converting to tidy data

Problem A) Columns containing values, instead of variables (violating principle #1)

Solution A) Melt (turning columns into rows)

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
 
pd.melt(frame = df, id_vars = 'name',
        value_vars = ['treatment a''treatment b'],
        var_name = 'treatment', value_name = 'result')
#frame: dataframe to melt
#id_vars: columns to fix
#value_vars: columns to melt, if not specified, else of id_vars
#var_name: rename the variable of melted columns 
#value_name: rename the value of melted columns  
 
cs

Problem B) one observation is divided to number of rows & different variables are merged into one column
(or to reshape df into more report-friendly)

Solution B) Pivot (opposite of melting), turns unique values into seperate columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
 
weather_tidy = weather.pivot(index = 'date,
                            columns = 'element'.
                            values = 'value')
#index: columns to be fixed
#columns: columns to pivot into seperate columns
#values: value to fill in the new columns' values
 
"""
But, if there are duplicate values for same index-column pair,
Value error would occur.
-> Pivot table is the solution
"""
 
weather2_tidy = weather.pivot_table(values='value',
                                    index = 'date',
                                    columns = 'element',
                                    aggfunc = np.mean)
#np.mean orders how to handle the duplicate values
#But this will give a hierarchical index (or Multiindex)
# by using '.reset_index()', it goes back to the original dataframe format 
 
weather2_tidy.reset_index()
 
cs

Beyond melt and pivot

Problem C) Columns contain multiple bits of info (e.g. variables as 'm014', 'm1524' as merged form of sex and age)

Solution C) Parsing the string

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#df name is 'tb', containing variables that sex and age are merged
 
#First, melt the variable in order to parse it
pd.melt(frame = tb, id_vaes = ['country''year'])
 
#Second, make new columns for parsed variables
tb_melt['sex'= tb_melt.variable.str[0]
tb_melt['age_group'= tb_melt.variable.str[1:]
 
#if the column name cannot be parsed by index or position,
#use .split() method (for string, built-in)
 
ebola_melt['str_split'= ebola_melt.type_country.str.split('_')
#values of this column are the lists containing splitted strs
#so, distribute each splitted str into seperate column
ebola_melt['type'= ebola_melt.str_split.str[0]
ebola_melt['country'= ebola_melt.str_split.str[1]
cs


Data preparation in python consists of 4 parts:


1) Exploration

2) Tidying (transforming)

3) Combining

4) Cleaning


Here, I reorganize and practice what I learned from DataCamp course #7.1: data exploration.


Exploration is needed for diagnosing several problems in my dataset before getting into analysis.


Common data problems include*:


1) Inconsistent column name

2) Missing data

3) Outlier

4) Duplicate rows

5) Untidy

6) Need to process columns and its types


*well, these are not obviously the PROBLEMs itself, but in general, one would want to control these before analysis.


First step toward a exploration is to inspect visually, using methods and attributes below:


1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
 
df = pd.read_csv('literary_birth_rate.csv')
 
df.head() #first five rows
df.tail() #last five rows
 
df.columns #column name index list
 
df.shape #column shape by (m row(s), n column(s))
 
df.info() #column별 characteristics (number,type)
cs

Second, if there is any column/variable that consists of categorical value, one would like to count frequency of each value,

while summary statistics should be reported for numerical value columns:

1
2
3
4
5
6
7
8
9
df.population.value_counts(dropna = False)
df['continent'].value_counts(dropna = False)
# if too many categorical values, add '.head()' or '.tail()'
# can access to the column named 'population' with df.population
# 'dropna = False' to see NA
# this gives frequency series, with data type
 
df.describe() #only returns results for numeric columns
# count, mean, st.d, min, quartiles, max
cs

Also, one could visualize data to explore more simply especially for the datasets consist of many variables. For example, in a lecture, recognition of outlier (both real and error) is done by this step.

Histograms, Bar plots, and Scatterplot(for relationship btw two numeric variables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import matplotlib.pyplot as plt
 
#Histogram
df.population.plot('hist')
# df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx = True, logy=True)
# rot: x 축 index 회전 각도, logx, y: rescale, x = '~' 등은 labeling
plt.show()
 
df[df.population > 1000000000#to see the outliers
 
#Boxplot
df.boxplot(column = 'population', by = 'continent')
#df.boxplot(column='initial_cost', rot = 90, by='Borough')
# by = 은 말그대로 initial cost의 boxplot이 Borough 값 (categorical) 에 따라 어떻게 달라지는지 보임
plt.show()
 
#Scatterplot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
Colored by Color Scripter
cs
 
cs


+ Recent posts