Data preparation in python consists of 4 parts:


1) Exploration

2) Tidying (transforming)

3) Combining

4) Cleaning


Here, I reorganize and practice what I learned from DataCamp course #7.1: data exploration.


Exploration is needed for diagnosing several problems in my dataset before getting into analysis.


Common data problems include*:


1) Inconsistent column name

2) Missing data

3) Outlier

4) Duplicate rows

5) Untidy

6) Need to process columns and its types


*well, these are not obviously the PROBLEMs itself, but in general, one would want to control these before analysis.


First step toward a exploration is to inspect visually, using methods and attributes below:


1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
 
df = pd.read_csv('literary_birth_rate.csv')
 
df.head() #first five rows
df.tail() #last five rows
 
df.columns #column name index list
 
df.shape #column shape by (m row(s), n column(s))
 
df.info() #column별 characteristics (number,type)
cs

Second, if there is any column/variable that consists of categorical value, one would like to count frequency of each value,

while summary statistics should be reported for numerical value columns:

1
2
3
4
5
6
7
8
9
df.population.value_counts(dropna = False)
df['continent'].value_counts(dropna = False)
# if too many categorical values, add '.head()' or '.tail()'
# can access to the column named 'population' with df.population
# 'dropna = False' to see NA
# this gives frequency series, with data type
 
df.describe() #only returns results for numeric columns
# count, mean, st.d, min, quartiles, max
cs

Also, one could visualize data to explore more simply especially for the datasets consist of many variables. For example, in a lecture, recognition of outlier (both real and error) is done by this step.

Histograms, Bar plots, and Scatterplot(for relationship btw two numeric variables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import matplotlib.pyplot as plt
 
#Histogram
df.population.plot('hist')
# df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx = True, logy=True)
# rot: x 축 index 회전 각도, logx, y: rescale, x = '~' 등은 labeling
plt.show()
 
df[df.population > 1000000000#to see the outliers
 
#Boxplot
df.boxplot(column = 'population', by = 'continent')
#df.boxplot(column='initial_cost', rot = 90, by='Borough')
# by = 은 말그대로 initial cost의 boxplot이 Borough 값 (categorical) 에 따라 어떻게 달라지는지 보임
plt.show()
 
#Scatterplot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
Colored by Color Scripter
cs
 
cs


+ Recent posts