Data preparation in python consists of 4 parts:
1) Exploration
2) Tidying (transforming)
3) Combining
4) Cleaning
Here, I reorganize and practice what I learned from DataCamp course #7.1: data exploration.
Exploration is needed for diagnosing several problems in my dataset before getting into analysis.
Common data problems include*:
1) Inconsistent column name
2) Missing data
3) Outlier
4) Duplicate rows
5) Untidy
6) Need to process columns and its types
*well, these are not obviously the PROBLEMs itself, but in general, one would want to control these before analysis.
First step toward a exploration is to inspect visually, using methods and attributes below:
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd df = pd.read_csv('literary_birth_rate.csv') df.head() #first five rows df.tail() #last five rows df.columns #column name index list df.shape #column shape by (m row(s), n column(s)) df.info() #column별 characteristics (number,type) | cs |
Second, if there is any column/variable that consists of categorical value, one would like to count frequency of each value,
while summary statistics should be reported for numerical value columns:
1 2 3 4 5 6 7 8 9 | df.population.value_counts(dropna = False) df['continent'].value_counts(dropna = False) # if too many categorical values, add '.head()' or '.tail()' # can access to the column named 'population' with df.population # 'dropna = False' to see NA # this gives frequency series, with data type df.describe() #only returns results for numeric columns # count, mean, st.d, min, quartiles, max | cs |
Also, one could visualize data to explore more simply especially for the datasets consist of many variables. For example, in a lecture, recognition of outlier (both real and error) is done by this step.
Histograms, Bar plots, and Scatterplot(for relationship btw two numeric variables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import pandas as pd import matplotlib.pyplot as plt #Histogram df.population.plot('hist') # df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx = True, logy=True) # rot: x 축 index 회전 각도, logx, y: rescale, x = '~' 등은 labeling plt.show() df[df.population > 1000000000] #to see the outliers #Boxplot df.boxplot(column = 'population', by = 'continent') #df.boxplot(column='initial_cost', rot = 90, by='Borough') # by = 은 말그대로 initial cost의 boxplot이 Borough 값 (categorical) 에 따라 어떻게 달라지는지 보임 plt.show() #Scatterplot df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70) plt.show() Colored by Color Scripter cs | cs |
'어쩌다 회사 > Data Science' 카테고리의 다른 글
7.2. Tidying data for analysis (DataCamp Data Scientist Career Track) (0) | 2018.12.13 |
---|