Python Beginners(9) -- Clean Data

Read several csv into a dictionary

import pandas as pd
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]
data = {}
for item in data_files:
    path = 'schools/'+item
    keyname = item.split(".")[0]  # get the filename before '.'
    data[keyname] = pd.read_csv(path)

Good resource for encoding
reference
Read files with other encoding

all_survey = pd.read_csv("schools/survey_all.txt", delimiter = "\t", encoding = "windows-1252")
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter = "\t", encoding = "windows-1252")
survey = pd.concat([all_survey, d75_survey], axis = 0)  # new rows = row_x + row_y
print (survey.head(5))

Change column names and select a few columns from original

survey["DBN"] = survey["dbn"]
columns_to_keep = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
survey = survey.loc[:, columns_to_keep]  # select a few columns
data["survey"] = survey

**apply function & lambda **
Create a new column called padded_csd, to turn original csd like '1' to '01', '3' to '03' while keep it as it is when len(csd) > 1.

data["class_size"]['padded_csd'] = data["class_size"]['CSD'].apply(lambda s: str(s).zfill(2)) #could use lambda here

data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Another way to do lambda -- apply function

def padding(num):
    a = str(num)
    if len(a) == 1:
        return a.zfill(2)
    else:
        return a
data["class_size"]['padded_csd'] = data["class_size"]['CSD'].apply(padding)

pd.to_numeric()

df["new_score"] = pd.to_numeric(df["score"], errors = "coerce")
df.loc[5, ["new_score"]] # print the first 5 rows

Group by

import numpy
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
print(data["class_size"].head())

Select rows based on column values

temp = data["graduation"]
temp = temp.loc[(temp["Cohort"]== '2006') & (temp["Demographic"] == "Total Cohort")]
data["graduation"] = temp
print (data["graduation"].head(5))

merge

combined = combined.merge(data["class_size"], on = "DBN", how = "inner")

fillna

means = combined.mean()
combined = combined.fillna(means)
combined = combined.fillna(value = 0)
print (combined.head(5))

pd.corr()
Quickly to find the correlations between columns
see doc

correlations = combined.corr() # the r value
print (correlations["sat_score"])
# draw a bar chart for corr between sat_scores and other survey fields
combined.corr()["sat_score"][survey_fields].plot.bar()

df.plot

import matplotlib.pyplot as plt
combined.plot.scatter(x="total_enrollment", y="sat_score")

draw school locations with color indicates the values of another column

from mpl_toolkits.basemap import Basemap
m = Basemap(
    projection='merc', 
    llcrnrlat=40.496044, 
    urcrnrlat=40.915256, 
    llcrnrlon=-74.255735, 
    urcrnrlon=-73.700272,
    resolution='i'
)
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)

longitudes = combined["lon"].tolist()
latitudes = combined["lat"].tolist()
# show value of 'ell_percent' column
m.scatter(longitudes, latitudes, s=20, zorder=2, latlon=True,
         c = combined["ell_percent"], cmap= "summer")
plt.show()

Apply Function
for avenger data practice

def clean_deaths(row):
    num_deaths = 0
    columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
 
    for c in columns:
        death = row[c]
        if pd.isnull(death) or death == 'NO':
            continue
        elif death == 'YES':
            num_deaths += 1
    return num_deaths

true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)

OR could do

death_col = list()

death_col += ['Death' + str(i) for i in range(1,6)]

true_avengers['Deaths'] = (true_avengers[death_col] == 'YES').sum(axis = 1)

check if the column value is correct

joined_accuracy_count  = int()
correct_joined_years = true_avengers[true_avengers['Years since joining'] == (2015 - true_avengers['Year'])]
joined_accuracy_count = len(correct_joined_years)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,937评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,503评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,712评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,668评论 1 276
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,677评论 5 366
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,601评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,975评论 3 396
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,637评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,881评论 1 298
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,621评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,710评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,387评论 4 319
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,971评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,947评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,189评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 44,805评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,449评论 2 342

推荐阅读更多精彩内容

  • ①喂宝宝要循序渐进,循循善诱 ②避免让宝宝一边吃饭,一边看电视 ③零食是造成宝宝挑食的一大“罪魁” ④用餐时不要让...
    深优U妈咪阅读 150评论 0 0
  • 醒来的时候隐约感觉到外面光线的强烈,朦胧中的我错觉的认为自己睡了一个很舒服的懒觉。按剧情我应该是慵懒地爬起来,揉一...
    弓谷所長阅读 672评论 0 5
  • 看过泰国广告的人都知道,它有一个共性就是会讲故事,要么让你潸然泪下,要么让你忍俊不禁。相比近两年国内各路小花小草代...
    苏穆凉阅读 974评论 1 1
  • 误打误撞进来,谁来告诉我,简书是啥?是心灵的自言自语?
    游子在简书阅读 300评论 2 3