我先评价一下这门课,总体来说,这门课难度的确不高,会一些简单的字符串处理和哈希表知识就可以上手了,实现的主要语言是python,也是比较流行的大数据处理语言。闲话不多说,开始入门。
进行数据处理编程的时候,我最大的感受就是一定要做好原始数据的预处理。一是可以减少之后面对具体问题的代码量,同时可以将数据有效化,筛去坏数据。因此我们要先做几件事情。
首先读取csv数据,需要引入unicodecsv的库,然后我们打开csv文件,rb表示文件可以被读取,with open的好处是可以自动关闭文件。unicodecsv.DictReader()函数使每一行都是一个字典,键是列名,值是相应列的值;并且函数的返回值为一个迭代器。迭代器转化为list进行储存从而方便处理,具体代码如下:
import unicodecsv
def read_csv(filename):
with open(filename, 'rb') as f:
reader = unicodecsv.DictReader(f)
return list(reader)
enrollments = read_csv("enrollments.csv")
daily_engagement = read_csv("daily_engagement.csv")
project_submissions = read_csv("project_submissions.csv")
print enrollments[0]
print daily_engagement[0]
print project_submissions[0]
此时我们已经将数据存入列表并输出每一个列表第一个值,如下:
{u'status': u'canceled', u'is_udacity': u'True', u'is_canceled': u'True', u'join_date': u'2014-11-10', u'account_key': u'448', u'cancel_date': u'2015-01-14', u'days_to_cancel': u'65'}
{u'lessons_completed': u'0.0', u'num_courses_visited': u'1.0', u'total_minutes_visited': u'11.6793745', u'projects_completed': u'0.0', u'acct': u'0', u'utc_date': u'2015-01-09'}
{u'lesson_key': u'3176718735', u'processing_state': u'EVALUATED', u'account_key': u'256', u'assigned_rating': u'UNGRADED', u'completion_date': u'2015-01-16', u'creation_date': u'2015-01-14'}
不难发现,无论键还是值都是字符串。因此我们要对数据进行预处理。
对日期对象进行预处理函数如下:
from datetime import datetime as dt
def parse_date(date):
if date =='':
return None
else:
return dt.strptime(data,"%Y-%m-%d")
对整型对象进行预处理函数如下(简化版):
def parse_maybe_int(i):
return None if string == '' else int(i)
我们发现在不同的csv文件中,account_key和acct都表示账号ID,为了数据的统一从而方便处理,我们将用acct表示账号ID的值转换为accout_key表示。代码如下:
def acct_to_account_key():
for engagement_record in daily_engagement:
engagement_record['account_key'] = engagement_record['acct']
del[engagement_record['acct']]
预处理结束后,我们就开始真正的数据处理了,我们先做第一件事情,就是统计一共注册过多少人(也就是有多少'account_key'),并验证注册和参与课程活动的人数是否相符合。
def get_unique_students(data):
unique_students = set()
for data_point in data:
unique_students.add(data_point['account_key'])
return unique_students
acct_to_account_key()
unique_students_enrollments = get_unique_students(enrollments)
unique_students_engagement = get_unique_students(daily_engagement)
print len(unique_students_enrollments)
print len(unique_students_engagement)
我们得到的输出如下:
1302
1237
那么问题来了,为什么两个数值不相等呢?代表我们的数据是存在缺陷的,所以我们要审核这两份数据。
我们审核的方法是某一个学员注册了账号但在参与数据中并没有体现,具体代码如下:
def find_problem():
for enrollment in enrollments:
if enrollment["account_key"] not in unique_students_engagement:
print enrollment
break
find_problem()
我们得到如下数据:
{u'status': u'canceled', u'is_udacity': u'False', u'is_canceled': u'True', u'join_date': u'2014-11-12', u'account_key': u'1219', u'cancel_date': u'2014-11-12', u'days_to_cancel': u'0'}
我们会发现这位学员注册日期与取消日期相等。经核实,当学员当日注销,则在参与数据中不体现。那么除了这部分数据无法匹配,是否还有其他数据有问题呢?我们继续审查。
def find_problem2():
num_problem_students = 0
for enrollment in enrollments:
if (enrollment['account_key'] not in unique_students_engagement \
and enrollment['join_date'] != enrollment['cancel_date']):
print enrollment
num_problem_students += 1
find_problem2()
我们得到如下的输出:
{u'status': u'canceled', u'is_udacity': u'True', u'is_canceled': u'True', u'join_date': u'2015-01-10', u'account_key': u'1304', u'cancel_date': u'2015-03-10', u'days_to_cancel': u'59'}
{u'status': u'canceled', u'is_udacity': u'True', u'is_canceled': u'True', u'join_date': u'2015-03-10', u'account_key': u'1304', u'cancel_date': u'2015-06-17', u'days_to_cancel': u'99'}
{u'status': u'current', u'is_udacity': u'True', u'is_canceled': u'False', u'join_date': u'2015-02-25', u'account_key': u'1101', u'cancel_date': u'', u'days_to_cancel': u''}
这三个数据的问题在于它们同属于测试账号,也就是u'is_udacity': u'True',这些测试账号并不一定会在参与数据当中。此时,我们就已经找到了注册数据与参与数据不匹配的两个原因。
由于测试账号对于数据分析毫无意义,下面我们需要将这些测试账号从注册数据的列表中删除。
首先,我们创建测试账号的集合,函数如下:
def test_set():
data_test_set = set()
for enrollment in enrollments:
if enrollment["is_udacity"] == "True" :
data_test_set.add(enrollment['account_key'])
return data_test_set
enrollments_test_set = test_set()
然后,我们将账号数据列表和参与数据列表中的测试账号相关信息删去,函数如下:
def remove_udacity_accounts(data):
non_udacity_accounts = []
for data_point in data:
if data_point["account_key"] not in enrollments_test_set:
non_udacity_accounts.append(data_point)
return non_udacity_accounts
enrollments_remove_udacity_accounts = remove_udacity_accounts(enrollments)
engagement_remove_udacity_accounts = remove_udacity_accounts(daily_engagement)
print len(enrollments_remove_udacity_accounts)
print len(engagement_remove_udacity_accounts)
输出结果如下:
1622
135656
此时,我们已经将测试账号从我们需要处理的数据列表中删除。
现在我们开始做第二件事情:统计付费学员在其注册开始后的第一周时间内的参与数据。
首先,我们要创建一个函数,满足如下要求:
(1)返回一个哈希表,哈希表内的学生账号ID来自于enrollments_remove_udacity_accounts(也就是将测试账号删除的数据列表)
(2)学生要满足还没有注销账号,或者超过7天才注销账号。
(3)哈希表的键为学生ID,值为学生最近的一次注册日期。
函数如下:
def paid_student_dict():
paid_student = {}
for student in enrollments_remove_udacity_accounts:
if student["is_canceled"] != "True" or parse_maybe_int(student["days_to_cancel"]) > 7:
account_key = student["account_key"]
join_date = student["join_date"]
if account_key not in paid_student or join_date > paid_student[account_key]:
paid_student[account_key] = join_date
return paid_student
paid_student_in_enrollments = paid_student_dict()
print len(paid_student_in_enrollments)
输出结果如下:
995
我们现在所得到的哈希表也就是所有付费学生的ID以及其相应的注册日期。所以根据这个哈希表,我们处理以下三个数据,删去那些不符合条件的学生记录:
enrollments_remove_udacity_accounts
engagement_remove_udacity_accounts
project_remove_udacity_accounts
函数如下:
def remove_free_trial_cancels(data):
new_data = []
for data_point in data:
if data_point['account_key'] in paid_students:
new_data.append(data_point)
return new_data
paid_enrollments = remove_free_trial_cancels(enrollments_remove_udacity_accounts)
paid_engagement = remove_free_trial_cancels(engagement_remove_udacity_accounts)
paid_project = remove_free_trial_cancels(project_remove_udacity_accounts)
print len(paid_enrollments)
print len(paid_engagement)
print len(paid_project)
输出结果如下:
1293
134549
3618
这样,我们最新的三个数据内的学生均为付费学生。之后我们要做的就是获取这些学生第一周的参与数据。
def within_one_week(join_date, engagement_date):
time_delta = parse_date(engagement_date) - parse_date(join_date)
return time_delta.days < 7
def first_paid_week():
paid_engagement_in_first_week = []
for engagement_record in paid_engagement:
account_key = engagement_record['account_key']
join_date = paid_students[account_key]
engagement_record_date = engagement_record['utc_date']
if within_one_week(join_date, engagement_record_date):
paid_engagement_in_first_week.append(engagement_record)
return paid_engagement_in_first_week
paid_engagement_in_first_week = first_paid_week()
print len(paid_engagement_in_first_week)
输出结果如下:
21508
第三件事情就是得到付费学员第一周上课的平均时间。所以我们首先将付费学员与付费学员的参与课程记录建立相对应的哈希表。
def division_by_account():
engagement_by_account = defaultdict(list)
for engagement_record in paid_engagement_in_first_week:
account_key = engagement_record['account_key']
engagement_by_account[account_key].append(engagement_record)
return engagement_by_account
engagement_by_account = division_by_account()
然后统计每一名学员和与之对应的参与课堂时间,从而得到学员们参与课程的平均时间,这里用到了numpy函数库。
def minutes_by_account():
total_minutes_by_account = {}
for account_key, engagement_for_students in engagement_by_account.items():
total_minutes = 0
for engagement_record in engagement_for_students:
total_minutes += float(engagement_record['total_minutes_visited'])
total_minutes_by_account[account_key] = total_minutes
return total_minutes_by_account
total_minutes_by_account = minutes_by_account()
print 'Mean:', np.mean(total_minutes_by_account.values())
print 'Std:', np.std(total_minutes_by_account.values())
print 'Min:', np.min(total_minutes_by_account.values())
print 'Max:', np.max(total_minutes_by_account.values())
输出结果:
Mean: 647.590173826
Std: 1129.27121042
Min: 0.0
Max: 10568.1008673
发现Max值问题很大,因为时间的单位是min。所以我们要找到max对应的这位学员的参与课堂记录:
def find_max_minutes_student():
student_with_max_minutes = None
max_minutes = 0
for student, total_minutes in total_minutes_by_account.items():
if total_minutes > max_minutes:
max_minutes = total_minutes
student_with_max_minutes = student
return student_with_max_minutes
def max_minutes_student_enagement():
for engagement_record in paid_engagement_in_first_week:
if engagement_record["account_key"] == student_with_max_minutes:
print engagement_record
student_with_max_minutes = find_max_minutes_student()
#print total_minutes_by_account[student_with_max_minutes]
max_minutes_student_enagement()
输出结果(由于数据量大我只取首位两个):
{u'lessons_completed': u'0.0', u'num_courses_visited': u'1.0', u'total_minutes_visited': u'50.9938951667', u'projects_completed': u'0.0', 'account_key': u'108', u'utc_date': u'2015-01-07'}
{u'lessons_completed': u'0.0', u'num_courses_visited': u'0.0', u'total_minutes_visited': u'0.0', u'projects_completed': u'0.0', 'account_key': u'108', u'utc_date': u'2015-04-26'}
我们会发现时间跨度远超1周,我首先想到是不是注册时间有问题。因此我调用如下函数:
print paid_students[student_with_max_minutes]
得到的结果是:
2015-07-09
所以问题出在了我们within_one_week()函数不够完整,我们只记录了学员最后一次注册时间,那么原有函数将7天及所有7天以前的记录(学员之前注册过的信息)都考虑进去了,这是明显不对的,修正后的函数为:
def within_one_week(join_date, engagement_date):
time_delta = parse_date(engagement_date) - parse_date(join_date)
return time_delta.days < 7 and time_delta.days >=0
print 'Mean:', np.mean(total_minutes_by_account.values())
print 'Max:', np.max(total_minutes_by_account.values())
得到如下结果:
Mean: 306.708326753
Max: 3564.7332645
为了能够处理类似数据,我们将上面函数进行改写,新的函数如下:
def group_data(data,key_name):
grouped_data = defaultdict(list)
for data_point in data:
key = data_point[key_name]
grouped_data[key].append(data_point)
return grouped_data
def sum_grouped_items(grouped_data,field_name):
summed_data = {}
for key, data_points in grouped_data.items():
total = 0
for data_point in data_points:
total += data_point[field_name]
summed_data[key] = total
return summed_data
def describe_data(data):
print 'Mean:', np.mean(data)
print 'Std:', np.std(data)
print 'Min:', np.min(data)
print 'Max:', np.max(data)
这样我们用同一函数既能够获得关于学员上课总分钟数的统计也能获得学员上课数量的统计信息。
engagement_by_account = group_data(paid_engagement_in_first_week,'account_key')
for engagement_record in paid_engagement_in_first_week:
engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])
total_minutes_by_account = sum_grouped_items(engagement_by_account,'total_minutes_visited')
#describe_data(total_minutes_by_account.values())
for engagement_record in paid_engagement_in_first_week:
engagement_record['lessons_completed'] = float(engagement_record['lessons_completed'])
lessons_completed_by_account = sum_grouped_items(engagement_by_account,'lessons_completed')
describe_data(lessons_completed_by_account.values())
输出结果结果为:
Mean: 1.63618090452
Std: 3.00256129983
Min: 0.0
Max: 36.0
得到付费学员第一周上课的平均时间和课程数后,我们可以利用类似的方法求得每周付费学员的上课天数。但是我们要在原数据中加入了一个参数'has_visited',也就是统计某天真正的参与课堂而不是只在网站闲逛。
def has_visited_update():
for engagement_record in paid_engagement:
if float(engagement_record['num_courses_visited']) > 0:
engagement_record['has_visited'] = 1
else:
engagement_record['has_visited'] = 0
has_visited_update()
days_visited_by_account = sum_grouped_items(engagement_by_account,'has_visited')
describe_data(days_visited_by_account.values())
下面开始划分及格学员,所谓的及格学员就是某课程pass的,反之是不及格学员,我们统计的课号为:['746169184', '3176718735']。我们会将及格学员的学号放在一个集合中。
def find_pass_subway():
subway_project_lesson_keys = ['746169184', '3176718735']
pass_subway_project = set()
for submission in paid_project:
project = submission['lesson_key']
rating = submission['assigned_rating']
if ((project in subway_project_lesson_keys) and(rating == 'PASSED' or rating == 'DISTINCTION')):
pass_subway_project.add(submission['account_key'])
return pass_subway_project
print len(find_pass_subway())
输出结果为:
647
我们将及格与不及格学员的第一周参与课堂记录分别存在两个列表中。
def devision_engagement():
passing_engagement = []
non_passing_engagement = []
for engagement_record in paid_engagement_in_first_week:
if engagement_record['account_key'] in pass_subway_project:
passing_engagement.append(engagement_record)
else:
non_passing_engagement.append(engagement_record)
return passing_engagement,non_passing_engagement
print len(passing_engagement)
print len(non_passing_engagement)
输出结果为:
4527
2392
然后我们比较两组学员的情况。比较的内容分别为(第一周内的):访问分钟数,完成课程数和参与课堂天数。
passing_engagement_by_account = group_data(passing_engagement,'account_key')
non_passing_engagement_by_account = group_data(non_passing_engagement,'account_key')
print 'non-passing students:'
non_passing_minutes = sum_grouped_items(non_passing_engagement_by_account,'total_minutes_visited')
describe_data(non_passing_minutes.values())
print 'passing students:'
passing_minutes = sum_grouped_items(passing_engagement_by_account,'total_minutes_visited')
describe_data(passing_minutes.values())
print 'non-passing students:'
non_passing_lessons = sum_grouped_items(non_passing_engagement_by_account,'lessons_completed')
describe_data(non_passing_lessons.values())
print 'passing students:'
passing_lessons = sum_grouped_items(passing_engagement_by_account,'lessons_completed')
describe_data(passing_lessons.values())
print 'non-passing students:'
non_passing_visits = sum_grouped_items(non_passing_engagement_by_account,'has_visited')
describe_data(non_passing_visits.values())
print 'passing students:'
passing_visits = sum_grouped_items(passing_engagement_by_account,'has_visited')
describe_data(passing_visits.values())
得到如下结果:
non-passing students:
Mean: 143.326474267
Std: 269.538619011
Min: 0.0
Max: 1768.52274933
passing students:
Mean: 394.586046484
Std: 448.499519327
Min: 0.0
Max: 3564.7332645
non-passing students:
Mean: 0.862068965517
Std: 2.54915994183
Min: 0.0
Max: 27.0
passing students:
Mean: 2.05255023184
Std: 3.14222705558
Min: 0.0
Max: 36.0
non-passing students:
Mean: 1.90517241379
Std: 1.90573144136
Min: 0
Max: 7
passing students:
Mean: 3.38485316847
Std: 2.25882147092
Min: 0
Max: 7
完