用Numpy和Pandas分析二维数据

1. 数据说明

UNIT
Remote unit that collects turnstile information. Can collect from multiple banks of turnstiles. Large subway stations can have more than one unit.
DATEn
Date in “yyyymmdd” (20110521) format.
TIMEn
Time in “hh:mm:ss” (08:05:02) format.
ENTRIESn
Raw reading of cummulative turnstile entries from the remote unit. Occasionally resets to 0.
EXITSn
Raw reading of cummulative turnstile exits from the remote unit. Occasionally resets to 0.
ENTRIESn_hourly
Difference in ENTRIES from the previous REGULAR reading.
EXITSn_hourly
Difference in EXITS from the previous REGULAR reading.
datetime
Date and time in “yyyymmdd hh:mm:ss” format (20110501 00:00:00). Can be parsed into a Pandas datetime object without modifications.
hour
Hour of the timestamp from TIMEn. Truncated rather than rounded.
day_week
Integer (0 6 Mon Sun) corresponding to the day of the week.
weekday
Indicator (0 or 1) if the date is a weekday (Mon Fri).
station
Subway station corresponding to the remote unit.
latitude
Latitude of the subway station corresponding to the remote unit.
longitude
Longitude of the subway station corresponding to the remote unit.
conds Categorical variable of the weather conditions (Clear, Cloudy etc.) for the time and location.
fog
Indicator (0 or 1) if there was fog at the time and location.
precipi
Precipitation in inches at the time and location.
pressurei
Barometric pressure in inches Hg at the time and location.
rain
Indicator (0 or 1) if rain occurred within the calendar day at the location.
tempi
Temperature in ℉ at the time and location.
wspdi
Wind speed in mph at the time and location.
meanprecipi
Daily average of precipi for the location.
meanpressurei
Daily average of pressurei for the location.
meantempi
Daily average of tempi for the location.
meanwspdi
Daily average of wspdi for the location.
weather_lat
Latitude of the weather station the weather data is from.
weather_lon
Longitude of the weather station the weather data is from.

questions i thought of :

what variables are related to subwary ridership?
-- which stations have the most riders?
-- what are the ridership patterns over time?
-- how does the weather affect ridership?
what patterns can i find in the weather?
-- is the temperature rising throughout the month?
-- how does weather vary across the city?

3. 二维numpy数组

two-dimensional data:
python:list of lists
numpy:2D array
pandas:dataframe

2D arrays as opposed to array of arrays:

more memory efficient
accessing element is a bit different a[1,3]
mean(),std() operate on entire array

import numpy as np


ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])
print ridership
print ridership[1, 3]
print ridership[1:3, 3:5]
print ridership[1, :]
    
# Vectorized operations on rows or columns
print ridership[0, :] + ridership[1, :]
print ridership[:, 0] + ridership[:, 1]
    
# Vectorized operations on entire arrays
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print a + b


write a function:
find the max riders on the first day
find the mean riders per days
def mean_riders_for_max_station(ridership):
    
    overall_mean = ridership.mean() # Replace this with your code
    max_station = ridership[0,:].argmax()
    mean_for_max = ridership[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

4. NumPy 轴

行的平均值

ridership.mean(axis=1)

列的平均值

ridership.mean(axis=0)

import numpy as np


a = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
    
print a.sum()
print a.sum(axis=0)
print a.sum(axis=1)
    

ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

def min_and_max_riders_per_day(ridership):
    mean_ridership_for_station = ridership.mean(axis=0)
    
    max_daily_ridership = mean_ridership_for_station.max()    # Replace this with your code
    min_daily_ridership = mean_ridership_for_station.min()   # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

5. NumPy 和 Pandas 数据类型

Pandas dataframe 每一列可以是不同的类型
dataframe.mean() 计算每一列的平均值

6. 访问 DataFrame 元素

.loc['索引名']  #访问相应的一行
.iloc[9] #按位置获取一行
.iloc[1,3]
df['列名']  #获取列
df.values #返回不含列名称或行索引，仅含有df中值的numpy二维数据，这样就可以计算整个df的统计量

import pandas as pd

# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)


# DataFrame creation
print ridership_df.iloc[0]
print ridership_df.loc['05-05-11']
print ridership_df['R003']
print ridership_df.iloc[1, 3]
        
print ridership_df[['R003', 'R005']]
    
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print df.sum()
print df.sum(axis=1)
print df.values.sum()
    
def mean_riders_for_max_station(ridership):
    overall_mean = ridership.values.mean() 
    max_station = ridership.iloc[0].argmax()  #return the colunm name 
    mean_for_max = ridership.loc[:,max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

7. 将数据加载到 DataFrame 中

DataFrame 可有效表示csv文件内容，可使每一列的数据类型不同

df = pd.read_csv('filename.csv')

8. 计算相关性

默认情况下，Pandas 的 std() 函数使用贝塞耳校正系数来计算标准偏差。调用 std(ddof=0) 可以禁止使用贝塞耳校正系数。
计算皮尔森系数时，需要使用ddof=0

NumPy 的 corrcoef() 函数可用来计算皮尔逊积矩相关系数，也简称为“相关系数”。

import pandas as pd
def correlation(x, y):
    x_standard = (x-x.mean())/x.std(ddof=0) 
    y_standard = (y-y.mean())/y.std(ddof=0)
    return (x_standard * y_standard).mean()

9. Pandas 轴名

axis = 1 axis='column' 行
axis = 0 axis='index' 列

10. DataFrame 向量化运算

import pandas as pd

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
print df1 + df2
    
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
print df1 + df2

df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                    index=['row1', 'row2', 'row3'])
df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                    index=['row4', 'row3', 'row2'])
print df1 + df2


# Cumulative entries and exits for one station for a few hours.
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})

def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Fill in this function to take a DataFrame with cumulative entries
    and exits (entries in the first column, exits in the second) and
    return a DataFrame with hourly entries and exits (entries in the
    first column, exits in the second).
    '''
    return entries_and_exits-entries_and_exits.shift(1)

11. DataFrame applymap()

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [10, 20, 30],
    'c': [5, 10, 15]
})
    
def add_one(x):
    return x + 1
        
print df.applymap(add_one)
    
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
 
def convert_grade(x):
    if x>= 90:
        return 'A'
    elif x>= 80:
        return 'B'
    elif x>= 70:
        return 'C'
    elif x>=60:
        return 'D'
    else:
        return 'F'
def convert_grades(grades):
    
    return grades.applymap(convert_grade)

12.DataFrame apply()

def standardize_column(column):
    return (column - column.mean())/column.std(ddof=0)
def standardize(df):
    return df.apply(standardize_column)

计算得出的默认标准偏差类型在 numpy 的 .std() 和 pandas 的 .std() 函数之间是不同的。默认情况下，numpy 计算的是总体标准偏差，ddof = 0。另一方面，pandas 计算的是样本标准偏差，ddof = 1。如果我们知道所有的分数，那么我们就有了总体——因此，要使用 pandas 进行归一化处理，我们需要将“ddof”设置为 0。

13. DataFrame apply() 使用案例 2

将一列数据转化为单个值

def column_second_largest(column):
    sorted_values = column.sort_values(ascending = False)
    return sorted_values.iloc[1]
    
def second_largest(df):
    '''
    Fill in this function to return the second-largest value of each 
    column of the input DataFrame.
    '''
    return df.apply(column_second_largest)

14. 向 Series 添加 DataFrame

import pandas as pd
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    0: [10, 20, 30, 40],
    1: [50, 60, 70, 80],
    2: [90, 100, 110, 120],
    3: [130, 140, 150, 160]
})

# Adding a Series to a square DataFrame    
print df + s
    
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
# Adding a Series to a one-row DataFrame 
print df + s

s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})
# Adding a Series to a one-column DataFrame
print df + s
    

    
# Adding when DataFrame column names match Series index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
    
print df + s
    
# Adding when DataFrame column names don't match Series index
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})
print df + s

df.add(s) --- df+s
df.add(s,axis='columns')
df.add(s,axis='index')

将dataframe与series相加，就是将dataframe的每一列与series的每一个值相加，它根据series的索引值和dataframe的列名匹配dataframe和series.

15. 再次归一化每一列

def standardize(df):
    '''
    归一化每一列
    '''
    return (df-df.mean())/df.std(ddof=0)

def standardize_rows(df):
    '''
    归一化每一行
    '''
    mean = df.mean(axis='columns')
    mean_difference = df-mean
    std = df.std(axis = 'columns',ddof=0)
    return mean_difference/std

16. Pandas groupby()

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])


print example_df
    
grouped_data = example_df.groupby('even')
print grouped_data.groups
    
# Group by multiple columns
grouped_data = example_df.groupby(['even', 'above_three'])
print grouped_data.groups
    
# Get sum of each group
grouped_data = example_df.groupby('even')
print grouped_data.sum()
    

grouped_data = example_df.groupby('even')
print grouped_data.sum()['value']
print grouped_data['value'].sum()

17. 每小时入站和出站数

def hourly(column):
return column - column.shift(1)

def get_hourly_entries_and_exits(entries_and_exits):
'''
Fill in this function to take a DataFrame with cumulative entries
and exits and return a DataFrame with hourly entries and exits.
The hourly entries and exits should be calculated separately for
each station (the 'UNIT' column).
'''
return entries_and_exits.groupby('UNIT')[['ENTRIESn','EXITSn']].apply(hourly)

18.合并 Pandas DataFrame

import pandas as pd

subway_df = pd.DataFrame({
    'UNIT': ['R003', 'R003', 'R003', 'R003', 'R003', 'R004', 'R004', 'R004',
             'R004', 'R004'],
    'DATEn': ['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
              '05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'ENTRIESn': [ 4388333,  4388348,  4389885,  4391507,  4393043, 14656120,
                 14656174, 14660126, 14664247, 14668301],
    'EXITSn': [ 2911002,  2911036,  2912127,  2913223,  2914284, 14451774,
               14451851, 14454734, 14457780, 14460818],
    'latitude': [ 40.689945,  40.689945,  40.689945,  40.689945,  40.689945,
                  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ,  40.69132 ],
    'longitude': [-73.872564, -73.872564, -73.872564, -73.872564, -73.872564,
                  -73.867135, -73.867135, -73.867135, -73.867135, -73.867135]
})

weather_df = pd.DataFrame({
    'DATEn': ['05-01-11', '05-01-11', '05-02-11', '05-02-11', '05-03-11',
              '05-03-11', '05-04-11', '05-04-11', '05-05-11', '05-05-11'],
    'hour': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'latitude': [ 40.689945,  40.69132 ,  40.689945,  40.69132 ,  40.689945,
                  40.69132 ,  40.689945,  40.69132 ,  40.689945,  40.69132 ],
    'longitude': [-73.872564, -73.867135, -73.872564, -73.867135, -73.872564,
                  -73.867135, -73.872564, -73.867135, -73.872564, -73.867135],
    'pressurei': [ 30.24,  30.24,  30.32,  30.32,  30.14,  30.14,  29.98,  29.98,
                   30.01,  30.01],
    'fog': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'rain': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    'tempi': [ 52. ,  52. ,  48.9,  48.9,  54. ,  54. ,  57.2,  57.2,  48.9,  48.9],
    'wspdi': [  8.1,   8.1,   6.9,   6.9,   3.5,   3.5,  15. ,  15. ,  15. ,  15. ]
})

def combine_dfs(subway_df, weather_df):
    '''
    Fill in this function to take 2 DataFrames, one with subway data and one with weather data,
    and return a single dataframe with one row for each date, hour, and location. Only include
    times and locations that have both subway data and weather data available.
    '''
    return subway_df.merge(weather_df,
        on=['DATEn','hour','latitude','longitude'],
        how='inner')

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,723评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,080评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,604评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,440评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,431评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,499评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,893评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,541评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,751评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,547评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,619评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,320评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,890评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,896评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,137评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,796评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,335评论 2赞 342