Pandas练习

开公众号啦，「心智万花筒」欢迎一起读书交流成长。

import numpy as np
from numpy import random
import pandas as pd

索引练习

索引是不能更常见的操作了。默认选取列，比如df['Date']，如果对行，用标签索引ix，如df.ix['2015-01-05']等，还有loc和iloc,他们的区别是这样的。

loc is label based indexing so basically looking up a value in a row, iloc is integer row based indexing, ix is a general method that first performs label based, if that fails then it falls to integer based.

.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.

df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=['100', '200', '300'])

df

	A	B
100	a	54
200	b	67
300	c	89

df['A'] # 默认的列索引

100    a
200    b
300    c
Name: A, dtype: object

df.loc['100'] # 按标签行索引

A     a
B    54
Name: 100, dtype: object

df.iloc[0] # 更像numpy索引，按整数索引行

A     a
B    54
Name: 100, dtype: object

df.ix['200'] #更general的标签索引

A     b
B    67
Name: 200, dtype: object

df.ix[1] # 整合了loc和iloc的功能
# 注意当标签为整数时，会fail

A     b
B    67
Name: 200, dtype: object

df

	A	B
100	a	54
200	b	67
300	c	89

Text练习

对Series对象进行字符串操作。Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

Series

s = pd.Series(['A', 'B', 'Aaba', 'Baca', np.nan, 'CABA', 'dog'])

# built in str 方法都可以直接用
s.str.lower()

0       a
1       b
2    aaba
3    baca
4     NaN
5    caba
6     dog
dtype: object

# element wise operation
s.str.len()

0    1.0
1    1.0
2    4.0
3    4.0
4    NaN
5    4.0
6    3.0
dtype: float64

Index

df.columns是一个Index对象，这样可以便捷地整理columns名字

idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])

# strip() / lstrip() / rstrip()
idx.str.strip()

Index([u'jack', u'jill', u'jesse', u'frank'], dtype='object')

import numpy as np
df_str = pd.DataFrame(np.random.random((3, 2)), columns=[' Column A ', ' Column B '],index=range(3))

df_str.columns

Index([u' Column A ', u' Column B '], dtype='object')

df_str.columns.str.strip() # 依然是Index 对象

Index([u'Column A', u'Column B'], dtype='object')

df_str.columns = df_str.columns.str.strip().str.lower().str.replace(' ','_')

df_str

	column_a	column_b
0	0.444976	0.441526
1	0.846533	0.549270
2	0.492742	0.363660

Replace

replace and findall, regular expressions work

s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])

s3.str.replace('^A|dog','XX')

0       XX
1        B
2        C
3    XXaba
4     Baca
5         
6      NaN
7     CABA
8       XX
9      cat
dtype: object

Pattern Match(contains)

The distinction between match and contains is strictness: match relies on strict re.match, while contains relies on re.search.

pattern = r'[a-z][0-9]'
s = pd.Series(['1', 'b2', '3a', '3b', 'c2c'])
s.str.contains(pattern)

0    False
1     True
2    False
3    False
4     True
dtype: bool

str方法汇总

url = 'http://pandas.pydata.org/pandas-docs/stable/text.html'
df_str_methond = pd.read_html(url,header=0,attrs={'id' : 'text-summary'})[0].set_index('Method')

df_str_methond.head()

Method	Description
cat()	Concatenate strings
split()	Split strings on delimiter
rsplit()	Split strings on delimiter working from the en...
get()	Index into each element (retrieve i-th element)
join()	Join strings in each element of the Series wit...

教程

参考

创建DataFrame

names = ['Bob','Jessica','Mary','John','Mel']
# Make a random list of 1,000 baby names using the five above
random.seed(500)
random_names = [names[random.randint(low=0,high=len(names))] for i in range(1000)]
random_names[:5]

['Mary', 'Jessica', 'Jessica', 'Bob', 'Jessica']

births = [random.randint(low=0,high=1000) for i in range(1000)]
births[:5]

[905, 52, 796, 217, 874]

# To merge these two lists together we will use the zip function.
BabyDataSet = zip(random_names,births) # return a list
BabyDataSet[:3]

[('Mary', 905), ('Jessica', 52), ('Jessica', 796)]

# data 接受list，字典等数据类型
df_baby = pd.DataFrame(data=BabyDataSet,columns=['Names','Births'])

df_baby.head()

	Names	Births
0	Mary	905
1	Jessica	52
2	Jessica	796
3	Bob	217
4	Jessica	874

df_baby['Names'].unique()

array(['Mary', 'Jessica', 'Bob', 'John', 'Mel'], dtype=object)

df_baby['Names'].describe()

count     1000
unique       5
top        Bob
freq       206
Name: Names, dtype: object

数据读写CSV

df_baby.to_csv('births1880.csv',index=False,header=False)

df_baby_read = pd.read_csv('births1880.csv',names=['Names','Births'])
df_baby_read.head()

	Names	Births
0	Mary	905
1	Jessica	52
2	Jessica	796
3	Bob	217
4	Jessica	874

数据读写Excel

d = {'Channel':[1], 'Number':[255]}
df = pd.DataFrame(d)
df

	Channel	Number
0	1	255

df.to_excel('test1.xlsx',sheet_name='test1',index=False)
df.to_excel('test2.xlsx', sheet_name = 'test2', index = False)
df.to_excel('test3.xlsx', sheet_name = 'test3', index = False)

import os
file_names = []
for file in os.listdir('.'):
    if file.endswith('.xlsx'):
        file_names.append(file)
file_names

['test1.xlsx', 'test2.xlsx', 'test3.xlsx']

把三个excel文件读到一起

def get_file(fname):
    df = pd.read_excel(fname,0)
    df['File'] = fname
    return df.set_index(['File'])

df_list = [get_file(fname) for fname in file_names]
df_list[0]

File	Channel	Number
test1.xlsx	1	255

big_df = pd.concat(df_list)
big_df

File	Channel	Number
test1.xlsx	1	255
test2.xlsx	1	255
test3.xlsx	1	255

Stack and Unstack

# Our small data set
d = {'one':[1,1],'two':[2,2]};i = ['a','b']
# Create dataframe
df = pd.DataFrame(data = d, index = i)
df

	one	two
a	1	2
b	1	2

stack = df.stack()
stack

a  one    1
   two    2
b  one    1
   two    2
dtype: int64

stack.index

MultiIndex(levels=[[u'a', u'b'], [u'one', u'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

unstack = df.unstack()
unstack

one  a    1
     b    1
two  a    2
     b    2
dtype: int64

unstack.index

MultiIndex(levels=[[u'one', u'two'], [u'a', u'b']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Groupby

d = {'one':[1,1,1,1,1],
     'two':[2,2,2,2,2],
     'letter':['a','a','b','b','c']}

# Create dataframe
df = pd.DataFrame(d)
df

	letter	one	two
0	a	1	2
1	a	1	2
2	b	1	2
3	b	1	2
4	c	1	2

one = df.groupby('letter')
one.sum()

letter	one	two
a	2	4
b	2	4
c	1	2

letterone = df.groupby(['letter','one']).sum()
letterone

letter	one	two
a	1	4
b	1	4
c	1	2

letterone.index

MultiIndex(levels=[[u'a', u'b', u'c'], [1]],
           labels=[[0, 1, 2], [0, 0, 0]],
           names=[u'letter', u'one'])

Outliers

# Create a dataframe with dates as your index
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'FL', 'FL'] 
data = [1.0, 1.2, 4, 5, 6, 7, 8, 9, 10]
idx = pd.date_range('1/1/2012', periods=9, freq='MS')
df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
df1['State'] = States

# Create a second dataframe
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 6, 11]
idx2 = pd.date_range('1/1/2013', periods=9, freq='MS')
df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
df2['State'] = States

# conbine dataframes
df = pd.concat([df1,df2])
df.head(2).append(df.tail(2))

	Revenue	State
2012-01-01	1.0	NY
2012-02-01	1.2	NY
2013-08-01	6.0	FL
2013-09-01	11.0	FL

方法一

# a new copy
newdf = df.copy()
newdf['x-Mean'] = abs(newdf['Revenue']-newdf['Revenue'].mean())
newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
newdf

	Revenue	State	x-Mean	1.96*std	Outlier
2012-01-01	1.0	NY	6.177778	5.666103	True
2012-02-01	1.2	NY	5.977778	5.666103	True
2012-03-01	4.0	NY	3.177778	5.666103	False
2012-04-01	5.0	NY	2.177778	5.666103	False
2012-05-01	6.0	FL	1.177778	5.666103	False
2012-06-01	7.0	FL	0.177778	5.666103	False
2012-07-01	8.0	GA	0.822222	5.666103	False
2012-08-01	9.0	FL	1.822222	5.666103	False
2012-09-01	10.0	FL	2.822222	5.666103	False
2013-01-01	10.0	NY	2.822222	5.666103	False
2013-02-01	10.0	NY	2.822222	5.666103	False
2013-03-01	9.0	NY	1.822222	5.666103	False
2013-04-01	9.0	NY	1.822222	5.666103	False
2013-05-01	8.0	FL	0.822222	5.666103	False
2013-06-01	8.0	FL	0.822222	5.666103	False
2013-07-01	7.0	GA	0.177778	5.666103	False
2013-08-01	6.0	FL	1.177778	5.666103	False
2013-09-01	11.0	FL	3.822222	5.666103	False

方法二

Group by item，按州计算Outlier值

groupby后常用的方法区别.

Agg,对各个组的聚合运算 computing a summary statistic (or statistics) about each group. Some examples:

Compute group sums or means
Compute group sizes / counts

Transform，在各个组内进行运算,返回一个具有相同索引的对象 perform some group-specific computations and return a like-indexed. Some examples:

Standardizing data (zscore) within group
Filling NAs within groups with a value derived from each group

Filter，根据各个组信息筛选 discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

Discarding data that belongs to groups with only a few members
Filtering out data based on the group sum or mean

newdf = df.copy()
State = newdf.groupby('State')
newdf['x-Mean'] = State.transform(lambda x: abs(x-x.mean()))
newdf['1.96*std'] = State.transform(lambda x: 1.96*x.std())
newdf['Outlier'] = State.transform(lambda x: abs(x-x.mean()) > 1.96*x.std())
newdf

	Revenue	State	x-Mean	1.96*std	Outlier
2012-01-01	1.0	NY	5.150	7.513293	False
2012-02-01	1.2	NY	4.950	7.513293	False
2012-03-01	4.0	NY	2.150	7.513293	False
2012-04-01	5.0	NY	1.150	7.513293	False
2012-05-01	6.0	FL	2.125	3.543134	False
2012-06-01	7.0	FL	1.125	3.543134	False
2012-07-01	8.0	GA	0.500	1.385929	False
2012-08-01	9.0	FL	0.875	3.543134	False
2012-09-01	10.0	FL	1.875	3.543134	False
2013-01-01	10.0	NY	3.850	7.513293	False
2013-02-01	10.0	NY	3.850	7.513293	False
2013-03-01	9.0	NY	2.850	7.513293	False
2013-04-01	9.0	NY	2.850	7.513293	False
2013-05-01	8.0	FL	0.125	3.543134	False
2013-06-01	8.0	FL	0.125	3.543134	False
2013-07-01	7.0	GA	0.500	1.385929	False
2013-08-01	6.0	FL	2.125	3.543134	False
2013-09-01	11.0	FL	2.875	3.543134	False