开公众号啦,「心智万花筒」欢迎一起读书交流成长。
import numpy as np
from numpy import random
import pandas as pd
索引练习
索引是不能更常见的操作了。默认选取列,比如df['Date']
,如果对行,用标签索引ix,如df.ix['2015-01-05']
等,还有loc
和iloc
,他们的区别是这样的。
loc is label based indexing so basically looking up a value in a row, iloc is integer row based indexing, ix is a general method that first performs label based, if that fails then it falls to integer based.
.ix[]
supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=['100', '200', '300'])
df
|
A |
B |
100 |
a |
54 |
200 |
b |
67 |
300 |
c |
89 |
df['A'] # 默认的列索引
100 a
200 b
300 c
Name: A, dtype: object
df.loc['100'] # 按标签行索引
A a
B 54
Name: 100, dtype: object
df.iloc[0] # 更像numpy索引,按整数索引行
A a
B 54
Name: 100, dtype: object
df.ix['200'] #更general的标签索引
A b
B 67
Name: 200, dtype: object
df.ix[1] # 整合了loc和iloc的功能
# 注意当标签为整数时,会fail
A b
B 67
Name: 200, dtype: object
df
|
A |
B |
100 |
a |
54 |
200 |
b |
67 |
300 |
c |
89 |
Text练习
对Series对象进行字符串操作。Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:
Series
s = pd.Series(['A', 'B', 'Aaba', 'Baca', np.nan, 'CABA', 'dog'])
# built in str 方法都可以直接用
s.str.lower()
0 a
1 b
2 aaba
3 baca
4 NaN
5 caba
6 dog
dtype: object
# element wise operation
s.str.len()
0 1.0
1 1.0
2 4.0
3 4.0
4 NaN
5 4.0
6 3.0
dtype: float64
Index
df.columns是一个Index对象,这样可以便捷地整理columns名字
idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])
# strip() / lstrip() / rstrip()
idx.str.strip()
Index([u'jack', u'jill', u'jesse', u'frank'], dtype='object')
import numpy as np
df_str = pd.DataFrame(np.random.random((3, 2)), columns=[' Column A ', ' Column B '],index=range(3))
df_str.columns
Index([u' Column A ', u' Column B '], dtype='object')
df_str.columns.str.strip() # 依然是Index 对象
Index([u'Column A', u'Column B'], dtype='object')
df_str.columns = df_str.columns.str.strip().str.lower().str.replace(' ','_')
df_str
|
column_a |
column_b |
0 |
0.444976 |
0.441526 |
1 |
0.846533 |
0.549270 |
2 |
0.492742 |
0.363660 |
Replace
replace and findall, regular expressions work
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'])
s3.str.replace('^A|dog','XX')
0 XX
1 B
2 C
3 XXaba
4 Baca
5
6 NaN
7 CABA
8 XX
9 cat
dtype: object
Pattern Match(contains)
The distinction between match and contains is strictness: match relies on strict re.match
, while contains relies on re.search
.
pattern = r'[a-z][0-9]'
s = pd.Series(['1', 'b2', '3a', '3b', 'c2c'])
s.str.contains(pattern)
0 False
1 True
2 False
3 False
4 True
dtype: bool
str方法汇总
url = 'http://pandas.pydata.org/pandas-docs/stable/text.html'
df_str_methond = pd.read_html(url,header=0,attrs={'id' : 'text-summary'})[0].set_index('Method')
df_str_methond.head()
Method |
Description |
cat() |
Concatenate strings |
split() |
Split strings on delimiter |
rsplit() |
Split strings on delimiter working from the en... |
get() |
Index into each element (retrieve i-th element) |
join() |
Join strings in each element of the Series wit... |
教程
参考
创建DataFrame
names = ['Bob','Jessica','Mary','John','Mel']
# Make a random list of 1,000 baby names using the five above
random.seed(500)
random_names = [names[random.randint(low=0,high=len(names))] for i in range(1000)]
random_names[:5]
['Mary', 'Jessica', 'Jessica', 'Bob', 'Jessica']
births = [random.randint(low=0,high=1000) for i in range(1000)]
births[:5]
[905, 52, 796, 217, 874]
# To merge these two lists together we will use the zip function.
BabyDataSet = zip(random_names,births) # return a list
BabyDataSet[:3]
[('Mary', 905), ('Jessica', 52), ('Jessica', 796)]
# data 接受list,字典等数据类型
df_baby = pd.DataFrame(data=BabyDataSet,columns=['Names','Births'])
df_baby.head()
|
Names |
Births |
0 |
Mary |
905 |
1 |
Jessica |
52 |
2 |
Jessica |
796 |
3 |
Bob |
217 |
4 |
Jessica |
874 |
df_baby['Names'].unique()
array(['Mary', 'Jessica', 'Bob', 'John', 'Mel'], dtype=object)
df_baby['Names'].describe()
count 1000
unique 5
top Bob
freq 206
Name: Names, dtype: object
数据读写CSV
df_baby.to_csv('births1880.csv',index=False,header=False)
df_baby_read = pd.read_csv('births1880.csv',names=['Names','Births'])
df_baby_read.head()
|
Names |
Births |
0 |
Mary |
905 |
1 |
Jessica |
52 |
2 |
Jessica |
796 |
3 |
Bob |
217 |
4 |
Jessica |
874 |
数据读写Excel
d = {'Channel':[1], 'Number':[255]}
df = pd.DataFrame(d)
df
df.to_excel('test1.xlsx',sheet_name='test1',index=False)
df.to_excel('test2.xlsx', sheet_name = 'test2', index = False)
df.to_excel('test3.xlsx', sheet_name = 'test3', index = False)
import os
file_names = []
for file in os.listdir('.'):
if file.endswith('.xlsx'):
file_names.append(file)
file_names
['test1.xlsx', 'test2.xlsx', 'test3.xlsx']
把三个excel文件读到一起
def get_file(fname):
df = pd.read_excel(fname,0)
df['File'] = fname
return df.set_index(['File'])
df_list = [get_file(fname) for fname in file_names]
df_list[0]
File |
Channel |
Number |
test1.xlsx |
1 |
255 |
big_df = pd.concat(df_list)
big_df
File |
Channel |
Number |
test1.xlsx |
1 |
255 |
test2.xlsx |
1 |
255 |
test3.xlsx |
1 |
255 |
Stack and Unstack
# Our small data set
d = {'one':[1,1],'two':[2,2]};i = ['a','b']
# Create dataframe
df = pd.DataFrame(data = d, index = i)
df
stack = df.stack()
stack
a one 1
two 2
b one 1
two 2
dtype: int64
stack.index
MultiIndex(levels=[[u'a', u'b'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
unstack = df.unstack()
unstack
one a 1
b 1
two a 2
b 2
dtype: int64
unstack.index
MultiIndex(levels=[[u'one', u'two'], [u'a', u'b']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Groupby
d = {'one':[1,1,1,1,1],
'two':[2,2,2,2,2],
'letter':['a','a','b','b','c']}
# Create dataframe
df = pd.DataFrame(d)
df
|
letter |
one |
two |
0 |
a |
1 |
2 |
1 |
a |
1 |
2 |
2 |
b |
1 |
2 |
3 |
b |
1 |
2 |
4 |
c |
1 |
2 |
one = df.groupby('letter')
one.sum()
letter |
one |
two |
a |
2 |
4 |
b |
2 |
4 |
c |
1 |
2 |
letterone = df.groupby(['letter','one']).sum()
letterone
letter |
one |
two |
a |
1 |
4 |
b |
1 |
4 |
c |
1 |
2 |
letterone.index
MultiIndex(levels=[[u'a', u'b', u'c'], [1]],
labels=[[0, 1, 2], [0, 0, 0]],
names=[u'letter', u'one'])
Outliers
# Create a dataframe with dates as your index
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'FL', 'FL']
data = [1.0, 1.2, 4, 5, 6, 7, 8, 9, 10]
idx = pd.date_range('1/1/2012', periods=9, freq='MS')
df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
df1['State'] = States
# Create a second dataframe
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 6, 11]
idx2 = pd.date_range('1/1/2013', periods=9, freq='MS')
df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
df2['State'] = States
# conbine dataframes
df = pd.concat([df1,df2])
df.head(2).append(df.tail(2))
|
Revenue |
State |
2012-01-01 |
1.0 |
NY |
2012-02-01 |
1.2 |
NY |
2013-08-01 |
6.0 |
FL |
2013-09-01 |
11.0 |
FL |
方法一
# a new copy
newdf = df.copy()
newdf['x-Mean'] = abs(newdf['Revenue']-newdf['Revenue'].mean())
newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
newdf
|
Revenue |
State |
x-Mean |
1.96*std |
Outlier |
2012-01-01 |
1.0 |
NY |
6.177778 |
5.666103 |
True |
2012-02-01 |
1.2 |
NY |
5.977778 |
5.666103 |
True |
2012-03-01 |
4.0 |
NY |
3.177778 |
5.666103 |
False |
2012-04-01 |
5.0 |
NY |
2.177778 |
5.666103 |
False |
2012-05-01 |
6.0 |
FL |
1.177778 |
5.666103 |
False |
2012-06-01 |
7.0 |
FL |
0.177778 |
5.666103 |
False |
2012-07-01 |
8.0 |
GA |
0.822222 |
5.666103 |
False |
2012-08-01 |
9.0 |
FL |
1.822222 |
5.666103 |
False |
2012-09-01 |
10.0 |
FL |
2.822222 |
5.666103 |
False |
2013-01-01 |
10.0 |
NY |
2.822222 |
5.666103 |
False |
2013-02-01 |
10.0 |
NY |
2.822222 |
5.666103 |
False |
2013-03-01 |
9.0 |
NY |
1.822222 |
5.666103 |
False |
2013-04-01 |
9.0 |
NY |
1.822222 |
5.666103 |
False |
2013-05-01 |
8.0 |
FL |
0.822222 |
5.666103 |
False |
2013-06-01 |
8.0 |
FL |
0.822222 |
5.666103 |
False |
2013-07-01 |
7.0 |
GA |
0.177778 |
5.666103 |
False |
2013-08-01 |
6.0 |
FL |
1.177778 |
5.666103 |
False |
2013-09-01 |
11.0 |
FL |
3.822222 |
5.666103 |
False |
方法二
Group by item,按州计算Outlier值
groupby后常用的方法区别.
Agg,对各个组的聚合运算 computing a summary statistic (or statistics) about each group. Some examples:
- Compute group sums or means
- Compute group sizes / counts
Transform,在各个组内进行运算,返回一个具有相同索引的对象 perform some group-specific computations and return a like-indexed. Some examples:
- Standardizing data (zscore) within group
- Filling NAs within groups with a value derived from each group
Filter,根据各个组信息筛选 discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discarding data that belongs to groups with only a few members
- Filtering out data based on the group sum or mean
newdf = df.copy()
State = newdf.groupby('State')
newdf['x-Mean'] = State.transform(lambda x: abs(x-x.mean()))
newdf['1.96*std'] = State.transform(lambda x: 1.96*x.std())
newdf['Outlier'] = State.transform(lambda x: abs(x-x.mean()) > 1.96*x.std())
newdf
|
Revenue |
State |
x-Mean |
1.96*std |
Outlier |
2012-01-01 |
1.0 |
NY |
5.150 |
7.513293 |
False |
2012-02-01 |
1.2 |
NY |
4.950 |
7.513293 |
False |
2012-03-01 |
4.0 |
NY |
2.150 |
7.513293 |
False |
2012-04-01 |
5.0 |
NY |
1.150 |
7.513293 |
False |
2012-05-01 |
6.0 |
FL |
2.125 |
3.543134 |
False |
2012-06-01 |
7.0 |
FL |
1.125 |
3.543134 |
False |
2012-07-01 |
8.0 |
GA |
0.500 |
1.385929 |
False |
2012-08-01 |
9.0 |
FL |
0.875 |
3.543134 |
False |
2012-09-01 |
10.0 |
FL |
1.875 |
3.543134 |
False |
2013-01-01 |
10.0 |
NY |
3.850 |
7.513293 |
False |
2013-02-01 |
10.0 |
NY |
3.850 |
7.513293 |
False |
2013-03-01 |
9.0 |
NY |
2.850 |
7.513293 |
False |
2013-04-01 |
9.0 |
NY |
2.850 |
7.513293 |
False |
2013-05-01 |
8.0 |
FL |
0.125 |
3.543134 |
False |
2013-06-01 |
8.0 |
FL |
0.125 |
3.543134 |
False |
2013-07-01 |
7.0 |
GA |
0.500 |
1.385929 |
False |
2013-08-01 |
6.0 |
FL |
2.125 |
3.543134 |
False |
2013-09-01 |
11.0 |
FL |
2.875 |
3.543134 |
False |