Pandas 介绍
pandas是python的一个数据分析库,主要提供两种主要的资料结构,Series与DataFrame。Series是用来处理时间 顺序相关资料,DataFrame则是用来处理结构化的资料(二维的数据资料)
安装Pandas
pip install pandas
Pandas读取不同格式的资料
读取CSV档案
import pandas as pd
df = pd.read_csv('file.csv')
print(df)
读取HTML档案
import pandas as pd
df = pd.read_html('https://www.jianshu.com/u/e635858eda0b')
print(df)
Pandas提供的资料结构
· Series:处理时间序列的相关资料,主要是创建一维list。
·DataFrame:处理结构化的资料,有索引和标签的二维资料集。
·Panel:处理三维数据。
1.series
数据类型是array
import pandas as pd
list = ['python', 'ruby', 'c', 'c++']
select = pd.Series(list)
print (select)
输出:
0 python
1 ruby
2 c
3 c++
dtype: object
数据类型是Dictionary
import pandas as pd
dict = {'key1': '1', 'key2': '2', 'key3': '3'}
select = pd.Series(dict, index = dict.keys())
输出:
print select
key3 3
key2 2
key1 1
dtype: object
print (select[0])
3
print select[2]
1
print select['key3']
3
print select[[2]]
key1 1
dtype: object
print (select[[0,2,1]])
key3 3
key1 1
key2 2
dtype: object
数据类型是单一数据
import pandas as pd
string = 'henry'
select = pd.Series (string, index = range(3))
print (select)
输出:
0 henry
1 henry
2 henry
切片选择
print (select[1:])
1 henry
2 henry
2.DataFrame
2.1建立DataFrame
可以用DDictionary或Array来创建,也可以用外部资料读取后创建。
Dictionary
import pandas as pd
groups = ['Movies', 'Sports', 'Conding', 'Fishing', 'Dancing']
num = [12, 5, 18, 99, 88]
dict = {'groups': groups, 'num': num}
df = pd.DataFrame(dict)
print (df)
输出:
groups num
0 Movies 12
1 Sports 5
2 Conding 18
3 Fishing 99
4 Dancing 88
Array
array = [['Movies',12], ['Sports', 5], ['Conding', 18], ['Fishing', 99], ['Dancing', 88]]
df = pd.DataFrame(arr, colums = ['name', 'num'])
df = pd.DataFrame(array, columns = ['name', 'num'])
print df
输出:
name num
0 Movies 12
1 Sports 5
2 Conding 18
3 Fishing 99
4 Dancing 88
2.2DataFrame的操作
DataFrame的方法
.shape 返回行数和列数
.describe() 返回描述性统计
.head()
.tail()
.columns
.index
.info()
import pandas as pd
groups = ['Movies', 'Sports', 'Conding', 'Fishing', 'Dancing']
num = [12, 5, 18, 99, 88]
dict = {'groups': groups, 'num': num}
df = pd.DataFrame(dict)
print df.shape
(5, 2)
print df.describe()
num
count 5.000000
mean 44.400000
std 45.224993
min 5.000000
25% 12.000000
50% 18.000000
75% 88.000000
max 99.000000
print df.head()
groups num
0 Movies 12
1 Sports 5
2 Conding 18
3 Fishing 99
4 Dancing 88
print df.columns
Index([u'groups', u'num'], dtype='object')
print df.index
RangeIndex(start=0, stop=5, step=1)
print df.info
<bound method DataFrame.info of groups num
0 Movies 12
1 Sports 5
2 Conding 18
3 Fishing 99
4 Dancing 88>
print df.tail(3)
groups num
2 Conding 18
3 Fishing 99
4 Dancing 88