数据框(dataFrame)的创建:
import numpy as np
import pandas as pd
data = {'year':[2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team':['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
'Lions', 'Lions'],
'wins':[11, 8, 10, 15, 11, 6, 10, 4],
'losses':[5, 8, 6, 1, 5, 10, 6, 12]
}
football =pd.DataFrame(data)
print football
输出:
losses team wins year
0 5 Bears 11 2010
1 8 Bears 8 2011
2 6 Bears 10 2012
3 1 Packers 15 2011
4 5 Packers 11 2012
5 10 Lions 6 2010
6 6 Lions 10 2011
7 12 Lions 4 2012
Pandas 也有很多帮助你理解数据框中一些基本信息的方法:
- dtypes: 获取每一柱数据的数据类型
- describle: 对于用来观察数据框的数值列的基本是有数据用的
- head :显示前5行数据集
- tail : 显示最后5行的数据集
test:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
print football.dtypes
print ""
print football.describe()
print ""
print football.head()
print ""
print football.tail()
运行结果:
# 数据类型
losses int64
team object
wins int64
year int64
dtype: object
#返回一些基本数据
losses wins year
count 8.000000 8.000000 8.000000 # 总数
mean 6.625000 9.375000 2011.125000 #平均数
std 3.377975 3.377975 0.834523 #标准差
min 1.000000 4.000000 2010.000000 #最小值
25% 5.000000 7.500000 2010.750000
50% 6.000000 10.000000 2011.000000
75% 8.500000 11.000000 2012.000000
max 12.000000 15.000000 2012.000000#最大值
losses team wins year
0 5 Bears 11 2010
1 8 Bears 8 2011
2 6 Bears 10 2012
3 1 Packers 15 2011
4 5 Packers 11 2012
losses team wins year
3 1 Packers 15 2011
4 5 Packers 11 2012
5 10 Lions 6 2010
6 6 Lions 10 2011
7 12 Lions 4 2012
再来一组code:
from pandas import DataFrame, Series
#################
# Syntax Reminder:
#
# The following code would create a two-column pandas DataFrame
# named df with columns labeled 'name' and 'age':
#
# people = ['Sarah', 'Mike', 'Chrisna']
# ages = [28, 32, 25]
# df = DataFrame({'name' : Series(people),
# 'age' : Series(ages)})
def create_dataframe():
'''
Create a pandas dataframe called 'olympic_medal_counts_df' containing
the data from the table of 2014 Sochi winter olympics medal counts.
The columns for this dataframe should be called
'country_name', 'gold', 'silver', and 'bronze'.
There is no need to specify row indexes for this dataframe
(in this case, the rows will automatically be assigned numbered indexes).
You do not need to call the function in your code when running it in the
browser - the grader will do that automatically when you submit or test it.
'''
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
'Netherlands', 'Germany', 'Switzerland', 'Belarus',
'Austria', 'France', 'Poland', 'China', 'Korea',
'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']
gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
data ={
'countries':countries,
'gold':gold,
'silver':silver,
'bronze':bronze}
olympic_medal_counts_df=DataFrame(data)
return olympic_medal_counts_df
运行结果:
bronze countries gold silver
0 9 Russian Fed. 13 11
1 10 Norway 11 5
2 5 Canada 10 10
3 12 United States 9 7
4 9 Netherlands 8 7
5 5 Germany 8 6
6 2 Switzerland 6 3
7 1 Belarus 5 0
8 5 Austria 4 8
9 7 France 4 4
10 1 Poland 4 1
11 2 China 3 4
12 2 Korea 3 3
13 6 Sweden 2 7
14 2 Czech Republic 2 4
15 4 Slovenia 2 2
16 3 Japan 1 4
17 1 Finland 1 3
18 2 Great Britain 1 1
19 1 Ukraine 1 0
20 0 Slovakia 1 0
21 6 Italy 0 2
22 2 Latvia 0 2
23 1 Australia 0 2
24 0 Croatia 0 1
25 1 Kazakhstan 0 0