DataFrame 表示矩阵数据表,有行索引和列索引。
构建方式
In [43]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
...: 'year' : [2000, 2001, 2002, 2001, 2001, 2003],
...: 'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
In [44]: frame = pd.DataFrame(data)
In [45]: frame
Out[45]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2001 2.9
5 Nevada 2003 3.2
对于大型 DataFrame,head 方法只选出前5行
In [46]: frame.head()
Out[46]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2001 2.9
指定顺序
In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[47]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2001 Nevada 2.9
5 2003 Nevada 3.2
传的列不在字典中
In [49]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
...: index=['one', 'two', 'three', 'four', 'five', 'six'])
In [50]: frame2
Out[50]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2001 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
某一列可以按字典型标记或属性检索为 Series
In [51]: frame2['state']
Out[51]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
In [52]: frame2.year
Out[52]:
one 2000
two 2001
three 2002
four 2001
five 2001
six 2003
Name: year, dtype: int64
行也可以通过位置或特殊属性 loc 进行选取
In [53]: frame2.loc['three']
Out[53]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
列的引用是可以修改的
In [54]: frame2['debt'] = 16.5
In [55]: frame2
Out[55]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2001 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
In [56]: frame2['debt'] = np.arange(6.)
In [57]: frame2
Out[57]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2001 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
将Series赋值给一列
In [58]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [59]: frame2['debt'] = val
In [60]: frame2
Out[60]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2001 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
del 删除某一列
In [61]: frame2['eastern'] = frame2.state == 'Ohio'
In [62]: frame2
Out[62]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2001 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
In [63]: del frame2['eastern']
In [64]: frame2.columns
Out[64]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
对Series的修改会映射到DaraFrame中,如果要复制,应显示使用Series的copy方法
另一种数据形式
In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
...: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [66]: frame3 = pd.DataFrame(pop)
In [67]: frame3
Out[67]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
调换行和列
In [68]: frame3.T
Out[68]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
如果显示指明索引,则内部的字典的键不会被排序
In [69]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[69]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
包含Series的字典也可以用于构造DataFrame
In [70]: pdata = {'Ohio': frame3['Ohio'][: -1],
...: 'Nevada': frame3['Nevada'][: 2]}
In [71]: pd.DataFrame(pdata)
Out[71]:
Ohio Nevada
2000 1.5 NaN
2001 1.7 2.4
索引和列拥有name属性
In [72]: frame3.index.name = 'year'
In [73]: frame3.columns.name = 'state'
In [74]: frame3
Out[74]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [75]: frame3.values
Out[75]:
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
自动选择适合所有列的类型
In [77]: frame2.values
Out[77]:
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2001, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
索引对象
在构造Series或DataFrame时,使用的任意数组或标签序列都可以在内部转换为索引对象
In [78]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [79]: index = obj.index
In [80]: index
Out[80]: Index(['a', 'b', 'c'], dtype='object')
In [81]: index[1:]
Out[81]: Index(['b', 'c'], dtype='object')
In [82]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-82-a452e55ce13b> in <module>
----> 1 index[1] = 'd'
c:\users\a\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
3881
3882 def __setitem__(self, key, value):
-> 3883 raise TypeError("Index does not support mutable operations")
3884
3885 def __getitem__(self, key):
TypeError: Index does not support mutable operations
In [83]:
In [83]: labels = pd.Index(np.arange(3))
In [84]: labels
Out[84]: Int64Index([0, 1, 2], dtype='int64')
In [85]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [86]: obj2
Out[86]:
0 1.5
1 -2.5
2 0.0
dtype: float64
In [87]: obj2.index is labels
Out[87]: True
索引对象是不可变的
In [89]: frame3.columns
Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [90]: 'Ohio' in frame3.columns
Out[90]: True
In [91]: 2003 in frame3.columns
Out[91]: False
In [88]: frame3
Out[88]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [89]: frame3.columns
Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [90]: 'Ohio' in frame3.columns
Out[90]: True
In [91]: 2003 in frame3.columns
Out[91]: False
In [92]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
In [93]: dup_labels
Out[93]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')