由于numpy 对于元素类别的限制(必须得是同一类型元素),因此在存储多种类别信息时,就显得有些捉襟见肘了。
而pandas
,则应运而生。其存储数据格式,非常类似于R中的data_frame。
pandas
构建dataframe
1)构建字典example_dict
,字典值为键信息所对应的列表。
2)将构建的字典转化为pandas包中的dataframe形式。example = pd.DataFrame(example_dict)
。
也可以通过导入外部文件的方式,如example = pd.read_csv('example.csv')
3)若外部文件中不包含行注释,可以为dataframe 构建标签,example.index = row_labels
。若引入的文件本身包含row_labels
,则在导入文件时需要增加选项index_col = 0
,否则pandas 会默认为表格添加一行注释。
import pandas as pd
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict)
print(cars)
# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']
# Specify row labels of cars
cars.index = row_labels
# Print cars again
print(cars)
选择dataframe 中的信息
data_frame[]
会生成一个panda_series
类型内容。
而如果想将结果返回为dataframe,需要使用双方括号,data_frame[[]]
。
data_frame 也是支持切片操作的,且行使用名称,列使用且仅使用切片。
loc 与iloc
data_frame.loc[]
,通过向其中输入列表,[row_label_dict, col_label_dict]
,从而指定输出选择的行与列的信息。
iloc
与loc
一样,只不过由名称选择变成了位置选择。
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out drives_right column as Series
print(cars.iloc[:, 2])
# Print out drives_right column as DataFrame
print(cars.iloc[:, [2]])
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ['cars_per_cap', 'drives_right']])
使用比较运算符进行筛选
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']
many_cars = cpc > 500 # 返回布尔值
car_maniac = cars[many_cars] # 只返回True 的row
# Print car_maniac
print(car_maniac)
- 还可以结合numpy 结合and, or, not 这些比较字符,实现更高效的筛选。
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Import numpy, you'll need this
import numpy as np
# Create medium: observations with cars_per_cap between 100 and 500
medium = cars[np.logical_and(cars['cars_per_cap'] > 100, cars['cars_per_cap'] < 500)]
# Print medium
print(medium)