数据分析—Pandas 的两种数据结构

一、数据结构

pandas主要用于处理非数值数据，numpy用来处理数值数据

image.png

二、常用数据类型：

1、series 一维，带标签数组

①创建series

# 导入包
import numpy as np
import pandas as pd

# 使用列表生成一个series
obj = pd.Series([4, 7, -5, 3])
print(obj)
print(obj.values)
print(obj.index)

image

# 使用数组生成一个Series
s2  = pd.Series(np.arange(7))
print(s2)

# 使用一个字典生成Series，其中字典的键，就是索引
s3 = pd.Series({'1':1, '2':2, '3':3})
print(s3)
print(s3.values)
print(s3.index)

②默认索引是从0开始，下面修改索引

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
print(obj2['a'])

image

obj2['d'] = 6
print(obj2[['c', 'a', 'd']])

image

obj2[obj2 > 0]

image

obj2 * 2

image

③常用属性和方法

np.exp(obj2)

image

print('b' in obj2)
print('e' in obj2)

image

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

image

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states) # 自动与dict的key匹配obj4

image

print(pd.isnull(obj4))
print(pd.notnull(obj4))

image

print(obj3 + obj4) # 数据自动对齐

image

a = {string.ascii_uppercase[i]: i for i in range(10)}
obj = Series(a)
print(obj[:3])

image.png

obj4.name = '人口'
obj4.index.name = '州'
obj4

image

obj = Series([4, 7, -5, 3])
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] # 更新索引obj

image

2、DataFrame 二维，它里面包含一个个的Series

① 创建DataFrame

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
'year': [2000, 2001, 2002, 2001, 2002],        
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data) # key对应frame的列名frame

image

frame = DataFrame(data, columns=['year', 'state', 'pop']) # 指定列顺序frame

image

frame2 = DataFrame(data,                  
columns=['year', 'state', 'pop', 'debt'],                   
index=['one', 'two', 'three', 'four', 'five']) # 分别指定行列名字，缺失值自动填充，比如debt列。frame2

image

print(frame2['state']) # 通过索引返回指定列，返回类型为Series
print(frame2.year)
print(type(frame.state))

image

print(frame2.loc['three']) # 使用loc访问行，iloc针对默认的数字索引print(frame2.iloc[0])

image

frame2['debt'] = 16.5 # 修改整列值frame2

image

frame2['debt'] = np.arange(5.)
frame2

image

val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val # 索引不匹配的话自动补NaNframe2

image

del frame2['eastern'] # 删除指定列frame2.columns

image

pop = {'Nevada': {2001: 2.4, 2002: 2.9},      
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop) # 通过嵌套字典指定列和行索引frame3.T # 转置

image

pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = DataFrame(pop) # 通过嵌套字典指定列和行索引
DataFrame(pop, index=[2001, 2002, 2003]) # 索引2003匹配不到，自动填充NaN

image

pdata = {'Ohio': frame3['Ohio'][:-1],         
'Nevada': frame3['Nevada'][:2]} # 使用Series替代普通数组DataFrame(pdata)

image

frame3.index.name = 'year' # 设置索引和列的名字frame3.columns.name = 'state'frame3

image

print('Ohio' in frame3.columns)
print(2003 in frame3.index)

image

三、Index的方法和属性

append：      连接另一个Index对象，产生一个新的Index。
diff：        计算差集，并得到一个Index。
intersection：计算交集
union：       计算并集
isin：        计算一个指示各值是否都包含在参数集合中的布尔型数组
delete：      删除索引i处的元素，并得到新的Index。
drop：        删除传入的值，并得到新的Index。
insert：      将元素插入到索引i处，并得到新的Index。
is_monotonic：如果单调增长，返回True。
is_unique：   当Index没有重复值时，返回True。
unique：      计算Index中唯一值得数组

四、pandas 知识结构汇总

series 一维 带标签的一组数据
dataframe 二维 有行标签有列标签
查看列类型: df.dtypes
查看行数列数:df.shape
查看列索引：df.colums
查看行索引：df.index
查看各个字段的记录数:df.count()
查看前五条数据：df.head()
查看最后五条数据：df.tail()
查看数据维度：df.ndim
查看数据对象值，二维数组：df.values
查看相关信息概览（行数，列数，列索引，列非空值个数，列类型，内存占用）：df.info()
快速综合统计结果（计数，均值，保准值，最大值，四分位数，最小值）：df.describe()
重命名，并覆盖原名：df.rename({'原名'：'新名'，inplace = true})
查看每列是否有空值：df.isnull().any()
查看某列是否有缺失值：df[df[['列名1'，'列名2']].isnull().values == True]
删除重复值：df[df[['列名1'，'列名2']].isnull().values == True].drop_duplicates()
指定列删除重复值：salesDf.drop_duplicates(subset=['列名1', '列名2'])
删除缺失数据：df= df.dropna(subset=['列名1'，'列名2'],how = 'any')
更新索引序号：df.reset_index(drop=True)
类型转换（str转为float）：df['销售数量'] = df['销售数量'].astype('float')
数据类型转为日期型：pd.to_datetime(df['销售时间'], format = '%Y-%m-%d', errors='coerce')
排序（默认升序，参数ascending=False 为降序）：salesDf.sort_values(by='列名')
指定位置几行几列：df.loc[0,'销售时间']
分组求和：
gb = df.groupby(df.index.month)
mounthDf = gb.sum()
显示所有列
pd.set_option('display.max_columns', None)
显示所有行
pd.set_option('display.max_rows', None)
设置value的显示长度为100，默认为50
pd.set_option('max_colwidth',100)
更改数据类型：astype('str')
增加 df[列名] = 值
删除 del df[列名]
修改 df[列名] = 新值
查找 df.query("(a==1) & (b==2)")
df.loc[行标签名] 支持bool过滤：df.loc[df.b ==5 ,'a'] = 20
df.iloc[第一行：第几列] 不支持bool过滤
df.T 行列标签颠倒
df.sort_values([标签名，标签名1]，ascending = False) 按值进行排序
df.sort_index([标签名]，ascending = False) 按索引进行排序
df.avg.rank(ascending = False) 排名
df['panduan'] = df .state == 'Ohio' 判断state是否等于Ohio true|false
df.workYear.unique() 查看唯一值
df.workYear.value_counts() 查看唯一值数量
df.reindex([]) 重新建立索引
df.avg.cumsum() 累加求和
df.describe() 描述统计
pd.cut(df.avg.bins=[0,5,10,20,30,40,100],lables=['0~5','5~10','10~20','20~30','30~40','40~100']) 分级
df.groupy(by='city').count()按照城市分组并取到数量
df.groupby(by =['city','workYear']).min() 多重索引
关联方法
concat 堆叠 多表合并 类似sql 中的union pd.concat([表1，表2]，axis=1) 默认上下拼接，axis=1左右拼接
join 根据索引 表1.join(表2) 默认根据索引连接
merge 根据键值
position.merge(right=company,how='inner',on='companyId') 关联id 名称相同
position.merge(right=company,how='inner',left_on='companyId',right_on='Id) 关联id 名称不相同
多重索引
df.sort_values(['city', 'workYear']).set_index(['city', 'workYear']) 列变成索引
df.groupby(by =['city','workYear']).max().reset_index() 重置索引
文本函数
df.字段名.str.count('数据分析') 某字符出现的次数
df.字段名.str.find('数据分析') 某字符出现的位置
df.字段名.str[1:-1] 去除方括号里面的字符
df.字段名.str.[1:-1] .str.replace("’"，"")去除某个符号
pandas 空值处理
df.loc[df.city=='上海','secondType'] = np.NaN 根据条件，指定列为NaN
df.fillna(1) NaN填充为1
df.dropna() 删除NaN的行
pd = pd.Series([1,1,2,3]) pd.drop_duplicates 去除重复值
pd = pd.Series([1,1,2,3]) p = pd[~pd.duplicated()] 去除重复值
apply 把自定义函数应用到所有的行和列上 优点：速度快
df.education.apply(lambda x:str(x) + '生')
参数axis = 0 默认行 axis = 1 针对列
不同城市薪资排位前5的
def func(x):
x.sort_values('companyId',ascending=False)
return x[:5]
df.groupby('city').apply(func)
数据透视表 生成一个新的表格
df.pivot_table(index='city',columns='workYear',values='companyId',aggfunc=[np.mean,np.sum],margins=True,dropna=False,)['mean'].loc['上海']
输出：df····.reset_index().to_csv(‘文件名.csv’)

以上涵盖工作中大部分pandas数据处理常用知识点，其实不难，重要的是多练习，加油~

数据分析—Pandas 的两种数据结构

一、数据结构

二、常用数据类型：

1、series 一维， 带标签数组

①创建series

②默认索引是从0开始，下面修改索引

③常用属性和方法

2、DataFrame 二维， 它里面包含一个个的Series

① 创建DataFrame

三、Index的方法和属性

四、pandas 知识结构汇总

1、series 一维，带标签数组

2、DataFrame 二维，它里面包含一个个的Series