from datetime import datetime
Python datetime对象
获取当前时间
now = datetime.now()
print(now) # 2022-07-31 14:15:11.898054
手动创建datetime
t1 = datetime(1996, 8, 3)
t2 = datetime(1996, 8, 14)
print(t1) # 1996-08-03 00:00:00
对datetime做数学运算
diff = t1 - t2
print(diff) # -11 days, 0:00:00
转换成datetime对象
import pandas as pd
ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.iloc[:5, :5])
'''
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 NaN 8157.0 NaN
4 12/31/2014 284 2730.0 8115.0 9633.0
'''
print(ebola.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 122 non-null object
1 Day 122 non-null int64
2 Cases_Guinea 93 non-null float64
3 Cases_Liberia 83 non-null float64
4 Cases_SierraLeone 87 non-null float64
5 Cases_Nigeria 38 non-null float64
6 Cases_Senegal 25 non-null float64
7 Cases_UnitedStates 18 non-null float64
8 Cases_Spain 16 non-null float64
9 Cases_Mali 12 non-null float64
10 Deaths_Guinea 92 non-null float64
11 Deaths_Liberia 81 non-null float64
12 Deaths_SierraLeone 87 non-null float64
13 Deaths_Nigeria 38 non-null float64
14 Deaths_Senegal 22 non-null float64
15 Deaths_UnitedStates 18 non-null float64
16 Deaths_Spain 16 non-null float64
17 Deaths_Mali 12 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 17.3+ KB
None
'''
发现Date中的日期信息是字符串对象,创建date_dt列,将Date转换成datetime类型。
ebola['date_dt'] = pd.to_datetime(ebola['Date'])
print(ebola.iloc[:5, -5:])
'''
Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali date_dt
0 NaN NaN NaN NaN 2015-01-05
1 NaN NaN NaN NaN 2015-01-04
2 NaN NaN NaN NaN 2015-01-03
3 NaN NaN NaN NaN 2015-01-02
4 NaN NaN NaN NaN 2014-12-3
'''
转换时可以指定日期格式,format='%m/%d/%Y'指定原数据1/5/2015中每个位置的含义
ebola['date_dt'] = pd.to_datetime(ebola['Date'], format='%m/%d/%Y')
print(ebola.iloc[:5, -5:])
'''
Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali date_dt
0 NaN NaN NaN NaN 2015-01-05
1 NaN NaN NaN NaN 2015-01-04
2 NaN NaN NaN NaN 2015-01-03
3 NaN NaN NaN NaN 2015-01-02
4 NaN NaN NaN NaN 2014-12-31
'''
to_datetime函数有许多参数。如果日期格式以‘日’开始(14-08-1996)或以‘年’开始(1996-08-14),可以把dayfirst和yearfirst两个参数分别设为True.
兑取其他日期格式,可以实验python的strptime语法手动指定表示方式。
加载包含日期的数据
使用read_csv加载数据时,可以直接在parse_dates参数中指定想要解析成日期的列。
ebola = pd.read_csv('data/country_timeseries.csv', parse_dates=[0])
print(ebola.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 122 non-null datetime64[ns]
1 Day 122 non-null int64
2 Cases_Guinea 93 non-null float64
3 Cases_Liberia 83 non-null float64
4 Cases_SierraLeone 87 non-null float64
5 Cases_Nigeria 38 non-null float64
6 Cases_Senegal 25 non-null float64
7 Cases_UnitedStates 18 non-null float64
8 Cases_Spain 16 non-null float64
9 Cases_Mali 12 non-null float64
10 Deaths_Guinea 92 non-null float64
11 Deaths_Liberia 81 non-null float64
12 Deaths_SierraLeone 87 non-null float64
13 Deaths_Nigeria 38 non-null float64
14 Deaths_Senegal 22 non-null float64
15 Deaths_UnitedStates 18 non-null float64
16 Deaths_Spain 16 non-null float64
17 Deaths_Mali 12 non-null float64
dtypes: datetime64[ns](1), float64(16), int64(1)
memory usage: 17.3 KB
None
'''
提取日期的各个部分
d = pd.to_datetime('1996-08-14')
print(d) # 1996-08-14 00:00:00
print(type(d)) # <class 'pandas._libs.tslibs.timestamps.Timestamp'>
print(d.year) # 1996
print(d.month) # 8
print(d.day) # 14
ebola['date_dt'] = pd.to_datetime(ebola['Date'])
print(ebola[['Date', 'date_dt']].head())
'''
Date date_dt
0 2015-01-05 2015-01-05
1 2015-01-04 2015-01-04
2 2015-01-03 2015-01-03
3 2015-01-02 2015-01-02
4 2014-12-31 2014-12-31
'''
对于datetime对象,可以实验dt访问器访问datetime方法。('Timestamp' object has no attribute 'dt')
下面使用year,month,day属性获取日期各部分
ebola['year'], ebola['month'], ebola['day'] = (ebola['date_dt'].dt.year, ebola['date_dt'].dt.month, ebola['date_dt'].dt.day)
print(ebola[['Date', 'date_dt','year', 'month', 'day']].head())
'''
Date date_dt year month day
0 2015-01-05 2015-01-05 2015 1 5
1 2015-01-04 2015-01-04 2015 1 4
2 2015-01-03 2015-01-03 2015 1 3
3 2015-01-02 2015-01-02 2015 1 2
4 2014-12-31 2014-12-31 2014 12 31
'''
日期运算和Timedelta
埃博拉病毒爆发的第一天(数据中最早的日期)是2014-03-22.计算疫情爆发的天数是,只需用每个日期减去该日期即可。用min方法获取日期列的爆发日期。
print(ebola.iloc[-5:, :5])
'''
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
117 2014-03-27 5 103.0 8.0 6.0
118 2014-03-26 4 86.0 NaN NaN
119 2014-03-25 3 86.0 NaN NaN
120 2014-03-24 2 86.0 NaN NaN
121 2014-03-22 0 49.0 NaN NaN
'''
print(ebola['date_dt'].min()) # 2014-03-22 00:00:00
ebola['outbreak_d'] = ebola['date_dt'] - ebola['date_dt'].min()
print(ebola[['Date', 'Day', 'outbreak_d']].head())
'''
Date Day outbreak_d
0 2015-01-05 289 289 days
1 2015-01-04 288 288 days
2 2015-01-03 287 287 days
3 2015-01-02 286 286 days
4 2014-12-31 284 284 days
'''
print(ebola[['Date', 'Day', 'outbreak_d']].tail())
'''
Date Day outbreak_d
117 2014-03-27 5 5 days
118 2014-03-26 4 4 days
119 2014-03-25 3 3 days
120 2014-03-24 2 2 days
121 2014-03-22 0 0 days
'''
执行这种日期运算,最终得到一个timedetla对象。
print(ebola.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 122 non-null datetime64[ns]
1 Day 122 non-null int64
2 Cases_Guinea 93 non-null float64
3 Cases_Liberia 83 non-null float64
4 Cases_SierraLeone 87 non-null float64
5 Cases_Nigeria 38 non-null float64
6 Cases_Senegal 25 non-null float64
7 Cases_UnitedStates 18 non-null float64
8 Cases_Spain 16 non-null float64
9 Cases_Mali 12 non-null float64
10 Deaths_Guinea 92 non-null float64
11 Deaths_Liberia 81 non-null float64
12 Deaths_SierraLeone 87 non-null float64
13 Deaths_Nigeria 38 non-null float64
14 Deaths_Senegal 22 non-null float64
15 Deaths_UnitedStates 18 non-null float64
16 Deaths_Spain 16 non-null float64
17 Deaths_Mali 12 non-null float64
18 date_dt 122 non-null datetime64[ns]
19 year 122 non-null int64
20 month 122 non-null int64
21 day 122 non-null int64
22 outbreak_d 122 non-null timedelta64[ns]
dtypes: datetime64[ns](2), float64(16), int64(4), timedelta64[ns](1)
memory usage: 22.0 KB
None
'''
datatime方法
banks = pd.read_csv('data/banklist.csv', parse_dates=[5, 6])
print(banks.head())
'''
Bank Name City ST \
0 Fayette County Bank Saint Elmo IL
1 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI
2 First NBC Bank New Orleans LA
3 Proficio Bank Cottonwood Heights UT
4 Seaway Bank and Trust Company Chicago IL
CERT Acquiring Institution Closing Date Updated Date
0 1802 United Fidelity Bank, fsb 2017-05-26 2017-07-26
1 30003 First-Citizens Bank & Trust Company 2017-05-05 2017-07-26
2 58302 Whitney Bank 2017-04-28 2017-07-26
3 35495 Cache Valley Bank 2017-03-03 2017-05-18
4 19328 State Bank of Texas 2017-01-27 2017-05-18
'''
print(banks.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bank Name 553 non-null object
1 City 553 non-null object
2 ST 553 non-null object
3 CERT 553 non-null int64
4 Acquiring Institution 553 non-null object
5 Closing Date 553 non-null datetime64[ns]
6 Updated Date 553 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 30.4+ KB
None
'''
# 添加两列,表示银行破产的年份和季度
banks['closing_quarter'], banks['closing_year'] = (banks['Closing Date'].dt.quarter,
banks['Closing Date'].dt.year)
# 每年银行的倒闭数量
closing_year = banks.groupby(['closing_year']).size()
# 每年每个季度的银行倒闭数量
closing_year_q = banks.groupby(['closing_year', 'closing_quarter']).size()
# 展示银行破产情况
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax = closing_year.plot()
plt.show()
fig, ax = plt.subplots()
ax = closing_year_q.plot()
plt.show()