LightGBM 实战:波动率预测(1)

这是 Kaggle 上的一个比赛:Optiver Realized Volatility Prediction
目的是通过股票的历史 order book 和 trade 来预测它接下来十分钟的volatility。基本金融概念就不赘述了。

1. Data overview

该比赛给出的训练数据包含 112 支股票。每支股票以 10 分钟为采样时间(time_id),记录了 order book 和 trade 的详细数据。

1.1 Order book

order book
  • time_id: 每十分钟为一个 bucket,time_id 也是 bucket 的 id
  • seconds_in_bucket: 10分钟=600秒,每个 bucket 重置
  • price: 归一化后的价格,也称为回报率。以第一条记录为基准值 1.00
  • size: 在这个价格上面报价的订单数量

1.2 Trade

trade

2. Benchmark

为了衡量我们预测的效果,需要一个简单的预测作为 benchmark,低于这个 benchmark 说明模型是无效的。

最简单的预测就是直接用过去 10 分钟的 volatility 来预测未来 10 分钟
的。如果我们选取均方根百分比误差(RMSPE)作为评分标准,则其误差为 0.341,即34.10%。

RMSPE 计算方式为:
\text{RMSPE} = \sqrt{ \frac{ 1 }{ N } \sum_{i=1}^N (\frac{y_i - \hat{y}_i}{y_i})^2 }

2.1 Benchmark code

可以用以下代码计算 benchmark,计算 volatility 和 RMSPE 的方法与之后 LightGBM 中是一致的。

vol = get_histocal_volatility('data/book_train.parquet/')
print_score(vol)

2.1.1 Volatility calculation

import pandas as pd
import numpy as np
import glob
import multiprocessing as mp

def vwap(row):
    return (row.bid_price1 * row.ask_size1 + row.ask_price1 * row.bid_size1) / (row.ask_size1 + row.bid_size1)
def calculate_realized_volatility(frame):
    # theo_price = frame.apply(vwap, axis=1)
    theo_price = vwap(frame)
    log_return = np.log(theo_price).diff().dropna()
    return np.sqrt(np.sum(log_return**2))

def get_stock_volatility(path):
    book = pd.read_parquet(path)
    vol = book.groupby('time_id').apply(calculate_realized_volatility).reset_index(name='volatility')
    vol['stock_id'] = int(path.split('=')[1])
    return vol.reindex(['stock_id', 'time_id', 'volatility'], axis=1)

def get_histocal_volatility(train_dir):
    paths = glob.glob(f'{train_dir}/*')
    with mp.Pool(4) as p:
        results = p.map(get_stock_volatility, paths)
        return pd.concat(results)

2.1.2 RMSPE calculation

def convert_to_submit_format(vol):
    row_id = pd.Series(vol.stock_id.astype(str) + '-' + vol.time_id.astype(str), name='row_id')
    return pd.concat([row_id, vol.iloc[:,2]], axis=1)

def print_score(vol):
    train = convert_to_submit_format(pd.read_csv('data/train.csv'))
    pred = convert_to_submit_format(vol)
    
    from sklearn.metrics import r2_score
    def rmspe(y_true, y_pred):
        return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))
    joined = pd.merge(pred, train, on='row_id')
    R2 = round(r2_score(y_true = joined.target, y_pred = joined.volatility),3)
    RMSPE = round(rmspe(y_true = joined.target, y_pred = joined.volatility),3)
    print(f'Performance of the naive prediction: R2 score: {R2}, RMSPE: {RMSPE}')

3. 用 LightGBM 预测 volatility

本节将简单应用 LightGBM 预测 volatility,暂时不涉及任何调参以及特征工程的优化。仅仅验证 LightGBM 在默认参数下、利用简单的特征即可获得超过 benchmark 的预测效果,重点阐述如何调包 :)。

3.1 特征提取

对于 book 和 trade,我们都需要做一些简单的计算来将原始数据稍作加工。

归根到底,波动率反应的是价格变化剧烈程度。理论上,当某个股票流动性不佳(即买卖价差大、挂单数量少)时,它的波动率会显著上升。此外,如果某个股票交易量相对其挂单量较大,也会有较大的波动率。

另外,由于每个 time_id 有 10 分钟的数据,直觉上,这些数据对未来交易的影响应该是会 decay 的,因此有必要根据时间将其分为多个 batch。

3.1.1 book features

book 方面,由于 volatility 是基于 vwap 的,我计算了多种 vwap,用于表现某一个 level 被全部 take 的情况。此外,买卖价差、买 1 和买 2 价差、卖 1 和 卖 2 的价差都可以纳入考虑。

def vwap(row, bid_idx, ask_idx):
    # TODO: multi-level
    return (row[f'bid_price{bid_idx}'] * row[f'ask_size{ask_idx}']
            + row[f'ask_price{ask_idx}'] * row[f'bid_size{bid_idx}']) / (
                row[f'ask_size{ask_idx}'] + row[f'bid_size{bid_idx}'])

def get_book(stock_id):
    book = pd.read_parquet(f"data/book_train.parquet/stock_id={stock_id}")
    
    # VWAPs
    
    # level 0 price does not change
    book['vwap11'] = vwap(book, 1, 1)
    # ask level 0 is fully traded
    book['vwap12'] = vwap(book, 1, 2)
    # bid level 0 is fully traded
    book['vwap21'] = vwap(book, 2, 1)
    # bid and ask level 0 is fully traded
    book['vwap22'] = vwap(book, 2, 2)
    
    book['bid_ask_spread'] = book.ask_price1 - book.bid_price1
    book['bid_gap'] = book.bid_price1 - book.bid_price2
    book['ask_gap'] = book.ask_price2 - book.ask_price1
    
    book['bid_imbalance'] = book.bid_size1 / book.bid_size2
    book['ask_imbalance'] = book.ask_size1 / book.ask_size2
    
    return book

def get_book_features(book, window):
    book['batch_id'] = cut_by_time(book, window)
    
    feature_dict = {
        'vwap11': ['mean', 'std', 'max', calculate_realized_volatility],
        'vwap12': ['mean', 'std', 'max', calculate_realized_volatility],
        'vwap21': ['mean', 'std', 'max', calculate_realized_volatility],
        'vwap22': ['mean', 'std', 'max', calculate_realized_volatility],
        'bid_gap': ['mean', 'std', 'max'],
        'ask_gap': ['mean', 'std', 'max'],
        'bid_ask_spread': ['mean', 'std', 'max'],
        'bid_size1': ['mean', 'std', 'max', 'sum'],
        'ask_size1': ['mean', 'std', 'max', 'sum'],
        'bid_imbalance': ['mean', 'std', 'max'],
        'ask_imbalance': ['mean', 'std', 'max'],
    }
    
    return get_features(book, feature_dict)

3.1.2 trade features

随便提取了几个 trade 的特征,例如交易数量、交易金额等等。

def get_trade(stock_id):
    trade = pd.read_parquet(f"data/trade_train.parquet/stock_id={stock_id}").rename({'size':'trade_volume'}, axis=1)
    trade['trade_amount'] = trade.price * trade.trade_volume
    trade['per_trade_quantity'] = trade.trade_volume / trade.order_count
    
    return trade

def get_trade_features(trade, window):
    trade['batch_id'] = cut_by_time(trade, window)
    feature_dict = {
        'trade_volume': ['mean', 'std', 'max', 'sum'],
        'trade_amount': ['mean', 'std', 'max'],
        'per_trade_quantity': ['mean', 'std', 'max']
    }

    return get_features(trade, feature_dict)


def get_all_features(stock_id, window):
    book = get_book(stock_id)
    book_features = get_book_features(book, window)
    trade = get_trade(stock_id)
    trade_features = get_trade_features(trade, window)
    
    merged = pd.merge(book_features, trade_features, on=['time_id_'])
    return merged

def get_train_data(stocks, window):
    df = []
    for stock in stocks:
        features = get_all_features(stock, window)
        features['stock_id_'] = stock
        df.append(features)
    return pd.concat(df).reset_index(drop=True)

3.2 应用 LightGBM

我们已经在 3.1 中得到了一堆特征了(294维),现在可以将它们 feed 到 LightGBM 中进行训练了。原始数据集中提供了 train.csv 用于训练。在这里,我们使用经典的交叉验证来训练模型。将原始数据随机分为 5 份,其中 4 份用于训练,1 份用于验证。

from sklearn.model_selection import KFold
import lightgbm

# root mean squared percentage error
def rmspe(y_true, y_pred):
    return np.sqrt(np.mean(np.square((y_true - y_pred)/y_true)))
def feval_rmspe(y_pred, lgb_train):
    y_true = lgb_train.get_label()
    return 'RMSPE', rmspe(y_true, y_pred), False

result = pd.read_csv('data/train.csv')
selected_y = result[result.stock_id.isin([stock_id])].target.reset_index(drop=True)

kf = KFold(n_splits = 5, random_state = 2021, shuffle = True)
train_features = train_data.drop('time_id_', axis=1)
for train_index, test_index in kf.split(train_features):
    # print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = train_features.loc[train_index], train_features.loc[test_index]
    y_train, y_test = selected_y.loc[train_index], selected_y.loc[test_index]
    
    train_dataset = lightgbm.Dataset(X_train,
                                     y_train, 
                                     weight=1/np.square(y_train))
    validation_dataset = lightgbm.Dataset(X_test, y_test, weight=1/np.square(y_test))
    model = lightgbm.train(params={}, 
                           train_set = train_dataset, 
                           valid_sets = [train_dataset, validation_dataset],
                           early_stopping_rounds=50,
                           feval = feval_rmspe,
                           verbose_eval = 20,
                           num_boost_round=1000)
    # get prediction score
    y_pred = model.predict(X_test)
    print("RMSPE = ", rmspe(y_test, y_pred))
    lightgbm.plot_importance(model, max_num_features=20)

注意,time_id 并不属于特征,在这里需要去掉以免对训练造成负面影响。

3.2.1 LightGBM 步骤

分为 3 步:

  1. 建立数据集(train_dataset = lightgbm.Dataset(...)
  2. 训练模型 (model = lightgbm.train(...))
  3. 预测 (model.predict(X_test))

着重讲下第 2 步的参数,它最为复杂。

    model = lightgbm.train(params={}, 
                           train_set = train_dataset, 
                           valid_sets = [train_dataset, validation_dataset],
                           early_stopping_rounds=50,
                           feval = feval_rmspe,
                           verbose_eval = 20,
                           num_boost_round=1000)
  • params={} 说明我们使用默认参数。由于篇幅所限,参数含义和调优放在下一篇文章。

  • valid_set, early_stopping_rounds, feval:这三个参数是可选参数,他们是有关联的。目的是避免过拟合。利用 feval 来评价验证集在这个模型下的准确度,如果超过 early_stopping_rounds 都没有提高,就提前终止训练,返回局部最优的模型

  • num_boost_round 用于控制 boost 过程的最大循环次数,每一次循环会生成一棵树来拟合残差,如果 early_stopping_rounds 生效,就不必运行到最后。

3.2.2 特征重要性

这也是一个有趣的概念,在我们的模型中,特征重要性如下所示:

feature importance

3.2.2.1 特征重要性的计算

对于树方法来说(LightGBM,XGBoost,GBDT),他们的计算方法都是类似的,即在某个节点分裂前后,比较其”不纯度“,不纯度下降越多,说明效果越好。

\text{importance} = \frac{ n }{ N } ( \rho - \frac{n_l}{n} \rho_l - \frac{n_r}{n} \rho_r )

其中:
n - 某节点样本数
N - 总样本数
\rho - 不纯度
下标 l, r 分别表示分裂后的左右子节点。

因此,其含义也是明显的。首先以该节点样本数占总样本数为权重,乘以分裂后降低的不纯度,得到该节点分裂的“贡献”,即重要性。

3.2.3 预测结果

我们非常轻松地将预测误差 RMSPE 从 0.34 降低到了 0.22~0.25 。由于本人电脑渣,只随机选用了几个股票的数据,但是从统计意义上讲应该和训练所有数据误差类似。因此虽不严谨也足够说明 LightGBM 确实是非常有效的。

当然,这是一个很菜的成绩,在这个比赛中,前几名已经将误差降低到了 0.18 以内。这是因为我们还没有进行任何参数调优,特征也是随手选的。我们将在下一篇中详细介绍如何优化参数,以及提高特征的质量。

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007232 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74970
[LightGBM] [Info] Number of data points in the train set: 3064, number of used features: 294
[LightGBM] [Info] Start training from score 0.003745
Training until validation scores don't improve for 50 rounds
[20]    training's RMSPE: 0.185026  valid_1's RMSPE: 0.249763
[40]    training's RMSPE: 0.141187  valid_1's RMSPE: 0.238492
[60]    training's RMSPE: 0.117718  valid_1's RMSPE: 0.240709
[80]    training's RMSPE: 0.100988  valid_1's RMSPE: 0.244922
Early stopping, best iteration is:
[34]    training's RMSPE: 0.150671  valid_1's RMSPE: 0.237703
RMSPE =  0.2377028864491903
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.009398 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74970
[LightGBM] [Info] Number of data points in the train set: 3064, number of used features: 294
[LightGBM] [Info] Start training from score 0.003736
Training until validation scores don't improve for 50 rounds
[20]    training's RMSPE: 0.182278  valid_1's RMSPE: 0.244471
[40]    training's RMSPE: 0.139054  valid_1's RMSPE: 0.242755
[60]    training's RMSPE: 0.116214  valid_1's RMSPE: 0.249089
Early stopping, best iteration is:
[26]    training's RMSPE: 0.164368  valid_1's RMSPE: 0.240414
RMSPE =  0.24041392797196037
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007414 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74968
[LightGBM] [Info] Number of data points in the train set: 3064, number of used features: 294
[LightGBM] [Info] Start training from score 0.003772
Training until validation scores don't improve for 50 rounds
[20]    training's RMSPE: 0.183314  valid_1's RMSPE: 0.263333
[40]    training's RMSPE: 0.140241  valid_1's RMSPE: 0.254926
[60]    training's RMSPE: 0.116984  valid_1's RMSPE: 0.256409
[80]    training's RMSPE: 0.0994774 valid_1's RMSPE: 0.261615
Early stopping, best iteration is:
[38]    training's RMSPE: 0.143255  valid_1's RMSPE: 0.253827
RMSPE =  0.2538268499911712
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007743 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74970
[LightGBM] [Info] Number of data points in the train set: 3064, number of used features: 294
[LightGBM] [Info] Start training from score 0.003678
Training until validation scores don't improve for 50 rounds
[20]    training's RMSPE: 0.185122  valid_1's RMSPE: 0.224511
[40]    training's RMSPE: 0.141241  valid_1's RMSPE: 0.220552
[60]    training's RMSPE: 0.11815   valid_1's RMSPE: 0.22418
[80]    training's RMSPE: 0.101414  valid_1's RMSPE: 0.227895
Early stopping, best iteration is:
[32]    training's RMSPE: 0.154159  valid_1's RMSPE: 0.219429
RMSPE =  0.2194288855505451
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004243 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74969
[LightGBM] [Info] Number of data points in the train set: 3064, number of used features: 294
[LightGBM] [Info] Start training from score 0.003703
Training until validation scores don't improve for 50 rounds
[20]    training's RMSPE: 0.184329  valid_1's RMSPE: 0.238253
[40]    training's RMSPE: 0.141179  valid_1's RMSPE: 0.232793
[60]    training's RMSPE: 0.118668  valid_1's RMSPE: 0.235608
Early stopping, best iteration is:
[28]    training's RMSPE: 0.161778  valid_1's RMSPE: 0.232168
RMSPE =  0.23216812618489302

4. LightGBM 参数调整及特征优化

见下一篇
LightGBM 实战:波动率预测(2)

5. Appendix

Useful tool functions, especially the get_features. It uses pandas.pivot to create "flatten" features based on trunks separated by time.

def calculate_realized_volatility(price):
    log_return = np.log(price).diff().dropna()
    return np.sqrt(np.sum(log_return**2))

def cut_by_time(df, window_seconds):    
    batch_id = df.groupby('time_id', group_keys=False).apply(
        lambda g: (g.seconds_in_bucket / window_seconds).astype(int))
    return batch_id

def get_features(df, feature_dict):
    features = df.groupby(['time_id', 'batch_id'], group_keys=False
                         ).agg(feature_dict).reset_index()
    
    # concat multi-level column
    features.columns = ['_'.join(col) for col in features.columns]
    # now each time id has several rows of features for different time window
    # use pandas.pivot to flat these rows
    flat_features = features.pivot(index='time_id_', columns='batch_id_')
    flat_features.columns = [f'{col[0]}_{col[1]}' for col in flat_features.columns]
    return flat_features.reset_index()

6. Reference

  1. Optiver Realized Volatility Prediction
  2. lightGBM Baseline for "Optiver Realized Volatility Prediction"
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,684评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,143评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,214评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,788评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,796评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,665评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,027评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,679评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,346评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,664评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,766评论 1 331
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,412评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,015评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,974评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,203评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,073评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,501评论 2 343

推荐阅读更多精彩内容

  • 我们在 Boosting Trees[https://www.jianshu.com/p/a9274dfb0007...
    找不到工作阅读 2,880评论 0 1
  • 集成学习 Boosting,与Bagging一样,使用的相同的弱学习器,不过是以自适应的方法顺序地学习这些弱学习器...
    Andy9918阅读 4,873评论 0 1
  • 机器学习 一個很棒的機器學習框架、庫和軟件的精選列表(按語言)。靈感來自於 awesome-php。 计算机视觉 ...
    Rain师兄阅读 694评论 0 0
  • 16宿命:用概率思维提高你的胜算 以前的我是风险厌恶者,不喜欢去冒险,但是人生放弃了冒险,也就放弃了无数的可能。 ...
    yichen大刀阅读 6,032评论 0 4
  • 公元:2019年11月28日19时42分农历:二零一九年 十一月 初三日 戌时干支:己亥乙亥己巳甲戌当月节气:立冬...
    石放阅读 6,870评论 0 2