UniformML Paper3-MLR《Learning Piece-wise Linear Models from Large Scale Data for Ad Click》

1 Introduction

线性模型在实际应用中虽然高效，但是实际很多特征空间的分界面未必都是线性的，为了适应这样的场景，我们一般会通过两种方法：
1、复杂的特征工程（人工制造出非线性组合的特征）

To help LR model catch the nonlinearity, feature engineering technique is explored, which is both time and humanity consuming

2、通过模型组合（引入非线性模型）

Another direction, is to capture the nonlinearity with well-designed models. Facebook (He et al. 2014) uses a hybrid model which combines decision trees with logistic regression. Decision tree plays a nonlinear feature transformation role, whose output is fed to LR model. However, tree-based method is not suitable for very sparse and high dimensional data

除了树模型+LR，还有factorization machine (FM)，但是一般只能解决2阶问题，没办法解决更高阶的非线性组合。

为了解决上述问题，本文提出了一个新的模型Large Scale Piece-wise Linear Model(LS-PLM)

In this paper, we present a piece-wise linear model and its training algorithm for large scale data.

这是一个分而治之的策略算法（divide-and-conquer strategy）：先对空间进行划分出多个区域，然后对每个区域采用线性模型

first divides the feature space into several local regions, then fits a linear model in each region, resulting in the output with combinations of weighted linear predictions.

这里的非线性就看多个区域的划分机制，如果划分越细，那非线性的表征能力就越强。

在学习曲线导数的时候，我们知道，在一个极短的线段里，可以近似为直线。

本文提出的模型具有以下优点：

非线性
支持大规模并行训练
稀疏性

model sparsity is a practical issue for online serving in industrial setting. We show LS-PLM with L1 and L2,1 regularizer can achieve good sparsity.

我们先看个例子，本文提出的模型对非线性的拟合能力

Example

2 模型细节

2.1 模型结构

前面已经提到，我们为了表征非线性关系，会把空间进行划分，用多个线性模型表征非线性模型。模型的公式如下：
$p(y=1|x)=g(\sum_{j=1}^m\sigma{(u_j^Tx)}\eta(w_j^Tx))$
这里g里用了两个函数来分别表征分而治之(dividing and fitting)的概念。
这里模型的参数
$\theta={u_1, u_2, ...,u_m, w_1, w_2, ..., w_m}$ ，其中{u1, u2, ..., um}参数是dividing function
$\sigma(.)$ 的；{w1, w2, ..., wm}参数是fitting function $\eta(.)$ 的。
当给定自变量x，我们的预测模型包含两部分：第一部分是dividing函数，负责把特征空间划分成m个区域；第二部分fitting函数是给出各个区域空间的概率预测。外层的g()保证我们的输出结果满足概率定义。
本文对上述三个函数的定义如下：

define of function

个人理解：感觉这里借鉴了non-parametric algorithm，然后和参数学习组合，比如算法局部加权线性回归算法，但是局部加权线性回归算法的在线计算复杂度会随着样本数量增加而增加，这了用了m固定区域来限制，保证了在线计算复杂度，而且也在一段区域内进行参数化
如果我们再对这个函数拆成两步，第一步是先通过m个logistic regression计算，第二步是根据第一步的计算结果再进行一次softmax。这个过程是不是和一个标准的2层神经网络一样？

本文还指出该模型和如下模型是如出一辙的形式。

image

有了模型定义，下面我们来说下损失函数的定义：

image

这里loss同样采用 经验误差+结构误差的模式，经验误差同LR采用的是交叉熵，模型结构误差采用的是L1和L2组合

$||\theta||_{2,1}=\sum_{i=1}^{d}\sqrt{\sum_{j=1}^{2m}\theta_{i,j}^2}$
$||\theta||_1=\sum_{ij}|\theta_{i,j}|$

2.2 模型优化求解方法

正是由于引入了模型结构误差，导致模型非凸非平滑

However, both L1 norm and L2,1 norm are non-smooth functions. This causes the objective function of Eq.(4) to be non-convex and non-smooth, making it difficult to employ those traditional gradient-descent optimization methods

作者为了他解决这个问题，提出新的求解方法(根据梯度方向来更新)

2.3 trick

本文模型在训练的时候采用了一个common feature trick，我们在广告曝光的时候，对于同一个用户在同一个页面展示不同的广告，这里曝光的每条训练数据集的user、context的feature都是一致的，只有item的feature不一致，所以在悬链的时候可以进行预计算公用特征，然后在计算不一致特征即可，能够高效提升计算效率。当然对训练数据集分组的时候就有要求，尽可能把带有公共特征的数据放在一台机器上。
如下，我们把feature空间分乘两部分： xc表示common feature， xnc表示非公共特征

feature重写

3 实验

从试验结果来看，模型超参数m如果越多能够提升模型的效果，但是对训练复杂度也会加大

image