机器学习常见的几个误区

原文:http://blog.sina.com.cn/s/blog_5357c0af0102uxoh.html

下面罗列的几个在机器学习算法实际应用中误区，解决了我很多困惑，推荐大家读一下：

Machine Learning Done Wrong

Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.

As pointed out in my previouspost, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable. In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don't-s). I’ll save some of the best practices (the do-s) in a future post.

1. Take default loss function for granted

Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

2. Use plain linear models for non-linear interaction

When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.

3. Forget about outliers

Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.

Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

4. Use high variance model when n<<P

SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p

5. L1/L2/... regularization without standardization

Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.

Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

6. Use linear model without considering multi-collinear predictors

Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance

Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.

So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.

原文地址：http://ml.posthaven.com/machine-learning-done-wrong

==================================================

翻译：

机器学习实践中的7种常见错误

统计建模非常像工程学。

在工程学中，有多种构建键-值存储系统的方式，每个设计都会构造一组不同的关于使用模式的假设集合。在统计建模中，有很多分类器构建算法，每个算法构造一组不同的关于数据的假设集合。

当处理少量数据时，尝试尽可能多的算法，然后挑选最好的一个的做法是比较合理的，因为此时实验成本很低。但当遇到“大数据”时，提前分析数据，然后设计相应“管道”模型（预处理，建模，优化算法，评价，产品化）是值得的。

正如我之前文章中所指出的，有很多种方法来解决一个给定建模问题。每个模型做出不同假设，如何导引和确定哪些假设合理的方法并不明确。在业界，大多数实践者是挑选他们更熟悉而不是最合适的建模算法。在本文中，我想分享一些常见错误（不能做的），并留一些最佳实践方法（应该做的）在未来一篇文章中介绍。

1. 想当然地使用缺省损失函数

许多实践者使用缺省损失函数(如，均方误差)训练和挑选最好的模型。实际上，现有损失函数很少符合业务目标。以欺诈检测为例，当试图检测欺诈性交易时，业务目标是最小化欺诈损失。现有二元分类器损失函数为误报率和漏报率分配相等权重，为了符合业务目标，损失函数惩罚漏报不仅要多于惩罚误报，而且要与金额数量成比例地惩罚每个漏报数据。此外，欺诈检测数据集通常含有高度不平衡的标签。在这些情况下，偏置损失函数能够支持罕见情况（如，通过上、下采样）。

2．非线性情况下使用简单线性模型

当构建一个二元分类器时，很多实践者会立即跳转到逻辑回归，因为它很简单。但是，很多人也忘记了逻辑回归是一种线性模型，预测变量间的非线性交互需要手动编码。回到欺诈检测问题，要获得好的模型性能，像“billing address = shipping address and transaction amount < $50”这种高阶交互特征是必须的。因此，每个人都应该选择适合高阶交互特征的带核SVM或基于树的分类器。

3．忘记异常值

异常值非常有趣，根据上下文环境，你可以特殊关注或者完全忽略它们。以收入预测为例，如果观察到不同寻常的峰值收入，给予它们额外关注并找出其原因可能是个好主意。但是如果异常是由于机械误差，测量误差或任何其它不可归纳的原因造成的，那么在将数据输入到建模算法之前忽略掉这些异常值是个不错的选择。

相比于其它模型，有些模型对异常值更为敏感。比如，当决策树算法简单地将每个异常值计为一次误分类时，AdaBoost算法会将那些异常值视为“硬”实例，并为异常值分配极大权值。如果一个数据集含有相当数量的异常值，那么，使用一种具有异常值鲁棒性的建模算法或直接过滤掉异常值是非常重要的。

4．样本数少于特征数（n<<P)时使用高方差模型

SVM是现有建模算法中最受欢迎算法之一，它最强大的特性之一是，用不同核函数去拟合模型的能力。SVM核函数可被看作是一种自动结合现有特征，从而形成一个高维特征空间的方式。由于获得这一强大特性不需任何代价，所以大多数实践者会在训练SVM模型时默认使用核函数。然而，当数据样本数远远少于特征数（n<<P)—业界常见情况如医学数据—时,高维特征空间意味着更高的数据过拟合风险。事实上，当样本数远小于特征数时，应该彻底避免使用高方差模型。

5．尚未标准化就进行L1/L2/等正则化

使用L1或L2去惩罚大系数是一种正则化线性或逻辑回归模型的常见方式。然而，很多实践者并没有意识到进行正则化之前标准化特征的重要性。

回到欺诈检测问题，设想一个具有交易金额特征的线性回归模型。不进行正则化，如果交易金额的单位为美元，拟合系数将是以美分为单位时的100倍左右。进行正则化，由于L1/L2更大程度上惩罚较大系数，如果单位为美元，那么交易金额将受到更多惩罚。因此，正则化是有偏的，并且趋向于在更小尺度上惩罚特征。为了缓解这个问题，标准化所有特征并将它们置于平等地位，作为一个预处理步骤。

6．不考虑线性相关直接使用线性模型

设想建立一个具有两变量X1和X2的线性模型，假设真实模型是Y=X1+X2。理想地，如果观测数据含有少量噪声，线性回归解决方案将会恢复真实模型。然而，如果X1和X2线性相关（大多数优化算法所关心的），Y=2*X1, Y=3*X1-X2或Y=100*X1-99*X2都一样好，这一问题可能并无不妥，因为它是无偏估计。然而，它却会使问题变得病态，使系数权重变得无法解释。

7. 将线性或逻辑回归模型的系数绝对值解释为特征重要性

因为很多现有线性回归量为每个系数返回P值，对于线性模型，许多实践者认为，系数绝对值越大，其对应特征越重要。事实很少如此，因为：(a)改变变量尺度就会改变系数绝对值；(b)如果特征是线性相关的，则系数可以从一个特征转移到另一个特征。此外，数据集特征越多，特征间越可能线性相关，用系数解释特征重要性就越不可靠。

这下你就知道了机器学习实践中的七种常见错误。这份清单并不详尽，它只不过是引发读者去考虑，建模假设可能并不适用于手头数据。为了获得最好的模型性能，挑选做出最合适假设的建模算法—而不只是选择你最熟悉那个算法，是很重要的。

最后编辑于：2017.12.05 06:40:25

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,772评论 6赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,458评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,610评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,640评论 1赞 276
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,657评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,590评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,962评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,631评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,870评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,611评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,704评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,386评论 4赞 319
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,969评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,944评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,179评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 44,742评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,440评论 2赞 342