Tune Machine Learning Algorithms in R (random forest case study)

It is difficult to find a good machine learning algorithm for your problem. But once you do, how do you get the best performance out of it.

In this post you will discover three ways that you can tune the parameters of a machine learning algorithm in R.

Walk through a real example step-by-step with working code in R. Use the code as a template to tune machine learning algorithms on your current or next machine learning project.

Tune Random Forest in R.

Photo bySusanne Nilsson, some rights reserved.

Get Better Accuracy From Top Algorithms

It is difficult to find a good or even a well-performing machine learning algorithm for your dataset.

Through a process of trial and error you can settle on a short list of algorithms that show promise, but how do you know which is the best.

You could use the default parameters for each algorithm. These are the parameters set by rules of thumb or suggestions in books and research papers. But how do you know the algorithms that you are settling on are showing their best performance?

Use Algorithm Tuning To Search For Algorithm Parameters

The answer is to search for good or even best combinations of algorithm parameters for your problem.

You need a process to tune each machine learning algorithm to know that you are getting the most out of it. Once tuned, you can make an objective comparison between the algorithms on your shortlist.

Searching for algorithm parameters can be difficult, there are many options, such as:

What parameters to tune?

What search method to use to locate good algorithm parameters?

What test options to use to limit overfitting the training data?

Get Started with Machine Learning in R, Right Now

R is the most popular platform among professional data scientists for applied machine learning.

Download your mini-course in Machine Learning with R.

Start Your FREE Mini-Course >>

FREE 14-Day Mini-Course in

Machine Learning with R

Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

Tune Machine Learning Algorithms in R

You can tune your machine learning algorithm parameters in R.

Generally, the approaches in this section assume that you already have a short list of well-performing machine learning algorithms for your problem from which you are looking to get better performance.

An excellent way to create your shortlist of well-performing algorithms is to use the caret package.

For more on how to use the caret package, see:

Caret R Package for Applied Predictive Modeling

In this section we will look at three methods that you can use in R to tune algorithm parameters:

Using the caret R package.

Using tools that come with the algorithm.

Designing your own parameter search.

Before we start tuning, let’s setup our environment and test data.

Test Setup

Let’s take a quick look at the data and the algorithm we will use in this case study.

Test Dataset

In this case study, we will use the sonar test problem.

This is a dataset from theUCI Machine Learning Repositorythat describes radar returns as either bouncing off metal or rocks.

It is a binary classification problem with 60 numerical input features that describe the properties of the radar return. You can learn more about this problem here:Sonar Dataset. You can see world class published results for this dataset here:Accuracy on the Sonar Dataset.

This is not a particularly difficult dataset, but is non-trivial and interesting for this example.

Let’s load the required libraries and load the dataset from themlbenchpackage.

1

2

3

4

5

6

7

8

9library(randomForest)

library(mlbench)

library(caret)

# Load Dataset

data(Sonar)

dataset<-Sonar

x<-dataset[,1:60]

y<-dataset[,61]

Test Algorithm

We will use the popular Random Forest algorithm as the subject of our algorithm tuning.

Random Forest is not necessarily the best algorithm for this dataset, but it is a very popular algorithm and no doubt you will find tuning it a useful exercise in you own machine learning work.

When tuning an algorithm, it is important to have a good understanding of your algorithm so that you know what affect the parameters have on the model you are creating.

In this case study, we will stick to tuning two parameters, namely themtryand thentreeparameters that have the following affect on our random forest model. There are many other parameters, but these two parameters are perhaps the most likely to have the biggest effect on your final accuracy.

Direct from the help page for therandomForest()function in R:

mtry: Number of variables randomly sampled as candidates at each split.

ntree: Number of trees to grow.

Let’s create a baseline for comparison by using the recommend defaults for each parameter andmtry=floor(sqrt(ncol(x)))or mtry=7 and ntree=500.

1

2

3

4

5

6

7

8

9# Create model with default paramters

control<-trainControl(method="repeatedcv",number=10,repeats=3)

seed<-7

metric<-"Accuracy"

set.seed(seed)

mtry<-sqrt(ncol(x))

tunegrid<-expand.grid(.mtry=mtry)

rf_default<-train(Class~.,data=dataset,method="rf",metric=metric,tuneGrid=tunegrid,trControl=control)

print(rf_default)

We can see our estimated accuracy is 81.3%.

1

2

3

4Resampling results

Accuracy   Kappa      Accuracy SD  Kappa SD

0.8138384  0.6209924  0.0747572    0.1569159

1. Tune Using Caret

The caret package in R provides an excellent facility to tune machine learning algorithm parameters.

Not all machine learning algorithms are available in caret for tuning. The choice of parameters is left to the developers of the package, namely Max Khun. Only those algorithm parameters that have a large effect (e.g. really require tuning in Khun’s opinion) are available for tuning in caret.

As such, onlymtryparameter is available in caret for tuning. The reason is its effect on the final accuracy and that it must be found empirically for a dataset.

Thentreeparameter is different in that it can be as large as you like, and continues to increases the accuracy up to some point. It is less difficult or critical to tune and could be limited more by compute time available more than anything.

Random Search

One search strategy that we can use is to try random values within a range.

This can be good if we are unsure of what the value might be and we want to overcome any biases we may have for setting the parameter (like the suggested equation above).

Let’s try a random search formtryusing caret:

1

2

3

4

5

6

7# Random Search

control<-trainControl(method="repeatedcv",number=10,repeats=3,search="random")

set.seed(seed)

mtry<-sqrt(ncol(x))

rf_random<-train(Class~.,data=dataset,method="rf",metric=metric,tuneLength=15,trControl=control)

print(rf_random)

plot(rf_random)

Note, that we are using a test harness similar to that which we would use to spot check algorithms. Both 10-fold cross-validation and 3 repeats slows down the search process, but is intended to limit and reduce overfitting on the training set. It won’t remove overfitting entirely. Holding back a validation set for final checking is a great idea if you can spare the data.

1

2

3

4

5

6

7

8

9

10

11

12

13

14Resampling results across tuning parameters:

mtry  Accuracy   Kappa      Accuracy SD  Kappa SD

11    0.8218470  0.6365181  0.09124610   0.1906693

14    0.8140620  0.6215867  0.08475785   0.1750848

17    0.8030231  0.5990734  0.09595988   0.1986971

24    0.8042929  0.6002362  0.09847815   0.2053314

30    0.7933333  0.5798250  0.09110171   0.1879681

34    0.8015873  0.5970248  0.07931664   0.1621170

45    0.7932612  0.5796828  0.09195386   0.1887363

47    0.7903896  0.5738230  0.10325010   0.2123314

49    0.7867532  0.5673879  0.09256912   0.1899197

50    0.7775397  0.5483207  0.10118502   0.2063198

60    0.7790476  0.5513705  0.09810647   0.2005012

We can see that the most accurate value for mtry was 11 with an accuracy of 82.1%.

Tune Random Forest Parameters in R Using Random Search

Grid Search

Another search is to define a grid of algorithm parameters to try.

Each axis of the grid is an algorithm parameter, and points in the grid are specific combinations of parameters. Because we are only tuning one parameter, the grid search is a linear search through a vector of candidate values.

1

2

3

4

5

6control<-trainControl(method="repeatedcv",number=10,repeats=3,search="grid")

set.seed(seed)

tunegrid<-expand.grid(.mtry=c(1:15))

rf_gridsearch<-train(Class~.,data=dataset,method="rf",metric=metric,tuneGrid=tunegrid,trControl=control)

print(rf_gridsearch)

plot(rf_gridsearch)

We can see that the most accurate value for mtry was 2 with an accuracy of 83.78%.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18Resampling results across tuning parameters:

mtry  Accuracy   Kappa      Accuracy SD  Kappa SD

1    0.8377273  0.6688712  0.07154794   0.1507990

2    0.8378932  0.6693593  0.07185686   0.1513988

3    0.8314502  0.6564856  0.08191277   0.1700197

4    0.8249567  0.6435956  0.07653933   0.1590840

5    0.8268470  0.6472114  0.06787878   0.1418983

6    0.8298701  0.6537667  0.07968069   0.1654484

7    0.8282035  0.6493708  0.07492042   0.1584772

8    0.8232828  0.6396484  0.07468091   0.1571185

9    0.8268398  0.6476575  0.07355522   0.1529670

10    0.8204906  0.6346991  0.08499469   0.1756645

11    0.8073304  0.6071477  0.09882638   0.2055589

12    0.8184488  0.6299098  0.09038264   0.1884499

13    0.8093795  0.6119327  0.08788302   0.1821910

14    0.8186797  0.6304113  0.08178957   0.1715189

15    0.8168615  0.6265481  0.10074984   0.2091663

Tune Random Forest Parameters in R Using Grid Search.png

2. Tune Using Algorithm Tools

Some algorithms provide tools for tuning the parameters of the algorithm.

For example, the random forest algorithm implementation in the randomForest package provides thetuneRF()function that searches for optimalmtryvalues given your data.

1

2

3

4# Algorithm Tune (tuneRF)

set.seed(seed)

bestmtry<-tuneRF(x,y,stepFactor=1.5,improve=1e-5,ntree=500)

print(bestmtry)

You can see that the most accurate value formtrywas 10 with an OOBError of 0.1442308.

This does not really match up with what we saw in the caret repeated cross validation experiment above, wheremtry=10gave an accuracy of 82.04%. Nevertheless, it is an alternate way to tune the algorithm.

1

2

3

4

5mtry  OOBError

5.OOB     5 0.1538462

7.OOB     7 0.1538462

10.OOB   10 0.1442308

15.OOB   15 0.1682692

Tune Random Forest Parameters in R using tuneRF

3. Craft Your Own Parameter Search

Often you want to search for both the parameters that must be tuned (handled by caret) and the those that need to be scaled or adapted more generally for your dataset.

You have to craft your own parameter search.

Two popular options that I recommend are:

Tune Manually: Write R code to create lots of models and compare their accuracy using caret

Extend Caret: Create an extension to caret that adds in additional parameters to caret for the algorithm you want to tune.

Tune Manually

We want to keep using caret because it provides a direct point of comparison to our previous models (apples to apples, even the same data splits) and because of the repeated cross validation test harness that we like as it reduces the severity of overfitting.

One approach is to create many caret models for our algorithm and pass in a different parameters directly to the algorithm manually. Let’s look at an example doing this to evaluate different values forntreewhile holdingmtryconstant.

1

2

3

4

5

6

7

8

9

10

11

12

13

14# Manual Search

control<-trainControl(method="repeatedcv",number=10,repeats=3,search="grid")

tunegrid<-expand.grid(.mtry=c(sqrt(ncol(x))))

modellist<-list()

for(ntreeinc(1000,1500,2000,2500)){

set.seed(seed)

fit<-train(Class~.,data=dataset,method="rf",metric=metric,tuneGrid=tunegrid,trControl=control,ntree=ntree)

key<-toString(ntree)

modellist[[key]]<-fit

}

# compare results

results<-resamples(modellist)

summary(results)

dotplot(results)

You can see that the most accuracy value forntreewas perhaps 2000 with a mean accuracy of 82.02% (a lift over our very first experiment using the default mtry value).

The results perhaps suggest an optimal value for ntree between 2000 and 2500. Also note, we held mtry constant at the default value. We could repeat the experiment with a possible bettermtry=2from the experiment above, or try combinations of of ntree and mtry in case they have interaction effects.

1

2

3

4

5

6

7

8

9Models: 1000, 1500, 2000, 2500

Number of resamples: 30

Accuracy

Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's

1000 0.600  0.8024 0.8500 0.8186  0.8571 0.9048    0

1500 0.600  0.8024 0.8095 0.8169  0.8571 0.9500    0

2000 0.619  0.8024 0.8095 0.8202  0.8620 0.9048    0

2500 0.619  0.8000 0.8095 0.8201  0.8893 0.9091    0

Tune Random Forest Parameters in R Manually

Extend Caret

Another approach is to create a “new” algorithm for caret to support.

This is the same random forest algorithm you are using, only modified so that it supports multiple tuning of multiple parameters.

A risk with this approach is that the caret native support for the algorithm has additional or fancy code wrapping it that subtly but importantly changes it’s behavior. You many need to repeat prior experiments with your custom algorithm support.

We can define our own algorithm to use in caret by defining a list that contains a number of custom named elements that the caret package looks for, such as how to fit and how to predict. See below for a definition of a custom random forest algorithm for use with caret that takes both an mtry and ntree parameters.

1

2

3

4

5

6

7

8

9

10

11

12customRF<-list(type="Classification",library="randomForest",loop=NULL)

customRF$parameters<-data.frame(parameter=c("mtry","ntree"),class=rep("numeric",2),label=c("mtry","ntree"))

customRF$grid<-function(x,y,len=NULL,search="grid"){}

customRF$fit<-function(x,y,wts,param,lev,last,weights,classProbs,...){

randomForest(x,y,mtry=param$mtry,ntree=param$ntree,...)

}

customRF$predict<-function(modelFit,newdata,preProc=NULL,submodels=NULL)

predict(modelFit,newdata)

customRF$prob<-function(modelFit,newdata,preProc=NULL,submodels=NULL)

predict(modelFit,newdata,type="prob")

customRF$sort<-function(x)x[order(x[,1]),]

customRF$levels<-function(x)x$classes

Now, let’s make use of this custom list in our call to the caret train function, and try tuning different values for ntree and mtry.

1

2

3

4

5

6

7# train model

control<-trainControl(method="repeatedcv",number=10,repeats=3)

tunegrid<-expand.grid(.mtry=c(1:15),.ntree=c(1000,1500,2000,2500))

set.seed(seed)

custom<-train(Class~.,data=dataset,method=customRF,metric=metric,tuneGrid=tunegrid,trControl=control)

summary(custom)

plot(custom)

This may take a minute or two to run.

You can see that the most accurate values forntreeandmtrywere 2000 and 2 with an accuracy of 84.43%.

We do perhaps see some interaction effects between the number of trees and the value ofntree. Nevertheless, if we had chosen the best value formtryfound using grid search of 2 (above) and the best value ofntreefound using grid search of 2000 (above), in this case we would have achieved the same level of tuning found in this combined search. This is a nice confirmation.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61mtry  ntree  Accuracy   Kappa      Accuracy SD  Kappa SD

1    1000   0.8442424  0.6828299  0.06505226   0.1352640

1    1500   0.8394805  0.6730868  0.05797828   0.1215990

1    2000   0.8314646  0.6564643  0.06630279   0.1381197

1    2500   0.8379654  0.6693773  0.06576468   0.1375408

2    1000   0.8313781  0.6562819  0.06909608   0.1436961

2    1500   0.8427345  0.6793793  0.07005975   0.1451269

2    2000   0.8443218  0.6830115  0.06754346   0.1403497

2    2500   0.8428066  0.6791639  0.06488132   0.1361329

3    1000   0.8350216  0.6637523  0.06530816   0.1362839

3    1500   0.8347908  0.6633405  0.06836512   0.1418106

3    2000   0.8428066  0.6800703  0.06643838   0.1382763

3    2500   0.8365296  0.6668480  0.06401429   0.1336583

4    1000   0.8316955  0.6574476  0.06292132   0.1317857

4    1500   0.8331241  0.6605244  0.07543919   0.1563171

4    2000   0.8378860  0.6699428  0.07147459   0.1488322

4    2500   0.8315368  0.6568128  0.06981259   0.1450390

5    1000   0.8284343  0.6505097  0.07278539   0.1516109

5    1500   0.8283622  0.6506604  0.07166975   0.1488037

5    2000   0.8219336  0.6375155  0.07548501   0.1564718

5    2500   0.8315440  0.6570792  0.07067743   0.1472716

6    1000   0.8203391  0.6341073  0.08076304   0.1689558

6    1500   0.8186797  0.6302188  0.07559694   0.1588256

6    2000   0.8187590  0.6310555  0.07081621   0.1468780

6    2500   0.8153463  0.6230495  0.07728249   0.1623253

7    1000   0.8217027  0.6367189  0.07649651   0.1606837

7    1500   0.8282828  0.6503808  0.06628953   0.1381925

7    2000   0.8108081  0.6147563  0.07605609   0.1573067

7    2500   0.8250361  0.6437397  0.07737756   0.1602434

8    1000   0.8187590  0.6314307  0.08378631   0.1722251

8    1500   0.8201876  0.6335679  0.07380001   0.1551340

8    2000   0.8266883  0.6472907  0.06965118   0.1450607

8    2500   0.8251082  0.6434251  0.07745300   0.1628087

9    1000   0.8121717  0.6177751  0.08218598   0.1709987

9    1500   0.8184488  0.6300547  0.08077766   0.1674261

9    2000   0.8247980  0.6429315  0.07260439   0.1513512

9    2500   0.8186003  0.6302674  0.07356916   0.1547231

10    1000   0.8235209  0.6407121  0.07991334   0.1656978

10    1500   0.8125541  0.6183581  0.06851683   0.1421993

10    2000   0.8187518  0.6308120  0.08538951   0.1782368

10    2500   0.8169336  0.6263682  0.07847066   0.1649216

11    1000   0.8203463  0.6341158  0.07222587   0.1497558

11    1500   0.8153463  0.6235878  0.09131621   0.1904418

11    2000   0.8234416  0.6402906  0.07586609   0.1576765

11    2500   0.8154906  0.6236875  0.07485835   0.1576576

12    1000   0.8201948  0.6336913  0.08672139   0.1806589

12    1500   0.8139105  0.6206994  0.08638618   0.1804780

12    2000   0.8137590  0.6204461  0.07771424   0.1629707

12    2500   0.8201876  0.6333194  0.07799832   0.1636237

13    1000   0.8123232  0.6173280  0.09299062   0.1936232

13    1500   0.8108802  0.6142721  0.08416414   0.1760527

13    2000   0.8154257  0.6236191  0.08079923   0.1693634

13    2500   0.8106566  0.6138814  0.08074394   0.1687437

14    1000   0.8171645  0.6270292  0.08608806   0.1799346

14    1500   0.8139033  0.6207263  0.08522205   0.1781396

14    2000   0.8170924  0.6276518  0.08766645   0.1822010

14    2500   0.8137590  0.6207371  0.08353328   0.1746425

15    1000   0.8091486  0.6110154  0.08455439   0.1745129

15    1500   0.8109668  0.6154780  0.08928549   0.1838700

15    2000   0.8059740  0.6047791  0.08829659   0.1837809

15    2500   0.8122511  0.6172771  0.08863418   0.1845635

Custom Tuning of Random Forest parameters in R

For more information on defining custom algorithms in caret see:

Using Your Own Model in train

To see the actual wrapper for random forest used by caret that you can use as a starting point, see:

Random Forest Wrapper for Caret Train

Summary

In this post you discovered the importance of tuning well-performing machine learning algorithms in order to get the best performance from them.

You worked through an example of tuning the Random Forest algorithm in R and discovered three ways that you can tune a well-performing algorithm.

Using the caret R package.

Using tools that come with the algorithm.

Designing your own parameter search.

You now have a worked example and template that you can use to tune machine learning algorithms in R on your current or next project.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,456评论 5 477
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,370评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,337评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,583评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,596评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,572评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,936评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,595评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,850评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,601评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,685评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,371评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,951评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,934评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,167评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 43,636评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,411评论 2 342

推荐阅读更多精彩内容

  • 这雨水侵略的太迅速 还没有吹起战争的号鼓 似乎是这片土地太渴望干净 又仿佛是这人间太渴求甘露 这些喷涌而出的圣斗士...
    冯文佳阅读 183评论 0 1
  • 跑步这事已经计划了好几年,说是要好好爱惜自己的身体,要好好锻炼下,然后,买上跑鞋,短裤,准备行动起来。 不知道多少...
    爱吃的大货阅读 268评论 0 0
  • 要知道,这个世界上是存在着一种“伪”人的。 这里的“伪人”,当然不是指机器人或是野人什么的。他确确实实指的是外表和...
    旅途中的歌阅读 2,079评论 0 0