2018 Forecasting exam

A1. Explain how the moving average method uses n observations to smooth time series data. What would be the difference in using n = 3 compared to n = 20?

The moving average method uses the last n observations in order to predict (observation t+1) or smooth

Moving average calculation

There are two differences in using n=3 and n=20: 1. the larger the n, the more smoothing the data is. That is to say, little fluctuation could we see in the data. we lose more futures in data points. 2. when the n= an odd number, we can choose centred moving average of order n, which is a weighted moving average method, and we can also choose not centred moving average. while n= a even number, we can simply use the famulation above to calculate the data value.

A2. Describe how simulated annealing works. Explain how the temperature variable and greed works in your answer.

The algorithm mimics the cooling of metallic solids from the liquid phase to increase the volume of crystals to make the metal “harder” and reduce the number of defects.The initial heat applied to the material forces its atoms to freely move in random directions (stochastic nature). As the cooling process occurs, the atom’s energy will slowly decrease resulting in a new formation.

Process of simulated annealing


Temperature variable and greed in SA

温度先设置成高温,原子可以随机游走,斯托克斯自然。当算法开始时,对于遇到的更差的解决方案,接受度会较高,随着温度逐渐降低,接受更差的解决方案的几率也会降低,最终会不再跳出这个局部最优。这样的算法能够减少其在高温时停留的概率,增加到寻找全局最优解的可能性。整个过程,目标函数时寻求最大解,温度只是其中的一个参数,用来计算接受邻近方案概率的。


When T is low, the probability of acceptance the next worse solution is low 

A3. Explain what deep learning is and give examples of how it is being used.

Deep learning

packages: Tensorflow and Keras.

Convolution Neural Network is a subfield of deep learning. CNN was firstly used to solve problems of computer vision and pattern recognition and it has subsequently been shown to be effective for NLP (Natural Language Processing) and have achieved excellent results in many NLP tasks.

A4. List the conditional probabilities for the following Bayesian network:


Probabilities for Bayesian network

在知道B发生的情况下,E发生的概率

知道 A 和 B 同时发生的情况下, C 发生的概率

知道C发生的情况下, D 发生的概率


B1. Your manager has created a function to model a business process and wants to use a genetic algorithm to identify the minimum solution.

a. Describe how genetic algorithms work. Explain the crossover and

mutation stages in your answer giving examples with binary

encoding. 

The general procedure of a Genetic Algorithm is as follows:

1. Define an end condition (time or number of iterations).

2. Generate a random population of chromosomes.

3. Evaluate fitness of each chromosome in the population.

4. Create a new population by repeating the following steps until a new population is complete:

• Select two parent chromosomes from the population according to their fitness.

•  Crossover, also called recombination, is a genetic operator used to combine the genetic information of two parents to generate new offspring stochastically. There are a number of techniques which handle the crossover stage, however the most common method is binary encoding (从母体A和母体B 中前后各取一段,组成后代:Select a random cut off point and form a new offspring by merging one side of the cut point of parent A to the other side of the cut point of parent B: eg A:10001|011 and B:01101|110 produces offspring :10001110 

• Randomly mutate the offspring.The mutation stage consists of a small alteration to the new offspring: For example: 10001110 mutates to 10101110 •The probability of this occurring to each individual bit of the chromosome is set by the decision-maker or analyst. Generally the probability is fixed to less than 0.1 (<10%). The level of the mutation probability denotes the stochastic nature of the algorithm.

• Place the offspring into the population.

5. Evaluate fitness of each chromosome in the population.

6. If the end condition is met, return the best solution(s) in the current

population.

b. Solve the function below with a genetic algorithm using the ga

function in the GA package in R. Use a real-valued optimisation

type and set the minimum input parameters as c(-10, -10) and the

maximum input parameters as c(10, 10).

Include the code used, a plot of the fitness value throughout the GA

generations, the summary output of the function and an explanation

of the summary output.

CODE HERE

# Define the business task function

B1 <- function(x)

{

  sum <- sum(x^4 - 16 *x^2 +5*x)

  return(-sum/2)

}

# load the GA package

library("GA")

# fit the B1 function into ga model and set parameters

GA <- ga(type = "real-valued", fitness = B1, lower=c(-10, -10), upper=c(10, 10))

# look at the results

summary(GA)

plot(GA)


GA results HERE: 

Iterations             = 100 

Fitness function value = 78.2173 

Solution = 

            x1        x2

[1,] -2.821962 -2.916912

The summary of GA results show that afer 100 iteration, GA found a largest fitness function value with 78.2173 with the parameters : x1 (-2.82) and x2 ( -2.92). then, reverse the sign of the fitness value for GAs, we get the minimum solution : 78.21

c. Repeat the ga optimisation with the following custom parameters:

popSize = 100

pcrossover = 0.9

pmutation = 0.2

maxiter = 500

Include the code used and the summary output of the function.

Describe what these custom parameters are used for and how they

have affected the result in comparison to your previous answer.

CODE:

# fit the B1 function into ga model and modify parameters

GA <- ga(type = "real-valued", fitness = B1, lower=c(-10, -10), upper=c(10, 10),popSize = 100,pcrossover = 0.9,pmutation = 0.2, maxiter = 500)

# look at the results

summary(GA)

plot(GA)

GA results: 

Iterations             = 500 

Fitness function value = 78.33233 

Solution = 

            x1        x2

[1,] -2.903774 -2.903451

EXPLANATION :

popSize= The population size.

pcrossover = The probability of crossover between pairs of chromosomes. Typically this is a large. DEFAULT IS 0.8, HERE IS 0.9

pmutation = The probability of mutation in a parent chromosome. Usually mutation occurs with a small probability, and by default is set to 0.1. HERE IS 0.2, it allows bigger mutation in population, and searched a better result

maxiter = The maximum number of iterations to run before the GA search is stopped. HERE IS 500, more iteration than before, a better result

Compared with previous result, this time we find a better result. The optimal solution increased 0.13 because this time we use a larger population size, a higher mutation rate and more iterations that increase the probability to find a better solution.

B2. Iveco has approached your consultancy company asking you to help them forecast the number of 35S12 vans sold in the UK in the next year. They have provided you with the quarterly time series sales data

from Q3 2008 to Q3 2017 (B2.csv) for the Iveco Daily 35S12 van.

a. Using the read.csv, ts and plot functions in R, import the data,

create a time series object then plot the time series object.

From looking at this plot, what can you say about the trend and

seasonality of the data? Include the plot in your answer.

Code :

#load the data set

data <- read.csv("2018B2.csv")

View(data)

# creat a time series object

iveco <- ts(data$Sales,start = c(2008,3),end = c(2017,3),frequency = 4)

# see the plot of time series data

plot (iveco)


From this plot, we can see that the trend is up before 2011 and then drop down dramatically. We can hardly see any seasonality of the data, maybe the seasonality is quite small.

b. Using the plot and stl functions in R, decompose the data with loess (additive) decomposition and explain what is shown in the plot.

Explain what the bars to the right of the plot represent.

# load "forecast" package

library("forecast")

#using loess decompose the data and set the seasonal window to periodic

lo <- stl(iveco, s.window = "periodic")

#take a look at the results

lo$time.series

plot(lo)

iveco decomposition

the bars indicate relative scale, large seasonal bar show that this variation is relatively small compared to data and trend.

In this plot, the first line shows the time series of the Iveco sales data. The following lines show seasonal, trend and reminder decomposition because we use the additive method here, so the sum of the last three is equal to the first line. The trend decomposition  account for the largest part of the whole data and the seasonal decomposition is very small.

c. Using the ets() function in the forecast package in R, predict future sales using exponential smoothing for the next year (4 observations only). Set alpha so that you give more weight to more recent observations. Include an image of the forecast in your answer.

#using ets function to predict the sales of next year, set alpha a big value, more weight on recent value

fit <- ets(iveco[1:37],model = "ZZZ", alpha = 0.9 )

pre <- predict(fit, h=4)

plot(pre)

line(iveco[1:37])


B3. HR has approached you to help them study your company’s

employees. They have provided you with a dataset (B3.csv) with the

following 6 columns about 14,999 employees:

satisfaction_level: Satisfaction Level

last_evaluation: Last evaluation

number_project: Number of projects

average_montly_hours: Average monthly hours

time_spend_company: Time spent at the company

Work_accidents: Number of accidents the employee

has had at work

a. Describe the differences between Principal Components Analysis

(PCA) and Exploratory Factor Analysis (EFA).

pca 是降低变量间线性相关性的方法, EFA是寻找导致变量发生的因素的方法

b. Using read.csv and the corrgram function from the corrgram

package, import the data and create a correlogram plot of the 6

measurements of the employees. Discuss the suitability of the data

for PCA and include an image of the plot in your answer.

CODE:

pdata <- read.csv("2018B3.csv")

View(pdata)

# load corrgram package

library("corrgram")

corrgram(pdata)

The PCA method is suitable for reducing a large number of correlated variables. As we can see in the corrgram plot, the blue means variables are positive correlated, while red means negative.  A darker colour means these two variables are highly correlated. last_evaluation, number_project and average_montly_hours are highly positive correlated, while satisfiction_level and number_project are highly negative correlated. We could use PCA method to reduce those correlated variables.

c. Using the plot and prcomp functions in R, plot a scree plot and

describe how you can use this plot to identify the number of

components to use in principal components analysis. Include the

scree plot in your answer.

code:

# fit the data into pca

hrpca <- prcomp(pdata, scale = TRUE)

plot(hrpca, type= "line", main = "scree plot")

A scree plot displays how much variation each principal component captures from the data. we can use the following rules

• Kaisers rule states to use components with values over 1.

•  use the "elbow rule"

•  Proportion of variance plot: the selected PCs should be able to describe at least 80% of the variance.

d. Using the prcomp function in R, use principal components analysis

on the data. Include and describe the results of the analysis.

Discuss the loadings and how appropriate it would be to use two

components.

code:

hrpca

summary(hrpca)

Result:

Importance of components:

                                    PC1    PC2    PC3    PC4    PC5

Standard deviation     1.353 1.0534 1.0000 0.9362 0.7968

Proportion of Variance 0.305 0.1849 0.1667 0.1461 0.1058

Cumulative Proportion  0.305 0.4899 0.6566 0.8027 0.9085

Rotation (n x k) = (6 x 6):

                             PC1         PC2         PC3

satisfaction_level    0.08693115 -0.82848859  0.08271569

last_evaluation      -0.50728391 -0.36995575  0.01296449

number_project       -0.57900111  0.11114716 -0.03199330

average_montly_hours -0.54922118 -0.12501818 -0.00810438

time_spend_company   -0.31310859  0.38036651  0.03235213

Work_accidents       -0.01352139  0.06385507  0.99541656

                             PC4          PC5          PC6

satisfaction_level   -0.37912166  0.272273055 -0.285204994

last_evaluation      -0.04769970 -0.714195147  0.305414036

number_project        0.20810048  0.005747078 -0.779770228

average_montly_hours  0.25387813  0.635654862  0.462763070

time_spend_company   -0.86109751  0.107569645  0.056369470

Work_accidents        0.06886704 -0.011459253 -0.003404899

The result shows that:

PC1 and PC2 together only can explain 48% of all variables.  The selected PCs should be able to describe at least 80% of the variance. So, they are not enough. 

last_evaluation,number_project and average_montly_hours have more weights on PC1 and satisfaction_level  weight more on PC2

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,132评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,802评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,566评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,858评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,867评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,695评论 1 282
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,064评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,705评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 42,915评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,677评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,796评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,432评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,041评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,992评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,223评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,185评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,535评论 2 343

推荐阅读更多精彩内容