本文通过数据分析,看看人民的幸福指数与哪些因素相关
本文数据分析流程
一.写爬虫,在网上抓取数据
二.数据处理
三.相关分析
四.主成分分析
五.聚类
六.可视化
一、写爬虫,在网上抓取数据
爬虫 建议大家用python,R能写爬虫,但是比较蹩脚,这里就示范用R来写爬虫。
本文需要两个数据集,处理之后,将两个数据集合并
第一个:
> library(rvest)
> url<- "https://en.wikipedia.org/wiki/World_Happiness_Report">
web<-read_html(url,encoding = "UTF-8")
> happy_report<-html_nodes(web,"table")%>%html_table()
> happy<-happy_report[[1]]
第二个:
> url2<-"List of countries by Social Progress Index"
> web2<-read_html(url2,encoding = "UTF-8")
> social<-html_nodes(web2,"table")%>%html_table()
>social<-social[[2]]
二、数据处理
分别处理两个数据集
预处理第一个数据集
> str(happy)
> happy<-happy[c(3,6:11)]
> View(happy)
> colnames(happy)<-gsub(" ","",colnames(happy),perl = T)”将标题中空格去掉“
> happy$Country<-as.character(mapvalues(happy$Country,from = c("United States", "Congo (Kinshasa)", "Congo (Brazzaville)", "Trinidad and Tobago"),to=c("USA","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad")))”标准化国家名字“
预处理第二个数据集
> social<-social[c(1,5,7,9)] “选择数据集中一些列”
> names(social)<-c("Country","basic_human_needs", "foundations_well_being", "opportunity") “对列名重新定义”
> social$Country<-as.character(mapvalues(social$Country,from = c("United States", "Côte d'Ivoire","Democratic Republic of Congo", "Congo", "Trinidad and Tobago"),to=c("USA", "Ivory Cost","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad"))) “标准化国家的名字”
> social[,2:4]<-sapply(social[,2:4],as.numeric) '将数据从字符转成数字'
Warning messages:
1: In lapply(X = X, FUN = FUN, ...) : 强制改变过程中产生了NA
2: In lapply(X = X, FUN = FUN, ...) : 强制改变过程中产生了NA
3: In lapply(X = X, FUN = FUN, ...) : 强制改变过程中产生了NA
连接两个数据集
> social.happy<-left_join(happy, social, by = c('Country' = 'Country'))检查缺失值 并填补缺失值
> mean(is.na(social.happy[,2:10]))
> for (i in 1:ncol(social.happy[,2:10])){
+social.happy[,2:10][is.na(social.happy[,2:10][,i]),i]<-median(social.happy[,2:10][,i],na.rm = T)
+ }
价格数据变换为0-1之间的值
> range_transform<-function(x){
+ (x-min(x))/(max(x)-min(x))
+ }
> social.happy[,2:10]<-as.data.frame(apply(social.happy[,2:10],2,range_transform))
将变量缩放到平均值0,标准差1
> sd_scale<-function(x){
+ (x-mean(x))/sd(x)
+ }
> social.happy[,2:10]<-as.data.frame(apply(social.happy[,2:10],2,sd_scale))
三、简单相关分析
四、主成分分析
soc.pca <- PCA(soc.happy[, 2:10], graph=FALSE)fviz_screeplot(soc.pca, addlabels = TRUE, ylim = c(0, 65))
每个栏上的百分比表示由各自主要组成部分解释的总方差的比例。如图所示,前三个主要组成部分占总方差的约80%。
> soc.pca$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 5.0714898 56.349887 56.34989
comp 2 1.4357885 15.953205 72.30309
comp 3 0.6786121 7.540134 79.84323
comp 4 0.6022997 6.692219 86.53544
comp 5 0.4007136 4.452373 90.98782
comp 6 0.3362642 3.736269 94.72409
comp 7 0.2011131 2.234590 96.95868
comp 8 0.1471443 1.634937 98.59361
comp 9 0.1265747 1.406386 100.00000
> soc.pca$var$contrib[,1:3]
Dim.1 Dim.2 Dim.3
GDPpercapita 15.7263477 2.455323e+00 3.162470e-02
Socialsupport 12.0654754 5.445993e-01 1.345610e+00
Healthylifeexpectancy 15.1886385 2.259270e+00 3.317633e+00
Freedomtomakelifechoices 6.6999181 2.207049e+01 8.064596e+00
Generosity 0.3270114 4.189437e+01 5.406678e+01
Trust 4.3688692 2.570658e+01 3.211058e+01
basic_human_needs 14.5402807 3.836956e+00 1.021076e+00
foundations_well_being 15.1664220 1.232353e+00 4.169125e-02opportunity 15.9170370 6.170376e-05 4.113192e-04
第一个主要部分解释了总变异的约57%,似乎代表机会,人均GDP,健康预期寿命,幸福基础和人类基本需求。
第二个主要部分解释了总变异的16%,代表了机会,社会支持和慷慨。
四、聚类
require(NbClust)nbc <- NbClust(soc.happy[, 2:10], distance="manhattan",min.nc=2, max.nc=30, method="ward.D", index='all')
`
`
`
* According to the majority rule, the best number of clusters is 3
> social.happy['cluster'] <- as.factor(pamK3$clustering)
> fviz_pca_ind(soc.pca,label="none",habillage = soc.happy$cluster, #color by clusterpalette = c("#00AFBB", "#E7B800", "#FC4E07", "#7CAE00", "#C77CFF", "#00BFC4"),addEllipses=TRUE) )