R数据科学（一）ggplot2

1. install packages

install.packages("tidyverse")
library(tidyverse)
tidyverse_update()
##################

安装三个数据包

install.packages(c("nycflights13", "gapminder", "Lahman"))

tidyverse 包括ggplot2, tibble, tidyr, readr, purrr和 dplyr包

PART I Explore

CHAPTER 1: Data Visualization with ggplot2

以ggplot2包中的mpg数据为例，它是一个数据框，每行为一个数据，每列为一个观测。mpg包括38种车的数据。

# 查看该数据集
head(ggplot2::mpg)

displ:车发动机大小，hwy：车的燃油效率

用该数据集创造第一幅ggplot图

library(ggplot2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

该图表示发动机大小与燃油呈现负相关。

ggplot() 函数产生最基础的坐标系统，然后可以在上面加图层，

# 空图层，背景，颜色，字体都设好了
ggplot(data = mpg)  
# aes()将数字映射为图形
ggplot(data = mpg) + geom_point(aes(displ,hwy)) 

#查看mpg数据
dim(mpg)
head(mpg)
# 查看hwy和cyl的关系
ggplot(mpg,aes(hwy,cyl)) + geom_point()

这里提供了一个画图模板：
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Aesthetic Mappings

aesthetic美学的，在图中表示点的大小，颜色等
我们可以把点的颜色按某个数值分组，如class

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))

也可以按点的大小分组

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))

或者映射给透明度或者形状

# Top
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Bottom
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
# ggplot一次只能用6个形状，这里有7个，所以SUV不显示了

我们可以手动定义几何类型

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

练习题：
1.为什么点不是蓝色的？

ggplot(data = mpg) +
geom_point(
mapping = aes(x = displ, y = hwy, color = "blue")
)

因为color放在映射里面了，映射自动从彩色里赋值。

ggplot(data = mpg) +
geom_point(
mapping = aes(x = displ, y =hwy, color = cty))

2.注意映射连续变量与分类变量的区别。如颜色连续变量为一个颜色从深到浅，分类变量为各个颜色的分类。

ggplot(data = mpg) +
geom_point(
mapping = aes(x = displ, y =hwy, color = displ))

4.一个变量有多个映射是可以的，但是造成了信息的冗余，一般不会这样做。

stroke是映射什么的？

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 1)

stroke映射点的边框粗细。

ggplot(mpg, aes(x = displ, y = hwy, colour = displ < 5)) +
  geom_point()

注意：R语法很容易出错，注意()，“”是否配对，如果运行R代码无反应，按Esc键退出。

Facets 分面

增加信息的方式一个是将变量给映射，另外一个方法是将分类变量给分面，从而将图分成几个小的面。
分面有两种函数，facet_wrap(~分类变量，nrow,ncol)这个函数放入一个分类变量。

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

facet_grid(a ~ b) 可以用两个组合变量来分面

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)

facet_grid()函数如果只想用一个变量来分面，可以用.留空。

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)

练习题：

如果用连续型变量来分面会出现什么后果?

head(mpg)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cty, nrow = 2)

结果是将连续型变量转换为因子，每个因子都有一个分面。
2.该图中有空位子，表示什么意思？

ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))

空点表示该位子无数值。
3.下面两个代码有何不同？

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)

.的位置代表不想用该变量进行分面。
4.用分面代替颜色映射的优势和劣势是什么？
一幅图中人眼可以识别的颜色不超过9种，分面可以区分更多的信息，但是不容易相互比较。

3.6 Geometric Objects 几何对象

几何对象是把数据用图形的方式映射出来

# left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# right
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))

每个几何对象函数都有对应的映射参数，但是具有独立性，有些不能通用

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

许多几何对象可以展示多组图形，ggplot2会自动分组，但是不展示图例。

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)

ggplot2也可以展示多个图层

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))

同一张图显示多个几何对象--局部映射和全局映射的区别，如有冲突，以局部变量为准。

# ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
# geom_point(mapping = aes(color = class)) +
# geom_smooth(
# data = filter(mpg, class == "subcompact"),
# se = FALSE
# )

filter设置geom_smooth几何对象的过滤，se表示标准差
练习题：
Exercise 3.6.2 该代码画图是什么样的？

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)

color作为全局变量传递给point和smooth，因此，这两个都画出来了。

Exercise 3.6.3
What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, colour = drv),
  )

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, colour = drv),
    show.legend = FALSE)

Re-create the R code necessary to generate the following graphs.

ggplot(mpg,aes(displ,hwy))+geom_point()+geom_smooth(se=F)
ggplot(mpg,aes(displ,hwy))+geom_point()+geom_smooth(aes(group=drv),se=F)
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)
ggplot(mpg,aes(displ,hwy))+geom_point(aes(color=drv))+geom_smooth(se=F)
ggplot(mpg,aes(displ,hwy))+geom_point(aes(color=drv))+geom_smooth(aes(linetype=drv),se=F)
ggplot(mpg, aes(x = displ, y = hwy)) +
   geom_point(size = 4, color = "white") +
   geom_point(aes(colour = drv))

3.7 Statistical Transformations 统计变换

统计变换：绘图时用来计算新数据的算法叫做统计变换stat
每个几何对象函数都有一个默认的统计变换，每个统计变换函数都又一个默认的几何对象。
用几何对象函数geom_bar作直方图，默认统计变换是stat_count.
一般可以用默认的统计变换，以下情况要用新的统计变换：
1.覆盖默认的统计变换

直方图默认的统计变换是stat_count,也就是统计计数。当需要直接用原表格的数据作图时就会需要覆盖默认的。

library(tibble)

demo <- tribble(
~a, ~b,
"bar_1", 20,
"bar_2", 30,
"bar_3", 40
)
# 默认stat=count，这里改成 "identity"

ggplot(data = demo) +
geom_bar(
mapping = aes(x = a, y = b), stat = "identity"
)

2.覆盖从统计变换生成变量到图形属性的默认映射
直方图默认的y轴是x轴的计数。此例子中x轴是五种cut（切割质量），直方图自动统计了这五种质量的钻石的统计计数，当你不想使用计数，而是想显示各质量等级所占比例的时候就需要用到prop。

ggplot(diamonds,aes(cut,..prop..,group=1))+geom_bar()
#group=1的意思是把所有钻石作为一个整体，显示五种质量的钻石所占比例体现出来。

3.在代码中强调统计变换
以stat_summary为例。

ggplot(diamonds)+stat_summary(aes(cut,depth),
                              fun.ymin = min,
                              fun.ymax=max,
                              fun.y=median)

练习题：
1.stat_summary()默认的几何对象是什么？
stat_summary的默认几何图形是geom_pointrange,而geom_pointrange默认的统计变换却是identity

ggplot(diamonds) + geom_pointrange(aes(cut,depth),
                                   stat = 'summary',
                                   fun.ymin=min,
                                   fun.ymax=max,
                                   fun.y=median)

geom_col()与geom_bar()的区别
geom_col()的默认统计变换为identity()，geom_bar()默认为count()
stat_smooth()计算变量为预测值，最低和最高置信区间及SE
geom_bar(aes(y = ..prop..))中group=1的设置？
默认分组是等于x的，分组是在组内执行

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = color, y = ..prop..)
)
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group = color))

3.8 Position Adjustments

geom_bar的颜色可以用color和fill调整

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))

bar的位置有三个参数可以调整"identity", "dodge" or "fill"
"identity"直接显示

ggplot(diamonds,aes(cut,fill=clarity))+geom_bar(alpha=1/5,position = 'identity')
ggplot(diamonds,aes(cut,color=clarity))+geom_bar(fill=NA,position = 'identity')

"fill"堆叠式，x每个分组都为100%

ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = clarity),
position = "fill"
)

"dodge" 并列式，一个放在另一个旁边

ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = clarity),
position = "dodge"
)

position = "jitter" 添加点的随机扰动，使重复的点暴露出来。

ggplot(data = mpg) +
geom_point(
mapping = aes(x = displ, y = hwy),
position = "jitter"
)

?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?posi
tion_stack.

ggplot(mtcars,aes(factor(cyl),fill=factor(vs))) + 
  geom_bar(position = position_dodge(preserve = 'total'))

练习题：

geom_jitter()哪个参数控制扰动大小？
width,height从水平和垂直方向控制

3.对比geom_jitter() 和 geom_count()

#geom_jitter()对点添加随机扰动
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()
#geom_count()重复的点越多，点越大
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

geom_boxplot()默认的统计变换是什么？

ggplot(data = mpg, mapping = aes(x = drv, y = hwy,color=class)) +
  geom_boxplot()
ggplot(data = mpg, aes(x = drv, y = hwy, colour = class)) +
  geom_boxplot(position = "identity")

默认为position_dodge()

3.9 Coordinate Systems 坐标系统

ggplot2默认为笛卡尔坐标系，x和y轴是独立的
coord_flip() 调换x和y轴

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()

coord_quickmap
为地图设置长宽比
此处需要加载maps包，否则会报错。

library(maps)
#如果报错则：install.packages("maps")
#library(maps)
nz <- map_data("nz")
 
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")
 # geom_polygon 是多边形图
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

coord_polar()极坐标系统

bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()

#width = 1把柱形图中间的空去掉了，
ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar()
#theta = "y"是将角度按y轴变量来设置，如不设置，会出现中间空心原点
ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")
ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar(width = 1) +
  coord_polar()
ggplot(diamonds) + geom_bar(aes(x=cut,fill=cut))+coord_polar()

#多组的bar图也能画出饼图
head(diamonds)
ggplot(diamonds,aes(cut,fill=color)) + 
  geom_bar(position = "fill") #注意position位置参数的设置，默认position = "identity"

ggplot(diamonds,aes(cut,fill=color)) + 
  geom_bar(position = "fill") + 
  coord_polar(theta = "y")

Exercise 3.9.2 lab()函数可以给图层增加x和y的标签和title

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  coord_flip() +
  labs(y = "Highway MPG", x = "", title = "Highway MPG by car class")

Exercise 3.9.4
coord_fixed()保持线为45度

p <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline()
p
p + coord_fixed()

3.10 The Layered Grammar of Graphics
ggplot2通用模板

ggplot(data = <DATA>) + #数据集data
<GEOM_FUNCTION>( #几何对象geom
mapping = aes(<MAPPINGS>), #映射aes
stat = <STAT>, #统计变换stat
position = <POSITION> #位置调整position
) +
<COORDINATE_FUNCTION> + #坐标系统
<FACET_FUNCTION> #分面系统

图形构建的过程由以上五个指标构建，后面两个用于微调。

阅读推荐：
生信技能树公益视频合辑：学习顺序是linux，r，软件安装，geo，小技巧，ngs组学！
B站链接：https://m.bilibili.com/space/338686099
YouTube链接：https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists
生信工程师入门最佳指南：https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA

最后编辑于：2018.11.17 10:18:09

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,684评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,143评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,214评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,788评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,796评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,665评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,027评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,679评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 41,346评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,664评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,766评论 1赞 331
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,412评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,015评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,974评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,203评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,073评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,501评论 2赞 343