Introduce to R
-
calculate
- modulo
5 %% 4
——1 - Exponentiation
2 ^ 5
——32 -
vector向量
c(“hello”, “hi”, “hola”)
c(12, 23, 44, 53)
poker_vector <- c(140, -50, 20)
names(poker_vector) <- c("Monday", "Tuesday", "Wednesday")
poker_vector 向量命名之
Monday Tuesday Wednesday
140 ....... -50 ......... 20
Poker1_vector <- poker_vector[c(2 :3)] 与python比较之
Poker1_vector
——Tuesday Wednesday
.........-50 ............20
-
matrix矩阵
matrix(1:9, byrow = TRUE, nrow = 3) 矩阵基本
The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE 按列还是按行排
rownames(my_matrix) <- row_names_vector 行命名
colnames(my_matrix) <- col_names_vector 列命名
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
............................... dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
..........................................................c("US", "non-US")))
worldwide_vector <- rowSums(star_wars_matrix) 行合并之
total_revenue_vector <- colSums(all_wars_matrix) 列合并之
all_wars_matrix <- cbind(star_wars_matrix,worldwide_vector) 列合并
all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)行合并
-
factor因子 有排序的向量or列表?
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
levels(factor_survey_vector) <- c("Female", "Male")设置排名
summary(levels(survey_vector)输出 长度、分类、类型
summary(factor_speed_vector)各个等级分类汇总
-
data frames数据结构
head(mtcars, 2)前面2行
tail(mtcars, 3)后面3行
str(mtcars)查看结构
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)布尔类型的向量
planets_df <- data.frame(name ,type, diameter, rotation, rings)各种向量组成数据结构
my_df[1:3,2:4]数据结构中的选取,此为123行,234列
planets_df[1:5, "diameter"] diameter是某一类的标题 此为1-5行按标题查询参数
planets_dfrings同样为按标题名/列的名称查询 subset(my_df, subset = some_condition) 按条件筛选 a是一个向量 order(a)给出对应的大小顺序 a[order(a)]按顺序排序后的向量a positions <- order(planets_dfdiameter)
planets_df[positions, ] 按其中某列数值的大小排序
subset函数,从某一个数据框中选择出符合某条件的数据或是相关的列
selectresult=subset(df1,name=="aa")
selectresult=subset(df1,name=="aa",select=c(age,sex))
selectresult=subset(df1,name=="aa" & sex=="f",select=c(age,sex))
names()显示数据结构中每一列的标题
-
list列表
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
shining_list <- list(moviename = mov, actors = act, reviews = rev)顺序是 定义的名字=储存好的向量
列表的三种查询方式:
shining_list[[1]]
shining_list[["reviews"]]
shining_list$reviews。
ext_list <- c(my_list, my_name = my_val)加一列进列表,且指定列的名称
Intermediate R
-
Conditionals and Control Flow条件控制
if (condition) {
expr
} if条件语句
if (condition) {
expr1
} else {
expr2
} if+else条件语句
if (condition1) {
expr1
} else if (condition2) {
expr2
} else if (condition3) {
expr3
} else {
expr4
}更多条件语句
-
Loops循环语句
while (condition) {
expr
} while循环语句
while + if循环语句
break终止循环
loop version 1
for (p in primes) {
....print(p)
}
loop version 2
for (i in 1:length(primes)) {
....print(primes[i])
}
paste(..., sep = " ", collapse = NULL) 字符串的连接
for (var1 in seq1) {
for (var2 in seq2) {
expr
}
}循环套循环
next:跳过此项,之后的继续
break:从这项开始就终止了
substr("abcdef", 2, 4) #从字符串“abcdef”中提取出第2到4个位置上的字符
substring("abcdef", 1:6, 1:6) #从字符串“abcdef”中提取出第1到1、2到2—6到6位置上的字符,即把字符串单个化
-
Function
sample(x, size, replace = FALSE, prob=c())随机抽样处理 其中size=抽取样本数目 replace是否重复抽样F/T prob表示各个样本被抽取的概率
参数为函数名,返回函数的参数名及其对应的默认值
mean(x, trim = 0, na.rm = FALSE, ...)
trim表示截尾平均数,0~0.5之间的数值,如:0.10表示丢弃最大10%和最小的10%的数据后,再计算算术平均数。默认为0.
rm是逻辑值,表示在计算之前,是否忽略NA的值。
sd(x, na.rm = FALSE)计算标准差
install.packages()安装包
library()加载包
search()看看现在装了哪些包
-
The apply family
lapply(X, FUN, ...)
lapply(数据,运算函数,函数的参数) 针对list
split_math <- strsplit(pioneers, split = ":")字符串的拆分,相当于paste的逆操作
tolower() toupper() 改变大小写
split_low <- lapply(split, tolower)
names <- lapply(split_low, function(x){x[1]})分离后分离出的第一列
select_el <- function(x, index) {
x[index]
}
names <- lapply(split_low, select_el, index = 1)利用function设置一个功能,然后套入lapply以此选出第一列
sapply(数据,运算函数,函数的参数,simplify = TRUE, USE.NAMES = TRUE) 相比lapply会重新整理好格式
identical(A,B)测试A、B是否相等,相等即TRUE
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)FUN.VALUE为对应fun中是几项,且是什么类型
-
Utilities
abs()绝对值
round()四舍五入
rev() reverse
sort(x, decreasing = F/T) 对量从小到大进行排序;顺序还是逆序
unlist()将list的结构变成非list结构
append()合并
seq(from, to, by = )向量的起点,终点,步长
seq(from, to, length.out = )向量中元素的数目
seq(from/along.with = )表示生成的向量为现有一向量的索引
seq(length.out = )便是生成从1开始,步长为1,长度为length.out的向量
rep(x,each = ,times = )重复
is.(): Check for the class of an R object.
as.(): Convert an R object from one class to another.
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)给出关键字在列表中的序号
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)关键字是否在列表中,分别输出TRUE/FALSE
sub和gsub用于字符串的替换 sub只替换第一次匹配的字符串,而gsub是替换所有匹配的字符串
sub(pattern,replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gsub(pattern,replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE) replacement表示要替换的内容
today <- Sys.Date()查询日期
now <- Sys.time()查询时间
unclass()消除数据分类
%Y: 4-digit year (1982)
%y: 2-digit year (82)
%m: 2-digit month (01)
%d: 2-digit day of the month (13)
%A: weekday (Wednesday)
%a: abbreviated weekday (Wed)
%B: month (January)
%b: abbreviated month (Jan)
format()将时间调节成指定时间格式
as.Date()
as.Date(ISOdate(year,month, day)) //转换为Date对象
%H: hours as a decimal number (00-23)
%I: hours as a decimal number (01-12)
%M: minutes as a decimal number
%S: seconds as a decimal number
%T: shorthand notation for the typical format %H:%M:%S
%p: AM/PM indicator
as.POSIXct()
diff() 相邻两项的差
Intermediate R- Practice
hist()创建直方图
boxplot()创建箱型图
Introduction to the Tidyverse
-
Data wrangling
library()导入包
library(dplyr)拓展包用于将多个数据表连接成一个整齐的数据集
library(gapminder) 摘自Gapminder的实验数据
gapminder %>% filter(year == 1957)拓展包中的函数 可以方便选取1957年的数据
arrange(hflights_df, DayofMonth, Month, Year) dplyr包中的arrange排列
arrange(gapminder, lifeExp)升序
arrange(gapminder, desc(lifeExp))降序
gapminder %>%
..filter(year == 1957) %>%
..arrange(desc(pop))这种格式下,gapminder会自动加载到每一行中,所以可以省略了。
mutate() 对已有列进行数据运算并添加为新列:
mutate(gapminder, month = 12*lifeExp)
-
Data visualization数据可视化
library(ggplot2)
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap))
..geom_point()散点图
x、y轴数据分布太散时可以取对数
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) + scale_x_log10()
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent, size = gdpPercap)) + scale_x_log10() +
..geom_point()成功的做出一次散点图,有颜色和点的大小
facet_wrap(~ continent)分面,加在末尾可以按continent划分为几个小图
expand_limits(y = 0)放入最末
-
Grouping and summarizing分组与总结
median(x, na.rm = FALSE, …)计算中位数
summarise(.data, ...)将分组的数据汇总,可以逗号隔开将不同的汇总处理。
group_by(.data, ..., add = FALSE)分组add=true即添加到已经存在的分组
ungroup(x, ...)取消分组
-
Types of visualizations可视化类别
geom_point()散点图
geom_line()折线图
geom_col()直方图
geom_histogram()柱状图 只有x轴的定义?
geom_boxplot() 箱型图
labs(title = "Comparing GDP per capita across continents")加在后面可以添加标题,同理可以用x = "x"添加x轴名称
Importing Data in R (Part 1)
pools <- read_csv("swimming_pools.csv")
pools <- read_csv("swimming_pools.csv", stringsAsFactors = FALSE)
read.delim("hotdogs.txt") header = TRUE,第一行为文件名
read_tsv(....,col_names = c(.....)) col_names指定每一列的标题
read_tsv中可以加skip = 跳过的line数(从1开始),n_max = 显示的line数,比如只要看23line,则skip=1,n_max=2.
同样tsv中,col_types = "cdil_" 为column的类型,character, double, integer and logical
path <- file.path("data", "hotdogs.txt")
hotdogs <- read.table(path,
sep = "",
col.names = c("type", "calories", "sodium"))
head(hotdogs)
which.min(x) 返回的是最小值的位置标识
tom <- hotdogs[which.max(hotdogs$sodium), ]
hotdogs2 <- read.delim("hotdogs.txt", header = FALSE,
col.names = c("type", "calories", "sodium"),
colClasses = c("factor", "NULL", "numeric")) / NA 读取txt拓展
potatoes <- read_delim("potatoes.txt", delim = "\t", col_names = properties)
library(data.table)
read.table()读取文件转化为数据框架
fread()和read.table类似,但是更加方便快捷
potatoes <- fread("potatoes.csv", select = c(6, 8))只导入第六列和第八列的数据
plot(potatoestexture, potatoesmoistness)散点图
-
Importing Excel data 导入excel数据
library(readxl) 导入xlsx文件
excel_sheets("urbanpop.xlsx")
data <- read_excel("data.xlsx", sheet = "my_sheet")
my_workbook <- lapply(excel_sheets("data.xlsx"),
......................................read_excel,
......................................path = "data.xlsx")
pop_a <- read_excel("urbanpop_nonames.xlsx", col_names = FALSE)
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel("urbanpop_nonames.xlsx", col_names = cols)导入没有列标题的excel文件后添加列标题
urbanpop_sel <- read_excel("urbanpop.xlsx", sheet = 2, col_names = FALSE, skip = 21)
head(urban_pop, n = 11)前十一项 head(..., n = )
path <- "urbanpop.xls"
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
na.fail(object, …)只会返回没有缺失值的数据,不然就报错
na.omit(object, …)会将缺失值排除返回正常数据
na.exclude(object, …)
na.pass(object, …)原数返回
excel_sheets("urbanpop.xlsx")查看这个表的sheet名
pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)导入excel的sheet
-
Reproducible Excel work with XLConnect
library(XLConnec)
loadWorkbook(filename, create = FALSE, password = NULL)建立excel工作簿
my_book <- loadWorkbook("urbanpop.xlsx")
getSheets(my_book) list my_book中的sheets
readWorksheet(my_book, sheet = 2)读取工作簿中的sheets
createSheet(object, name)创建一个新的sheet
writeWorksheet(object,data,sheet,startRow,startCol,header,rownames)在sheet中写入数据
saveWorkbook(object,file)将工作簿存入关联的exclle文件
renameSheet(object,sheet,newName)给sheet改名
removeSheet(my_book, sheet = 4)移除sheet
Importing Data in R
-
Importing data from databases (Part 1)从数据库导入数据
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
.................dbname = "tweater",
.................host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
.................port = 3306,
.................user = "student",
.................password = "datacamp")
table_names <- dbListTables(con)得到表的名字(向量)
dbReadTable(conn, name, ...)
dbWriteTable(conn, name, value, ...)
tables <- lapply(table_names, dbReadTable, conn = con)导入所有tables
dbGetQuery(con, "SELECT age FROM people WHERE gender = 'male'")查询出的形式是数据结构
CHAR_LENGTH(name) 即the number of characters in the name
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")对数据库发送问题
dbFetch(res, n = 1, ...)抓取接下来的n个element/row并将其返回为数据结构
dbClearResult(res, ...)清理返回的结构
bConnect(drv, ...)
dbDisconnect(conn, ...)
-
Importing data from the web (Part 1) 从网页导入数据
read.csv
library(readr)
read_csv
library(gdata)
read.xls()
download.file(url_xls, destfile = "local_latitude.xls") 通过url下载xls文件
library(hhtr)
url <- "http://www.example.com/"
resp <- GET(url) 从链接中获取数据存入resp
raw_content <- content(resp, as = "raw") 获取resp的数据 as是其表现的形式,如txt形式等
library(jsonlite)
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
wine <- fromJSON(wine_json)
str(wine)
sw4 <- fromJSON(url_sw4)
sw4$Title
json1 <- '[1, 2, 3, 4, 5, 6]'
fromJSON(json1) 得出的是一列数
json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
fromJSON(json2)体会一下json 得出的是a,b分开的两列数
json1 <- '[[1, 2], [3, 4]]'
fromJSON(json1)得出2维矩阵
json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}]'
fromJSON(json2)得出的是正常的表格形式的行列
water_json <- toJSON(water) toJSON()可以将读取的数据结构格式的文件转化为JSON格式
JSON的格式;pretty和mini格式
转换为json格式时,toJSON(water, pretty/mini = TRUE)即可转换为对应格式
对于已经是json格式的 可用prettify() / minify()转换为对应的格式
-
Importing data from statistical software packages
library(haven)
haven包可以用来加载:
SAS:read.sas()
STATA:read_dta(), read_stata()
SPSS:read_sav() read_por()
traits <- read_sav("person.sav")
summary(traits) 可以看出traits中有多少缺失值NA,最大值最小值等信息。
subset(traits, Extroversion > 40 & Agreeableness > 40)从数据中按条件选择子集
as_factor(work$GENDER)将这一列转化为factor格式
tail(florida, n = 6)查看最后六项/列
demo <- read.spss("international.sav", to.data.frame = TRUE) 用read.spss读取spss的文件 且转换为数据结构模式
boxplot(demo$gdp)做成箱型图
cor(sizeheight, sizewidth)计算相关性
demo_2 <- read.spss("international.sav", to.data.frame = TRUE, use.value.labels = FALSE)此处变量的价值标签不被转化为R的因子factors
Cleaning Data in R 数据清洗
-
Introduction and exploring raw data 介绍与探索原始数据
dim()用于查看数据结构有多少行多少列 dimension
-
Tidying data整理数据
library(tidyr)
gather(data, key, value, -col(以此为基准x), na.rm = FALSE, convert = FALSE)在wide_df数据结构中,将宽数据变成长数据格式,key是指新的列的名称,其中值是现在存在的。value,值组成的新的一列的名称。
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)将长数据变成宽数据格式
bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")把一列分成两列
bmi_cc <- unite(bmi_cc_clean, Country_ISO, Country, ISO, sep = "-")分开的再合并 separate的逆操作
-
Tidying dataPreparing data for analysis为分析准备数据
R中变量的形式 "character", "numeric" ,"integer":class(99L), "factor":class(factor("factor)), "logical":TRUE/FALSE
library(lubridate)将日期的格式从character转化为日期格式,例如:
mdy_hm("July 15, 2012 12:56")其中月日年小时分钟,这个顺序是依据要转换的原始数据的顺序
library(stringr)
str_trim(c(" Filip ", "Nick ", " Jonathan")) str_trim()将多余空格清理掉
str_pad(c("23485W", "8823453Q", "994Z"), width = 9, side = "left", pad = "0")防止以pad=0开头的数,0丢失
toupper()全体大写
tolower()全体小写
str_detect(c("banana", "kiwi"), "a")查询那俩c中是否有a
str_replace(c("banana", "kiwi"), "a", "o")在那两个向量中用o来代替a,但是如果有多个a只替换第一个
str_replace_all(c("banana", "kiwi"), "a", "o")这就可以替换全部的啦
na.omit(social_df)移除social_df中包含NA缺失值的行与列
complete.cases()返回一个向量,查看每一行中是否没有缺失值。
hist()生成柱状图
-
Putting it all together
class(weather)查看数据类型
dim(weather)查看分类汇总
names(weather)查看各列的名称
str(weather)查看数据结构
library(dplyr)
glipmse(weather)换种方法查看数据结构
summary(weather)分类总结数据结构
as.character()
as.numeric()将变量转化为不同的格式
sum(is.na(weather6))查看weather6中有几个缺失值
summary(weather6)查看缺失值的分布
ind <- which(is.na(weather6$Max.Gust.SpeedMPH))找出指定列的缺失值的位置index