R数据科学（十）使用stringr处理字符串

本章将介绍 R 中的字符串处理。将学习字符串的基本工作原理，以及如何手工创建字符串，但本章的重点是正则表达式（regular expression， regexp）。正则表达式的用处非常大，字符串通常包含的是非结构化或半结构化数据，正则表达式可以用简练的语言来描述字符串中的模式。

library(tidyverse)
library(stringr)

10.2　字符串基础

# 创建字符串
string1 <- "This is a string"
string2 <- 'To put a "quote" inside a string, use single quotes'

如果想要在字符串中包含一个单引号或双引号，可以使用 \ 对其进行“转义”

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

如果想要在字符串中包含一个反斜杠，就需要使用两个反斜杠： \
注意，字符串的打印形式与其本身的内容不是相同的，因为打印形式中会显示出转义字
符。如果想要查看字符串的初始内容，可以使用 writelines() 函数

x <- c("\"", "\\")
x
writeLines(x)

换行符 \n 和制表符 \t
使用 c() 函数来创建字符向量

c("one", "two", "three")

10.2.1　字符串长度
以 str_ 开头的。例如， str_length() 函数可以返回字符串中的字符数量

str_length(c("a", "R for data science", NA))

10.2.2　字符串组合

str_c("x", "y")

str_c("x", "y", "z")

str_c("x", "y", sep = ", ")

x <- c("abc", NA)
str_c("|-", x, "-|")

# 和多数 R 函数一样，缺失值是可传染的。如果想要将它们输出为 "NA"，可以使用 str_replace_na()
str_c("|-", str_replace_na(x), "-|")

str_c() 函数是向量化的，它可以自动循环短向量，使得其与最长的向量具有相同的长度

str_c("prefix-", c("a", "b", "c"), "-suffix")

10.2.3　字符串取子集
str_sub() 函数来提取字符串的一部分。除了字符串参数外， str_sub() 函数中还
有 start 和 end 参数，它们给出了子串的位置（包括 start 和 end 在内）

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)

# 负数表示从后往前数
str_sub(x, -3, -1)

还可以使用 str_sub() 函数的赋值形式来修改字符串

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))

10.2.5　练习
(1) 在没有使用 stringr 的那些代码中，你会经常看到 paste() 和 paste0() 函数，这两个函
数的区别是什么？ stringr 中的哪两个函数与它们是对应的？这些函数处理 NA 的方式有
什么不同？

paste("foo", "bar")
#> [1] "foo bar"
paste0("foo", "bar")
#> [1] "foobar"

str_c("foo", "bar")
#> [1] "foobar"
str_c("foo", NA)
#> [1] NA
paste("foo", NA)
#> [1] "foo NA"
paste0("foo", NA)
#> [1] "fooNA"

(2) 用自己的语言描述一下 str_c() 函数的 sep 和 collapse 参数有什么区别？

(3) 使用 str_length() 和 str_sub() 函数提取出一个字符串最中间的字符。如果字符串中的
字符数是偶数，你应该怎么做？

x <- c("a", "abc", "abcd", "abcde", "abcdef")
L <- str_length(x)
m <- ceiling(L / 2)
str_sub(x, m, m)
#> [1] "a" "b" "b" "c" "c"

(4) str_wrap() 函数的功能是什么？应该在何时使用这个函数？
将文本处理成固定宽度的文本

thanks_path <- file.path(R.home('doc'),'thanks')
thanks <- str_c(readLines(thanks_path),collapse = '\n')
thanks <- word(thanks,1,3,fixed("\n\n"))
cat(str_wrap(thanks),"\n")

(5) str_trim() 函数的功能是什么？其逆操作是哪个函数？
str_trim()去除字符串两边的空格，str_pad()在两边增加空格

str_trim(" abc ")
#> [1] "abc"
str_trim(" abc ", side = "left")
#> [1] "abc "
str_trim(" abc ", side = "right")
#> [1] " abc"

str_pad("abc", 5, side = "both")
#> [1] " abc "
str_pad("abc", 4, side = "right")
#> [1] "abc "
str_pad("abc", 4, side = "left")
#> [1] " abc"

(6) 编写一个函数将字符向量转换为字符串，例如，将字符向量 c("a", "b", "c") 转换为
字符串 a、 b 和 c。仔细思考一下，如果给定一个长度为 0、 1 或 2 的向量，那么这个函
数应该怎么做？

str_commasep <- function(x, delim = ",") {
  n <- length(x)
  if (n == 0) {
    ""
  } else if (n == 1) {
    x
  } else if (n == 2) {
    # no comma before and when n == 2
    str_c(x[[1]], "and", x[[2]], sep = " ")
  } else {
    # commas after all n - 1 elements
    not_last <- str_c(x[seq_len(n - 1)], delim)
    # prepend "and" to the last element
    last <- str_c("and", x[[n]], sep = " ")
    # combine parts with spaces
    str_c(c(not_last, last), collapse = " ")
  }
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"

10.3　使用正则表达式进行模式匹配
10.3.1　基础匹配

x <- c("apple", "banana", "pear")
str_view(x, "an")

str_view(x, ".a.")

# 要想建立正则表示式，我们需要使用\\
dot <- "\\."
# 实际上表达式本身只包含一个\：
writeLines(dot)
#> \.
# 这个表达式告诉R搜索一个.
str_view(c("abc", "a.c", "bef"), "a\\.c")

x <- "a\\b"
writeLines(x)
#> a\b
str_view(x, "\\\\")

正则表达式 . 的字符串形式应是 \. 你需要 4 个反斜杠来匹配 1 个反斜杠！

10.3.2　练习
(1) 解释一下为什么这些字符串不能匹配一个反斜杠 \： ""、 "\"、 "\"。
(2) 如何匹配字符序列 "'\ ？
(3) 正则表达式 ...... 会匹配哪种模式？如何用字符串来表示这个正则表达式？

10.3.3　锚点
有时我们需要在正则表达式中设置锚点，以便 R 从字符串的开头或末尾进行匹配。我们可以设置两种锚点。
• ^ 从字符串开头进行匹配。
• $ 从字符串末尾进行匹配。

x <- c("apple", "banana", "pear")
str_view(x, "^a")

str_view(x, "a$")

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")

始于权力（^），终于金钱（$）

10.3.4　练习
(1) 如何匹配字符串 " $^$ " ？ ^\$\\$$

str_view(c("$^$", "ab$^$sfas"), "^\\$\\^\\$$")

(2) 给定 stringr::words 中的常用单词语料库，创建正则表达式以找出满足下列条件的所
有单词。
a. 以 y 开头的单词。
b. 以 x 结尾的单词。
c. 长度正好为 3 个字符的单词。（不要使用 str_length() 函数，这是作弊！）
d. 具有 7 个或更多字符的单词。
因为这个列表非常长，所以你可以设置 str_view() 函数的 match 参数，只显示匹配的
单词（match = TRUE）或未匹配的单词（match = FALSE）。

word <- stringr::words
str_view(word,"^y",match = T)
str_view(word,"x$",match = T)
str_view(word,"^...$",match = T)
str_view(word,".......",match = T)

10.3.5　字符类与字符选项
• \d 可以匹配任意数字。
• \s 可以匹配任意空白字符（如空格、制表符和换行符）。
• [abc] 可以匹配 a、 b 或 c。
• [^abc] 可以匹配除 a、 b、 c 外的任意字符。

10.3.6　练习
(1) 创建正则表达式来找出符合以下条件的所有单词。
a. 以元音字母开头的单词。

str_view(word,"^[aeiou]",match = T)

b. 只包含辅音字母的单词（提示：考虑一下匹配“非”元音字母）。

str_view(word,"^[^aeiou]+$",match = T)

c. 以 ed 结尾，但不以 eed 结尾的单词。

str_view(word,"ed$|^[^e]ed$",match = T)

d. 以 ing 或 ize 结尾的单词。

str_view(word,"ing$|ize$",match = T)
str_view(stringr::words, "i(ng|se)$", match = TRUE)

(2) 实际验证一下规则： i 总是在 e 前面，除非 i 前面有 c。

str_view(stringr::words, "(cei|[^c]ie)", match = TRUE)

(3) q 后面总是跟着一个 u 吗？

str_view(stringr::words, "qu", match = TRUE)

(4) 编写一个正则表达式来匹配英式英语单词，排除美式英语单词。

(5) 创建一个正则表达式来匹配你所在国家的电话号码。

x <- c("123-4560-7890", "1235-2351")
str_view(x, "\\d\\d\\d-\\d\\d\\d\\d-\\d\\d\\d\\d")

10.3.7　重复
正则表达式的另一项强大功能是，其可以控制一个模式能够匹配多少次。
• ?： 0 次或 1 次。
• +： 1 次或多次。
• *： 0 次或多次。

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')

• {n}：匹配 n 次。
• {n,}：匹配 n 次或更多次。
• {,m}：最多匹配 m 次。
• {n, m}：匹配 n 到 m 次。

str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")

默认的匹配方式是“贪婪的”：正则表达式会匹配尽量长的字符串。通过在正则表达式后
面添加一个 ?，你可以将匹配方式更改为“懒惰的”，即匹配尽量短的字符串。
10.3.8　练习
(1) 给出与 ?、 + 和 * 等价的 {m, n} 形式的正则表达式。
? {0,1} + {1,} * {0,}
(2) 用语言描述以下正则表达式匹配的是何种模式（仔细阅读来确认我们使用的是正则表达
式，还是定义正则表达式的字符串）？
a. ^.*$
b. "\{.+\}"
c. \d{4}-\d{2}-\d{2}
d. "\\{4}"
(3) 创建正则表达式来找出满足以下条件的所有单词。
a. 以 3 个辅音字母开头的单词
b. 有连续 3 个或更多元音字母的单词。
c. 有连续 2 个或更多元音—辅音配对的单词。

str_view(stringr::words, "^[^aeiou]{3}",match=T)
str_view(stringr::words, "[aeiou]{3,}",match=T)
str_view(stringr::words, "[aeiou][^aeiou]{2,}",match=T)

括号还可以定义“分组”，你可以通过回溯引用（如 \1、 \2 等）来引用这些分组。

str_view(stringr::words, "(..)\\1", match = TRUE)

10.4.1　匹配检测

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE

# 有多少个以t开头的常用单词？
sum(str_detect(words, "^t"))
#> [1] 65
# 以元音字母结尾的常用单词的比例是多少？
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277

str_detect() 函数的一种常见用法是选取出匹配某种模式的元素。你可以通过逻辑取子集
方式来完成这种操作，也可以使用便捷的 str_subset() 包装器函数：

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

df <- tibble(
word = words,
i = seq_along(word)
)
df %>%
filter(str_detect(words, "x$"))

str_detect() 函数的一种变体是 str_count()，后者不是简单地返回是或否，而是返回字符
串中匹配的数量：

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# 平均来看，每个单词中有多少个元音字母？
mean(str_count(words, "[aeiou]"))
#> [1] 1.99

df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)

str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")

10.4.3　提取匹配内容
要想提取匹配的实际文本，我们可以使用 str_extract() 函数。为了说明这个函数的用
法

length(sentences)
#> [1] 720
head(sentences)

colors <- c(
"red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(colors, collapse = "|")
color_match
#> [1] "red|orange|yellow|green|blue|purple"

has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"

more <- sentences[str_count(sentences, color_match) > 1]
str_view_all(more, color_match)
str_extract(more, color_match)
#> [1] "blue" "green" "orange

str_extract_all(more, color_match)
#> [[1]]
#> [1] "blue" "red"
#>
#> [[2]]
#> [1] "green" "red"
#>
#> [[3]]
#> [1] "orange" "red"

str_extract() 函数可以给出完整匹配； str_match() 函数则可以给出每个独立分组。 str_
match() 返回的不是字符向量，而是一个矩阵，其中一列是完整匹配，后面的列是每个分
组的匹配

noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
#> [1] "the smooth" "the sheet" "the depth" "a chicken"
#> [5] "the parked" "the sun" "the huge" "the ball"
#> [9] "the woman" "a helps"
has_noun %>%
str_match(noun)
#> [,1] [,2] [,3]
#> [1,] "the smooth" "the" "smooth"
#> [2,] "the sheet" "the" "sheet"
#> [3,] "the depth" "the" "depth"
#> [4,] "a chicken" "a" "chicken"
#> [5,] "the parked" "the" "parked"
#> [6,] "the sun" "the" "sun"
#> [7,] "the huge" "the" "huge"
#> [8,] "the ball" "the" "ball"
#> [9,] "the woman" "the" "woman"
#> [10,] "a helps" "a" "helps"
# 如果数据是保存在 tibble 中的，那么使用 tidyr::extract() 会更容易。这个函数的工作方式
#与 str_match() 函数类似，只是要求为每个分组提供一个名称，以作为新列放在 tibble 中
tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE
)

10.4.7　替换匹配内容
str_replace() 和 str_replace_all() 函数可以使用新字符串替换匹配内容。

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people

10.4.9　拆分
str_split() 函数可以将字符串拆分为多个片段。

sentences %>%
head(5) %>%
str_split(" ")

"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE) # simplify=T返回一个矩阵
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#> [,1] [,2]
#> [1,] "Name" "Hadley"
#> [2,] "Country" "NZ"
#> [3,] "Age" "35"

apropos() 函数可以在全局环境空间中搜索所有可用对象。当不能确切想起函数名称时，
这个函数特别有用：

apropos("replace")

dir() 函数可以列出一个目录下的所有文件。 dir() 函数的 patten 参数可以是一个正则
表达式，此时它只返回与这个模式相匹配的文件名。例如，你可以使用以下代码返回当
前目录中的所有 R Markdown 文件

head(dir(pattern = "\\.Rmd$"))

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,457评论 5赞 459
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,837评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,696评论 0赞 319
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,183评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,057评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,105评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,520评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,211评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,482评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,574评论 2赞 309
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,353评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,213评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,576评论 3赞 298
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,897评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,174评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,489评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,683评论 2赞 335

R数据科学（十） 使用stringr处理字符串

推荐阅读更多精彩内容

R数据科学（十）使用stringr处理字符串