data.table(二)|Reference semantics

This vignette discusses data.table's reference semantics which allows to add/update/delete columns of a data.table by reference, and also combine them with i and by. It is aimed at those who are already familiar with data.table syntax, its general form, how to subset rows in i, select and compute on columns, and perform aggregations by group.

If you're not familiar with these concepts, please read the “Introduction to data.table” vignette first.

Data {#data}

We will use the same flights data as in the “Introduction to data.table” vignette.

library(data.table)

flights <- fread("flights14.csv")
flights

dim(flights)

Introduction

In this vignette, we will

1.first discuss reference semantics briefly and look at the two different forms in which the := operator can be used

2.then see how we can add/update/delete columns by reference in j using the := operator and how to combine with i and by.

3.and finally we will look at using := for its side-effect and how we can avoid the side effects using copy().

Reference semantics

All the operations we have seen so far in the previous vignette resulted in a new data set.We will see how to add new column(s),update or delete existing column(s) on the original data.
a) Background

Before we look at reference semantics, consider the data.frame shown below:

DF = data.frame(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DF

When we did:

DF$c <-  18:13   #(1) --replace entire column
# or
DF$c[DF$ID == "b"] <-  15:13  #(2) -- subassign in column 'c'

both (1) and (2) resulted in deep copy of the entire data.frame in versions of R versions < 3.1. It copied more than once. To improve performance by avoiding these redundant copies, data.table utilised the available but unused := operator in R.

Great performance improvements were made in R v3.1 as a result of which only a shallow copy is made for (1) and not deep copy. However, for (2) still, the entire column is deep copied even in R v3.1+. This means the more columns one subassigns to in the same query, the more deep copies R does.

shallow vs deep copy {.bs-callout .bs-callout-info}

A shallow copy is just a copy of the vector of column pointers (corresponding to the columns in a data.frame or data.table). The actual data is not physically copied in memory.

A deep copy on the other hand copies the entire data to another location in memory.

With data.table's := operator, absolutely no copies are made in both (1) and (2), irrespective of R version you are using. This is because := operator updates data.table columns in-place (by reference).

b) The := operator
It can be used in j in two ways:

(a) The LHS := RHS form

DT[,c("colA","colB",...) := list(valA,valB,...)]

# when you have only one column to assign to you 
# can drop the quotes and list(),for convenience
DT[,colA := valA]

(b) The functional form

DT[,`:=`(colA = valA, #valA is assigned to colA

         colB = valB, #valB is assigned to colB
)]

{.bs-callout .bs-callout-warning}

Note that the code above explains how := can be used. They are not working examples. We will start using them on flights data.table from the next section.

{.bs-callout .bs-callout-info}

In (a), LHS takes a character vector of column names and RHS a list of values. RHS just needs to be a list, irrespective of how its generated (e.g., using lapply(), list(), mget(), mapply() etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance.
On the other hand, (b) is handy if you would like to jot some comments down for later.
The result is returned invisibly.
Since := is available in j, we can combine it with i and by operations just like the aggregation operations we saw in the previous vignette.

In the two forms of := shown above, note that we don't assign the result back to a variable. Because we don't need to. The input data.table is modified by reference. Let's go through examples to understand what we mean by this.

For the rest of the vignette, we will work with flights data.table.

Add/update/delete columns by reference

a) Add columns by reference {#ref-j}

– How can we add columns speed and total delay of each flight to flights data.table?

flights[,`:=`(speed = distance /(air_time/60), 
              ##speed in mph (mi/h)
              delay = arr_delay + dep_delay)]

head(flights)

## alternatively, using the 'LHS := RHS' form
# flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)]

Note that {.bs-callout .bs-callout-info}

We did not have to assign the result back to flights.

The flights data.table now contains the two newly added columns. This is what we mean by added by reference.

We used the functional form so that we could add comments on the side to explain what the computation does. You can also see the LHS := RHS form (commented).

b) Update some rows of columns by reference - sub-assign by reference {#ref-i-j}

Let's take a look at all the hours available in the flights data.table:

# get all 'hour' in flights
flights[,sort(unique(hour))]

We see that there are totally 25 unique values in the data.Both 1 and 24 hours seem to be present.
Let's go ahead and replace 24 with 0.
– Replace those rows where hour == 24 with the value 0

## subassign by reference
flights[hour == 24L,hour := 0L]  ## DT[i,j]的形式

{.bs-callout .bs-callout-info}

We can use i along with := in j the very same way as we have already seen in the “Introduction to data.table” vignette.

Column hour is replaced with 0 only on those row indices where the condition hour == 24L specified in i evaluates to TRUE.

:= returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty [] at the end of the query as shown below:

flights[hour == 24L,hour := 0L][]

Let's look at all the hour to verify.

## check again
flights[,sort(unique(hour))]

xercise: {.bs-callout .bs-callout-warning #update-by-reference-question}

What is the difference between flights[hour == 24L, hour := 0L] and flights[hour == 24L][, hour := 0L]
? Hint: The latter needs an assignment (<-) if you would want to use the result later.

If you can't figure it out, have a look at the Note section of ?":=".

c) Delete column by reference

– Remove delay column

flights[,c("delay") := NULL]
head(flights)

{.bs-callout .bs-callout-info #delete-convenience}

Assigning NULL to a column deletes that column. And it happens instantly.
We can also pass column numbers instead of names in the LHS, although it is good programming practice to use column names.
When there is just one column to delete, we can drop the c() and double quotes and just use the column name unquoted, for convenience. That is:

flights[, delay := NULL]is equivalent to the code above.

d) := along with grouping using by {#ref-j-by}

We have already seen the use of i along with := in Section 2b. Let's now see how we can use := along with by.

– How can we add a new column which contains for each orig,dest pair the maximum speed?

flights[,max_speed := max(speed),by = .(origin,dest)]
head(flights)

{.bs-callout .bs-callout-info}

We add a new column max_speed using the := operator by reference.
We provide the columns to group by the same way as shown in the Introduction to data.table vignette. For each group, max(speed) is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. flights data.table is modified in-place.
We could have also provided by with a character vector as we saw in the Introduction to data.table vignette, e.g., by = c("origin", "dest").

e) Multiple columns and :=

– How can we add two more columns computing max() of dep_delay and arr_delay for each month, using .SD?

in_cols <- c("dep_delay","arr_delay")
out_cols <- c("max_dep_delay","max_arr_delay")
flights[,c(out_cols) := lapply(.SD,max),by = month, .SDcols = in_cols]

head(flights)

{.bs-callout .bs-callout-info}

We use the LHS := RHS form. We store the input column names and the new columns to add in separate variables and provide them to .SDcols and for LHS (for better readability).

Note that since we allow assignment by reference without quoting column names when there is only one column as explained in Section 2c, we can not do out_cols := lapply(.SD, max). That would result in adding one new column named out_col. Instead we should do either c(out_cols) or simply (out_cols). Wrapping the variable name with ( is enough to differentiate between the two cases.

The LHS := RHS form allows us to operate on multiple columns. In the RHS, to compute the max on columns specified in .SDcols, we make use of the base function lapply() along with .SD in the same way as we have seen before in the “Introduction to data.table” vignette. It returns a list of two elements, containing the maximum value corresponding to dep_delay and arr_delay for each group.

Before moving on to the next section, let's clean up the newly created columns speed, max_speed, max_dep_delay and max_arr_delay.

# RHS gets automatically recycled to length of LHS
flights[,c("speed","max_speed","max_dep_delay","max_arr_delay"):= NULL]
head(flights)

:= and copy()
:= modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use copy() function, as we will see in a moment.

a) := for its side effect

Let's say we would like to create a function that would return the maximum speed for each month. But at the same time, we would also like to add the column speed to flights. We could write a simple function as follows:

foo <- function(DT) {
  DT[,speed := distance/(air_time/60)]
  DT[, .(max_speed = max(speed)),by = month]   
  ##取列j 重命名
}

ans = foo(flights)

head(flights)
head(ans)

{.bs-callout .bs-callout-info}

Note that the new column speed has been added to flights data.table. This is because := performs operations by reference. Since DT (the function argument) and flights refer to the same object in memory, modifying DT also reflects on flights.
And ans contains the maximum speed for each month.

b) The copy() function

In the previous section,we used :=for its side effect.But of course this may not be always desirable.
Sometimes,we would like to pass a data.dable object to a function,and might want to use the :=operator,but wouldn't want to update the original object.We can accomplish this using the function copy.
{.bs-callout .bs-callout-info}

The copy() function deep copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.

There are two particular places where copy() function is essential:

1.Contrary to the situation we have seen in the previous point, we may not want the input data.table to a function to be modified by reference. As an example, let's consider the task in the previous section, except we don't want to modify flights by reference.

Let's first delete the speed column we generated in the previous section.

flights[,speed := NULL]

Now ,we could accomplish the task as follows:

foo <- function(DT) {
  DT <- copy(DT)   ## deep copy
  DT[,speed := distance/(air_time/60)]  ## doesn't affect flights
  DT[, .(max_speed = max(speed)),by = month]   
  ##取列j 重命名
}

ans = foo(flights)

head(flights)
head(ans)

{.bs-callout .bs-callout-info}

.Using copy() function did not update flights data.table by reference. It doesn't contain the column speed.
.And ans contains the maximum speed corresponding to each month.

However we could improve this functionality further by shallow copying instead of deep copying. In fact, we would very much like to provide this functionality for v1.9.8. We will touch up on this again in the data.table design vignette.

When we store the column names on to a variable, e.g., DT_n = names(DT), and then add/update/delete column(s) by reference. It would also modify DT_n, unless we do copy(names(DT)).

DT = data.table(x = 1L,y = 2L)
DT_n = names(DT)
DT_n

# add a new column by reference
DT[, z := 3L]

##DT_n also gets updated
DT_n

## use copy()
DT_n = copy(names(DT))
DT[,w := 4L]
# DT_n doesn't get updated
DT_n

summary

The := operator {.bs-callout .bs-callout-info}

It is used to add/update/delete columns by reference.
We have also seen how to use := along with i and by the same way as we have seen in the Introduction to data.table vignette. We can in the same way use keyby, chain operations together, and pass expressions to by as well all in the same way. The syntax is consistent.
We can use := for its side effect or use copy() to not modify the original object while updating by reference.

So far we have seen a whole lot in j, and how to combine it with by and little of i. Let's turn our attention back to i in the next vignette “Keys and fast binary search based subset” to perform blazing fast subsets by keying data.tables.

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,590评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 86,808评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,151评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,779评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,773评论 5赞 367
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,656评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,022评论 3赞 398
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,678评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 41,038评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,659评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,756评论 1赞 330
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,411评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,005评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,973评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,203评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,053评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,495评论 2赞 343

data.table(二)|Reference semantics

推荐阅读更多精彩内容