Match as an alternative to merge in R

22 May 2017 - 4 mins read time
Tags: R

One of the most common operations in data wrangling is joining two sets of data by a common variable. Probably the most popular method for this is the obscure vlookup function in Excel (simply because it’s the most widely used software for data manipulaton). The closest alternative in base R is merge and the dplyr package contains the join function family which is even more convenient. But there is a more simple and direct solution when only one variable needs to be added to a dataset.

Suppose we have two data frames, df.a and df.b, and we wish to get the values of other.var from df.b into df.a so that each id gets their “own” value. There are various methods for joining, each yielding a different result. But in my experience left join on a single variable is the most frequent and this is what we will explore here.

df.a <- data.frame(id = sample(LETTERS, 10), 
                   some.var = rnorm(10))
df.a

##    id    some.var
## 1   K  0.18680978
## 2   W -1.20721078
## 3   P -0.05905494
## 4   L  1.55776950
## 5   Y  2.13328219
## 6   B  0.69926471
## 7   A -0.03656637
## 8   S -0.69594103
## 9   T  0.90289095
## 10  X  0.31828502

df.b <- data.frame(id = sample(LETTERS, 20), 
                   other.var = runif(20, 1, 100))
df.b

##    id other.var
## 1   T 18.461094
## 2   M 86.746373
## 3   N 76.536230
## 4   P 79.338731
## 5   B 62.402434
## 6   I 50.785243
## 7   O 65.291921
## 8   X 27.777302
## 9   A 65.784727
## 10  D 45.353893
## 11  H 26.760644
## 12  Q 10.775249
## 13  E 47.873233
## 14  C 74.418394
## 15  Z 67.870696
## 16  Y 35.722271
## 17  K 40.220227
## 18  U 69.011976
## 19  V  3.544297
## 20  F  5.644257

Left join with ‘merge’

When using merge, we specify the arguments of the function, run it and then through some magic a new dataset with requested columns is created. Note that we don’t need to specify by which variable we wish to merge if variable names are the same.

df.merge <- merge(df.a, df.b, all.x = T, all.y = F)
df.merge

##    id    some.var other.var
## 1   A -0.03656637  65.78473
## 2   B  0.69926471  62.40243
## 3   K  0.18680978  40.22023
## 4   L  1.55776950        NA
## 5   P -0.05905494  79.33873
## 6   S -0.69594103        NA
## 7   T  0.90289095  18.46109
## 8   W -1.20721078        NA
## 9   X  0.31828502  27.77730
## 10  Y  2.13328219  35.72227

Left join with ‘match’

A more hands-on approach involves first figuring out which rows in df.a correspond to which rows in df.b according to id. The match function allows us to do just that.

match(df.a$id, df.b$id)

##  [1] 17 NA  4 NA 16  5  9 NA  1  8

Now that we have the row numbers, we can simply return other.var in df.b where the matches occur. A useful side effect is that we can define the name for the new variable while matching.

df.a$other.var <- df.b$other.var[match(df.a$id, df.b$id)]

Now let’s compare the results.

df.merge

##    id    some.var other.var
## 1   A -0.03656637  65.78473
## 2   B  0.69926471  62.40243
## 3   K  0.18680978  40.22023
## 4   L  1.55776950        NA
## 5   P -0.05905494  79.33873
## 6   S -0.69594103        NA
## 7   T  0.90289095  18.46109
## 8   W -1.20721078        NA
## 9   X  0.31828502  27.77730
## 10  Y  2.13328219  35.72227

df.a

##    id    some.var other.var
## 1   K  0.18680978  40.22023
## 2   W -1.20721078        NA
## 3   P -0.05905494  79.33873
## 4   L  1.55776950        NA
## 5   Y  2.13328219  35.72227
## 6   B  0.69926471  62.40243
## 7   A -0.03656637  65.78473
## 8   S -0.69594103        NA
## 9   T  0.90289095  18.46109
## 10  X  0.31828502  27.77730

We can see that the result is essentially the same. What merge has done is rearranged the rows which is something we might not want to happen. So I encourage the use of match when possible since it allows the addition of a single column without running a function over entire data sets.