Looping: lapply, sapply, mapply & apply

The apply family of functions takes the prize for being the most useful yet most confusing and unintuitive (at least initially). Here I hope to demystify this wonderful set of tools.

In short, this set of functions is useful when we need to repeat something over a set of values (list values, vector, dataframe columns, etc). In most programming languages, including R, we can write a for() or do() loop. But the apply functions make it easier to iteratively run commands over a set of values. They are also more computationally efficient than other looping techniques.

Here is the list of functions:

Function Overview Sample Application
lapply

“l” for list — that’s what it outputs

Loop through each item of a list or a vector and execute a function on each item. Outputs a list of the same length as the input. You can input a data frame as well; if you think of it as a list of column vectors. Build a list of 10 vectors with 1, 2, 3, …, 10 random numbers each: lapply(1:10, function(x) { runif(x) } ).
sapply

“s” for simple — because list output is simplified to vector.

Same as lapply, but outputs a vector instead of a list. Count the number of distinct values in each column of a data frame and return a vector: sapply(df, function(x){length(unique(x))}).
mapply

“m” for multivariate.

Same as lapply, but instead of looping through each item in a single vector/list, it loops through each item of multiple vectors/lists in tandem. Runs a command on the first item in vector1 and vector2, then second item of vector1 and vector2, etc. Therefore the two vectors or lists have to be of the same length.
apply Unlike the other functions. It works on the “margins” of a matrix or data frame. Think of totals row and totals column in an Excel table. Calculate the mean of each column in a matrix or numeric data frame.
tapply

“t” for table I guess? Because the output is a table?

Essentially performs PivotTable summaries. But I strongly prefer dplyr functions over `tapply()` for practically all cases. Just use dplyr 🙂

Picking between lapply vs sapply vs mapply

In deciding which of these to use, you need to understand the function you wish to run iteratively. That function will dictate what kind of input is needed and the kind of output it returns.

If the output of the function is a single value, then the lapply/sapply/mapply will return N values; in which case the output might be best stored in a vector. For example, mean() returns the mean of a vector of numbers — output is a single value. So running any of the apply functions will return N means.

But if the output of the function is a vector, list, dataframe, etc, then you’ll end up with N of those, in which case you will likely need a list output. For example, the output of runif(Y) is a vector of Y values; if Y is greater than 1 and you want to run this multiple times, you’ll end up with N vectors each of length Y. So you can store that as a list (or a dataframe).

Below are the function structures:

> str(lapply)
function (X, FUN, ...)

> str(sapply)
function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)  

> str(apply)
function (X, MARGIN, FUN, ...)  

X represents the input that we are looping over. FUN represents the function we are running iteratively. In the case of apply(), MARGIN represents whether we want to execute on columns (2) or rows (1).

Notice that mapply() does not have an X input because that function can take on any number of objects to loop over. That is represented by the ... in the function definition.

Functions to run

We can apply a function that is already defined, or create a function “on the fly” (aka anonymous function). Here we go over both scenarios:

Applying predefined functions

Let’s say we have a simple list of numeric vectors, list_a:

> list_a
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
[1] 0.3992223 0.9515923 0.5431508 0.7627956 0.5508387

[[3]]
 [1] 0 0 0 0 0 1 1 0 0 0

The list has three vectors, and we want to calculate the mean of each vector. Since we are dealing with just one input (list_a), and it’s a list, we’ll use either lapply() or sapply():

> lapply(list_a, mean)
[[1]]
[1] 5.5

[[2]]
[1] 0.6415199

[[3]]
[1] 0.2

> sapply(list_a, mean)
[1] 5.5000000 0.6415199 0.2000000

lapply() returns a list of three items, each representing the mean of the corresponding vector; sapply() returns the same result, but coerces it to a vector for convenience.

This is a simple example because the mean() function has only one required input, and the remaining are optional (see ?(mean)). If the function had more than one required input, we would use mapply(). But what if we wanted to tweak some of the optional parameters for mean()? For example, the na.rm to ignore NA values? Let’s look at the structure of sapply() and lapply() again:

> str(lapply)
function (X, FUN, ...)

> str(sapply)
function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

The ... represents optional parameters to FUN. So if we want to include the na.rm = TRUE parameter when applying mean(), we can do so by running lapply(list_a, mean, na.rm = TRUE) or sapply(list_a, mean, na.rm = TRUE). This would effectively run mean() with the na.rm parameter set to TRUE.

Any predefined function can be run in this manner, including functions you have written yourself. Let’s say we wrote a function called gap() that calculates the difference between min and max of a vector:

# Define gap function 
> gap = function(v) {
          min = min(v)
          max = max(v)
          return(max - min)
      }

# apply gap() to list_a
> sapply(list_a, gap)
[1] 9.0000000 0.5523699 1.0000000

We run the gap() function within sapply() just as we would any other predefined function. As you can see, sapply() returns a vector with three numbers representing the difference between min and max within each respective vector in the list.

Defining functions in real-time (aka anonymous functions)

In the above example, we could also have defined the function in “real-time” within the sapply() call, instead of creating a named standalone function. This is also known as an anonymous function because the function is not named. This technique is recommended if the function is relatively simple and only being used once. If the function needs to be called again later in the code, it is best to define the function as we did above.

Let’s re-run the anonymous version of the gap() function:

# 1. verbatim copy of gap() function
> sapply(list_a, function(v) { 
        min = min(v)
        max = max(v)
        return(max - min)
       })
[1] 9.0000000 0.5523699 1.0000000

# 2. simpler version of gap() function
> sapply(list_a, function(v) { return(max(v) - min(v)) })
[1] 9.0000000 0.5523699 1.0000000

# 3. simplest version of gap() function
> sapply(list_a, function(v) { max(v) - min(v) })
[1] 9.0000000 0.5523699 1.0000000

The three examples are identical, with increasing simplicity. The two key things to be aware of are: (1) the v within the custom function represents the X argument of sapply(), and (2) the last command within the function, or whatever is within the return() statement, is what the function returns.

Regarding the v: we could have used any letter or variable name to represent X. Within the curly braces (i.e., inside the function) v represents the value of the X input at each iteration. For example, if X is a vector, then v represents X[1] in the first iteration, then X[2], and so forth, until all values of X have been considered.

Looping through a vector

In the examples above we looped through values of a list, but we could also loop through a vector. Let’s say you want to create two lists of random numbers, one with four numbers and one with five.


> lapply(4:5, runif)
[[1]]
[1] 0.1840649 0.2870561 0.8138439 0.4689943

[[2]]
[1] 0.82608012 0.16391831 0.93507711 0.73203296 0.04265434

The runif() function draws random numbers from a uniform distribution; the only required input is the number of random values to draw. In the above example, we effectively run runif() with an input of 4, then with an input of 5; the output is a list of two vectors, one with four and one with five random numbers.

It is a strange use case, but hopefully should bring across the point.

Looping through dataframe columns

If you consider a data frame to simply be a list of N equal-length vectors, where N is the number of columns in the data frame, we can easily extend the list-based inputs above to be based on data frame columns. Let’s say you have a data frame with five columns, and you want to see the number of distinct values in each column:

> data2
   BBWAA From   To Seasons  Ages
1      1 1979 2003      25 20-44
2      1 1981 2001      21 20-40
3      1 1982 2001      20 22-41
4      1 1982 1999      18 24-41
5      1 1978 1998      21 21-41
6      1 1977 1997      21 21-41
7      1 1981 1997      17 21-37
8      1 1978 1996      19 23-41
9      1 1973 1995      23 21-43
10     1 1984 1995      12 24-35

> sapply(data2, function(x) {length(unique(x))})
  BBWAA    From      To Seasons    Ages 
      1       7       7       8       9 

Here we created a custom anonymous function to count the number of unique values in each column of data2. Since we used sapply, the output is a vector. Conveniently it outputs a named vector, with each value having a corresponding name.

Running mapply()

In the examples above, we ran a function looping through just one input list or vector, because the functions we ran only had one required input. However, if we wish to loop through multiple lists or vectors in tandem, we would use mapply(). Below we will look at a simple extension of the above examples.

Recall we ran our custom function that computed max() minus min() for each vector in our list_a list, using sapply(). What if we have a second list, also having three vectors, and want to apply a similar function across each pair of vectors?

Below are two examples of mapply() on two lists, list_a and list_b:

# print list_a
> list_a
[[1]]
[1]  1  2  3  4  5  6  7  8  9 10

[[2]]
[1] 0.3992223 0.9515923 0.5431508 0.7627956 0.5508387

[[3]]
 [1] 0 0 0 0 0 1 1 0 0 0

# print list_b
> list_b
[[1]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[2]]
[1] 0.57162896 0.49288256 0.78369524 0.54034070 0.08800153

[[3]]
 [1] 0 1 0 1 0 1 0 1 0 0


# Example 1: combine each each pair of vectors across both lists
> mapply(c, list_a, list_b)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 
 [10] 11 12 13 14 15 16 17 18 19 20

[[2]]
 [1] 0.39922235 0.95159225 0.54315077 0.76279560 0.55083868
 [6] 0.57162896 0.49288256 0.78369524 0.54034070 0.08800153

[[3]]
 [1]  0 0 0 0 0 1 1 0 0 0
 [10] 0 1 0 1 0 1 0 1 0 0


# Example 2: compute max - min across both sets of vectors
> mapply(function(a, b) {max(a, b) - min(a, b)}, a = list_a, b = list_b)
[1] 19.0000000  0.8635907  1.0000000

In this example we have two lists, list_a and list_b, that are similarly structured with three vectors.

In the first mapply() function we use c() function to combine the vectors across both lists. The result is a list with three vectors; the first vector is the combination of the first vector from each list, the second vector is the combination of the second vector from each lists, and so forth.

In the second example we compute the max minus min across both sets of vectors.The result is a vector of length three; the max minus min between the first vectors in the two lists is 19, and so forth.

Notice that the first mapply() resulted in a list of three vectors while the second mapply() resulted in a vector. This is due to the nature of the function we are executing in the two examples. The first function is c(), which is used to combine vectors; so when combining two vectors we get a vector. Doing that for three pairs of vectors results in three vectors, thus stored as a list. On the other hand, max() minus min() in a single iteration produces a single number; run three times, we get three numbers. So this naturally is best stored as a vector of length three.

Running apply()

apply() is actually a lot like sapply(): it can take in a data frame (or matrix) and output a vector. I find that to be the most common use of apply. Unless you’re doing matrix math, you probably wouldn’t need to use apply() in any other way. The main difference between the two is that when you input a data frame into sapply(), it’s treated as a list of columns. Whereas apply() can basically treat the data frame as a list of columns (MARGIN=2) or rows (MARGIN=1).

Below is an example of apply() to find the max value in each column and

[soucecode lang=”splus”]
~~~~~~~~~~~~~~~
# print mat data frame

mat
a b c
1 0.9257055 0.3598718 0.39724358
2 0.6133401 0.1021334 0.90042458
3 0.5458946 0.4004844 0.05772717
4 0.6195114 0.4098930 0.83160198
5 0.2652464 0.4448444 0.96032109

max across columns

apply(mat, 1, max)
[1] 0.9257055 0.9004246 0.5458946 0.8316020 0.9603211

#max across rows

apply(mat, 2, max)
a b c
0.9257055 0.4448444 0.9603211
[/soucecode]

The data frame has five rows and three columns, and the apply() function calculates the max across columns and rows.

Conclusion

In my opinion, you know you have reached a new level of R proficiency if you are starting to use the apply functions on a regular basis. For a long time I thought they were so unintuitive, but now they’re a top tool in my arsenal. I hope I have made their uses clearer to you!

Leave a Reply

Be the First to Comment!

avatar

wpDiscuz