The *apply* family of functions takes the prize for being the most useful yet most confusing and unintuitive (at least initially). Here I hope to demystify this wonderful set of tools.

In short, this set of functions is useful when we need to repeat something over a set of values (list values, vector, dataframe columns, etc). In most programming languages, including R, we can write a `for()`

or `do()`

loop. But the apply functions make it easier to iteratively run commands over a set of values. They are also more computationally efficient than other looping techniques.

Here is the list of functions:

Function |
Overview |
Sample Application |

lapply
“l” for list — that’s what it outputs |
Loop through each item of a list or a vector and execute a function on each item. Outputs a list of the same length as the input. You can input a data frame as well; if you think of it as a list of column vectors. | Build a list of 10 vectors with 1, 2, 3, …, 10 random numbers each: lapply(1:10, function(x) { runif(x) } ). |

sapply
“s” for simple — because list output is simplified to vector. |
Same as lapply, but outputs a vector instead of a list. | Count the number of distinct values in each column of a data frame and return a vector: sapply(df, function(x){length(unique(x))}). |

mapply
“m” for multivariate. |
Same as lapply, but instead of looping through each item in a single vector/list, it loops through each item of multiple vectors/lists in tandem. Runs a command on the first item in vector1 and vector2, then second item of vector1 and vector2, etc. Therefore the two vectors or lists have to be of the same length. | |

apply | Unlike the other functions. It works on the “margins” of a matrix or data frame. Think of totals row and totals column in an Excel table. | Calculate the mean of each column in a matrix or numeric data frame. |

tapply
“t” for table I guess? Because the output is a table? |
Essentially performs PivotTable summaries. But I strongly prefer dplyr functions over `tapply()` for practically all cases. | Just use dplyr ðŸ™‚ |

### Picking between `lapply`

vs `sapply`

vs `mapply`

In deciding which of these to use, you need to understand the function you wish to run iteratively. That function will dictate what kind of input is needed and the kind of output it returns.

If the output of the function is a single value, then the lapply/sapply/mapply will return N values; in which case the output might be best stored in a vector. For example, `mean()`

returns the mean of a vector of numbers — output is a single value. So running any of the apply functions will return N means.

But if the output of the function is a vector, list, dataframe, etc, then you’ll end up with N of those, in which case you will likely need a list output. For example, the output of `runif(Y)`

is a vector of Y values; if Y is greater than 1 and you want to run this multiple times, you’ll end up with N vectors each of length Y. So you can store that as a list (or a dataframe).

Below are the function structures:

> str(lapply) function (X, FUN, ...) > str(sapply) function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) > str(mapply) function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) > str(apply) function (X, MARGIN, FUN, ...)

`X`

represents the input that we are looping over. `FUN`

represents the function we are running iteratively. In the case of `apply()`

, `MARGIN`

represents whether we want to execute on columns (2) or rows (1).

Notice that `mapply()`

does not have an `X`

input because that function can take on any number of objects to loop over. That is represented by the `...`

in the function definition.

### Functions to run

We can apply a function that is already defined, or create a function “on the fly” (aka anonymous function). Here we go over both scenarios:

#### Applying predefined functions

Let’s say we have a simple list of numeric vectors, *list_a*:

> list_a [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] 0.3992223 0.9515923 0.5431508 0.7627956 0.5508387 [[3]] [1] 0 0 0 0 0 1 1 0 0 0

The list has three vectors, and we want to calculate the mean of each vector. Since we are dealing with just one input (list_a), and it’s a list, we’ll use either `lapply()`

or `sapply()`

:

> lapply(list_a, mean) [[1]] [1] 5.5 [[2]] [1] 0.6415199 [[3]] [1] 0.2 > sapply(list_a, mean) [1] 5.5000000 0.6415199 0.2000000

`lapply()`

returns a list of three items, each representing the mean of the corresponding vector; `sapply()`

returns the same result, but coerces it to a vector for convenience.

This is a simple example because the `mean()`

function has only one required input, and the remaining are optional (see `?(mean)`

). If the function had more than one required input, we would use `mapply()`

. But what if we wanted to tweak some of the *optional* parameters for `mean()`

? For example, the `na.rm`

to ignore `NA`

values? Let’s look at the structure of `sapply()`

and `lapply()`

again:

> str(lapply) function (X, FUN, ...) > str(sapply) function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

The `...`

represents optional parameters to FUN. So if we want to include the `na.rm = TRUE`

parameter when applying `mean()`

, we can do so by running `lapply(list_a, mean, na.rm = TRUE)`

or `sapply(list_a, mean, na.rm = TRUE)`

. This would effectively run `mean()`

with the `na.rm`

parameter set to `TRUE`

.

Any predefined function can be run in this manner, including functions you have written yourself. Let’s say we wrote a function called `gap()`

that calculates the difference between min and max of a vector:

# Define gap function > gap = function(v) { min = min(v) max = max(v) return(max - min) } # apply gap() to list_a > sapply(list_a, gap) [1] 9.0000000 0.5523699 1.0000000

We run the `gap()`

function within `sapply()`

just as we would any other predefined function. As you can see, `sapply()`

returns a vector with three numbers representing the difference between min and max within each respective vector in the list.

#### Defining functions in real-time (aka anonymous functions)

In the above example, we could also have defined the function in “real-time” within the `sapply()`

call, instead of creating a named standalone function. This is also known as an anonymous function because the function is not named. This technique is recommended if the function is relatively simple and only being used once. If the function needs to be called again later in the code, it is best to define the function as we did above.

Let’s re-run the anonymous version of the `gap()`

function:

# 1. verbatim copy of gap() function > sapply(list_a, function(v) { min = min(v) max = max(v) return(max - min) }) [1] 9.0000000 0.5523699 1.0000000 # 2. simpler version of gap() function > sapply(list_a, function(v) { return(max(v) - min(v)) }) [1] 9.0000000 0.5523699 1.0000000 # 3. simplest version of gap() function > sapply(list_a, function(v) { max(v) - min(v) }) [1] 9.0000000 0.5523699 1.0000000

The three examples are identical, with increasing simplicity. The two key things to be aware of are: (1) the *v* within the custom function represents the *X* argument of `sapply()`

, and (2) the last command within the function, or whatever is within the `return()`

statement, is what the function returns.

Regarding the *v*: we could have used any letter or variable name to represent X. Within the curly braces (i.e., inside the function) `v`

represents the value of the `X`

input at each iteration. For example, if X is a vector, then `v`

represents X[1] in the first iteration, then X[2], and so forth, until all values of X have been considered.

#### Looping through a vector

In the examples above we looped through values of a list, but we could also loop through a vector. Let’s say you want to create two lists of random numbers, one with four numbers and one with five.

> lapply(4:5, runif) [[1]] [1] 0.1840649 0.2870561 0.8138439 0.4689943 [[2]] [1] 0.82608012 0.16391831 0.93507711 0.73203296 0.04265434

The `runif()`

function draws random numbers from a uniform distribution; the only required input is the number of random values to draw. In the above example, we effectively run `runif()`

with an input of 4, then with an input of 5; the output is a list of two vectors, one with four and one with five random numbers.

It is a strange use case, but hopefully should bring across the point.

#### Looping through dataframe columns

If you consider a data frame to simply be a list of N equal-length vectors, where N is the number of columns in the data frame, we can easily extend the list-based inputs above to be based on data frame columns. Let’s say you have a data frame with five columns, and you want to see the number of distinct values in each column:

> data2 BBWAA From To Seasons Ages 1 1 1979 2003 25 20-44 2 1 1981 2001 21 20-40 3 1 1982 2001 20 22-41 4 1 1982 1999 18 24-41 5 1 1978 1998 21 21-41 6 1 1977 1997 21 21-41 7 1 1981 1997 17 21-37 8 1 1978 1996 19 23-41 9 1 1973 1995 23 21-43 10 1 1984 1995 12 24-35 > sapply(data2, function(x) {length(unique(x))}) BBWAA From To Seasons Ages 1 7 7 8 9

Here we created a custom anonymous function to count the number of unique values in each column of `data2`

. Since we used sapply, the output is a vector. Conveniently it outputs a named vector, with each value having a corresponding name.

### Running `mapply()`

In the examples above, we ran a function looping through just one input list or vector, because the functions we ran only had one required input. However, if we wish to loop through multiple lists or vectors in tandem, we would use `mapply()`

. Below we will look at a simple extension of the above examples.

Recall we ran our custom function that computed `max()`

minus `min()`

for each vector in our *list_a* list, using `sapply()`

. What if we have a second list, also having three vectors, and want to apply a similar function across each pair of vectors?

Below are two examples of `mapply()`

on two lists, *list_a* and *list_b*:

# print list_a > list_a [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] 0.3992223 0.9515923 0.5431508 0.7627956 0.5508387 [[3]] [1] 0 0 0 0 0 1 1 0 0 0 # print list_b > list_b [[1]] [1] 11 12 13 14 15 16 17 18 19 20 [[2]] [1] 0.57162896 0.49288256 0.78369524 0.54034070 0.08800153 [[3]] [1] 0 1 0 1 0 1 0 1 0 0 # Example 1: combine each each pair of vectors across both lists > mapply(c, list_a, list_b) [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [10] 11 12 13 14 15 16 17 18 19 20 [[2]] [1] 0.39922235 0.95159225 0.54315077 0.76279560 0.55083868 [6] 0.57162896 0.49288256 0.78369524 0.54034070 0.08800153 [[3]] [1] 0 0 0 0 0 1 1 0 0 0 [10] 0 1 0 1 0 1 0 1 0 0 # Example 2: compute max - min across both sets of vectors > mapply(function(a, b) {max(a, b) - min(a, b)}, a = list_a, b = list_b) [1] 19.0000000 0.8635907 1.0000000

In this example we have two lists, *list_a* and *list_b*, that are similarly structured with three vectors.

In the first `mapply()`

function we use `c()`

function to combine the vectors across both lists. The result is a list with three vectors; the first vector is the combination of the first vector from each list, the second vector is the combination of the second vector from each lists, and so forth.

In the second example we compute the max minus min across both sets of vectors.The result is a vector of length three; the max minus min between the first vectors in the two lists is 19, and so forth.

Notice that the first `mapply()`

resulted in a list of three vectors while the second `mapply()`

resulted in a vector. This is due to the nature of the function we are executing in the two examples. The first function is `c()`

, which is used to combine vectors; so when combining two vectors we get a vector. Doing that for three pairs of vectors results in three vectors, thus stored as a list. On the other hand, `max()`

minus `min()`

in a single iteration produces a single number; run three times, we get three numbers. So this naturally is best stored as a vector of length three.

### Running `apply()`

`apply()`

is actually a lot like `sapply()`

: it can take in a data frame (or matrix) and output a vector. I find that to be the most common use of apply. Unless you’re doing matrix math, you probably wouldn’t need to use `apply()`

in any other way. The main difference between the two is that when you input a data frame into `sapply()`

, it’s treated as a list of *columns*. Whereas `apply()`

can basically treat the data frame as a list of columns (MARGIN=2) or rows (MARGIN=1).

Below is an example of `apply()`

to find the max value in each column and

[soucecode lang=”splus”]

~~~~~~~~~~~~~~~

# print mat data frame

mat

a b c

1 0.9257055 0.3598718 0.39724358

2 0.6133401 0.1021334 0.90042458

3 0.5458946 0.4004844 0.05772717

4 0.6195114 0.4098930 0.83160198

5 0.2652464 0.4448444 0.96032109

# max across columns

apply(mat, 1, max)

[1] 0.9257055 0.9004246 0.5458946 0.8316020 0.9603211

#max across rows

apply(mat, 2, max)

a b c

0.9257055 0.4448444 0.9603211

[/soucecode]

The data frame has five rows and three columns, and the `apply()`

function calculates the max across columns and rows.

### Conclusion

In my opinion, you know you have reached a new level of R proficiency if you are starting to use the apply functions on a regular basis. For a long time I thought they were so unintuitive, but now they’re a top tool in my arsenal. I hope I have made their uses clearer to you!

## Leave a Reply

Be the First to Comment!