R Vectors: A Must-Learn

For many beginners, the word “vector” is one of those scary, geeky things that creep us out. But understanding vectors will instantly speed up your learning process. So in this post I will try my best to break down r vectors for you.

Note: this post is similar to the book’s chapter on vectors

What are vectors

This is a vector:

R Vector

This is another vector:

R Vector

And so is this:

R Vector

A vector is simply a one-dimensional collection of things. The things in the vector are of the same type (i.e., they must all be numeric, or character, etc). They can be as long as you want, or as short as one (like the first one).

How to Create Vectors in R

There are essentially two ways to “create” a vector:

  • Make one manually using the c() function or other shortcuts
  • Take one from an existing object, like a data frame

Let’s replicate the examples above manually using the c() function:

# two ways to make a vector of length one
v1 = 55
v1 = c(55)

v2 = c(55,  63,	1000,	2)
v3 = c("James", "John", "Jane", "Jack")

Note the subtle power of the first example. The value 55 may look like a simple number, but in R speak it’s actually a vector of length one. The other examples are trivial; c() stands for combine — as in combining things together.

There are sometimes shortcuts for creating vectors:

v = 5:10

There we created a vector v with integer values 5 through 10.

Vectors also can come from existing data objects, like data frames. E.g., df$colname is a reference to a column from the data frame df. That, itself, is a vector. So is a row from a data frame, like df[1,]. That is the first row of data frame df, and is also, itself, a vector.

How to Reference Vectors

Here are some ways to reference vectors:

# get second value of vector v2
v2[2]
#result: 63

# get first and third values of vector v3
## writing this out out in words because wordpress does not render the c-vector correctly
v3[ c parenthesis 1 comma 3 parenthesis ]
# result: "James" "Jane"

# get the values of the second vector, in reverse order
v2[ 4:1 ]
# result: 2  1000  63  55

# we could also use a vector of TRUE/FALSE to pick specific values
## writing this out out in words because wordpress does not render the c-vector correctly
v2[ c parenthesis TRUE, TRUE, FALSE, FALSE parenthesis ]

# result: 55  63  

Basically, in all those examples, we are supplying a vector of numbers

So Why do Vectors Matter?

Ever wonder what’s happening when you run something like this?

dataframe[ order(dataframe$columnname), 1:10 ]

It’s definitely a head-scratcher for newbies. There we are sorting the dataframe by columname and only keeping the first ten columns. It turns out the two things inside the brackets are simply vectors instructing R how to order the rows and which columns to keep.

You may also see things like this:

names(dataframe)[grepl(“blah_”, names(dataframe))]

It looks absolutely incomprehensible to newbies, and even some intermediate users. Here we are pulling column names from dataframe that have “blah_” in the name. But it turns out to be simple vector operations.

Vector Math

I’m not sure this is technically “vector math” but it sounds smart, so I’ll stick with that. Let’s run some operations on our vectors above:

# tell me which values > 60
v2 > 60

The result looks like this (second row):

Vector Operations in R

The R operation returns a vector of TRUE / FALSE values indicating which values are > 60.

Recall in the section above, to reference values within a vector we simply supply it a vector of either index numbers, or TRUE / FALSE values indicating which values to keep. So what if we wanted the actual values of the vector that are > 60?

# tell me the values of vector v2 > 60
v2[ v2 > 60 ]

# that basically runs this operations:
# v2[ c(FALSE,  TRUE,  TRUE, FALSE) ]

# result:
# 63 1000

And on that same token, let’s find the names in vector v3 that have the letter “n”:

v3[ grepl("n", v3) ]

# result:
# "John" "Jane"

grepl() is sort of like Excel’s FIND() and SEARCH() function, but way more powerful. It’s basically looks for “n” in the vector v3 and returns a vector of TRUE/FALSE values accordingly.

Let’s look at yet another example to bring this whole thing to full clarity. Let’s sort the values of vector v2, using the order() function. The order function returns the ranking of each value in a vector. That ranking is, naturally, a vector itself:

order(v2)

# values of v2:
# 55   63 1000    2

# result:
# 4 1 2 3

This is actually slightly confusing, but here’s what the result says:
the first ordered value (i.e., smallest) is the 4th value in v2 (which is 2)
the second ordered value is the 1st value in v2 (55)

the largest value is the 3rd value in v2 (1000)

So knowing that, we get a sorted version of v2 by simply putting the order function inside v2’s brackets:

v2[ order(v2) ]

# result:
# 2   55   63 1000

I don’t now about you, but this is magical.

Vectors in Data Frame Operations

Recall a data frame is like an Excel data table. A collection of rows and columns, where each column is of the same type (numeric, character, etc). And you reference a data frame like this:

dataframe[ rows, columns ]

And most importantly, each row and column is a vector itself. That is, df$col is a vector (column named col). df[1, ] is the first row of the data frame, and is also a vector.

Here is a sampling of data frame operations you will use / see often, involving vectors:


# get first ten rows and first ten columns.  Remember 1:10 is just a shortcut for c(1,2,3,4,5,6,7,8,9,10)
df[ 1:10, 1:10 ]

# SORT df by col2, and get all columns.  df$col2 references col2, and is actually a vector itself
df[ order(df$co2), ]

# FILTER rows where col2 > 50.  Like an Excel filter.  df$col2 > 50 returns a TRUE/FALSE vector
df[ df$col2 > 50, ]

# FILTER rows meeting multiple criteria.  & is like AND() in Excel.  It returns TRUE/FALSE if all conditions are met
df[ df$col2 > 50 & df$col4 == "John", ]

# Get columns containing "raw_".  Return all rows
df[ , grepl("raw_", names(df) ]

# Organize columns alphabetically
df[ , order(names(df) ]

# Get columns whose names have more than five characters
df[ , nchar(names(df)) > 5 ]

# CREATE new column based on conditional ifelse(), which is like Excel's IF()
df$newcol = ifelse(df$oldcol < 5, "lt 5", "gte 5")

The list is practically endless. But now you can see the powerful role that vectors play in data frame manipulation.

Let’s look a bit closer at the last example, ifelse(). ifelse() evaluates the condition, df$oldcol < 5, which results in a vector of TRUE/FALSE. When TRUE, assign newcol “lt 5”, when FALSE, assign newcol “get 5”. Once again, vectors come into play.

Conclusion

Every data object in R (including lists, matrix, dataframe, etc) can be broken down to vectors. Vectors are the fundamental data structure of R, and once you realize that and get comfortable with vectors, a lot of things in R will suddenly make a lot of sense.

Leave a Reply

8 Comments on "R Vectors: A Must-Learn"

avatar

David Hood
Guest
David Hood
2 years 5 months ago

Just noting a typo in “so why do vectors matter”

“dataframe[ order(dataframe$columnname), 1:10 ]
It’s definitely a head-scratcher for newbies. There we are sorting the dataframe by columname and only keeping the first ten rows.”

I think you may have intended to write “only keeping the first 10 columns”

Sid
Guest
Sid
2 years 1 month ago

This is fabulous, if only someone can explain easily how to import data into R from excel files it’d be great. I love the ease with which you explained vectors, hope to switch from excel pretty soon!!!

Ayan
Guest
Ayan
2 years 17 days ago

I believe there’s also a parenthesis missing in “You may also see things like this:

names(dataframe)[grepl(“blah_”, names(dataframe)]”

Also, if v3 = c(“James”, “John”, “Jane”, “Jack”)
then v31 returns an error. If we need to return the first and third values of that vector v3, then I believe one way of doing it is:

v3

Otávio Celidonio
Guest
2 years 15 hours ago
Dear all! Congratulations for your web site. It looks very helpful. I’m a advanced excel user, but I’m on the first steps on R. I’m trying to create a data on excel, but to do that I need to calculate a array formula, but no one was able to help me with that. As you can see bellow, the formula is a array, and basically I need to Sum something from a minimum distance due some restriction. In the first “IF” there is the restriction, on this case the colun “G” represents the Year, and I want to sum only the sum from the last year “G2-1”. In the second “IF” it’s a formula who calculates the distance in kilometers from a point to all my data and consider only that one who is smaller than the value on the cell “BD$1”. So my formula Sum everything from the cells “$AA$2:$AA$225667” due to the restrictions that I told before. As you can see, it’s a Big Data and to process this formula in all the 225667 lines in a core i7 with 8GB it would take more than 50 hours, so if you could help to build on R, I… Read more »
wpDiscuz