For many beginners, the word “vector” is one of those scary, geeky things that creep us out. But understanding vectors will instantly speed up your learning process. So in this post I will try my best to break down r vectors for you.
Note: this post is similar to the book’s chapter on vectors
What are vectors
This is a vector:
This is another vector:
And so is this:
A vector is simply a one-dimensional collection of things. The things in the vector are of the same type (i.e., they must all be numeric, or character, etc). They can be as long as you want, or as short as one (like the first one).
How to Create Vectors in R
There are essentially two ways to “create” a vector:
- Make one manually using the
c()function or other shortcuts
- Take one from an existing object, like a data frame
Let’s replicate the examples above manually using the
# two ways to make a vector of length one v1 = 55 v1 = c(55) v2 = c(55, 63, 1000, 2) v3 = c("James", "John", "Jane", "Jack")
Note the subtle power of the first example. The value 55 may look like a simple number, but in R speak it’s actually a vector of length one. The other examples are trivial;
c() stands for combine — as in combining things together.
There are sometimes shortcuts for creating vectors:
v = 5:10
There we created a vector v with integer values 5 through 10.
Vectors also can come from existing data objects, like data frames. E.g., df$colname is a reference to a column from the data frame df. That, itself, is a vector. So is a row from a data frame, like df[1,]. That is the first row of data frame df, and is also, itself, a vector.
How to Reference Vectors
Here are some ways to reference vectors:
# get second value of vector v2 v2 #result: 63 # get first and third values of vector v3 ## writing this out out in words because wordpress does not render the c-vector correctly v3[ c parenthesis 1 comma 3 parenthesis ] # result: "James" "Jane" # get the values of the second vector, in reverse order v2[ 4:1 ] # result: 2 1000 63 55 # we could also use a vector of TRUE/FALSE to pick specific values ## writing this out out in words because wordpress does not render the c-vector correctly v2[ c parenthesis TRUE, TRUE, FALSE, FALSE parenthesis ] # result: 55 63
Basically, in all those examples, we are supplying a vector of numbers
So Why do Vectors Matter?
Ever wonder what’s happening when you run something like this?
It’s definitely a head-scratcher for newbies. There we are sorting the dataframe by columname and only keeping the first ten columns. It turns out the two things inside the brackets are simply vectors instructing R how to order the rows and which columns to keep.
You may also see things like this:
It looks absolutely incomprehensible to newbies, and even some intermediate users. Here we are pulling column names from dataframe that have “blah_” in the name. But it turns out to be simple vector operations.
I’m not sure this is technically “vector math” but it sounds smart, so I’ll stick with that. Let’s run some operations on our vectors above:
# tell me which values > 60 v2 > 60
The result looks like this (second row):
The R operation returns a vector of TRUE / FALSE values indicating which values are > 60.
Recall in the section above, to reference values within a vector we simply supply it a vector of either index numbers, or TRUE / FALSE values indicating which values to keep. So what if we wanted the actual values of the vector that are > 60?
# tell me the values of vector v2 > 60 v2[ v2 > 60 ] # that basically runs this operations: # v2[ c(FALSE, TRUE, TRUE, FALSE) ] # result: # 63 1000
And on that same token, let’s find the names in vector v3 that have the letter “n”:
v3[ grepl("n", v3) ] # result: # "John" "Jane"
grepl() is sort of like Excel’s FIND() and SEARCH() function, but way more powerful. It’s basically looks for “n” in the vector v3 and returns a vector of TRUE/FALSE values accordingly.
Let’s look at yet another example to bring this whole thing to full clarity. Let’s sort the values of vector v2, using the
order() function. The order function returns the ranking of each value in a vector. That ranking is, naturally, a vector itself:
order(v2) # values of v2: # 55 63 1000 2 # result: # 4 1 2 3
This is actually slightly confusing, but here’s what the result says:
the first ordered value (i.e., smallest) is the 4th value in v2 (which is 2)
the second ordered value is the 1st value in v2 (55)
the largest value is the 3rd value in v2 (1000)
So knowing that, we get a sorted version of v2 by simply putting the order function inside v2’s brackets:
v2[ order(v2) ] # result: # 2 55 63 1000
I don’t now about you, but this is magical.
Vectors in Data Frame Operations
Recall a data frame is like an Excel data table. A collection of rows and columns, where each column is of the same type (numeric, character, etc). And you reference a data frame like this:
And most importantly, each row and column is a vector itself. That is, df$col is a vector (column named col). df[1, ] is the first row of the data frame, and is also a vector.
Here is a sampling of data frame operations you will use / see often, involving vectors:
# get first ten rows and first ten columns. Remember 1:10 is just a shortcut for c(1,2,3,4,5,6,7,8,9,10) df[ 1:10, 1:10 ] # SORT df by col2, and get all columns. df$col2 references col2, and is actually a vector itself df[ order(df$co2), ] # FILTER rows where col2 > 50. Like an Excel filter. df$col2 > 50 returns a TRUE/FALSE vector df[ df$col2 > 50, ] # FILTER rows meeting multiple criteria. & is like AND() in Excel. It returns TRUE/FALSE if all conditions are met df[ df$col2 > 50 & df$col4 == "John", ] # Get columns containing "raw_". Return all rows df[ , grepl("raw_", names(df) ] # Organize columns alphabetically df[ , order(names(df) ] # Get columns whose names have more than five characters df[ , nchar(names(df)) > 5 ] # CREATE new column based on conditional ifelse(), which is like Excel's IF() df$newcol = ifelse(df$oldcol < 5, "lt 5", "gte 5")
The list is practically endless. But now you can see the powerful role that vectors play in data frame manipulation.
Let’s look a bit closer at the last example,
ifelse() evaluates the condition, df$oldcol < 5, which results in a vector of TRUE/FALSE. When TRUE, assign newcol “lt 5”, when FALSE, assign newcol “get 5”. Once again, vectors come into play.
Every data object in R (including lists, matrix, dataframe, etc) can be broken down to vectors. Vectors are the fundamental data structure of R, and once you realize that and get comfortable with vectors, a lot of things in R will suddenly make a lot of sense.