6 How to Filter and Transform Data with the dplyr Package

In the last chapter, I showed you how to use R base to filter and transform objects. You will need to learn how to use base R to work with most object types in R.

However, there’s another popular package for data filtering and transformation – the dplyr package.

The dplyr package (pronounced like “data plier”) was developed to allow more intuitive data transformation. It falls under the tidyverse, which is a collection of other popular R packages.

A big difference between this method and the last chapter is our focus. Rather than filtering and transforming any object type, we’ll focus exclusively on data frames.

6.2 Is It Still Worth Learning Base R?

dplyr’s strengths are mostly limited to data frames, whereas base R can work with vectors, matrices, arrays, and lists. That makes it still worthwhile to study base R.

This isn’t a knock against dplyr. Much of the work you’ll do requires a data frame and most R functions work with data frames fairly easily. But I have found that the functions I typically write (which we’ll cover later) work a lot better when you understand base R.

6.3 How to Install and Load dplyr

In our next chapter, I’ll go into more detail about installing and loading packages. For now though, simply run the script below and you can follow along with our examples.

  install.packages("dplyr")
  load(dplyr)

You can also run install.packages("tidyverse"), which will install other packages related to data management, including dplyr.

6.4 The dplyr Syntax

If you recall, filtering a data frame in base R looks like this:

  #If needed, you can reload the Bond data set with this script
  bond <- read.csv("https://raw.githubusercontent.com/taylorrodgers/bond/main/bond.csv")
  
  bond[bond["year"]>=1990,1:3]

That’s different in dplyr, which looks like this:

  bond %>% 
    select(filmname,year,actor) %>% 
    filter(year>=1990)

You can think of the dplyr syntax as a set of instructions to read one-by-one:

  1. Hey data set named “Bond”… bond %>%
  2. Tell me the film name, year released, and actor name for… select(filmname,year,actor) %>%
  3. James Bond films made after 1990 filter(year>=1990)

That’s a lot easier to understand than normal base R, ain’t it?

6.5 Understand the Difference Between Selecting and Filtering

With dplyr, it’s important to remember the difference between selecting and filtering.

Selecting uses the select() function and reduces the columns to those you specify:

In addition to select(), there’s also mutate() and transmute(), which serve similar purposes. We’ll cover those two functions in the next section.

Filtering uses the filter() function and reduces the rows to those you specify.

6.6 How to Select and Mutate a Data Frame with dplyr in R

In dplyr, you can select columns using the select() function:

  bond %>% 
    select(actor,filmname,gross)

Executing the code above will only select the columns actor, filmname, and gross.

Sometimes you’ll need to transform or alter the data. That’s where mutate() comes in handy. mutate() will allow you to create new fields that alter existing ones. Confused?

Execute the script below to see:

  bond %>% 
    mutate(gross_millions=gross*1000000)

The mutate() function added a new column to the end called gross_millions with our new calculation.

One thing to keep in mind with mutate() is that it always includes all existing columns. This can be good or bad, depending on your goal.

Let’s say you want to create a new column, but not include the existing ones in your output. Rather than add another %>% select() to your code, you can use the transmute() function.

transmute() combines the functionality of mutate() and select(). You can both define the columns you want to keep and mutate others.

  bond %>% 
    transmute(actor,filmname,gross=gross*1000000)

I personally like transmute() for selecting columns. However, select() and mutate() are appropriate in many situations as well.

6.7 How to Filter a Data Frame with dplyr in R

To filter rows in a data frame, use the filter() function:

  bond %>% 
    filter(year>=1980 & actor=="Daniel Craig")

If you noticed, we used the same operators as in base R. We can use any R operators shown below:

Table 6.1: R Operators
label symbol
less than <
greater than >
less than or equal <=
greater than or equal >=
equal ==
does not equal !=
and &
or |
in %in%

You can apply any of these operators within the filter() function:

  bond %>% 
    filter(actor == "Daniel Craig" | actor == "Sean Connery")
  bond %>%
    filter(gross >= 700 & year < 2000)

Like in base R, you can also pass vectors in for dynamic filters:

  actor_list <- c("Daniel Craig","Sean Connery","Timothy Dalton")
  bond %>% 
    filter(actor %in% actor_list)

6.8 How to Summarize and Group Data with dplyr in R

Probably the most useful thing about dplyr is the ability to create new data frames that group and summarize data found in the larger data set. This is the primary reason I like dplyr.

Let’s say you wanted to take the Bond data set and find out the mean and standard deviation for gross revenue, but only for the films starring Daniel Craig and Sean Connery.

To accomplish this in base R, you’d have to use the following code:

  data.frame(actor_subselect=c("Daniel Craig","Sean Connery"), 
             average_revenue=c(mean(bond[bond$actor=="Daniel Craig","gross"]),
                               mean(bond[bond$actor=="Sean Connery","gross"])),
             sdev_revenue=c(sd(bond[bond$actor=="Daniel Craig","gross"]),
                            sd(bond[bond$actor=="Sean Connery","gross"])))

It’s not very simple, is it? Imagine having to do that for a much larger data set with even more actors!

dplyr’s group_by() and summarize() functions really cut down on this work. Here’s how we can accomplish the same thing as above:

  bond %>% 
    filter(actor=="Daniel Craig" | actor == "Sean Connery") %>% 
    group_by(actor) %>% 
    summarize(average_revenue=mean(gross),sdev_revenue=sd(gross))

Now you’ll notice this script isn’t shorter than what we did earlier. However, it’s a lot easier to read. That’s the beauty of dplyr!

6.9 Things to Remember

  • dplyr is a package that provides a more intuitive syntax for transforming and analyzing data frames in R
  • Select and create new columns with select(), mutate(), and transmute()
  • Filter rows with filter() and the base R operators
  • Provide summary statistics with group_by() and summarize() functions

6.10 Exercises

  1. Using the mtcars data set and the dplyr package, filter to cars that have a gear with the value of 4 or an hp greater than 115. (Hint: use data(mtcars) to load the data set.)
  2. Using the mtcars data set and the dplyr package, apply the same filter as the first question and select only the columns mpg, cyl, gear, and hp. (Note: the output will include the car models as row names by default. No need to remove those.)
  3. Using the mtcars data set and the dplyr package, create a new column that takes the natural log (log()) of mpg and name it mpg_log. Be sure to still select all the other columns mentioned in the second question except for mpg.
  4. Using the mtcars data set and the dplyr package, determine the average mpg, grouped by gear, and filtered to only wt greater than 2.