In the previous tutorial we covered basics in terms of data wrangling, but as you have more conditions and parameters, even typing the subset function over and over can get very tedious, and that is why the package dplyr is written. In a sense, this is advance grammar for R.
6 Key Objectives in Data Manipulation are:
- Filtering rows (to create subset)
- Selecting columns (selecting specific variables)
- Adding new variables
- Sorting
- Aggregrating
- More Functions
Before we start off, we must install dplyr into our RStudio, either through the package tab, or from the code below (which is recommended, as someone else who view your code might not have it installed), and also install ggplot2 first even though it is for data visualisation that will be covered on the next tutorial (because there is a nice inbuilt dataset called diamonds)
library(dplyr) library(ggplot2) head(diamonds)
Output:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
##
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
1. Filtering
Let’s subset the data to look at diamonds that have good cuts that have carats larger than 0.30
Goodcut <- diamonds %>% filter(cut == "Good") %>% filter(carat > 0.30) head(Goodcut)
Output:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
##
## 1 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 2 0.31 Good H SI1 64.0 54 402 4.29 4.31 2.75
## 3 0.32 Good H SI2 63.1 56 403 4.34 4.37 2.75
## 4 0.32 Good H SI2 63.8 56 403 4.36 4.38 2.79
## 5 0.70 Good E VS2 57.5 58 2759 5.85 5.90 3.38
## 6 0.70 Good F VS1 59.4 62 2759 5.71 5.76 3.40
%>% This is a pipe operator and it chain commands together. What we just did was to chain select the diamond dataset, filter cut that is “Good”, and filter out diamonds which have more than 0.30 carats.
You might be wondering if it is worth learning another format of doing filtering since we already kinda covered it in the basics. dplyr actually have formatting and shorter code as you include more and more complex conditions to your filter. You realise you only wrote diamond once, instead of
Goodcut = subset(diamonds, diamonds$cut=="Good"&diamonds$carat>0.30)
You can understand the pipe operator http://seananderson.ca/2014/09/13/dplyr-intro.html
2. Select Columns
Select only the carat, cut and price
salesinformation <- diamonds %>% select(carat,cut,price) head(salesinformation)
Output:
## # A tibble: 6 x 3
## carat cut price
##
## 1 0.23 Ideal 326
## 2 0.21 Premium 326
## 3 0.23 Good 327
## 4 0.29 Premium 334
## 5 0.31 Good 335
## 6 0.24 Very Good 336
3. Adding new variables
Let’s create a new variable called the volume of the diamonds, with the function mutate.
P.S.Here I even pipe the command head with the rest of the commands because I am lazy to type head(diamonds). Trust me, after doing this a while, you will hate typing the same name for a dataset over and over again, as it also give room for alot of error (such as you typing diamond or Diamonds instead of diamonds)
volume <- diamonds %>% select(carat,x,y,z) %>% mutate (volume = x*y*z) %>% head
4. Sorting
Let’s arrange the diamonds based on the highest price.
diamonds %>% arrange(desc(price))%>%head
Output:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
##
## 1 2.29 Premium I VS2 60.8 60 18823 8.50 8.47 5.16
## 2 2.00 Very Good G SI1 63.5 56 18818 7.90 7.97 5.04
## 3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
## 4 2.07 Ideal G SI2 62.5 55 18804 8.20 8.13 5.11
## 5 2.00 Very Good H SI1 62.8 57 18803 7.95 8.00 5.01
## 6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
5. Aggregate Information
What if I wnat to find out the average price for each type of clarity and the number of diamonds that have the particular clarity?
clarity <- diamonds %>% select(carat,clarity,price)%>% #group_by helps to group the diamonds of the same clarity together group_by(clarity)%>% #sumarise helps to summarise the values of the analysis required summarise(total.no.diamonds = n(), average.price = mean(price)) clarity
Output:
## # A tibble: 8 x 3
## clarity total.no.diamonds average.price
##
## 1 I1 741 3924.169
## 2 SI2 9194 5063.029
## 3 SI1 13065 3996.001
## 4 VS2 12258 3924.989
## 5 VS1 8171 3839.455
## 6 VVS2 5066 3283.737
## 7 VVS1 3655 2523.115
## 8 IF 1790 2864.839
6. More functions
There are even more useful functions in the dpylr package, which can be found in this neat cheatsheet for data wrangling (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf).
Next: Tutorial 8 Data Visualisation