R Tutorial 7: Data Manipulation Continued

In the previous tutorial we covered basics in terms of data wrangling, but as you have more conditions and parameters, even typing the subset function over and over can get very tedious, and that is why the package dplyr is written. In a sense, this is advance grammar for R.

6 Key Objectives in Data Manipulation are:

  1. Filtering rows (to create subset)
  2. Selecting columns (selecting specific variables)
  3. Adding new variables
  4. Sorting
  5. Aggregrating
  6. More Functions

Before we start off, we must install dplyr into our RStudio, either through the package tab, or from the code below (which is recommended, as someone else who view your code might not have it installed), and also install ggplot2 first even though it is for data visualisation that will be covered on the next tutorial (because there is a nice inbuilt dataset called diamonds)

library(dplyr)
library(ggplot2)
head(diamonds)
Output:
## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##                  
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

1. Filtering

Let’s subset the data to look at diamonds that have good cuts that have carats larger than 0.30

Goodcut <- diamonds %>% 
  filter(cut == "Good") %>% 
  filter(carat > 0.30)
head(Goodcut)
Output:
## # A tibble: 6 x 10
##   carat   cut color clarity depth table price     x     y     z
##              
## 1  0.31  Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 2  0.31  Good     H     SI1  64.0    54   402  4.29  4.31  2.75
## 3  0.32  Good     H     SI2  63.1    56   403  4.34  4.37  2.75
## 4  0.32  Good     H     SI2  63.8    56   403  4.36  4.38  2.79
## 5  0.70  Good     E     VS2  57.5    58  2759  5.85  5.90  3.38
## 6  0.70  Good     F     VS1  59.4    62  2759  5.71  5.76  3.40

%>% This is a pipe operator and it chain commands together. What we just did was to chain select the diamond dataset, filter cut that is “Good”, and filter out diamonds which have more than 0.30 carats.

You might be wondering if it is worth learning another format of doing filtering since we already kinda covered it in the basics. dplyr actually have formatting and shorter code as you include more and more complex conditions to your filter. You realise you only wrote diamond once, instead of

Goodcut = subset(diamonds, diamonds$cut=="Good"&diamonds$carat>0.30)

You can understand the pipe operator http://seananderson.ca/2014/09/13/dplyr-intro.html

2. Select Columns

Select only the carat, cut and price

salesinformation <- diamonds %>% select(carat,cut,price)
head(salesinformation)
Output:
## # A tibble: 6 x 3
##   carat       cut price
##         
## 1  0.23     Ideal   326
## 2  0.21   Premium   326
## 3  0.23      Good   327
## 4  0.29   Premium   334
## 5  0.31      Good   335
## 6  0.24 Very Good   336

3. Adding new variables

Let’s create a new variable called the volume of the diamonds, with the function mutate.

P.S.Here I even pipe the command head with the rest of the commands because I am lazy to type head(diamonds). Trust me, after doing this a while, you will hate typing the same name for a dataset over and over again, as it also give room for alot of error (such as you typing diamond or Diamonds instead of diamonds)

volume <- diamonds %>%
  select(carat,x,y,z) %>%
  mutate (volume = x*y*z) %>%
  head

4. Sorting

Let’s arrange the diamonds based on the highest price.

diamonds %>% arrange(desc(price))%>%head
Output:
## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##                  
## 1  2.29   Premium     I     VS2  60.8    60 18823  8.50  8.47  5.16
## 2  2.00 Very Good     G     SI1  63.5    56 18818  7.90  7.97  5.04
## 3  1.51     Ideal     G      IF  61.7    55 18806  7.37  7.41  4.56
## 4  2.07     Ideal     G     SI2  62.5    55 18804  8.20  8.13  5.11
## 5  2.00 Very Good     H     SI1  62.8    57 18803  7.95  8.00  5.01
## 6  2.29   Premium     I     SI1  61.8    59 18797  8.52  8.45  5.24

5. Aggregate Information

What if I wnat to find out the average price for each type of clarity and the number of diamonds that have the particular clarity?

clarity <- diamonds %>%
  select(carat,clarity,price)%>%
  #group_by helps to group the diamonds of the same clarity together
  group_by(clarity)%>%
  #sumarise helps to summarise the values of the analysis required
  summarise(total.no.diamonds = n(), average.price = mean(price))
clarity
Output:
## # A tibble: 8 x 3
##   clarity total.no.diamonds average.price
##                           
## 1      I1               741      3924.169
## 2     SI2              9194      5063.029
## 3     SI1             13065      3996.001
## 4     VS2             12258      3924.989
## 5     VS1              8171      3839.455
## 6    VVS2              5066      3283.737
## 7    VVS1              3655      2523.115
## 8      IF              1790      2864.839

6. More functions

There are even more useful functions in the dpylr package, which can be found in this neat cheatsheet for data wrangling (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf).

Next: Tutorial 8 Data Visualisation

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s