R Tutorial 6: Data Manipulation

In this tutorial, we will be covering basics of data manipulations using in-built R functions.

For the basics, we will be using two inbuilt datasets in R for practice purposes: airquality (you can call this out directly)

Data Viewing

head(<dataframe>) – This allow you to see the first 5 rows of a dataset

head(airquality)
Output:
##   Ozone Solar.R Wind Temp Month Day Year     Date   DateType
## 1    41     190  7.4   67     5   1 2017 2017-5-1 2017-05-01
## 2    36     118  8.0   72     5   2 2017 2017-5-2 2017-05-02
## 3    12     149 12.6   74     5   3 2017 2017-5-3 2017-05-03
## 4    18     313 11.5   62     5   4 2017 2017-5-4 2017-05-04
## 5    NA      NA 14.3   56     5   5 2017 2017-5-5 2017-05-05
## 6    28      NA 14.9   66     5   6 2017 2017-5-6 2017-05-06

tail(<dataframe>) – This allows you to see the last 5 rows of a dataset

tail(airquality)
Output:
##     Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 148    14      20 16.6   63     9  25 2017 2017-9-25 2017-09-25
## 149    30     193  6.9   70     9  26 2017 2017-9-26 2017-09-26
## 150    NA     145 13.2   77     9  27 2017 2017-9-27 2017-09-27
## 151    14     191 14.3   75     9  28 2017 2017-9-28 2017-09-28
## 152    18     131  8.0   76     9  29 2017 2017-9-29 2017-09-29
## 153    20     223 11.5   68     9  30 2017 2017-9-30 2017-09-30

str(<dataframe>) – This allows you to see an overview of your variables in the dataset.

str(airquality)
Output:
## 'data.frame':    153 obs. of  9 variables:
##  $ Ozone   : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R : int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind    : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp    : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month   : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year    : num  2017 2017 2017 2017 2017 ...
##  $ Date    : chr  "2017-5-1" "2017-5-2" "2017-5-3" "2017-5-4" ...
##  $ DateType: Date, format: "2017-05-01" "2017-05-02" ...

summary(<dataframe>) – This helps you see the statistics of observations in the variables present.

summary(airquality)
Output:
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day            Year          Date          
##  Min.   :5.000   Min.   : 1.0   Min.   :2017   Length:153        
##  1st Qu.:6.000   1st Qu.: 8.0   1st Qu.:2017   Class :character  
##  Median :7.000   Median :16.0   Median :2017   Mode  :character  
##  Mean   :6.993   Mean   :15.8   Mean   :2017                     
##  3rd Qu.:8.000   3rd Qu.:23.0   3rd Qu.:2017                     
##  Max.   :9.000   Max.   :31.0   Max.   :2017                     
##                                                                  
##     DateType         
##  Min.   :2017-05-01  
##  1st Qu.:2017-06-08  
##  Median :2017-07-16  
##  Mean   :2017-07-16  
##  3rd Qu.:2017-08-23  
##  Max.   :2017-09-30  
## 

dim(<dataframe>) – A quick way to know the number of rows and column of a dataset.

dim(airquality)
Output:
## [1] 153   9

Subsetting Data

A quick way to isolate rows and colums can be done in the manner dataframe[x,y], where x is the column and y is row number. Leaving x or y blank will return the entire row or column that you want. Also, you can not only isolate for 1 row/column, but a certain range of it if you replace the value from x and y with a vector of integers that correspond with the row/columns number you want.

Singular row/columns

You should assign the subsetted dataframe to another variable for easy access in the future, else you will have to retype the command again if you want that specific subset again.

a <- airquality[3,]
b <- airquality[,4]
a
Output:
##   Ozone Solar.R Wind Temp Month Day Year     Date   DateType
## 3    12     149 12.6   74     5   3 2017 2017-5-3 2017-05-03
b
Output:
##   [1] 67 72 74 62 56 66 65 59 61 69 74 69 66 68 58 64 66 57 68 62 59 73 61
##  [24] 61 57 58 57 67 81 79 76 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79
##  [47] 77 72 65 73 76 77 76 76 76 75 78 73 80 77 83 84 85 81 84 83 83 88 92
##  [70] 92 89 82 73 81 91 80 81 82 84 87 85 74 81 82 86 85 82 86 88 86 83 81
##  [93] 81 81 82 86 85 87 89 90 90 92 86 86 82 80 79 77 79 76 78 78 77 72 75
## [116] 79 81 86 88 97 94 96 94 91 92 93 93 87 84 80 78 75 73 81 76 77 71 71
## [139] 78 67 76 68 82 64 71 81 69 63 70 77 75 76 68

Range of row/columns

Let’s say I want rows from 4 to 10, I just need to create a vector c(4:10) and put it in.
REMEMBER THE COMMA, it was one of the major mistakes I made.

airquality[c(4:10),]
Output:
##    Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 4     18     313 11.5   62     5   4 2017  2017-5-4 2017-05-04
## 5     NA      NA 14.3   56     5   5 2017  2017-5-5 2017-05-05
## 6     28      NA 14.9   66     5   6 2017  2017-5-6 2017-05-06
## 7     23     299  8.6   65     5   7 2017  2017-5-7 2017-05-07
## 8     19      99 13.8   59     5   8 2017  2017-5-8 2017-05-08
## 9      8      19 20.1   61     5   9 2017  2017-5-9 2017-05-09
## 10    NA     194  8.6   69     5  10 2017 2017-5-10 2017-05-10

If you want to exclude a certain row/column, you just need to put an ‘-‘ sign in front of the vector.

airquality[-c(1:140),]
Output:
##     Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 141    13      27 10.3   76     9  18 2017 2017-9-18 2017-09-18
## 142    24     238 10.3   68     9  19 2017 2017-9-19 2017-09-19
## 143    16     201  8.0   82     9  20 2017 2017-9-20 2017-09-20
## 144    13     238 12.6   64     9  21 2017 2017-9-21 2017-09-21
## 145    23      14  9.2   71     9  22 2017 2017-9-22 2017-09-22
## 146    36     139 10.3   81     9  23 2017 2017-9-23 2017-09-23
## 147     7      49 10.3   69     9  24 2017 2017-9-24 2017-09-24
## 148    14      20 16.6   63     9  25 2017 2017-9-25 2017-09-25
## 149    30     193  6.9   70     9  26 2017 2017-9-26 2017-09-26
## 150    NA     145 13.2   77     9  27 2017 2017-9-27 2017-09-27
## 151    14     191 14.3   75     9  28 2017 2017-9-28 2017-09-28
## 152    18     131  8.0   76     9  29 2017 2017-9-29 2017-09-29
## 153    20     223 11.5   68     9  30 2017 2017-9-30 2017-09-30

P.S. I would like to point out that instead of calling columns out by its column number, it is more intuitive to call out the header of the column through the method below:

c <- airquality$Wind
c
Output:
##   [1]  7.4  8.0 12.6 11.5 14.3 14.9  8.6 13.8 20.1  8.6  6.9  9.7  9.2 10.9
##  [15] 13.2 11.5 12.0 18.4 11.5  9.7  9.7 16.6  9.7 12.0 16.6 14.9  8.0 12.0
##  [29] 14.9  5.7  7.4  8.6  9.7 16.1  9.2  8.6 14.3  9.7  6.9 13.8 11.5 10.9
##  [43]  9.2  8.0 13.8 11.5 14.9 20.7  9.2 11.5 10.3  6.3  1.7  4.6  6.3  8.0
##  [57]  8.0 10.3 11.5 14.9  8.0  4.1  9.2  9.2 10.9  4.6 10.9  5.1  6.3  5.7
##  [71]  7.4  8.6 14.3 14.9 14.9 14.3  6.9 10.3  6.3  5.1 11.5  6.9  9.7 11.5
##  [85]  8.6  8.0  8.6 12.0  7.4  7.4  7.4  9.2  6.9 13.8  7.4  6.9  7.4  4.6
##  [99]  4.0 10.3  8.0  8.6 11.5 11.5 11.5  9.7 11.5 10.3  6.3  7.4 10.9 10.3
## [113] 15.5 14.3 12.6  9.7  3.4  8.0  5.7  9.7  2.3  6.3  6.3  6.9  5.1  2.8
## [127]  4.6  7.4 15.5 10.9 10.3 10.9  9.7 14.9 15.5  6.3 10.9 11.5  6.9 13.8
## [141] 10.3 10.3  8.0 12.6  9.2 10.3 10.3 16.6  6.9 13.2 14.3  8.0 11.5

At this point, you might question the usefulness of subsetting via column/row numbers. This is just to give you a basic understanding of how R handles data frames. I will introduce a more efficient method in Tutorial 7.

Conditions with Logical Operators

Now we will move on to subset data using logical operators (less than / greater than / equal to). This is a big advantage R has over Excel as Excel commands and filters are rather limited.

An inbuilt function called subset(, condition) is key to doing so. We now want to isolate observations in the dataset where the temperature fell below 65F:

colddays <- subset(airquality, airquality$Temp<65)
colddays
Output:
##     Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 4      18     313 11.5   62     5   4 2017  2017-5-4 2017-05-04
## 5      NA      NA 14.3   56     5   5 2017  2017-5-5 2017-05-05
## 8      19      99 13.8   59     5   8 2017  2017-5-8 2017-05-08
## 9       8      19 20.1   61     5   9 2017  2017-5-9 2017-05-09
## 15     18      65 13.2   58     5  15 2017 2017-5-15 2017-05-15
## 16     14     334 11.5   64     5  16 2017 2017-5-16 2017-05-16
## 18      6      78 18.4   57     5  18 2017 2017-5-18 2017-05-18
## 20     11      44  9.7   62     5  20 2017 2017-5-20 2017-05-20
## 21      1       8  9.7   59     5  21 2017 2017-5-21 2017-05-21
## 23      4      25  9.7   61     5  23 2017 2017-5-23 2017-05-23
## 24     32      92 12.0   61     5  24 2017 2017-5-24 2017-05-24
## 25     NA      66 16.6   57     5  25 2017 2017-5-25 2017-05-25
## 26     NA     266 14.9   58     5  26 2017 2017-5-26 2017-05-26
## 27     NA      NA  8.0   57     5  27 2017 2017-5-27 2017-05-27
## 144    13     238 12.6   64     9  21 2017 2017-9-21 2017-09-21
## 148    14      20 16.6   63     9  25 2017 2017-9-25 2017-09-25

You can also have many other conditions, that can be joined together with the ‘and’ and ‘or’ functions.

coldandwindydays 15&airquality$Temp<65)
coldandwindydays 
Output:
##     Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 9       8      19 20.1   61     5   9 2017  2017-5-9 2017-05-09
## 18      6      78 18.4   57     5  18 2017 2017-5-18 2017-05-18
## 25     NA      66 16.6   57     5  25 2017 2017-5-25 2017-05-25
## 148    14      20 16.6   63     9  25 2017 2017-9-25 2017-09-25

Max & Min

Thus, with the understanding how subsetting works, we can move on to utilise more functions that help you refine your subsets.

Max() and Min () returns the max/min of the column Temperature

max(airquality$Temp)
Output:
## [1] 97
min(airquality$Temp)
Output:
## [1] 56

which.min & which.max

‘which’ function here helps you isolate the row/column number of the min or max. ‘which’ can be used with other functions as well, so try it out yourself!

which.min(airquality$Temp)
Output:
## [1] 5
which.max(airquality$Temp)
Output:
## [1] 120

mean() & median()

If you would like to know if the observations are above/below the average or median you can use mean() and median().

mean(airquality$Temp)
Output:
## [1] 77.88235
median(airquality$Temp)
Output:
## [1] 79

Quantile

There is also this function called quantile, for those people who want to find out the bell curve 😉

quantile(airquality$Temp,probs = seq(0,1,0.25))
Output:
##   0%  25%  50%  75% 100% 
##   56   72   79   85   97

Data Cleansing

Any data scientist/or analyst handling data will tell you they spend 90% of the time on this part, as real-world data always look like a hot mess of raw text. Identifying potential pitfalls of your data (e.g. missing, incomplete, badly formatted data) comes with experience.

The writer of this article (me) has very little experience in this. Hence, we will be covering some elementary techniques of “data-cleaning”

Removing Empty Observations

#before deleting observations with missing values, you might want to check it out first.
#is.na(dataframe) returns a vector of numbers that are the rows that have missing values.
airquality[is.na(airquality),]
Output:
##      Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 5       NA      NA 14.3   56     5   5 2017  2017-5-5 2017-05-05
## 10      NA     194  8.6   69     5  10 2017 2017-5-10 2017-05-10
## 25      NA      66 16.6   57     5  25 2017 2017-5-25 2017-05-25
## 26      NA     266 14.9   58     5  26 2017 2017-5-26 2017-05-26
## 27      NA      NA  8.0   57     5  27 2017 2017-5-27 2017-05-27
## 32      NA     286  8.6   78     6   1 2017  2017-6-1 2017-06-01
## 33      NA     287  9.7   74     6   2 2017  2017-6-2 2017-06-02
## 34      NA     242 16.1   67     6   3 2017  2017-6-3 2017-06-03
## 35      NA     186  9.2   84     6   4 2017  2017-6-4 2017-06-04
## 36      NA     220  8.6   85     6   5 2017  2017-6-5 2017-06-05
## 37      NA     264 14.3   79     6   6 2017  2017-6-6 2017-06-06
## 39      NA     273  6.9   87     6   8 2017  2017-6-8 2017-06-08
## 42      NA     259 10.9   93     6  11 2017 2017-6-11 2017-06-11
## 43      NA     250  9.2   92     6  12 2017 2017-6-12 2017-06-12
## 45      NA     332 13.8   80     6  14 2017 2017-6-14 2017-06-14
## 46      NA     322 11.5   79     6  15 2017 2017-6-15 2017-06-15
## 52      NA     150  6.3   77     6  21 2017 2017-6-21 2017-06-21
## 53      NA      59  1.7   76     6  22 2017 2017-6-22 2017-06-22
## 54      NA      91  4.6   76     6  23 2017 2017-6-23 2017-06-23
## 55      NA     250  6.3   76     6  24 2017 2017-6-24 2017-06-24
## 56      NA     135  8.0   75     6  25 2017 2017-6-25 2017-06-25
## 57      NA     127  8.0   78     6  26 2017 2017-6-26 2017-06-26
## 58      NA      47 10.3   73     6  27 2017 2017-6-27 2017-06-27
## 59      NA      98 11.5   80     6  28 2017 2017-6-28 2017-06-28
## 60      NA      31 14.9   77     6  29 2017 2017-6-29 2017-06-29
## 61      NA     138  8.0   83     6  30 2017 2017-6-30 2017-06-30
## 65      NA     101 10.9   84     7   4 2017  2017-7-4 2017-07-04
## 72      NA     139  8.6   82     7  11 2017 2017-7-11 2017-07-11
## 75      NA     291 14.9   91     7  14 2017 2017-7-14 2017-07-14
## 83      NA     258  9.7   81     7  22 2017 2017-7-22 2017-07-22
## 84      NA     295 11.5   82     7  23 2017 2017-7-23 2017-07-23
## 102     NA     222  8.6   92     8  10 2017 2017-8-10 2017-08-10
## 103     NA     137 11.5   86     8  11 2017 2017-8-11 2017-08-11
## 107     NA      64 11.5   79     8  15 2017 2017-8-15 2017-08-15
## 115     NA     255 12.6   75     8  23 2017 2017-8-23 2017-08-23
## 119     NA     153  5.7   88     8  27 2017 2017-8-27 2017-08-27
## 150     NA     145 13.2   77     9  27 2017 2017-9-27 2017-09-27
## NA      NA      NA   NA   NA    NA  NA   NA             
## NA.1    NA      NA   NA   NA    NA  NA   NA             
## NA.2    NA      NA   NA   NA    NA  NA   NA             
## NA.3    NA      NA   NA   NA    NA  NA   NA             
## NA.4    NA      NA   NA   NA    NA  NA   NA             
## NA.5    NA      NA   NA   NA    NA  NA   NA             
## NA.6    NA      NA   NA   NA    NA  NA   NA             
#you might be just concerned with the NA values of 1 certain column such as Solar.R, so you can refine the search as below
airquality[is.na(airquality$Solar.R),]
Output:
##    Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 5     NA      NA 14.3   56     5   5 2017  2017-5-5 2017-05-05
## 6     28      NA 14.9   66     5   6 2017  2017-5-6 2017-05-06
## 11     7      NA  6.9   74     5  11 2017 2017-5-11 2017-05-11
## 27    NA      NA  8.0   57     5  27 2017 2017-5-27 2017-05-27
## 96    78      NA  6.9   86     8   4 2017  2017-8-4 2017-08-04
## 97    35      NA  7.4   85     8   5 2017  2017-8-5 2017-08-05
## 98    66      NA  4.6   87     8   6 2017  2017-8-6 2017-08-06
#From the above methods, you can assign dummy values to the rows, but if you want to remove those observations, you can use na.omit(dataframe)

na.omit(airquality)
Output:
##     Ozone Solar.R Wind Temp Month Day Year      Date   DateType
## 1      41     190  7.4   67     5   1 2017  2017-5-1 2017-05-01
## 2      36     118  8.0   72     5   2 2017  2017-5-2 2017-05-02
## 3      12     149 12.6   74     5   3 2017  2017-5-3 2017-05-03
## 4      18     313 11.5   62     5   4 2017  2017-5-4 2017-05-04
## 7      23     299  8.6   65     5   7 2017  2017-5-7 2017-05-07
## 8      19      99 13.8   59     5   8 2017  2017-5-8 2017-05-08
## 9       8      19 20.1   61     5   9 2017  2017-5-9 2017-05-09
.
.
.
## 152    18     131  8.0   76     9  29 2017 2017-9-29 2017-09-29
## 153    20     223 11.5   68     9  30 2017 2017-9-30 2017-09-30

Changing data types of variables

Though there are other data types that can be used for analyses, most of the time raw data come in raw text or number (e.g. phone numbers, address).

But for this tutorial, we will just look at the example of formatting raw text into the ‘date’ class. Why bother to change it to date type? When you do time series analysis, you will realise R packages would require the column’s data class to be “Date”.

First, we create a column for a year.

airquality$Year <- 2017

Then, we create another column for a formatted version of the date, separated with ‘-‘

airquality$Date <- with(airquality,paste(Year,Month,Day,sep='-'))

Now the class of the Date Column in the data frame is “character”.

class(airquality$Date) 

Output:
## [1] "character"

Now we can use as.Date(dataframe, format) to change this raw text into a ‘Date’ class. As you type as.Date, you might realise the autocomplete shows you many other “as.xxxx” function, so you can actually explore different classes.

P.S. you can actually type the previous line into this line, without creating another column, but for the sake of understanding, it is separated.

airquality$DateType <- as.Date(airquality$Date,"%Y-%m-%d")

Check the class of the Date Column again.

class(airquality$DateType)

Output:
## [1] "Date"

Next: Tutorial 7 Data Manipulation (Continued)

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s