In this second practice, we will cover the concepts from the below topics
- R Tutorial 6: Data Manipulation
- R Tutorial 7: Data Manipulation (Continued)
- R Tutorial 8: Data Visualisation
Exercise 1 – Data Manipulation
In this practice, you will have to load the data Boston in the library MASS and answer the following questions about the data set.
These are the meaning of the column names for the Boston dataset:
- crim – per capita crime rate by town.
- zn – proportion of residential land zoned for lots over 25,000 sq.ft.
- indus – proportion of non-retail business acres per town.
- chas – Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox – nitrogen oxides concentration (parts per 10 million).
- rm – average number of rooms per dwelling.
- age – proportion of owner-occupied units built prior to 1940.
- dis – weighted mean of distances to five Boston employment centres.
- rad – index of accessibility to radial highways.
- tax – full-value property-tax rate per \$10,000.
- ptratio – pupil-teacher ratio by town.
- black – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
- lstat – lower status of the population (percent).
- medv – median value of owner-occupied homes in thousands
- Boston is a data frame, what is the dimension of this data frame?
- What’s the average value of data in this age column?
- Find all entries of age that are more than the mean value of the data in the age vector.
- The 10th to the 85th observations (row) are recorded wrongly so they are to be excluded from the dataset, so what’s the new mean value of the data in the age column
- If now 85th to 100th observations are recorded wrongly, what is the median value in the age column?
- What’s the minimal value of all data in the age column, and which row is it?
- The column name chas is a factor variable, which means if the entry is 1 the data tract bounds the river, 0 otherwise. We also will are concerned about the crime rate in the region, so we will be looking at the column crim as well. Isolate observations that at least one of the following is satisfied (you need to use alot of “and” and “or”)
- The data tract bounds river, and crime rate is less than average crime rate of all data.
- The data does not tract bounds river, and crime rate is less than median crime rate of all data.
- The crime rate is extremely low, i.e. lower than the first quantile of all data. You many use quantile function to help you.
- What is the average value of medv of all these observations that fulfills at least 1 of the 3 observation.
For each question 1 to 7, process the data and store the answer into variable named “Q1” to “Q7” accordingly.
Exercise 2 – Data Visualisation
This will be a simple graph, but plot a scatter plot of prices (medv) against the crime rate in the region (crim) to observe the relationship between the two variables.
Scatterplots are not in the tutorial, so this is a test of your skills in googling solution. (hint: look for cheatsheets for ggplot2)
Just make sure the line of code to create the visualisation works.
– End of Assignment 2 –