Scraping & Analysing Tripadvisor Reviews

Project Series is an initiative by BACT AY18/19. Project Series aims to excite students in learning Data Analytics Tools through practical projects. The Series is applicable to all students (regardless of prior programming knowledge) .

Project Description

Slide17.jpg

Working for a hotel and want to determine in which aspects the hotel can improve? Check this project guide out! Learn to scrape all reviews for a hotel from Trip Advisor and analyse the scrape data to glean meaningful insights.

Desired Output

Output_ScrapeReviews

Output_CategorizeReviews

Output_SentimentScore
3. Derive a sentiment score to differentiate between positive & negative comments

Learning Objectives

  1. Learn the fundamentals of Python programming
  2. Scrape data from an online website using Selenium (Web Scraping Package in Python)
  3. Store scraped data in DataFrames using Pandas (Data manipulation package in Python)
  4. Categorize reviews in self-defined categories (e.g. Food, Location, Rooms, Services) using Natural Language Toolkit (NLTK) Package
  5. Derive sentiment score for reviews using Sentiment Intensity Analyzer from NLTK package

Execution

1. Python Fundamentals

Get Started with the basics of Python (Set Up, Basic Programming Knowledge, Functions, Libraries) with our own BACT Python Tutorial Series:
https://nusbact.com/tutorial-series/python-tutorial-series/

If you prefer to learn by watching videos, here is kjdElectronic’s Python Beginner Tutorial playlist (Focus on Tutorials 1 to 8)

2. Pandas DataFrame

DataFrames are simply tables that can store data within Python. Click here for a concise tutorial on Dataframes by Datacamp.com (focus on Sections 1 to 9)

3. Web Scraping with Beautiful Soup & Selenium

BeautifulSoup & Selenium are 2 different Python libraries used for pulling data out of HTML & XML Files. Check out the differences between them here

Check out the following Youtube Tutorials on scraping using BeautifulSoup & Selenium. You can experiment with both as each video is only 10 minutes long.

Get Set Python Youtube Channel:
Web Scraping – Introduction
Dynamic Web Scraping – 1: Covering BeautifulSoup
Dynamic Web Scraping – 2: Covering Selenium

Learn to scrape Multiple Web Pages from SAF Business Analytics’ video here

4. Sorting Reviews by Category 

You will be using the NLTK package to break the reviews down into individual words (also known as Tokens). You will also be using For Loop, Lists & Dictionaries for this part of the project.

Learn how to break sentences into tokens from this pythonspot.com tutorial:
https://pythonspot.com/tokenizing-words-and-sentences-with-nltk/

Learn how to sort reviews into your own custom defined categories from a tutorial by pythonprogramming.net:
https://pythonprogramming.net/text-classification-nltk-tutorial/

5. Deriving Sentiment Score for Reviews 

Learn how to derive sentiment scores using the Vader Sentiment Analyzer within NLTK. It will rank a piece of text as positive, negative or neutral using a pre-defined list of positive & negative words.

Here’s a tutorial on by learndatasci.com on deriving sentiment scores given a text. You can skip the part on extracting data from the Reddit through API. Start from the section titled “Labeling our Data”.
https://www.learndatasci.com/tutorials/sentiment-analysis-reddit-headlines-pythons-nltk/

Source Codes for Project

Click here to access the source codes for this project if you are ain’t moving anywhere.

Facing Difficulties & need help?

We will be more than happy to help you! Drop us a message on our Facebook page, https://www.facebook.com/nusbact/