Project Series is an initiative by BACT AY18/19. Project Series aims to excite students in learning Data Analytics Tools through practical projects. The Series is applicable to all students (regardless of prior programming knowledge) .
Working for a hotel and want to determine in which aspects the hotel can improve? Check this project guide out! Learn to scrape all reviews for a hotel from Trip Advisor and analyse the scrape data to glean meaningful insights.
- Learn the fundamentals of Python programming
- Scrape data from an online website using Selenium (Web Scraping Package in Python)
- Store scraped data in DataFrames using Pandas (Data manipulation package in Python)
- Categorize reviews in self-defined categories (e.g. Food, Location, Rooms, Services) using Natural Language Toolkit (NLTK) Package
- Derive sentiment score for reviews using Sentiment Intensity Analyzer from NLTK package
1. Python Fundamentals
Get Started with the basics of Python (Set Up, Basic Programming Knowledge, Functions, Libraries) with our own BACT Python Tutorial Series:
If you prefer to learn by watching videos, here is kjdElectronic’s Python Beginner Tutorial playlist (Focus on Tutorials 1 to 8)
2. Pandas DataFrame
DataFrames are simply tables that can store data within Python. Click here for a concise tutorial on Dataframes by Datacamp.com (focus on Sections 1 to 9)
3. Web Scraping with Beautiful Soup & Selenium
BeautifulSoup & Selenium are 2 different Python libraries used for pulling data out of HTML & XML Files. Check out the differences between them here
Check out the following Youtube Tutorials on scraping using BeautifulSoup & Selenium. You can experiment with both as each video is only 10 minutes long.
Learn to scrape Multiple Web Pages from SAF Business Analytics’ video here
4. Sorting Reviews by Category
You will be using the NLTK package to break the reviews down into individual words (also known as Tokens). You will also be using For Loop, Lists & Dictionaries for this part of the project.
Learn how to break sentences into tokens from this pythonspot.com tutorial:
Learn how to sort reviews into your own custom defined categories from a tutorial by pythonprogramming.net:
5. Deriving Sentiment Score for Reviews
Learn how to derive sentiment scores using the Vader Sentiment Analyzer within NLTK. It will rank a piece of text as positive, negative or neutral using a pre-defined list of positive & negative words.
Here’s a tutorial on by learndatasci.com on deriving sentiment scores given a text. You can skip the part on extracting data from the Reddit through API. Start from the section titled “Labeling our Data”.
Source Codes for Project
Click here to access the source codes for this project if you are ain’t moving anywhere.
Facing Difficulties & need help?
We will be more than happy to help you! Drop us a message on our Facebook page, https://www.facebook.com/nusbact/