Topics covered in this Tutorial:
- Inspecting a Web Page
- Scraping Data from a Web Page
Inspecting a Web Page
Let’s take a look at the Car Listings on Carousell: https://carousell.com/categories/cars-32/cars-for-sale-1173/
1) To view the HTML code of a part of the website, Right Click on the first listing and select “Inspect”
2) Try to move your cursor over the higher level tags and find the line which covers the entire listing
3) Now you can observe that the line which encompasses one listing is:
div class="col-lg-3 col-md-4 col-sm-4 col-xs-6"
Tag Name: div
Class Name: col-lg-3 col-md-4 col-sm-4 col-xs-6
4) Find the exact location where the information we need (Car Name, Price, Reg Year, Depreciation) can be found. Right-click on the info and select “Inspect”.
The listing name, price, registration year and depreciation can be found in this block of code:
Tag Name: figcaption
Class Name: A-N
It is important to take note of the Tag Name & Class Name as we will use them to identify which parts of the page to scrape. Now that we have inspected the page and know what to look for we are all set to scrape data!
Scraping Data
1) Loop through each Car listing on the Website
Sub ScrapeCarDataPt1() | |
Dim IE As InternetExplorer | |
Set IE = CreateObject("InternetExplorer.Application") | |
IE.Visible = True | |
IE.Navigate "https://carousell.com/categories/cars-32/cars-for-sale-1173/" | |
Do While IE.ReadyState <> READYSTATE_COMPLETE: Loop | |
Dim CarListings As IHTMLElementCollection | |
Dim CarListing As IHTMLElement | |
Set CarListings = IE.Document.body.getElementsByClassName("col-lg-3 col-md-4 col-sm-4 col-xs-6") | |
For Each CarListing In CarListings | |
Debug.Print CarListing.innerText | |
Next CarListing | |
IE.Quit | |
End Sub |
The first 9 lines are from Section 1 of this Tutorial. Let’s explore the remaining lines.
Dim CarListings As IHTMLElementCollection
This line introduces CarListings as an object that can hold multiple HTML elements (all the car listings on the website).
Dim CarListing As IHTMLElement
This line introduces CarListing as an object that holds just 1 HTML element.
Set CarListings = IE.Document.body.getElementsByClassName("col-lg-3 col-md-4 col-sm-4 col-xs-6")
We observed that each listing is found under the class name “col-lg-3 col-md-4 col-sm-4 col-xs-6”. Therefore we assign all the elements (all listings) with this class name to CarListings. CarListings now has all 40 cars assigned to it.
For Each CarListing In CarListings Debug.Print CarListing.innerText Next CarListing
With this chunk of code, we are looping through each listing in the collection and printing out the innerText of each listing.
Try this out! Change line 15 in the code to "Debug.Print CarListing.InnerHTML" Compare the results for .innerHTML with .innerText .innerHTML will be print the HTML tags you see in the Inspect Window.
2) Get the Car Name from each Car Listing
Now that we have gotten Text from all the Car listings, we need to filter out just the information we need.
The Listing Names are found under Class=”A-i” under each listing. The remaining information are found under tag <dl>. Let’s filter out the Listing Names first.
Sub ScrapeCarDataPt2() | |
Dim IE As InternetExplorer | |
Set IE = CreateObject("InternetExplorer.Application") | |
IE.Visible = True | |
IE.Navigate "https://carousell.com/categories/cars-32/cars-for-sale-1173/" | |
Do While IE.ReadyState <> READYSTATE_COMPLETE: Loop | |
Dim CarListings As IHTMLElementCollection | |
Dim CarListing As IHTMLElement | |
Dim ItemName As IHTMLElementCollection | |
Set CarListings = IE.Document.body.getElementsByClassName("col-lg-3 col-md-4 col-sm-4 col-xs-6") | |
For Each CarListing In CarListings | |
Set ItemName = CarListing.getElementsByClassName("A-i") | |
Debug.Print ItemName(0).innerText | |
Next CarListing | |
IE.Quit | |
End Sub |
Dim ItemName As IHTMLElementCollection
This line introduces ItemName as an object that will hold a collection of elements.
Set ItemName = CarListing.getElementsByClassName("A-i") Debug.Print ItemName(0).innerText
Notice that instead of IE.Document.Body.getElementsByClassName we used CarListing.getElementsbyClassName to find out the Item Name. This is because the Item Name we want to find is already inside CarListing and we don’t need to search the entire Document again.
We use ItemName(0).innerText to print out the Car Name. ItemName(0) refers to the 1st element in the ItemName collection. As we know that there is only 1 Car Name per listing, we use ItemName(0). ItemName(2) refers to the 3rd element in the ItemName collection.
3) Get the Car Price, Car Registration Year and Depreciation from each Car Listing
Sub ScrapeCarDataPt3() | |
Dim IE As InternetExplorer | |
Set IE = CreateObject("InternetExplorer.Application") | |
IE.Visible = True | |
IE.Navigate "https://carousell.com/categories/cars-32/cars-for-sale-1173/" | |
Do While IE.ReadyState <> READYSTATE_COMPLETE: Loop | |
Dim CarListings As IHTMLElementCollection | |
Dim CarListing As IHTMLElement | |
Dim ItemName As IHTMLElementCollection | |
Dim ItemDetails As IHTMLElementCollection | |
Dim ItemDetail As IHTMLElement | |
Set CarListings = IE.Document.body.getElementsByClassName("col-lg-3 col-md-4 col-sm-4 col-xs-6") | |
For Each CarListing In CarListings | |
Set ItemName = CarListing.getElementsByClassName("A-i") | |
Debug.Print ItemName(0).innerText | |
Set ItemDetails = CarListing.getElementsByTagName("dd") | |
For Each ItemDetail In ItemDetails | |
Debug.Print ItemDetail.innerText | |
Next ItemDetail | |
Next CarListing | |
IE.Quit | |
End Sub |
Dim ItemDetails As IHTMLElementCollection Dim ItemDetail As IHTMLElement
Similar to CarListings and CarListing, we introduce ItemDetails to hold a collection of elements (Price, Reg Year & Depreciation) and ItemDetail to hold just one element.
Set ItemDetails = CarListing.getElementsByTagName("dd")
Notice that the Item Details (boxed up in red) are in “<dd>” tags and do not have Class Names. Hence we use .get ElementsbyTagName(“dd”) instead.
For Each ItemDetail In ItemDetails Debug.Print ItemDetail.innerText Next ItemDetail
Similar to CarListing in CarListings, we print each Item Detail in the immediate window.
Your output should look like this:
4) Instead of Debug.Print, write the values to Excel Workbook
Sub ScrapeCarDataPt4() | |
Dim IE As InternetExplorer | |
Set IE = CreateObject("InternetExplorer.Application") | |
IE.Visible = True | |
IE.Navigate "https://carousell.com/categories/cars-32/cars-for-sale-1173/" | |
Do While IE.ReadyState <> READYSTATE_COMPLETE: Loop | |
Dim CarListings As IHTMLElementCollection | |
Dim CarListing As IHTMLElement | |
Dim ItemName As IHTMLElementCollection | |
Dim ItemDetails As IHTMLElementCollection | |
Dim ItemDetail As IHTMLElement | |
Dim wb As Workbook | |
Dim ws As Worksheet | |
Set wb = ThisWorkbook | |
Set ws = wb.Sheets("Sheet1") | |
Dim rowCnt, colmnCnt As Integer | |
rowCnt = 1 | |
Set CarListings = IE.Document.body.getElementsByClassName("col-lg-3 col-md-4 col-sm-4 col-xs-6") | |
For Each CarListing In CarListings | |
Set ItemName = CarListing.getElementsByClassName("r-k") | |
ws.Cells(rowCnt, 1).Value = ItemName(0).innerText | |
colmnCnt = 2 | |
Set ItemDetails = CarListing.getElementsByTagName("dd") | |
For Each ItemDetail In ItemDetails | |
ws.Cells(rowCnt, colmnCnt).Value = ItemDetail.innerText | |
colmnCnt = colmnCnt + 1 | |
Next ItemDetail | |
rowCnt = rowCnt + 1 | |
Next CarListing | |
IE.Quit | |
End Sub |
Writing values to cells was covered in Tutorial 2. Transferring data from the immediate window to the Excel file is extremely important. Not everyone knows Excel VBA and thus the data must be displayed on the Excel surface.
Want simple, easy-to-follow & streamlined tutorials for any other projects in VBA, Python or R? Leave us a comment below!