TDM 20200: Project 03 — 2024

Motivation: Web scraping is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. In general, scraping data from websites has always been a popular topic in The Data Mine. We will continue to use the website of "books.toscrape.com" to practice scraping skills

Context: In the previous projects we gently introduced XML and XPath to parse a XML document, also introduced some basic web scraping skills using ""BeautifulSoup". In this project, we will learn some basic skills to scrap with another library "selenium"

Scope: Python, web scraping, BeautifulSoup, selenium

Learning Objectives
  • Understand webpage structures

  • Use selenium to scrape data from web pages

Readings and Resources

Questions

In this project, we will re-create the results from Project 2, BUT in this project, we will use Selenium instead of Beautiful Soup.

Question 1 (2 points)

  1. Please use selenium to get and display the website’s HTML source code https://books.toscrape.com

  2. Review the website’s HTML source code. What is the title for that webpage?

  • You may import the following libraries/ modules. For this project, we assume that you are using the Firefox browser. (If you are not using Firefox, please download and use Firefox on this project.)

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.firefox.service import Service
    from webdriver_manager.firefox import GeckoDriverManager
    from selenium.webdriver.firefox.options import Options
  • You may refer to the following options, to configure a Firefox instance, to run efficiently.

    firefox_options = Options()
    firefox_options.add_argument("--headless")
    firefox_options.add_argument("--disable-extensions")
    firefox_options.add_argument("--no-sandbox")
    firefox_options.add_argument("--disable-dev-shm-usage")
  • Use this code to create a selenium driver

    driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()),options=firefox_options)

Question 2 (2 points)

  1. Please use the selenium library to get and display all categories' names from the homepage of the website.

  • Similar to project 2, you will need to explore the webpage source code, to find the information that you need.

  • You may use find_elements() with By.CSS_SELECTOR, to explore the page source, the category information located under ".nav-list>li>ul>li>a", like (for instance) the following sample code, or you may also use other ways to get information. Selenium gives you several ways to find the content that you need.

    driver.find_elements(By.CSS_SELECTOR,".nav-list>li>ul>li>a" )
  • You may use a for loop to iterate over all of the categories, and print only the category names.

Question 3 (2 points)

  1. Now, instead of only getting the names of the categories, get all of the category links from the homepage as well.

  2. Update the code from question 3a to get (only) the links for books with the category "Romance".

  • The output will be like:

romance_url is https://books.toscrape.com/catalogue/category/books/romance_8/index.html
  • Continuing the work from Question 2, get additional information about the links, including (of course) the . ".get_attribute('href')" method may be useful

  • Use a "if" statement to check whether the category name "Romance", and if so, then get the "Romance" category link

Question 4 (2 points)

  1. Starting from the homepage of Romance category "https://books.toscrape.com/catalogue/category/books/romance_8/index.html", please get the titles of all of the books from the "Romance" category’s first webpage.

  2. Find the next pagination link from the "Romance" category of the first webpage. Next, get all of the book titles, from the second page of the "Romance" category.

  • For next page link, look for an html element with a class usually like 'pager', 'pagination' or similar for a pagination, in the Romance webpage. It will be <div> <ul class="pager"> …​ <li class="next"><a href="page-2.html">next</a></li> …​ </ul> </div>

You will need to extract the "href" attribute of a tag

  • Selenium has a "click()" method, to allow you to dynamically click a link in a webpage.

Question 5 (2 points)

  1. In project 2 and Project 3 we used two different libraries, "BeautifulSoup" and "Selenium", to accomplish the very similar tasks. Please briefly outline the similarities and differences between these two libraries (BeautifulSoup versus Selenium).

Project 03 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project03.ipynb

  • Python file with code and comments for the assignment

    • firstname-lastname-project03.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.