diy – paulvanderlaken.com

Over the course of last week, I built a Python program that scrapes quotes from Goodreads.com in a tidy format. For instance, these are the first three results my program returns when scraping for the tag robot:

Quote	author	source	likes	tags
Goodbye, Hari, my love. Remember always–all you did for me.	Isaac Asimov	Forward the Foundation	33	[‘asimov’, ‘foundation’, ‘human’, ‘robot’]
Unfortunately this Electric Monk had developed a fault, and had started to believe all kinds of things, more or less at random. It was even beginning to believe things they’d have difficulty believing in Salt Lake City.	Douglas Adams	Dirk Gently’s Holistic Detective Agency	25	[‘belief’, ‘humor’, ‘mormonism’, ‘religion’, ‘robot’]
It’s hard to wipe your eyes when you have whirring buzzsaws for hands.	Daniel H. Wilson	How to Survive a Robot Uprising: Tips on Defending Yourself Against the Coming Rebellion	20	[‘buzzaw’, ‘robot’, ‘survive’, ‘uprising’]

The first three quotes on Goodreads.com tagged ‘robot’

“Paul, why the hell are you building a Python API for Goodreads quotes?” I hear you asking. Well, let me provide you with some context.

A while back, I created a twitter bot called ArtificialStupidity.

As it’s bio reads, ArtificialStupidity is a highly sentient AI intelligently matching quotes and comics through state-of-the-art robotics, sophisticated machine learning, and blockchain technology.

Basically, every 15 minutes, a Python script is triggered on my computer (soon on my Raspberry Pi 4). Each time it triggers, this script generates a random number to determine whether it should post something. If so, the script subsequently generates another random number to determine what is should post: a quote, a comic, or both. Behind the scenes, some other functions add hastags and — voila — a tweet is born!

(An upcoming post will elaborate on the inner workings of my ArtificialStupidity Python script)

More often than not, ArtificialStupidity produces some random, boring tweet:

"For every $20 you spend on web analytics tools, you should spend $80 on the brains to make sense of the data." – Jeff Sauer#xkcd #deeplearning #dataviz #data #analytics #webdev #ArtificialStupidity no.147 pic.twitter.com/Ink2TOhu9G
— ArtificialStupidity (@ArtificialStup5) November 30, 2019

However, every now and then, the bot actually manages to combine a quote with a comic in a way that gets some laughs:

"Aim for simplicity in Data Science. Real creativity won't make things more complex. Instead, it will simplify them." – Damian Duffy Mingle#datascience #data #science #rstats #statistics #ArtificialStupidity no.195 pic.twitter.com/BOgwsJeLRP
— ArtificialStupidity (@ArtificialStup5) December 17, 2019

Now, in order to compile these tweets, my computer hosts two databases. One containing data- and tech- related comics; the other a variety of inspirational quotes. Each time the ArtificialStupidity bot posts a tweet, it draws from one or both of these datasets randomly. With, on average, one post every couple hours, I thus need several hundreds of items in these databases in order to prevent repetition — which is definitely not entertaining.

Up until last week, I manually expanded these databases every week or so. Adding new comics and quotes as I encountered them online. However, this proved a tedious task. Particularly for the quotes, as I set up the database in a specific format (“quote” – author). In contrast, websites like Goodreads.com display their quotes in a different format (e.g., “quote” ― author, source \n tags \n likes). Apart from the different format, the apostrophes and long slash also cause UTF-8 issues in my Python script. Hence, weekly reformatting of quotes proved an annoying task.

Up until this week!

While reformatting some bias-related quotes, I decided I’d rather invest 10 times more time developing my Python skills, than mindlessly reformatting quotes for a minute longer. So I started coding.

I am proud to say that, some six hours later, I have compiled the script below.

I’ll walk you through it’s functions.

So first, I import the modules/packages I need. Note that you will probably first have to pip install package-name on your own computer!

argparse for the command-line interface arguments
re for the regular expressions to clean quotes
bs4 for its BeautifulSoup for scraping website content
urllib.request for opening urls
csv to save csv files
os for directory pathing

import argparse
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv
import os

Next, I set up the argparse.ArgumentParser so that I can use my API using the command line. Now you can call the Python script using the command line (e.g., goodreads-scraper.py -t 'bias' -p 3 -q 80), and provide it with some arguments. No arguments are necessary. Most have sensible defaults. If you forget to provide a tag you will be prompted to provide one as the script runs (see later).

ap = argparse.ArgumentParser(description='Scrape quotes from Goodreads.com')

ap.add_argument("-t", "--tag",
                required=False, type=str, default=None,
                help="tag (topic/theme) of quotes to scrape")
ap.add_argument("-p", "--max_pages",
                required=False, type=int, default=10,
                help="maximum number of webpages to scrape")
ap.add_argument("-q", "--max_quotes",
                required=False, type=int, default=100,
                help="maximum number of quotes to scrape")

args = vars(ap.parse_args())

Now, the main function for this script is download_goodreads_quotes. This function contains many other functions within. You will see I set my functions up in a nested fashion, so that functions which are only used inside a certain scope, are instantiated there. In regular words, I create the functions where I use them.

First, download_goodreads_quotes creates download_quotes_from_page. In turn, download_quotes_from_page creates and calls compile_url — to create the url — get_soup — to download url contents — extract_quotes_elements_from_soup — to do just that — and extract_quote_dict. This latter function is the workhorse, as it takes each scraped quote element block of HTML and extracts the quote, author, source, and number of likes. It cleans each of these data points and returns them as a dictionary. In the end, download_quotes_from_page returns a list of dictionaries for every quote element block on a page.

Second, download_goodreads_quotes creates and calls download_all_pages which calls download_quotes_from_page for all pages up to max_pages, or up to the page that no longer returns quote data, or up to the number of max_quotes has been reached. All gathered quote dictionaries are added to a results list.

def download_goodreads_quotes(tag, max_pages=1, max_quotes=50):

    def download_quotes_from_page(tag, page):

        def compile_url(tag, page):
            return f'https://www.goodreads.com/quotes/tag/{tag}?page={page}'

        def get_soup(url):
            response = urlopen(Request(url))
            return BeautifulSoup(response, 'html.parser')

        def extract_quotes_elements_from_soup(soup):
            elements_quotes = soup.find_all("div", {"class": "quote mediumText"})
            return elements_quotes

        def extract_quote_dict(quote_element):

            def extract_quote(quote_element):
                try:
                    quote = quote_element.find('div', {'class': 'quoteText'}).get_text("|", strip=True)
                    # first element is always the quote
                    quote = quote.split('|')[0]
                    quote = re.sub('^“', '', quote)
                    quote = re.sub('”\s?$', '', quote)
                    return quote
                except:
                    return None

            def extract_author(quote_element):
                try:
                    author = quote_element.find('span', {'class': 'authorOrTitle'}).get_text()
                    author = author.strip()
                    author = author.rstrip(',')
                    return author
                except:
                    return None

            def extract_source(quote_element):
                try:
                    source = quote_element.find('a', {'class': 'authorOrTitle'}).get_text()
                    return source
                except:
                    return None

            def extract_tags(quote_element):
                try:
                    tags = quote_element.find('div', {'class': 'greyText smallText left'}).get_text(strip=True)
                    tags = re.sub('^tags:', '', tags)
                    tags = tags.split(',')
                    return tags
                except:
                    return None

            def extract_likes(quote_element):
                try:
                    likes = quote_element.find('a', {'class': 'smallText', 'title': 'View this quote'}).get_text(strip=True)
                    likes = re.sub('likes$', '', likes)
                    likes = likes.strip()
                    return int(likes)
                except:
                    return None

            quote_data = {'quote': extract_quote(quote_element),
                          'author': extract_author(quote_element),
                          'source': extract_source(quote_element),
                          'likes': extract_likes(quote_element),
                          'tags': extract_tags(quote_element)}

            return quote_data

        url = compile_url(tag, page)
        print(f'Retrieving {url}...')
        soup = get_soup(url)
        quote_elements = extract_quotes_elements_from_soup(soup)

        return [extract_quote_dict(e) for e in quote_elements]

    def download_all_pages(tag, max_pages, max_quotes):
        results = []
        p = 1
        while p <= max_pages:
            res = download_quotes_from_page(tag, p)
            if len(res) == 0:
                print(f'No results found on page {p}.\nTerminating search.')
                return results

            results = results + res

            if len(results) >= max_quotes:
                print(f'Hit quote maximum ({max_quotes}) on page {p}.\nDiscontinuing search.')
                return results[0:max_quotes]
            else:
                p += 1

        return results

    return download_all_pages(tag, max_pages, max_quotes)

Additionally, I use two functions to actually store the scraped quotes: recreate_quote turns a quote dictionary into a quote (I actually do not use the source and likes, but maybe others want to do so); save_quotes calls this recreate quote for the list of quote dictionaires it’s given, and stores them in a csv file in the current directory.

Update 2020/04/05: added UTF-8 encoding based on infoguild‘s comment.

def recreate_quote(dict):
    return f'"{dict.get("quote")}" - {dict.get("author")}'

def save_quotes(quote_data, tag):
    save_path = os.path.join(os.getcwd(), 'scraped' + '-' + tag + '.txt')
    print('saving file')
    with open(save_path, 'w', encoding='utf-8') as f:
        quotes = [recreate_quote(q) for q in quote_data]
        for q in quotes:
            f.write(q + '\n')

Finally, I need to call all these functions when the user runs this script via the command line. That’s what the following code does. If looks at the provided (default) arguments, and if no tag is provided, the user is prompted for one. Next Goodreads.com is scraped using the earlier specified download_goodreads_quotes function, and the results are saved to a csv file.

if __name__ == '__main__':
    tag = args['tag'] if args['tag'] != None else input('Provide tag to search quotes for: ')
    mp = args['max_pages']
    mq = args['max_quotes']
    result = download_goodreads_quotes(tag, max_pages=mp, max_quotes=mq)
    save_quotes(result, tag)

Use

If you paste these script pieces sequentially in a Python script / text file, and save this file as goodreads-scraper.py. You can then run this script using your command line, like so goodreads-scraper.py -t 'bias' -p 3 -q 80 where the text after -t is the tag you are searching for, -p is the number of pages you want to scrape, and -q is the maximum number of quotes you want the program to scrape.

Let me know what your favorite quote is once you get it running!

To-do

So this is definitely still work in progress. Some potential improvements I want to integrate come directly to mind:

Avoid errors for quotes including newlines, or
Write code to extract only the text of the quote, instead of the whole text of the quote element.
Build in concurrency using futures (but take care that quotes are still added the results sequentially. Maybe we can already download the soups of all pages, as this takes the longest.
Write a function to return a random quote
Write a function to return a random quote within a tag
Implement a lower limit for the number of likes of quotes
Refactor the download_all_pages bit.
Add comments and docstrings.

Feedback or tips?

I have been programming in R for quite a while now, but Python and software development in general are still new to me. This will probably be visible in the way I program, my syntax, the functions I use, or other things. Please provide any feedback you may have as I’d love to get better!