Category: application

Python Web Scraping: Quotes from Goodreads.com

Over the course of last week, I built a Python program that scrapes quotes from Goodreads.com in a tidy format. For instance, these are the first three results my program returns when scraping for the tag robot:

Quote	author	source	likes	tags
Goodbye, Hari, my love. Remember always–all you did for me.	Isaac Asimov	Forward the Foundation	33	[‘asimov’, ‘foundation’, ‘human’, ‘robot’]
Unfortunately this Electric Monk had developed a fault, and had started to believe all kinds of things, more or less at random. It was even beginning to believe things they’d have difficulty believing in Salt Lake City.	Douglas Adams	Dirk Gently’s Holistic Detective Agency	25	[‘belief’, ‘humor’, ‘mormonism’, ‘religion’, ‘robot’]
It’s hard to wipe your eyes when you have whirring buzzsaws for hands.	Daniel H. Wilson	How to Survive a Robot Uprising: Tips on Defending Yourself Against the Coming Rebellion	20	[‘buzzaw’, ‘robot’, ‘survive’, ‘uprising’]

The first three quotes on Goodreads.com tagged ‘robot’

“Paul, why the hell are you building a Python API for Goodreads quotes?” I hear you asking. Well, let me provide you with some context.

A while back, I created a twitter bot called ArtificialStupidity.

As it’s bio reads, ArtificialStupidity is a highly sentient AI intelligently matching quotes and comics through state-of-the-art robotics, sophisticated machine learning, and blockchain technology.

Basically, every 15 minutes, a Python script is triggered on my computer (soon on my Raspberry Pi 4). Each time it triggers, this script generates a random number to determine whether it should post something. If so, the script subsequently generates another random number to determine what is should post: a quote, a comic, or both. Behind the scenes, some other functions add hastags and — voila — a tweet is born!

(An upcoming post will elaborate on the inner workings of my ArtificialStupidity Python script)

More often than not, ArtificialStupidity produces some random, boring tweet:

"For every $20 you spend on web analytics tools, you should spend $80 on the brains to make sense of the data." – Jeff Sauer#xkcd #deeplearning #dataviz #data #analytics #webdev #ArtificialStupidity no.147 pic.twitter.com/Ink2TOhu9G
— ArtificialStupidity (@ArtificialStup5) November 30, 2019

However, every now and then, the bot actually manages to combine a quote with a comic in a way that gets some laughs:

"Aim for simplicity in Data Science. Real creativity won't make things more complex. Instead, it will simplify them." – Damian Duffy Mingle#datascience #data #science #rstats #statistics #ArtificialStupidity no.195 pic.twitter.com/BOgwsJeLRP
— ArtificialStupidity (@ArtificialStup5) December 17, 2019

Now, in order to compile these tweets, my computer hosts two databases. One containing data- and tech- related comics; the other a variety of inspirational quotes. Each time the ArtificialStupidity bot posts a tweet, it draws from one or both of these datasets randomly. With, on average, one post every couple hours, I thus need several hundreds of items in these databases in order to prevent repetition — which is definitely not entertaining.

Up until last week, I manually expanded these databases every week or so. Adding new comics and quotes as I encountered them online. However, this proved a tedious task. Particularly for the quotes, as I set up the database in a specific format (“quote” – author). In contrast, websites like Goodreads.com display their quotes in a different format (e.g., “quote” ― author, source \n tags \n likes). Apart from the different format, the apostrophes and long slash also cause UTF-8 issues in my Python script. Hence, weekly reformatting of quotes proved an annoying task.

Up until this week!

While reformatting some bias-related quotes, I decided I’d rather invest 10 times more time developing my Python skills, than mindlessly reformatting quotes for a minute longer. So I started coding.

I am proud to say that, some six hours later, I have compiled the script below.

I’ll walk you through it’s functions.

So first, I import the modules/packages I need. Note that you will probably first have to pip install package-name on your own computer!

argparse for the command-line interface arguments
re for the regular expressions to clean quotes
bs4 for its BeautifulSoup for scraping website content
urllib.request for opening urls
csv to save csv files
os for directory pathing

import argparse
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv
import os

Next, I set up the argparse.ArgumentParser so that I can use my API using the command line. Now you can call the Python script using the command line (e.g., goodreads-scraper.py -t 'bias' -p 3 -q 80), and provide it with some arguments. No arguments are necessary. Most have sensible defaults. If you forget to provide a tag you will be prompted to provide one as the script runs (see later).

ap = argparse.ArgumentParser(description='Scrape quotes from Goodreads.com')

ap.add_argument("-t", "--tag",
                required=False, type=str, default=None,
                help="tag (topic/theme) of quotes to scrape")
ap.add_argument("-p", "--max_pages",
                required=False, type=int, default=10,
                help="maximum number of webpages to scrape")
ap.add_argument("-q", "--max_quotes",
                required=False, type=int, default=100,
                help="maximum number of quotes to scrape")

args = vars(ap.parse_args())

Now, the main function for this script is download_goodreads_quotes. This function contains many other functions within. You will see I set my functions up in a nested fashion, so that functions which are only used inside a certain scope, are instantiated there. In regular words, I create the functions where I use them.

First, download_goodreads_quotes creates download_quotes_from_page. In turn, download_quotes_from_page creates and calls compile_url — to create the url — get_soup — to download url contents — extract_quotes_elements_from_soup — to do just that — and extract_quote_dict. This latter function is the workhorse, as it takes each scraped quote element block of HTML and extracts the quote, author, source, and number of likes. It cleans each of these data points and returns them as a dictionary. In the end, download_quotes_from_page returns a list of dictionaries for every quote element block on a page.

Second, download_goodreads_quotes creates and calls download_all_pages which calls download_quotes_from_page for all pages up to max_pages, or up to the page that no longer returns quote data, or up to the number of max_quotes has been reached. All gathered quote dictionaries are added to a results list.

def download_goodreads_quotes(tag, max_pages=1, max_quotes=50):

    def download_quotes_from_page(tag, page):

        def compile_url(tag, page):
            return f'https://www.goodreads.com/quotes/tag/{tag}?page={page}'

        def get_soup(url):
            response = urlopen(Request(url))
            return BeautifulSoup(response, 'html.parser')

        def extract_quotes_elements_from_soup(soup):
            elements_quotes = soup.find_all("div", {"class": "quote mediumText"})
            return elements_quotes

        def extract_quote_dict(quote_element):

            def extract_quote(quote_element):
                try:
                    quote = quote_element.find('div', {'class': 'quoteText'}).get_text("|", strip=True)
                    # first element is always the quote
                    quote = quote.split('|')[0]
                    quote = re.sub('^“', '', quote)
                    quote = re.sub('”\s?$', '', quote)
                    return quote
                except:
                    return None

            def extract_author(quote_element):
                try:
                    author = quote_element.find('span', {'class': 'authorOrTitle'}).get_text()
                    author = author.strip()
                    author = author.rstrip(',')
                    return author
                except:
                    return None

            def extract_source(quote_element):
                try:
                    source = quote_element.find('a', {'class': 'authorOrTitle'}).get_text()
                    return source
                except:
                    return None

            def extract_tags(quote_element):
                try:
                    tags = quote_element.find('div', {'class': 'greyText smallText left'}).get_text(strip=True)
                    tags = re.sub('^tags:', '', tags)
                    tags = tags.split(',')
                    return tags
                except:
                    return None

            def extract_likes(quote_element):
                try:
                    likes = quote_element.find('a', {'class': 'smallText', 'title': 'View this quote'}).get_text(strip=True)
                    likes = re.sub('likes$', '', likes)
                    likes = likes.strip()
                    return int(likes)
                except:
                    return None

            quote_data = {'quote': extract_quote(quote_element),
                          'author': extract_author(quote_element),
                          'source': extract_source(quote_element),
                          'likes': extract_likes(quote_element),
                          'tags': extract_tags(quote_element)}

            return quote_data

        url = compile_url(tag, page)
        print(f'Retrieving {url}...')
        soup = get_soup(url)
        quote_elements = extract_quotes_elements_from_soup(soup)

        return [extract_quote_dict(e) for e in quote_elements]

    def download_all_pages(tag, max_pages, max_quotes):
        results = []
        p = 1
        while p <= max_pages:
            res = download_quotes_from_page(tag, p)
            if len(res) == 0:
                print(f'No results found on page {p}.\nTerminating search.')
                return results

            results = results + res

            if len(results) >= max_quotes:
                print(f'Hit quote maximum ({max_quotes}) on page {p}.\nDiscontinuing search.')
                return results[0:max_quotes]
            else:
                p += 1

        return results

    return download_all_pages(tag, max_pages, max_quotes)

Additionally, I use two functions to actually store the scraped quotes: recreate_quote turns a quote dictionary into a quote (I actually do not use the source and likes, but maybe others want to do so); save_quotes calls this recreate quote for the list of quote dictionaires it’s given, and stores them in a csv file in the current directory.

Update 2020/04/05: added UTF-8 encoding based on infoguild‘s comment.

def recreate_quote(dict):
    return f'"{dict.get("quote")}" - {dict.get("author")}'

def save_quotes(quote_data, tag):
    save_path = os.path.join(os.getcwd(), 'scraped' + '-' + tag + '.txt')
    print('saving file')
    with open(save_path, 'w', encoding='utf-8') as f:
        quotes = [recreate_quote(q) for q in quote_data]
        for q in quotes:
            f.write(q + '\n')

Finally, I need to call all these functions when the user runs this script via the command line. That’s what the following code does. If looks at the provided (default) arguments, and if no tag is provided, the user is prompted for one. Next Goodreads.com is scraped using the earlier specified download_goodreads_quotes function, and the results are saved to a csv file.

if __name__ == '__main__':
    tag = args['tag'] if args['tag'] != None else input('Provide tag to search quotes for: ')
    mp = args['max_pages']
    mq = args['max_quotes']
    result = download_goodreads_quotes(tag, max_pages=mp, max_quotes=mq)
    save_quotes(result, tag)

Use

If you paste these script pieces sequentially in a Python script / text file, and save this file as goodreads-scraper.py. You can then run this script using your command line, like so goodreads-scraper.py -t 'bias' -p 3 -q 80 where the text after -t is the tag you are searching for, -p is the number of pages you want to scrape, and -q is the maximum number of quotes you want the program to scrape.

Let me know what your favorite quote is once you get it running!

To-do

So this is definitely still work in progress. Some potential improvements I want to integrate come directly to mind:

Avoid errors for quotes including newlines, or
Write code to extract only the text of the quote, instead of the whole text of the quote element.
Build in concurrency using futures (but take care that quotes are still added the results sequentially. Maybe we can already download the soups of all pages, as this takes the longest.
Write a function to return a random quote
Write a function to return a random quote within a tag
Implement a lower limit for the number of likes of quotes
Refactor the download_all_pages bit.
Add comments and docstrings.

Feedback or tips?

I have been programming in R for quite a while now, but Python and software development in general are still new to me. This will probably be visible in the way I program, my syntax, the functions I use, or other things. Please provide any feedback you may have as I’d love to get better!

ArchiGAN: Designing buildings with reinforcement learning

I’ve seen some uses of reinforcement learning and generative algorithms for architectural purposes already, like these evolving blueprints for school floorplans. However, this new application called ArchiGAN blew me away!

ArchiGAN (try here) was made by Stanislas Chaillou as a Harvard master’s thesis project. The program functions in three steps:

building footprint massing
program repartition
furniture layout

Generation stack image — Stanislas’ three generation steps

Each of these three steps uses a TensorFlow Pix2Pix GAN-model (Christopher Hesse’s implementation) in the back-end, and their combination makes for a entire apartment building “generation stack” — according to Stanislas — which also allows for user input at each step.

The design of a building can be inferred from the piece of land it stands on. Hence, Stanislas fed his first model using GIS-data (Geographic Information System) from the city of Boston in order to generate typical footprints based on parcel shapes.

Model 1 results image — The inputs and outputs of model I

Stanislas’ second model was responsible for repartition and fenestration (the placement of windows and doors). This GAN took the footprint of the building (the output of model I) as input, along with the position of the entrance door (green square), and the positions of the user-specified windows.

Stanislas used a database of 800+ plans of apartments for training. To visualize the output, rooms are color-coded and walls and fenestration are blackened.

Model II results image — The inputs and outputs of model II

Finally, in the third model, the rooms are filled with appropriate furniture. What training data Stanislas has used here, he did not specify in the original blog.

Model III results image — The inputs and outputs of model III

Now, to put all things together, Stanislas created a great interactive tool you can play with yourself. The original NVIDEA blog contains some great GIFs of the tool being used:

Stanislas’ GAN-models progressively learned to design rooms and realistically position doors and windows. It took about 250 iterations to get some realistic floorplans out of the algorithm. Here’s how an example learning sequence looked like:

Architectural sequence image — Visualization of the training process

Now, Stanislas was not done yet. He also scaled the utilization of GANs to design whole apartment buildings. Here, he chains the models and processes multiple units as single images at each step.

Apartment building generation pipeline image — Generating whole appartment blocks using ArchiGAN

Stanislas did other cool things to improve the flexibility of his ArchiGAN models, about which you can read more in the original blog. Let these visuals entice you to read more:

GAN-enabled building layouts image — ArchiGAN scaled to handle whole appartment blocks and neighborhoods.

I believe a statistical approach to design conception will shape AI’s potential for Architecture. This approach is less deterministic and more holistic in character. Rather than using machines to optimize a set of variables, relying on them to extract significant qualities and mimicking them all along the design process represents a paradigm shift.
Stanislas Chaillou (via)

I am so psyched about these innovative applications of machine learning, so please help me give Stanislas the attention and credit he deserved.

Currently, Stanislas is Data Scientist & Architect at Spacemaker.ai. Read more about him in his NVIDEA developer bio here. He recently published a sequence of articles, laying down the premise of AI’s intersection with Architecture. Read here about the historical background behind this significant evolution, to be followed by AI’s potential for floor plan design, and for architectural style analysis & generation.

Learn Programming Project-Based: Build-Your-Own-X

Last week, this interesting reddit thread was filled with overviews for cool projects that may help you learn a programming language. The top entries are:

Build Your Own X, by Dani Stefanovic
Project-based Learning, by Tu Tran
Projects from Scratch, by Algory L.
Project-based Tutorials in C, by Robby
Awesome DIY Software, by Cameron Eagans

There’s a wide range of projects you can get started on building:

If you want to focus on building stuff in a specific programming language, you can follow these links:

If you’re really into C, then follow these links to build your own:

Causal Random Forests, by Mark White

I stumbled accros this incredibly interesting read by Mark White, who discusses the (academic) theory behind, inner workings, and example (R) applications of causal random forests:

EXPLICITLY OPTIMIZING ON CAUSAL EFFECTS VIA THE CAUSAL RANDOM FOREST: A PRACTICAL INTRODUCTION AND TUTORIAL (By Mark White)

These so-called “honest” forests seem a great technique to identify opportunities for personalized actions: think of marketing, HR, medicine, healthcare, and other personalized recommendations. Note that an experimental setup for data collection is still necessary to gather the right data for these techniques.

https://www.markhw.com/blog/causalforestintro

Artificial Stupidity – by Vincent Warmerdam @PyData 2019 London

PyData is famous for it’s great talks on machine learning topics. This 2019 London edition, Vincent Warmerdam again managed to give a super inspiring presentation. This year he covers what he dubs Artificial Stupidity™. You should definitely watch the talk, which includes some great visual aids, but here are my main takeaways:

Vincent speaks of Artificial Stupidity, of machine learning gone HorriblyWrong™ — an example of which below — for which Vincent elaborates on three potential fixes:

Image result for paypal but still learning got scammed — Example of a model that goes HorriblyWrong™, according to Vincent’s talk.

1. Predict Less, but Carefully

Vincent argues you shouldn’t extrapolate your predictions outside of your observed sampling space. Even better: “Not predicting given uncertainty is a great idea.” As an alternative, we could for instance design a fallback mechanism, by including an outlier detection model as the first step of your machine learning model pipeline and only predict for non-outliers.

I definately recommend you watch this specific section of Vincent’s talk because he gives some very visual and intuitive explanations of how extrapolation may go HorriblyWrong™.

Be careful! One thing we should maybe start talking about to our bosses: Algorithms merely automate, approximate, and interpolate. It’s the extrapolation that is actually kind of dangerous.
Vincent Warmerdam @ Pydata 2019 London

Basically, we can choose to not make automated decisions sometimes.

2. Constrain thy Features

What we feed to our models really matters. […] You should probably do something to the data going into your model if you want your model to have any sort of fairness garantuees.
Vincent Warmerdam @ Pydata 2019 London

Often, simply removing biased features from your data does not reduce bias to the extent we may have hoped. Fortunately, Vincent demonstrates how to remove biased information from your variables by applying some cool math tricks.

Unfortunately, doing so will often result in a lesser predictive accuracy. Unsurprisingly though, as you are not closely fitting the biased data any more. What makes matters more problematic, Vincent rightfully mentions, is that corporate incentives often not really align here. It might feel that you need to pick: it’s either more accuracy or it’s more fairness.

However, there’s a nice solution that builds on point 1. We can now take the highly accurate model and the highly fair model, make predictions with both, and when these predictions differ, that’s a very good proxy where you potentially don’t want to make a prediction. Hence, there may be observations/samples where we are comfortable in making a fair prediction, whereas in most other situations we may say “right, this prediction seems unfair, we need a fallback mechanism, a human being should look at this and we should not automate this decision”.

Vincent does not that this is only one trick to constrain your model for fairness, and that fairness may often only be fair in the eyes of the beholder. Moreover, in order to correct for these biases and unfairness, you need to know about these unfair biases. Although outside of the scope of this specific topic, Vincent proposes this introduces new ethical issues:

Basically, we can choose to put our models on a controlled diet.

3. Constrain thy Model

Vincent argues that we should include constraints (based on domain knowledge, or common sense) into our models. In his presentation, he names a few. For instance, monotonicity, which implies that the relationship between X and Y should always be either entirely non-increasing, or entirely non-decreasing. Incorporating the previously discussed fairness principles would be a second example, and there are many more.

If we every come up with a model where more smoking leads to better health, that’s bad. I have enough domain knowledge to say that that should never happen. So maybe I should just make a system where I can say “look this one column with relationship to Y should always be strictly negative”.
Vincent Warmerdam @ Pydata 2019 London

Basically, we can integrate domain knowledge or preferences into our models.

Conclusion: Watch the talk!

Putting R in Production, by Heather Nolis & Mark Sellors

It is often said that R is hard to put into production. Fortunately, there are numerous talks demonstrating the contrary.

Here’s one by Heather Nolis, who productionizes R models at T-Mobile. Her teams even shares open-source version of some of their productionized Tensorflow models on github. Read more about that model here.

There’s another great talk on the RStudio website. In this talk, Mark Sellors discusses some of the misinformation around the idea of what “putting something into production” actually means, and provides some tips on overcoming obstacles.

Cover image via Fotolia.