Over the course of last week, I built a Python program that scrapes quotes from Goodreads.com in a tidy format. For instance, these are the first three results my program returns when scraping for the tag robot:
Quote | author | source | likes | tags |
---|---|---|---|---|
Goodbye, Hari, my love. Remember always–all you did for me. | Isaac Asimov | Forward the Foundation | 33 | [‘asimov’, ‘foundation’, ‘human’, ‘robot’] |
Unfortunately this Electric Monk had developed a fault, and had started to believe all kinds of things, more or less at random. It was even beginning to believe things they’d have difficulty believing in Salt Lake City. | Douglas Adams | Dirk Gently’s Holistic Detective Agency | 25 | [‘belief’, ‘humor’, ‘mormonism’, ‘religion’, ‘robot’] |
It’s hard to wipe your eyes when you have whirring buzzsaws for hands. | Daniel H. Wilson | How to Survive a Robot Uprising: Tips on Defending Yourself Against the Coming Rebellion | 20 | [‘buzzaw’, ‘robot’, ‘survive’, ‘uprising’] |
“Paul, why the hell are you building a Python API for Goodreads quotes?” I hear you asking. Well, let me provide you with some context.
A while back, I created a twitter bot called ArtificialStupidity.
As it’s bio reads, ArtificialStupidity is a highly sentient AI intelligently matching quotes and comics through state-of-the-art robotics, sophisticated machine learning, and blockchain technology.
Basically, every 15 minutes, a Python script is triggered on my computer (soon on my Raspberry Pi 4). Each time it triggers, this script generates a random number to determine whether it should post something. If so, the script subsequently generates another random number to determine what is should post: a quote, a comic, or both. Behind the scenes, some other functions add hastags and — voila — a tweet is born!
(An upcoming post will elaborate on the inner workings of my ArtificialStupidity Python script)
More often than not, ArtificialStupidity produces some random, boring tweet:
However, every now and then, the bot actually manages to combine a quote with a comic in a way that gets some laughs:
Now, in order to compile these tweets, my computer hosts two databases. One containing data- and tech- related comics; the other a variety of inspirational quotes. Each time the ArtificialStupidity bot posts a tweet, it draws from one or both of these datasets randomly. With, on average, one post every couple hours, I thus need several hundreds of items in these databases in order to prevent repetition — which is definitely not entertaining.
Up until last week, I manually expanded these databases every week or so. Adding new comics and quotes as I encountered them online. However, this proved a tedious task. Particularly for the quotes, as I set up the database in a specific format (“quote” – author). In contrast, websites like Goodreads.com display their quotes in a different format (e.g., “quote” ― author, source \n tags \n likes). Apart from the different format, the apostrophes and long slash also cause UTF-8 issues in my Python script. Hence, weekly reformatting of quotes proved an annoying task.
Up until this week!
While reformatting some bias-related quotes, I decided I’d rather invest 10 times more time developing my Python skills, than mindlessly reformatting quotes for a minute longer. So I started coding.
I am proud to say that, some six hours later, I have compiled the script below.
I’ll walk you through it’s functions.
So first, I import the modules/packages I need. Note that you will probably first have to pip install package-name
on your own computer!
argparse
for the command-line interface argumentsre
for the regular expressions to clean quotesbs4
for itsBeautifulSoup
for scraping website contenturllib.request
for opening urlscsv
to save csv filesos
for directory pathing
import argparse
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv
import os
Next, I set up the argparse.ArgumentParser
so that I can use my API using the command line. Now you can call the Python script using the command line (e.g., goodreads-scraper.py -t 'bias' -p 3 -q 80
), and provide it with some arguments. No arguments are necessary. Most have sensible defaults. If you forget to provide a tag you will be prompted to provide one as the script runs (see later).
ap = argparse.ArgumentParser(description='Scrape quotes from Goodreads.com')
ap.add_argument("-t", "--tag",
required=False, type=str, default=None,
help="tag (topic/theme) of quotes to scrape")
ap.add_argument("-p", "--max_pages",
required=False, type=int, default=10,
help="maximum number of webpages to scrape")
ap.add_argument("-q", "--max_quotes",
required=False, type=int, default=100,
help="maximum number of quotes to scrape")
args = vars(ap.parse_args())
Now, the main function for this script is download_goodreads_quotes
. This function contains many other functions within. You will see I set my functions up in a nested fashion, so that functions which are only used inside a certain scope, are instantiated there. In regular words, I create the functions where I use them.
First, download_goodreads_quotes
creates download_quotes_from_page
. In turn, download_quotes_from_page
creates and calls compile_url
— to create the url — get_soup
— to download url contents — extract_quotes_elements_from_soup
— to do just that — and extract_quote_dict
. This latter function is the workhorse, as it takes each scraped quote element block of HTML and extracts the quote, author, source, and number of likes. It cleans each of these data points and returns them as a dictionary. In the end, download_quotes_from_page
returns a list of dictionaries for every quote element block on a page.
Second, download_goodreads_quotes
creates and calls download_all_pages
which calls download_quotes_from_page
for all pages up to max_pages
, or up to the page that no longer returns quote data, or up to the number of max_quotes
has been reached. All gathered quote dictionaries are added to a results
list.
def download_goodreads_quotes(tag, max_pages=1, max_quotes=50):
def download_quotes_from_page(tag, page):
def compile_url(tag, page):
return f'https://www.goodreads.com/quotes/tag/{tag}?page={page}'
def get_soup(url):
response = urlopen(Request(url))
return BeautifulSoup(response, 'html.parser')
def extract_quotes_elements_from_soup(soup):
elements_quotes = soup.find_all("div", {"class": "quote mediumText"})
return elements_quotes
def extract_quote_dict(quote_element):
def extract_quote(quote_element):
try:
quote = quote_element.find('div', {'class': 'quoteText'}).get_text("|", strip=True)
# first element is always the quote
quote = quote.split('|')[0]
quote = re.sub('^“', '', quote)
quote = re.sub('”\s?$', '', quote)
return quote
except:
return None
def extract_author(quote_element):
try:
author = quote_element.find('span', {'class': 'authorOrTitle'}).get_text()
author = author.strip()
author = author.rstrip(',')
return author
except:
return None
def extract_source(quote_element):
try:
source = quote_element.find('a', {'class': 'authorOrTitle'}).get_text()
return source
except:
return None
def extract_tags(quote_element):
try:
tags = quote_element.find('div', {'class': 'greyText smallText left'}).get_text(strip=True)
tags = re.sub('^tags:', '', tags)
tags = tags.split(',')
return tags
except:
return None
def extract_likes(quote_element):
try:
likes = quote_element.find('a', {'class': 'smallText', 'title': 'View this quote'}).get_text(strip=True)
likes = re.sub('likes$', '', likes)
likes = likes.strip()
return int(likes)
except:
return None
quote_data = {'quote': extract_quote(quote_element),
'author': extract_author(quote_element),
'source': extract_source(quote_element),
'likes': extract_likes(quote_element),
'tags': extract_tags(quote_element)}
return quote_data
url = compile_url(tag, page)
print(f'Retrieving {url}...')
soup = get_soup(url)
quote_elements = extract_quotes_elements_from_soup(soup)
return [extract_quote_dict(e) for e in quote_elements]
def download_all_pages(tag, max_pages, max_quotes):
results = []
p = 1
while p <= max_pages:
res = download_quotes_from_page(tag, p)
if len(res) == 0:
print(f'No results found on page {p}.\nTerminating search.')
return results
results = results + res
if len(results) >= max_quotes:
print(f'Hit quote maximum ({max_quotes}) on page {p}.\nDiscontinuing search.')
return results[0:max_quotes]
else:
p += 1
return results
return download_all_pages(tag, max_pages, max_quotes)
Additionally, I use two functions to actually store the scraped quotes: recreate_quote
turns a quote dictionary into a quote (I actually do not use the source and likes, but maybe others want to do so); save_quotes
calls this recreate quote for the list of quote dictionaires it’s given, and stores them in a csv file in the current directory.
Update 2020/04/05: added UTF-8 encoding based on infoguild‘s comment.
def recreate_quote(dict):
return f'"{dict.get("quote")}" - {dict.get("author")}'
def save_quotes(quote_data, tag):
save_path = os.path.join(os.getcwd(), 'scraped' + '-' + tag + '.txt')
print('saving file')
with open(save_path, 'w', encoding='utf-8') as f:
quotes = [recreate_quote(q) for q in quote_data]
for q in quotes:
f.write(q + '\n')
Finally, I need to call all these functions when the user runs this script via the command line. That’s what the following code does. If looks at the provided (default) arguments, and if no tag is provided, the user is prompted for one. Next Goodreads.com is scraped using the earlier specified download_goodreads_quotes
function, and the results are saved to a csv file.
if __name__ == '__main__':
tag = args['tag'] if args['tag'] != None else input('Provide tag to search quotes for: ')
mp = args['max_pages']
mq = args['max_quotes']
result = download_goodreads_quotes(tag, max_pages=mp, max_quotes=mq)
save_quotes(result, tag)
Use
If you paste these script pieces sequentially in a Python script / text file, and save this file as goodreads-scraper.py. You can then run this script using your command line, like so goodreads-scraper.py -t 'bias' -p 3 -q 80
where the text after -t
is the tag you are searching for, -p
is the number of pages you want to scrape, and -q
is the maximum number of quotes you want the program to scrape.
Let me know what your favorite quote is once you get it running!
To-do
So this is definitely still work in progress. Some potential improvements I want to integrate come directly to mind:
- Avoid errors for quotes including newlines, or
- Write code to extract only the text of the quote, instead of the whole text of the quote element.
- Build in concurrency using futures (but take care that quotes are still added the results sequentially. Maybe we can already download the soups of all pages, as this takes the longest.
- Write a function to return a random quote
- Write a function to return a random quote within a tag
- Implement a lower limit for the number of likes of quotes
- Refactor the download_all_pages bit.
- Add comments and docstrings.
Feedback or tips?
I have been programming in R for quite a while now, but Python and software development in general are still new to me. This will probably be visible in the way I program, my syntax, the functions I use, or other things. Please provide any feedback you may have as I’d love to get better!
Thank you so much for this. I was able to pull 3000 quotes within 10 minutes or less (didn’t actually check when I started though).
I had issues using this initially.
Starting with having issues with Beautifulsoap installation.
Also, you went ahead to import without telling us to first install the packages via “pip”.
I tried running the code and it didn’t work.
So, i had to install them first and used this code, it didn’t still work, discovered it was “Beautifulsoap error”.
I did a lot of googling and I found that “from bs4 import BeautifulSoup” worked for me instead of “from bs4 import BeautifulSoup4”
Later, I had issues with the charset encoding. So, i changed your code to output the quotes in UTF-8.
PS: For a new user to use the code below.
1) Copy the code
2) Save it using any IDE or even your notepad (ensure you select ‘save as all file types) and give it a name e.g paulscrapper.py.
3) Save on desktop for easy access.
4) Open your command prompt and type “cd Desktop”
5) Then paste for example: paulscrapper.py -t ‘relationships’ -p 50 -q 2000
I found out that each page is about 30 on goodreads, hence (50*30 and I left room for more, thereby making it 200).
CODE
import argparse
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv
import os
ap = argparse.ArgumentParser(description=’Scrape quotes from Goodreads.com’)
ap.add_argument(“-t”, “–tag”,
required=False, type=str, default=None,
help=”tag (topic/theme) of quotes to scrape”)
ap.add_argument(“-p”, “–max_pages”,
required=False, type=int, default=10,
help=”maximum number of webpages to scrape”)
ap.add_argument(“-q”, “–max_quotes”,
required=False, type=int, default=100,
help=”maximum number of quotes to scrape”)
args = vars(ap.parse_args())
def download_goodreads_quotes(tag, max_pages=1, max_quotes=50):
def download_quotes_from_page(tag, page):
def compile_url(tag, page):
return f’https://www.goodreads.com/quotes/tag/{tag}?page={page}’
def get_soup(url):
response = urlopen(Request(url))
return BeautifulSoup(response, ‘html.parser’)
def extract_quotes_elements_from_soup(soup):
elements_quotes = soup.find_all(“div”, {“class”: “quote mediumText”})
return elements_quotes
def extract_quote_dict(quote_element):
def extract_quote(quote_element):
try:
quote = quote_element.find(‘div’, {‘class’: ‘quoteText’}).get_text(“|”, strip=True)
# first element is always the quote
quote = quote.split(‘|’)[0]
quote = re.sub(‘^“’, ”, quote)
quote = re.sub(‘”\s?$’, ”, quote)
return quote
except:
return None
def extract_author(quote_element):
try:
author = quote_element.find(‘span’, {‘class’: ‘authorOrTitle’}).get_text()
author = author.strip()
author = author.rstrip(‘,’)
return author
except:
return None
def extract_source(quote_element):
try:
source = quote_element.find(‘a’, {‘class’: ‘authorOrTitle’}).get_text()
return source
except:
return None
def extract_tags(quote_element):
try:
tags = quote_element.find(‘div’, {‘class’: ‘greyText smallText left’}).get_text(strip=True)
tags = re.sub(‘^tags:’, ”, tags)
tags = tags.split(‘,’)
return tags
except:
return None
def extract_likes(quote_element):
try:
likes = quote_element.find(‘a’, {‘class’: ‘smallText’, ‘title’: ‘View this quote’}).get_text(strip=True)
likes = re.sub(‘likes$’, ”, likes)
likes = likes.strip()
return int(likes)
except:
return None
quote_data = {‘quote’: extract_quote(quote_element),
‘author’: extract_author(quote_element),
‘source’: extract_source(quote_element),
‘likes’: extract_likes(quote_element),
‘tags’: extract_tags(quote_element)}
return quote_data
url = compile_url(tag, page)
print(f’Retrieving {url}…’)
soup = get_soup(url)
quote_elements = extract_quotes_elements_from_soup(soup)
return [extract_quote_dict(e) for e in quote_elements]
def download_all_pages(tag, max_pages, max_quotes):
results = []
p = 1
while p = max_quotes:
print(f’Hit quote maximum ({max_quotes}) on page {p}.\nDiscontinuing search.’)
return results[0:max_quotes]
else:
p += 1
return results
return download_all_pages(tag, max_pages, max_quotes)
def recreate_quote(dict):
return f’”{dict.get(“quote”)}” – {dict.get(“author”)}’
def save_quotes(quote_data, tag):
save_path = os.path.join(os.getcwd(), ‘scraped’ + ‘-‘ + tag + ‘.txt’)
print(‘saving file’)
with open(save_path, ‘w’, encoding=’utf-8′) as f:
quotes = [recreate_quote(q) for q in quote_data]
for q in quotes:
f.write(q + ‘\n’)
if __name__ == ‘__main__’:
tag = args[‘tag’] if args[‘tag’] != None else input(‘Provide tag to search quotes for: ‘)
mp = args[‘max_pages’]
mq = args[‘max_quotes’]
result = download_goodreads_quotes(tag, max_pages=mp, max_quotes=mq)
save_quotes(result, tag)
LikeLike
Thanks for replying and taking the time to let me know you ran into some issues.
I’m sorry to have not been clear enough. I will definitely change the blog to state that a pip install is required first.
Fortunately, great to hear it worked our for you in the end!
Smart move regarding the encoding as well!
LikeLike
Thank you, Paul. This was a great script. I am grateful! You rock!
LikeLike
Hi Oluwasegun, can you please provide the py file?
LikeLike
Hi, Thanks for the great article.
But I am having an issue with the indentation. Can you please provide the .py file?
LikeLike
Hi Shekhar, I’ve integrated Oluwasegun’s changes in my original article. So you can just copy that Python code there. That identation also works as it is supposed to.
LikeLike