If you are looking for a project to build a bot or AI application, look no further.
Enter the stage, PyBoy, a Nintendo Game Boy (DMG-01 [1989]) written in Python 2.7. The implementation runs in almost pure Python, but with dependencies for drawing graphics and getting user interactions through SDL2 and NumPy.
PyBoy is great for your AI robot projects as it is loadable as an object in Python. This means, it can be initialized from another script, and be controlled and probed by the script. You can even use multiple emulators at the same time, just instantiate the class multiple times.
The imagery suggests you can play anything from classic Super Mario to Pokemon. I suggest you start with the github, background report and PyBoy documentation right away.
I love how people are using data and data science to fight fake news these days (see also Identifying Dirty Twitter Bots), and I recently came across another great example.
Conspirador Norteño (real name unkown) is a member of what they call #TheResistance. It’s a group of data scientists discovering and analyzing so-called botnets – networks of artificial accounts on social media websites, like Twitter.
TheResistance uses quantitative analysis to unveil large groups of fake accounts, spreading potential fake news, or fake-endorsing the (fake) news spread by others.
In a recent Twitter thread, Norteno shows how they discovered that many of Dr. Shiva Ayyadurai (self-proclaimed Inventor of Email) his early followers are likely bots.
They looked at the date of these accounts started following Shiva, offset by the date of their accounts’ creation. A remarkeable pattern appeared:
Although @va_shiva‘s recent followers look unremarkable, a significant majority of his first 5000 followers appear to have been created in batches and to have subsequently followed @va_shiva in rapid succession.
Looking at those followers in more detail, other suspicious patterns emerge. Their names follow a same pattern, they have an about equal amount of followers, followings, tweets, and (no) likes. Moreover, they were created only seconds apart. Many of them seem to follow each other as well.
If that wasn’t enough proof of something’s off, here’s a variety of their tweets… Not really what everyday folks would tweet right? Plus similar patterns again across acounts.
At first, I thought, so what? This Shiva guy probably just set up some automated (Python?) scripts to make Twitter account and follow him. Good for him. It worked out, as his most recent 10k followers followed him organically.
However, it becomes more scary if you notice this Shiva guy is (succesfully) promoting the firing of people working for the government:
— Dr.SHIVA Ayyadurai, MIT PhD. Inventor of Email (@va_shiva) April 13, 2020
Anyways, wanted to share this simple though cool approach to finding bots & fake news networks on social media. I hope you liked it, and would love to hear your thoughts in the comments!
I’ve had this WordPress domain for several years now, and in the beginning it was very convenient.
WordPress enabled me to set up a fully functional blog in a matter of hours. Everything from HTML markup, external content embedding, databases, and simple analytics was already conveniently set up.
However, after a while, I wanted to do some more advanced stuff. Here, the disadvantages of WordPress hosting became evident fast. Anything beyond the most simple capabilities is locked firmly behind paywalls. Arguably rightfully so. If you want to use WordPress’ add-ins, I feel you should pay for them. That’s their business model after all.
However, what greatly annoys me is that WordPress actively hinders you from arranging matters yourself. Want to incorporate some JavaScript in your page? Upgrade to a paid account. Want to use Google Analytics? Upgrade and buy an add-in. Want to customize your HTML / CSS code? Upgrade or be damned. Even the simplest of tasks — just downloading visitor counts — WordPress made harder than it should be.
You can download visitor statistics manually — day by day, week by week, or year by year. However, there is no way to download your visitor history in batches. If you want to have your daily visiting history, you will manually have to download and store every day’s statistics.
For me, getting historic daily data would entail 1100 times entering a date, scrolling down, clicking a button, specifying a filename, and clicking to save. I did this once, for 36 monthly data snapshots, and the insights were barely worth the hassle, I assure you.
Fortunately, today, after nearly three years of hosting on WordPress, I finally managed to circumvent past this annoyance! Using the Python script detailed below, my computer now automonously logs in to WordPress and downloads the historic daily visitor statistics for all my blogs and pages!
Let me walk you through the program and code.
Modules & Setup
Before we jump into Python, you need to install Chromedriver. Just download the zip and unpack the execution file somewhere you can find it, and make sure to copy the path into Python. You will need it later. Chromedriver allows Python’s selenium webdriver to open up and steer a chrome browser.
We need another module for browsing: webdriver_manager. The other modules and their functions are for more common purposes: os for directory management, re for regular expression, datetime for working with dates, and time for letting the computer sleep in between operations.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
from datetime import datetime, timedelta
import os
import re
Helper Functions
I try to write my code in functions, so let’s dive into the functions that allow us to download visitor statistics.
To begin, we need to set up a driver (i.e., automated browser) and this is what get_driver does. Two things are important here. Firstly, the function takes an argument dir_download. You need to give it a path so it knows where to put any downloaded files. This path is stored under preferences in the driver options. Secondly, you need to specify the path_chromedriver argument. This needs to be the exact location you unpacked the chromedriver.exe. All these paths you can change later in the main program, so don’t worry about them for now. The get_driver function returns a ready-to-go driver object.
Next, our driver will need to know where to browse to. So the function below, compile_traffic_url, uses an f-string to generate the url for the visitor statistics overview of a specific domain and date. Important here is that you will need to change the domain default from paulvanderlaken.com to your own WordPress adress. Take a look at the statistics overview in your regular browser to see how you may tailor your urls.
Now, in the rest of the program, I work dates formatted and stored as datetime.datetime.date(). By default, the compile_traffic_url function also uses a datetime date argument for today’s date. However, WordPress expects simple string dates in the urls. Hence, I need a way to convert these complex datetime dates into simpler strings. That’s what the strftimefunction below does. It formats a datetime date to a date_string, in the format YYYY-MM-DD.
So we know how to generate the urls for the pages we want to scrape. We compile them using this handy function.
If we would let the driver browse directly to one of these compiled traffic urls, you will find yourself redirected to the WordPress login page, like below. That’s a bummer!
Hence, whenever we start our program, we will first need to log in once using our password. That’s what the signing_in function below is for. This function takes in a driver, a username, and a password. It uses the compile_traffic_url function to generate a traffic url (by default of today’s traffic [see above]). Then the driver loads the website using its get method. This will redirect us to the WordPress login page. In order for the webpages to load before our driver starts clicking away, we let our computer sleep a bit, using time.sleep.
Now, our automated driver is looking at the WordPress login page. We need to help it find where to input the username and password. If you press CTRL+SHIFT+C while on any webpage, the HTML behind it will show. Now you can just browse over the webpage elements, like the login input fields, and see what their CSS selectors, names, and classes are.
If you press CTRL+SHIFT+C on a webpage, the html behind it will show.
So, next, I order the driver to find the HTML element of the username-input field and input my username keys into it. We ask the driver to find the Continue-button and click it. Time for the driver to sleep again, while the page loads the password input field. Afterwards, we ask the driver to find the password input field, input our password, and click the Continue-button a second time. While our automatic login completes, we let the computer sleep some more.
Once we have logged in once, we will remain logged in until the Python program ends, which closes the driver.
Okay, so now that we have a function that logs us in, let’s start downloading our visitor statistics!
The download_traffic function takes in a driver, a date, and a list of dates_downloaded (an empty list by default). First, it checks whether the date to download occurs in dates_downloaded. If so, we do not want to waste time downloading statistics we already have. Otherwise, it puts the driver to work downloading the traffic for the specified date following these steps:
Compile url for the specified date
Driver browses to the webpage of that url
Computer sleeps while the webpage loads
Driver executes script, letting it scroll down to the bottom of the webpage
Driver is asked to find the button to download the visitor statistics in csv
Driver clicks said button
Computer sleeps while the csv is downloaded
If anything goes wrong during these steps, an error message is printed and no document is downloaded. With no document downloaded, our program can try again for that link the next time.
def download_traffic(driver, date, dates_downloaded=[]):
if date in dates_downloaded:
print(f'Already downloaded {date} traffic')
else:
try:
print(f'Downloading {date} traffic')
url = compile_traffic_url(date=date)
driver.get(url)
sleep(1)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
button = driver.find_element_by_class_name('stats-download-csv')
button.click()
sleep(1)
except:
print(f'Error during downloading of {date}')
We need one more function to generate the dates_downloaded list of download_traffic. The date_from_filename function below takes in a filename (e.g., paulvanderlaken.com_posts_day_12_28_2019_12_28_2019) and searches for a regular expression date format. The found match is turned into a datetime date using strptime and returned. This allows us to walk through a directory on our computer and see for which dates we have already downloaded visitor statistics. You will see how this works in the main program below.
def date_from_filename(filename):
match = re.search(r'\d{2}_\d{2}_\d{4}', filename)
date = datetime.strptime(match.group(), '%m_%d_%Y').date()
return date
Main program
In the end, we combine all these above functions in our main program. Here you will need to change five things to make it work on your computer:
path_data – enter a folder path where you want to store the retrieved visitor statistics csv’s
path_chromedriver – enter the path to the chromedriver.exe you unpacked
first_date – enter the date from which you want to start scraping (by default up to today)
username – enter your WordPress username or email address
password – enter your WordPress password
if __name__ == '__main__':
path_data = 'C:\\Users\\paulv\\stack\\projects\\2019_paulvanderlaken.com-anniversary\\traffic-day\\'
path_chromedriver = 'C:\\Users\\paulv\\chromedriver.exe'
first_date = datetime(2017, 1, 18).date()
last_date = datetime.today().date()
username = "insert_username"
password = "insert_password"
driver = get_driver(dir_download=path_data, path_chromedriver=path_chromedriver)
days_delta = last_date - first_date
days = [first_date + timedelta(days) for days in range(days_delta.days + 1)]
dates_downloaded = [date_from_filename(file) for _, _, f in os.walk(path_data) for file in f]
signing_in(driver, username=username, password=password)
for d in days:
download_traffic(driver, d, dates_downloaded)
driver.close()
If you have downloaded Chromedriver, have copied all the code blocks from this blog into a Python script, and have added in your personal paths, usernames, and passwords, this Python program should work like a charm on your computer as well. By default, the program will scrape statistics from all days from the first_date up to the day you run the program, but this you can change obviously.
Results
For me, the program took about 10 seconds to download one csv consisting of statistics for one day. So three years of WordPress blogging, or 1095 daily datasets of statistics, were extracted in about 3 hours. I did some nice cooking and wrote this blog in the meantime : )
The result after 3 hours of scraping
Compare that to the horror of having to surf, scroll, and click that godforsaken Download data as CSV button ~1100 times!!
The horror button (in Dutch)
Final notes
The main goal of this blog was to share the basic inner workings of this scraper with you, and to give you the same tool to scrape your own visitor statistics.
Now, this project can still be improved tremendously and in many ways. For instance, with very little effort you could add some command line arguments (with argparse) so you can run this program directly or schedule it daily. My next step is to set it up to run daily on my Raspberry Pi.
An additional potential improvement: when the current script encounters no statistics do download for a specific day, no csv is saved. This makes the program try again a next time it is run, as the dates_downloaded list will not include that date. Probably this some minor smart tweaks will solve this issue.
Moreover, there are many more statistics you could scrape of your WordPress account, like external clicks, the visitors home countries, search terms, et cetera.
The above are improvement points you can further develop yourself, and if you do please share them with the greater public so we can all benefit!
For now, I am happy with these data, and will start on building some basic dashboards and visualizations to derive some insights from my visitor patterns. If you have any ideas or experiences please let me know!
I hope this walkthrough and code may have help you in getting in control of your WordPress website as well. Or that you learned a thing or two about basic web scraping with Python. I am still in the midst of starting with Python myself, so if you have any tips, tricks, feedback, or general remarks, please do let me know! I am always happy to talk code and love to start pet projects to improve my programming skills, so do reach out if you have any ideas!
Over the course of last week, I built a Python program that scrapes quotes from Goodreads.com in a tidy format. For instance, these are the first three results my program returns when scraping for the tag robot:
Quote
author
source
likes
tags
Goodbye, Hari, my love. Remember always–all you did for me.
Isaac Asimov
Forward the Foundation
33
[‘asimov’, ‘foundation’, ‘human’, ‘robot’]
Unfortunately this Electric Monk had developed a fault, and had started to believe all kinds of things, more or less at random. It was even beginning to believe things they’d have difficulty believing in Salt Lake City.
As it’s bio reads, ArtificialStupidity is a highly sentient AI intelligently matching quotes and comics through state-of-the-art robotics, sophisticated machine learning, and blockchain technology.
Basically, every 15 minutes, a Python script is triggered on my computer (soon on my Raspberry Pi 4). Each time it triggers, this script generates a random number to determine whether it should post something. If so, the script subsequently generates another random number to determine what is should post: a quote, a comic, or both. Behind the scenes, some other functions add hastags and — voila — a tweet is born!
(An upcoming post will elaborate on the inner workings of my ArtificialStupidity Python script)
More often than not, ArtificialStupidity produces some random, boring tweet:
Now, in order to compile these tweets, my computer hosts two databases. One containing data- and tech- related comics; the other a variety of inspirational quotes. Each time the ArtificialStupidity bot posts a tweet, it draws from one or both of these datasets randomly. With, on average, one post every couple hours, I thus need several hundreds of items in these databases in order to prevent repetition — which is definitely not entertaining.
Up until last week, I manually expanded these databases every week or so. Adding new comics and quotes as I encountered them online. However, this proved a tedious task. Particularly for the quotes, as I set up the database in a specific format (“quote” – author). In contrast, websites like Goodreads.com display their quotes in a different format (e.g., “quote” ― author, source \n tags \n likes). Apart from the different format, the apostrophes and long slash also cause UTF-8 issues in my Python script. Hence, weekly reformatting of quotes proved an annoying task.
Up until this week!
While reformatting some bias-related quotes, I decided I’d rather invest 10 times more time developing my Python skills, than mindlessly reformatting quotes for a minute longer. So I started coding.
I am proud to say that, some six hours later, I have compiled the script below.
I’ll walk you through it’s functions.
So first, I import the modules/packages I need. Note that you will probably first have to pip install package-name on your own computer!
argparse for the command-line interface arguments
re for the regular expressions to clean quotes
bs4 for its BeautifulSoup for scraping website content
urllib.request for opening urls
csv to save csv files
os for directory pathing
import argparse
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv
import os
Next, I set up the argparse.ArgumentParser so that I can use my API using the command line. Now you can call the Python script using the command line (e.g., goodreads-scraper.py -t 'bias' -p 3 -q 80), and provide it with some arguments. No arguments are necessary. Most have sensible defaults. If you forget to provide a tag you will be prompted to provide one as the script runs (see later).
ap = argparse.ArgumentParser(description='Scrape quotes from Goodreads.com')
ap.add_argument("-t", "--tag",
required=False, type=str, default=None,
help="tag (topic/theme) of quotes to scrape")
ap.add_argument("-p", "--max_pages",
required=False, type=int, default=10,
help="maximum number of webpages to scrape")
ap.add_argument("-q", "--max_quotes",
required=False, type=int, default=100,
help="maximum number of quotes to scrape")
args = vars(ap.parse_args())
Now, the main function for this script is download_goodreads_quotes. This function contains many other functions within. You will see I set my functions up in a nested fashion, so that functions which are only used inside a certain scope, are instantiated there. In regular words, I create the functions where I use them.
First, download_goodreads_quotes creates download_quotes_from_page. In turn, download_quotes_from_page creates and calls compile_url — to create the url — get_soup — to download url contents — extract_quotes_elements_from_soup — to do just that — and extract_quote_dict. This latter function is the workhorse, as it takes each scraped quote element block of HTML and extracts the quote, author, source, and number of likes. It cleans each of these data points and returns them as a dictionary. In the end, download_quotes_from_page returns a list of dictionaries for every quote element block on a page.
Second, download_goodreads_quotes creates and calls download_all_pages which calls download_quotes_from_page for all pages up to max_pages, or up to the page that no longer returns quote data, or up to the number of max_quotes has been reached. All gathered quote dictionaries are added to a results list.
Additionally, I use two functions to actually store the scraped quotes: recreate_quote turns a quote dictionary into a quote (I actually do not use the source and likes, but maybe others want to do so); save_quotes calls this recreate quote for the list of quote dictionaires it’s given, and stores them in a csv file in the current directory.
Update 2020/04/05: added UTF-8 encoding based on infoguild‘s comment.
def recreate_quote(dict):
return f'"{dict.get("quote")}" - {dict.get("author")}'
def save_quotes(quote_data, tag):
save_path = os.path.join(os.getcwd(), 'scraped' + '-' + tag + '.txt')
print('saving file')
with open(save_path, 'w', encoding='utf-8') as f:
quotes = [recreate_quote(q) for q in quote_data]
for q in quotes:
f.write(q + '\n')
Finally, I need to call all these functions when the user runs this script via the command line. That’s what the following code does. If looks at the provided (default) arguments, and if no tag is provided, the user is prompted for one. Next Goodreads.com is scraped using the earlier specified download_goodreads_quotes function, and the results are saved to a csv file.
if __name__ == '__main__':
tag = args['tag'] if args['tag'] != None else input('Provide tag to search quotes for: ')
mp = args['max_pages']
mq = args['max_quotes']
result = download_goodreads_quotes(tag, max_pages=mp, max_quotes=mq)
save_quotes(result, tag)
Use
If you paste these script pieces sequentially in a Python script / text file, and save this file as goodreads-scraper.py. You can then run this script using your command line, like so goodreads-scraper.py -t 'bias' -p 3 -q 80 where the text after -t is the tag you are searching for, -p is the number of pages you want to scrape, and -q is the maximum number of quotes you want the program to scrape.
Let me know what your favorite quote is once you get it running!
To-do
So this is definitely still work in progress. Some potential improvements I want to integrate come directly to mind:
Avoid errors for quotes including newlines, or
Write code to extract only the text of the quote, instead of the whole text of the quote element.
Build in concurrency using futures (but take care that quotes are still added the results sequentially. Maybe we can already download the soups of all pages, as this takes the longest.
Write a function to return a random quote
Write a function to return a random quote within a tag
Implement a lower limit for the number of likes of quotes
Refactor the download_all_pages bit.
Add comments and docstrings.
Feedback or tips?
I have been programming in R for quite a while now, but Python and software development in general are still new to me. This will probably be visible in the way I program, my syntax, the functions I use, or other things. Please provide any feedback you may have as I’d love to get better!
After several years of proscrastinating, the inevitable finally happened: Three months ago, I committed to learning Python!
I must say that getting started was not easy. One afternoon three months ago, I sat down, motivated to get started. Obviously, the first step was to download and install Python as well as something to write actual Python code. Coming from R, I had expected to be coding in a handy IDE within an hour or so. Oh boy, what was I wrong.
Apparently, there were already a couple of versions of Python present on my computer. And apparently, they were in grave conflict. I had one for the R reticulate package; one had come with Anaconda; another one from messing around with Tensorflow; and some more even. I was getting all kinds of error, warning, and conflict messages already, only 10 minutes in. Nothing I couldn’t handle in the end, but my good spirits had dropped slightly.
With Python installed, the obvious next step was to find the RStudio among the Python IDE’s and get working in that new environment. As an rational consumer, I went online to read about what people recommend as a good IDE. PyCharm seemed to be quite fancy for Data Science. However, what’s this Spyder alternative other people keep talking about? Come again, there are also Rodeo, Thonny, PyDev, and Wing? What about those then? A whole other group of Pythonista’s said that, as I work in Data Science, I should get Anaconda and work solely in Jupyter Notebooks! Okay…? But I want to learn Python to broaden my skills and do more regular software development as well. Maybe I start simple, in a (code) editor? However, here we have Atom, Sublime Text, Vim, and Eclipse? All these decisions. And I personally really dislike making regrettable decisions or committing to something suboptimal. This was already taking much, much longer than the few hours I had planned for setup.
This whole process demotivated so much that I reverted back to programming in R and RStudio the week after. However, I had not given up. Over the course of the week, I brought the selection back to Anaconda Jupyter Notebooks, PyCharm, and Atom, and I was ready to pick one. But wait… What’s this Visual Studio Code (VSC) thing by Microsoft. This looks fancy. And it’s still being developed and expanded. I had already been working in Visual Studio learning C++, and my experiences had been good so far. Moreover, Microsoft seems a reliable software development company, they must be able to build a good IDE? I decided to do one last deepdive.
The more I read about VSC and its features for Python, the more excited I got. Hey, VSC’s Python extension automatically detects Python interpreters, so it solves my conflicts-problem. Linting you say? Never heard of it, but I’ll have it. Okay, able to run notebooks, nice! Easy debugging, testing, and handy snippets… Okay! Machine learning-based IntelliSense autocompletes your Python code – that sounds like something I’d like. A shit-ton of extensions? Yes please! Multi-language support – even tools for R programming? Say no more! I’ll take it. I’ll take it all!
Linting in VSC provides code suggestions
My goods friends at Microsoft were not done yet though. To top it all of, they have documented everything so well. It’s super easy to get started! There are numerous ordered pages dedicated to helping you set up and discover your new Python environment in VSC:
The Microsoft VSC pages also link to some more specific resources:
Editing Python in VS Code: Learn more about how to take advantage of VS Code’s autocomplete and IntelliSense support for Python, including how to customize their behvior… or just turn them off.
Linting Python: Linting is the process of running a program that will analyse code for potential errors. Learn about the different forms of linting support VS Code provides for Python and how to set it up.
Debugging Python: Debugging is the process of identifying and removing errors from a computer program. This article covers how to initialize and configure debugging for Python with VS Code, how to set and validate breakpoints, attach a local script, perform debugging for different app types or on a remote computer, and some basic troubleshooting.
Unit testing Python: Covers some background explaining what unit testing means, an example walkthrough, enabling a test framework, creating and running your tests, debugging tests, and test configuration settings.
Python IntelliSense in VSC makes real-time code autocomplete suggestions
My Own Python Journey
So three months in I am completely blown away at how easy, fun, and versatile the language is. Nearly anything is possible, most of the language is intuitive and straightforward, and there’s a package for anything you can think of. Although I have spent many hours, I am very happy with the results. I did not get this far, this quickly, in any other language. Let me share some of the stuff I’ve done the past three months.
I’ve mainly been building stuff. Some things from scratch, others by tweaking and recycling other people’s code. In my opinion, reusing other people’s code is not necessarily bad, as long as you understand what the code does. Moreover, I’ve combed through lists and lists of build-it-yourself projects to get inspiration for projects and used stuff from my daily work and personal life as further reasons to code. I ended up building:
solutions to the first 31 problems of Project Euler, which I highly recommend you try to solve yourself!
solutions to the first dozen problems posed in Automate the Boring Stuff with Python. This book and online tutorial forces you to get your hands dirty right from the start. Simply amazing content and the learning curve is precisely good
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning." – Rick Cook#programming#coding#ArtificialStupidity no.20 pic.twitter.com/cBiR1HQszn
all Socratica Python Youtube videos. They are simply a fantastic introduction to the language and amazingly amusing. You can sponsor them here
hours and hours of Corey Shafer’s Youtube channel. Seriously good quality content, and more in-depth than Socratica. Corey covers the versatile functionalities included in the standard Python libraries and then some more
Although it is no longer maintained, you might find some more, interesting links on my Python resources page or here, for those transitioning from R. If only the links to the more up-to-date resources pages. Anyway, hope this current blog helps you on your Python journey or to get Python and Visual Studio Code working on your computer. Please feel free to share any of the stories, struggles, or successes you experience!
Last week, this interesting reddit thread was filled with overviews for cool projects that may help you learn a programming language. The top entries are: