Tag: webscraping

Learn to style HTML using CSS — Tutorials by Mozilla

Cascading Stylesheets — or CSS — is the first technology you should start learning after HTML. While HTML is used to define the structure and semantics of your content, CSS is used to style it and lay it out. For example, you can use CSS to alter the font, color, size, and spacing of your content, split it into multiple columns, or add animations and other decorative features.
https://developer.mozilla.org/en-US/docs/Learn/CSS

I was personally encoutered CSS in multiple stages of my Data Science career:

When I started using (R) markdown (see here, or here), I could present my data science projects as HTML pages, styled through CSS.
When I got more acustomed to building web applications (e.g., Shiny) on top of my data science models, I had to use CSS to build more beautiful dashboard layouts.
When I was scraping data from Ebay, Amazon, WordPress, and Goodreads, my prior experiences with CSS & HTML helped greatly to identify and interpret the elements when you look under the hood of a webpage (try pressing CTRL + SHIFT + C).

I know others agree with me when I say that the small investment in learning the basics behind HTML & CSS pay off big time:

ok listen……. i finally took a few hours to learn some CSS basics and big time recommend to any and all #rstats people who have always felt absolutely clueless looking up CSS stuff on stack overflow
— Sharla Gelfand (@sharlagelfand) June 30, 2020

I read that Mozilla offers some great tutorials for those interested in learning more about “the web”, so here are some quicklinks to their free tutorials:

Screenshot via developer.mozilla.org/en-US/docs/Learn/CSS/CSS_layout/Introduction

Python Web Scraping: WordPress Visitor Statistics

I’ve had this WordPress domain for several years now, and in the beginning it was very convenient.

WordPress enabled me to set up a fully functional blog in a matter of hours. Everything from HTML markup, external content embedding, databases, and simple analytics was already conveniently set up.

However, after a while, I wanted to do some more advanced stuff. Here, the disadvantages of WordPress hosting became evident fast. Anything beyond the most simple capabilities is locked firmly behind paywalls. Arguably rightfully so. If you want to use WordPress’ add-ins, I feel you should pay for them. That’s their business model after all.

However, what greatly annoys me is that WordPress actively hinders you from arranging matters yourself. Want to incorporate some JavaScript in your page? Upgrade to a paid account. Want to use Google Analytics? Upgrade and buy an add-in. Want to customize your HTML / CSS code? Upgrade or be damned. Even the simplest of tasks — just downloading visitor counts — WordPress made harder than it should be.

You can download visitor statistics manually — day by day, week by week, or year by year. However, there is no way to download your visitor history in batches. If you want to have your daily visiting history, you will manually have to download and store every day’s statistics.

For me, getting historic daily data would entail 1100 times entering a date, scrolling down, clicking a button, specifying a filename, and clicking to save. I did this once, for 36 monthly data snapshots, and the insights were barely worth the hassle, I assure you.

Fortunately, today, after nearly three years of hosting on WordPress, I finally managed to circumvent past this annoyance! Using the Python script detailed below, my computer now automonously logs in to WordPress and downloads the historic daily visitor statistics for all my blogs and pages!

Let me walk you through the program and code.

Modules & Setup

Before we jump into Python, you need to install Chromedriver. Just download the zip and unpack the execution file somewhere you can find it, and make sure to copy the path into Python. You will need it later. Chromedriver allows Python’s selenium webdriver to open up and steer a chrome browser.

We need another module for browsing: webdriver_manager. The other modules and their functions are for more common purposes: os for directory management, re for regular expression, datetime for working with dates, and time for letting the computer sleep in between operations.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
from datetime import datetime, timedelta
import os
import re

Helper Functions

I try to write my code in functions, so let’s dive into the functions that allow us to download visitor statistics.

To begin, we need to set up a driver (i.e., automated browser) and this is what get_driver does. Two things are important here. Firstly, the function takes an argument dir_download. You need to give it a path so it knows where to put any downloaded files. This path is stored under preferences in the driver options. Secondly, you need to specify the path_chromedriver argument. This needs to be the exact location you unpacked the chromedriver.exe. All these paths you can change later in the main program, so don’t worry about them for now. The get_driver function returns a ready-to-go driver object.

def get_driver(dir_download, path_chromedriver):
    chrome_options = webdriver.ChromeOptions()
    prefs = {'download.default_directory': dir_download}
    chrome_options.add_experimental_option('prefs', prefs)
    driver = webdriver.Chrome(executable_path=path_chromedriver, options=chrome_options)
    return driver

Next, our driver will need to know where to browse to. So the function below, compile_traffic_url, uses an f-string to generate the url for the visitor statistics overview of a specific domain and date. Important here is that you will need to change the domain default from paulvanderlaken.com to your own WordPress adress. Take a look at the statistics overview in your regular browser to see how you may tailor your urls.

Now, in the rest of the program, I work dates formatted and stored as datetime.datetime.date(). By default, the compile_traffic_url function also uses a datetime date argument for today’s date. However, WordPress expects simple string dates in the urls. Hence, I need a way to convert these complex datetime dates into simpler strings. That’s what the strftimefunction below does. It formats a datetime date to a date_string, in the format YYYY-MM-DD.

def compile_traffic_url(domain='paulvanderlaken.com', date=datetime.today().date()):
    date_string = date.strftime('%Y-%m-%d')
    return f'https://wordpress.com/stats/day/posts/{domain}?startDate={date_string}'

So we know how to generate the urls for the pages we want to scrape. We compile them using this handy function.

If we would let the driver browse directly to one of these compiled traffic urls, you will find yourself redirected to the WordPress login page, like below. That’s a bummer!

Hence, whenever we start our program, we will first need to log in once using our password. That’s what the signing_in function below is for. This function takes in a driver, a username, and a password. It uses the compile_traffic_url function to generate a traffic url (by default of today’s traffic [see above]). Then the driver loads the website using its get method. This will redirect us to the WordPress login page. In order for the webpages to load before our driver starts clicking away, we let our computer sleep a bit, using time.sleep.

def signing_in(driver, username, password):
    print('Sign in routine')

    url = compile_traffic_url()

    driver.get(url)
    sleep(1)

    field_email = driver.find_element_by_css_selector('#usernameOrEmail')
    field_email.send_keys(username)

    button_submit = driver.find_element_by_class_name('button')
    button_submit.click()

    sleep(1)

    field_password = driver.find_element_by_css_selector('#password')
    field_password.send_keys(password)

    button_submit = driver.find_element_by_class_name('button')
    button_submit.click()

    sleep(2)

Now, our automated driver is looking at the WordPress login page. We need to help it find where to input the username and password. If you press CTRL+SHIFT+C while on any webpage, the HTML behind it will show. Now you can just browse over the webpage elements, like the login input fields, and see what their CSS selectors, names, and classes are.

If you press `CTRL`+`SHIFT`+`C` on a webpage, the html behind it will show.

So, next, I order the driver to find the HTML element of the username-input field and input my username keys into it. We ask the driver to find the Continue-button and click it. Time for the driver to sleep again, while the page loads the password input field. Afterwards, we ask the driver to find the password input field, input our password, and click the Continue-button a second time. While our automatic login completes, we let the computer sleep some more.

Once we have logged in once, we will remain logged in until the Python program ends, which closes the driver.

Okay, so now that we have a function that logs us in, let’s start downloading our visitor statistics!

The download_traffic function takes in a driver, a date, and a list of dates_downloaded (an empty list by default). First, it checks whether the date to download occurs in dates_downloaded. If so, we do not want to waste time downloading statistics we already have. Otherwise, it puts the driver to work downloading the traffic for the specified date following these steps:

Compile url for the specified date
Driver browses to the webpage of that url
Computer sleeps while the webpage loads
Driver executes script, letting it scroll down to the bottom of the webpage
Driver is asked to find the button to download the visitor statistics in csv
Driver clicks said button
Computer sleeps while the csv is downloaded

If anything goes wrong during these steps, an error message is printed and no document is downloaded. With no document downloaded, our program can try again for that link the next time.

def download_traffic(driver, date, dates_downloaded=[]):
    if date in dates_downloaded:
        print(f'Already downloaded {date} traffic')
    else:
        try:
            print(f'Downloading {date} traffic')
            url = compile_traffic_url(date=date)
            driver.get(url)
            sleep(1)
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            button = driver.find_element_by_class_name('stats-download-csv')
            button.click()
            sleep(1)
        except:
            print(f'Error during downloading of {date}')

We need one more function to generate the dates_downloaded list of download_traffic. The date_from_filename function below takes in a filename (e.g., paulvanderlaken.com_posts_day_12_28_2019_12_28_2019) and searches for a regular expression date format. The found match is turned into a datetime date using strptime and returned. This allows us to walk through a directory on our computer and see for which dates we have already downloaded visitor statistics. You will see how this works in the main program below.

def date_from_filename(filename):
    match = re.search(r'\d{2}_\d{2}_\d{4}', filename)
    date = datetime.strptime(match.group(), '%m_%d_%Y').date()
    return date

Main program

In the end, we combine all these above functions in our main program. Here you will need to change five things to make it work on your computer:

path_data – enter a folder path where you want to store the retrieved visitor statistics csv’s
path_chromedriver – enter the path to the chromedriver.exe you unpacked
first_date – enter the date from which you want to start scraping (by default up to today)
username – enter your WordPress username or email address
password – enter your WordPress password

if __name__ == '__main__':
    path_data = 'C:\\Users\\paulv\\stack\\projects\\2019_paulvanderlaken.com-anniversary\\traffic-day\\'
    path_chromedriver = 'C:\\Users\\paulv\\chromedriver.exe'

    first_date = datetime(2017, 1, 18).date()
    last_date = datetime.today().date()

    username = "insert_username"
    password = "insert_password"

    driver = get_driver(dir_download=path_data, path_chromedriver=path_chromedriver)

    days_delta = last_date - first_date
    days = [first_date + timedelta(days) for days in range(days_delta.days + 1)]
    dates_downloaded = [date_from_filename(file) for _, _, f in os.walk(path_data) for file in f]

    signing_in(driver, username=username, password=password)

    for d in days:
        download_traffic(driver, d, dates_downloaded)
    driver.close()

If you have downloaded Chromedriver, have copied all the code blocks from this blog into a Python script, and have added in your personal paths, usernames, and passwords, this Python program should work like a charm on your computer as well. By default, the program will scrape statistics from all days from the first_date up to the day you run the program, but this you can change obviously.

Results

For me, the program took about 10 seconds to download one csv consisting of statistics for one day. So three years of WordPress blogging, or 1095 daily datasets of statistics, were extracted in about 3 hours. I did some nice cooking and wrote this blog in the meantime : )

Compare that to the horror of having to surf, scroll, and click that godforsaken Download data as CSV button ~1100 times!!

The horror button (in Dutch)

Final notes

The main goal of this blog was to share the basic inner workings of this scraper with you, and to give you the same tool to scrape your own visitor statistics.

Now, this project can still be improved tremendously and in many ways. For instance, with very little effort you could add some command line arguments (with argparse) so you can run this program directly or schedule it daily. My next step is to set it up to run daily on my Raspberry Pi.

An additional potential improvement: when the current script encounters no statistics do download for a specific day, no csv is saved. This makes the program try again a next time it is run, as the dates_downloaded list will not include that date. Probably this some minor smart tweaks will solve this issue.

Moreover, there are many more statistics you could scrape of your WordPress account, like external clicks, the visitors home countries, search terms, et cetera.

The above are improvement points you can further develop yourself, and if you do please share them with the greater public so we can all benefit!

For now, I am happy with these data, and will start on building some basic dashboards and visualizations to derive some insights from my visitor patterns. If you have any ideas or experiences please let me know!

I hope this walkthrough and code may have help you in getting in control of your WordPress website as well. Or that you learned a thing or two about basic web scraping with Python. I am still in the midst of starting with Python myself, so if you have any tips, tricks, feedback, or general remarks, please do let me know! I am always happy to talk code and love to start pet projects to improve my programming skills, so do reach out if you have any ideas!

Join 385 other subscribers

Python Web Scraping: Quotes from Goodreads.com

Over the course of last week, I built a Python program that scrapes quotes from Goodreads.com in a tidy format. For instance, these are the first three results my program returns when scraping for the tag robot:

Quote	author	source	likes	tags
Goodbye, Hari, my love. Remember always–all you did for me.	Isaac Asimov	Forward the Foundation	33	[‘asimov’, ‘foundation’, ‘human’, ‘robot’]
Unfortunately this Electric Monk had developed a fault, and had started to believe all kinds of things, more or less at random. It was even beginning to believe things they’d have difficulty believing in Salt Lake City.	Douglas Adams	Dirk Gently’s Holistic Detective Agency	25	[‘belief’, ‘humor’, ‘mormonism’, ‘religion’, ‘robot’]
It’s hard to wipe your eyes when you have whirring buzzsaws for hands.	Daniel H. Wilson	How to Survive a Robot Uprising: Tips on Defending Yourself Against the Coming Rebellion	20	[‘buzzaw’, ‘robot’, ‘survive’, ‘uprising’]

The first three quotes on Goodreads.com tagged ‘robot’

“Paul, why the hell are you building a Python API for Goodreads quotes?” I hear you asking. Well, let me provide you with some context.

A while back, I created a twitter bot called ArtificialStupidity.

As it’s bio reads, ArtificialStupidity is a highly sentient AI intelligently matching quotes and comics through state-of-the-art robotics, sophisticated machine learning, and blockchain technology.

Basically, every 15 minutes, a Python script is triggered on my computer (soon on my Raspberry Pi 4). Each time it triggers, this script generates a random number to determine whether it should post something. If so, the script subsequently generates another random number to determine what is should post: a quote, a comic, or both. Behind the scenes, some other functions add hastags and — voila — a tweet is born!

(An upcoming post will elaborate on the inner workings of my ArtificialStupidity Python script)

More often than not, ArtificialStupidity produces some random, boring tweet:

"For every $20 you spend on web analytics tools, you should spend $80 on the brains to make sense of the data." – Jeff Sauer#xkcd #deeplearning #dataviz #data #analytics #webdev #ArtificialStupidity no.147 pic.twitter.com/Ink2TOhu9G
— ArtificialStupidity (@ArtificialStup5) November 30, 2019

However, every now and then, the bot actually manages to combine a quote with a comic in a way that gets some laughs:

"Aim for simplicity in Data Science. Real creativity won't make things more complex. Instead, it will simplify them." – Damian Duffy Mingle#datascience #data #science #rstats #statistics #ArtificialStupidity no.195 pic.twitter.com/BOgwsJeLRP
— ArtificialStupidity (@ArtificialStup5) December 17, 2019

Now, in order to compile these tweets, my computer hosts two databases. One containing data- and tech- related comics; the other a variety of inspirational quotes. Each time the ArtificialStupidity bot posts a tweet, it draws from one or both of these datasets randomly. With, on average, one post every couple hours, I thus need several hundreds of items in these databases in order to prevent repetition — which is definitely not entertaining.

Up until last week, I manually expanded these databases every week or so. Adding new comics and quotes as I encountered them online. However, this proved a tedious task. Particularly for the quotes, as I set up the database in a specific format (“quote” – author). In contrast, websites like Goodreads.com display their quotes in a different format (e.g., “quote” ― author, source \n tags \n likes). Apart from the different format, the apostrophes and long slash also cause UTF-8 issues in my Python script. Hence, weekly reformatting of quotes proved an annoying task.

Up until this week!

While reformatting some bias-related quotes, I decided I’d rather invest 10 times more time developing my Python skills, than mindlessly reformatting quotes for a minute longer. So I started coding.

I am proud to say that, some six hours later, I have compiled the script below.

I’ll walk you through it’s functions.

So first, I import the modules/packages I need. Note that you will probably first have to pip install package-name on your own computer!

argparse for the command-line interface arguments
re for the regular expressions to clean quotes
bs4 for its BeautifulSoup for scraping website content
urllib.request for opening urls
csv to save csv files
os for directory pathing

import argparse
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv
import os

Next, I set up the argparse.ArgumentParser so that I can use my API using the command line. Now you can call the Python script using the command line (e.g., goodreads-scraper.py -t 'bias' -p 3 -q 80), and provide it with some arguments. No arguments are necessary. Most have sensible defaults. If you forget to provide a tag you will be prompted to provide one as the script runs (see later).

ap = argparse.ArgumentParser(description='Scrape quotes from Goodreads.com')

ap.add_argument("-t", "--tag",
                required=False, type=str, default=None,
                help="tag (topic/theme) of quotes to scrape")
ap.add_argument("-p", "--max_pages",
                required=False, type=int, default=10,
                help="maximum number of webpages to scrape")
ap.add_argument("-q", "--max_quotes",
                required=False, type=int, default=100,
                help="maximum number of quotes to scrape")

args = vars(ap.parse_args())

Now, the main function for this script is download_goodreads_quotes. This function contains many other functions within. You will see I set my functions up in a nested fashion, so that functions which are only used inside a certain scope, are instantiated there. In regular words, I create the functions where I use them.

First, download_goodreads_quotes creates download_quotes_from_page. In turn, download_quotes_from_page creates and calls compile_url — to create the url — get_soup — to download url contents — extract_quotes_elements_from_soup — to do just that — and extract_quote_dict. This latter function is the workhorse, as it takes each scraped quote element block of HTML and extracts the quote, author, source, and number of likes. It cleans each of these data points and returns them as a dictionary. In the end, download_quotes_from_page returns a list of dictionaries for every quote element block on a page.

Second, download_goodreads_quotes creates and calls download_all_pages which calls download_quotes_from_page for all pages up to max_pages, or up to the page that no longer returns quote data, or up to the number of max_quotes has been reached. All gathered quote dictionaries are added to a results list.

def download_goodreads_quotes(tag, max_pages=1, max_quotes=50):

    def download_quotes_from_page(tag, page):

        def compile_url(tag, page):
            return f'https://www.goodreads.com/quotes/tag/{tag}?page={page}'

        def get_soup(url):
            response = urlopen(Request(url))
            return BeautifulSoup(response, 'html.parser')

        def extract_quotes_elements_from_soup(soup):
            elements_quotes = soup.find_all("div", {"class": "quote mediumText"})
            return elements_quotes

        def extract_quote_dict(quote_element):

            def extract_quote(quote_element):
                try:
                    quote = quote_element.find('div', {'class': 'quoteText'}).get_text("|", strip=True)
                    # first element is always the quote
                    quote = quote.split('|')[0]
                    quote = re.sub('^“', '', quote)
                    quote = re.sub('”\s?$', '', quote)
                    return quote
                except:
                    return None

            def extract_author(quote_element):
                try:
                    author = quote_element.find('span', {'class': 'authorOrTitle'}).get_text()
                    author = author.strip()
                    author = author.rstrip(',')
                    return author
                except:
                    return None

            def extract_source(quote_element):
                try:
                    source = quote_element.find('a', {'class': 'authorOrTitle'}).get_text()
                    return source
                except:
                    return None

            def extract_tags(quote_element):
                try:
                    tags = quote_element.find('div', {'class': 'greyText smallText left'}).get_text(strip=True)
                    tags = re.sub('^tags:', '', tags)
                    tags = tags.split(',')
                    return tags
                except:
                    return None

            def extract_likes(quote_element):
                try:
                    likes = quote_element.find('a', {'class': 'smallText', 'title': 'View this quote'}).get_text(strip=True)
                    likes = re.sub('likes$', '', likes)
                    likes = likes.strip()
                    return int(likes)
                except:
                    return None

            quote_data = {'quote': extract_quote(quote_element),
                          'author': extract_author(quote_element),
                          'source': extract_source(quote_element),
                          'likes': extract_likes(quote_element),
                          'tags': extract_tags(quote_element)}

            return quote_data

        url = compile_url(tag, page)
        print(f'Retrieving {url}...')
        soup = get_soup(url)
        quote_elements = extract_quotes_elements_from_soup(soup)

        return [extract_quote_dict(e) for e in quote_elements]

    def download_all_pages(tag, max_pages, max_quotes):
        results = []
        p = 1
        while p <= max_pages:
            res = download_quotes_from_page(tag, p)
            if len(res) == 0:
                print(f'No results found on page {p}.\nTerminating search.')
                return results

            results = results + res

            if len(results) >= max_quotes:
                print(f'Hit quote maximum ({max_quotes}) on page {p}.\nDiscontinuing search.')
                return results[0:max_quotes]
            else:
                p += 1

        return results

    return download_all_pages(tag, max_pages, max_quotes)

Additionally, I use two functions to actually store the scraped quotes: recreate_quote turns a quote dictionary into a quote (I actually do not use the source and likes, but maybe others want to do so); save_quotes calls this recreate quote for the list of quote dictionaires it’s given, and stores them in a csv file in the current directory.

Update 2020/04/05: added UTF-8 encoding based on infoguild‘s comment.

def recreate_quote(dict):
    return f'"{dict.get("quote")}" - {dict.get("author")}'

def save_quotes(quote_data, tag):
    save_path = os.path.join(os.getcwd(), 'scraped' + '-' + tag + '.txt')
    print('saving file')
    with open(save_path, 'w', encoding='utf-8') as f:
        quotes = [recreate_quote(q) for q in quote_data]
        for q in quotes:
            f.write(q + '\n')

Finally, I need to call all these functions when the user runs this script via the command line. That’s what the following code does. If looks at the provided (default) arguments, and if no tag is provided, the user is prompted for one. Next Goodreads.com is scraped using the earlier specified download_goodreads_quotes function, and the results are saved to a csv file.

if __name__ == '__main__':
    tag = args['tag'] if args['tag'] != None else input('Provide tag to search quotes for: ')
    mp = args['max_pages']
    mq = args['max_quotes']
    result = download_goodreads_quotes(tag, max_pages=mp, max_quotes=mq)
    save_quotes(result, tag)

Use

If you paste these script pieces sequentially in a Python script / text file, and save this file as goodreads-scraper.py. You can then run this script using your command line, like so goodreads-scraper.py -t 'bias' -p 3 -q 80 where the text after -t is the tag you are searching for, -p is the number of pages you want to scrape, and -q is the maximum number of quotes you want the program to scrape.

Let me know what your favorite quote is once you get it running!

To-do

So this is definitely still work in progress. Some potential improvements I want to integrate come directly to mind:

Avoid errors for quotes including newlines, or
Write code to extract only the text of the quote, instead of the whole text of the quote element.
Build in concurrency using futures (but take care that quotes are still added the results sequentially. Maybe we can already download the soups of all pages, as this takes the longest.
Write a function to return a random quote
Write a function to return a random quote within a tag
Implement a lower limit for the number of likes of quotes
Refactor the download_all_pages bit.
Add comments and docstrings.

Feedback or tips?

I have been programming in R for quite a while now, but Python and software development in general are still new to me. This will probably be visible in the way I program, my syntax, the functions I use, or other things. Please provide any feedback you may have as I’d love to get better!

Identifying “Dirty” Twitter Bots with R and Python

Past week, I came across two programming initiatives to uncover Twitter bots and one attempt to identify fake Instagram accounts.

Mike Kearney developed the R package botornot which applies machine learning to estimate the probability that a Twitter user is a bot. His default model is a gradient boosted model trained using both users-level (bio, location, number of followers and friends, etc.) and tweets-level information (number of hashtags, mentions, capital letters, etc.). This model is 93.53% accurate when classifying bots and 95.32% accurate when classifying non-bots. His faster model uses only the user-level data and is 91.78% accurate when classifying bots and 92.61% accurate when classifying non-bots. Unfortunately, the models did not classify my account correctly (see below), but you should definitely test yourself and your friends via this Shiny application.

Fun fact: botornot can be integrated with Mike’s rtweet package

Scraping Dirty Bots

At around the same time, I read this very interesting blog by Andy Patel. Annoyed by the fake Twitter accounts that kept liking and sharing his tweets, Andy wrote a Python script called pronbot_search. It’s an iterative search algorithm which Andy seeded with the dozen fake Twitter accounts that he identified originally. Subsequently, the program iterated over the friends and followers of each of these fake users, looking for other accounts displaying similar traits (e.g., similar description, including an URL to a sex-website called “Dirty Tinder”).

Whenever a new account was discovered, it was added to the query list, and the process continued. Because of the Twitter API restrictions, the whole crawling process took literal days before Andy manually terminated it. The results are just amazing:

After a day, the results looked like so. Notice the weird clusters of relationships in this network. [original]

The full bot network uncovered by Andy included 22.000 fake Twitter accounts:

At the end of the weekend of March 10th, Andy had to stop the scraper after running for several days even though he had only processed 18% of the networks of the 22.000 included Twitter bots [original]

The bot network on Twitter is probably enormous! Zooming in on the network, Andy notes that:

Pretty much the same pattern I’d seen after one day of crawling still existed after one week. Just a few of the clusters weren’t “flower” shaped.

Andy Patel, March 2018, link

Zoomed in to a specific part of the network you can see the separate clusters of bots doing little more than liking each others messages. [original]

In his blog, Andy continues to look at all kind of data on these fake accounts. I found most striking that many of these account are years and years old already. Potentially, Twitter can use Mike Kearney’s botornot application to spot and remove them!

Most of the bots in the Dirty Tinder network found by Andy Patel were 3 to 8 years old already. [original]

Andy was nice enough to share the data on these bot accounts here, for you to play with. His Python code is stored in the same github repo and more details around this project you can read in his original blog.

Fake Instagram Accounts

Finally, SRFdata (Timo Grossenbacher) attempted to uncover fake Instagram followers among the 7 million followers in the network of 115 important Swiss Instagram influencers in R. Magi Metrics was used to retrieve information for public Instagram accounts and rvest for private accounts. Next, clear fake accounts (e.g., little followers, following many, no posts, no profile picture, numbers in name) were labelled manually, and approximately 10% of the inspected 1000 accounts appeared fake. Finally, they trained a random forest model to classify fake accounts with a sensitivity (true negative) rate of 77.4% and an overall accuracy of around 94%.

Datasets to practice and learn Programming, Machine Learning, and Data Science

Many requests have come in regarding “training datasets” – to practice programming. Fortunately, the internet is full of open-source datasets! I compiled a selected list of datasets and repositories below. If you have any additions, please comment or contact me! For information on programming languages or algorithms, visit the overviews for R, Python, SQL, or Data Science, Machine Learning, & Statistics resources.

This list is no longer being maintained. There are other, more frequently updated repositories of useful datasets included in bold below:

LAST UPDATED: 2019-12-23

A Million News Headlines: News headlines published over a period of 14 years.

AggData | Datasets

Aligned Hansards of the 36th Parliament of Canada

Amazon Web Services: Public Datasets

American Community Survey

ArcGIS Hub Open Data

arXiv.org help – arXiv Bulk Data Access – Amazon S3

Asset Macro: Financial & Macroeconomic Historical Data

Awesome JSON Datasets

Awesome Public Datasets

Behavioral Risk Factor Surveillance System

British Oceanographic Data Center

Bureau of Justice

Canada

Causality | Data Repository

CDC Wonder Online Database

Census Bureau Home Page

Center for Disease Control

ChEMBLdb

ChemDB

City of Chicago

Click Dataset | Center for Complex Networks and Systems Research

CommonCrawl 2013 Web Crawl

Consumer Finance: Mortgage Database

CRCNS – Collaborative Research in Computational Neuroscience

Data is Plural

Data.Seattle.Gov | Seattle’s Data Site

Data.world

Data.World datasets

DataHub

Datasets for Data Mining

DataSF

Dataverse

DELVE datasets

DMOZ open directory (mirror)

DRYAD

Enigma Public

Enron Email Dataset

European Environment Agency (EEA) | Data and maps

Eurostat

Eurostat Database

Eurovision YouTube Comments: YouTube comments on entries from the 2003-2008 Eurovision Song Contests

FAA Data

Face Recognition Homepage – Databases

FAOSTAT Data

FBI Crime Data Explorer

FEMA Data Feeds

Figshare

FiveThirthyEight.com

Flickr personal taxonomies

FlowingData

Fraudulent E-mail Corpus: CLAIR collection of “Nigerian” fraud emails

Freebase (last datadump)

Gapminder.org

Gene Expression Omnibus (GEO) Main page

GeoJSON files for real-time Virginia transportation data.

Golem Dataset

Google Books n-gram dataset

Google Public Data Explorer

Google Research: A Web Research Corpus Annotated with Freebase Concepts

Health Intelligence

Healthcare Cost and Utilization Project

HealthData.gov

Human Fertility Database

Human Mortality Database

ICPRS Social Science Studies

ICWSM Spinnr Challenge 2011 dataset

IIE.org Open Doors Data Portal

ImageNet

IMDB dataset

IMF Data and Statistics

Informatics Lab Open Data

Inside AirBnB

Internet Archive: Digital Library

IPUMS

Ironic Corpus: 1950 sentences labeled for ironic content

Kaggle Datasets

KAPSARC Energy Data Portal

KDNuggets Datasets

Knoema

Lahman’s Baseball Database

Lending Club Loan Data

Linking Open Data

London Datastore

Makeover Monday

Medical Expenditure Panel Survey

Million Song Dataset | scaling MIR research

MLDATA | Machine Learning Dataset Repository

MLvis Scientific Data Repository

MovieLens Data Sets | GroupLens Research

NASA

NASA Earth Data

National Health and Nutrition Examination Survey

National Hospital Ambulatory Medical Care Survey Data

New York State

NYPD Crash Data Band-Aid

ODI Leeds

OECD Data

OECD.Stat

Office for National Statistics

Old Newspapers: A cleaned subset of HC Corpora newspapers

Open Data Inception Portals

Open Data Nederland

Open Data Network

OpenDataSoft Repository

Our World in Data

Pajek datasets

PermID from Thomson Reuters

Pew Research Center

Plenar.io

PolicyMap

Princeton University Library

Registry of Research Data Repositories

Retrosheet.org

Satori OpenData

SCOTUS Opinions Corpus: Lots of Big, Important Words

Sharing PyPi/Maven dependency data « RTFB

SMS Spam Collection

Socrata

St. Louis Federal Reserve

Stanford Large Network Dataset Collection

State of the Nation Corpus (1990 – 2017): Full texts of the South African State of the Nation addresses

Statista

Substance Abuse and Mental Health Services Administration

Swiss Open Government Data

Tableau Public

The Association of Religious Data Archives

The Economist

The General Social Survey

The Huntington’s Early California Population Project

The World Bank | Data

The World Bank Data Catalog

Toronto Open Data

Translation Task Data

Transport for London

Twitter Data 2010

Ubuntu Dialogue Corpus: 26 million turns from natural two-person dialogues

UC Irvine Knowledge Discovery in Databases Archive

UC Irvine Machine Learning Repository –

UC Irvine Network Data Repository

UN Comtrade Database

UN General Debates:Transcriptions of general debates at the UN from 1970 to 2016

UNdata

Uniform Crime Reporting

UniGene

United States Exam Data

University of Michigan ICPSR

University of Rochester LibGuide “Data-Stats”

US Bureau of Labor Statistics

US Census Bureau Data

US Energy Information Administration

US Government Web Services and XML Data Sources

USA Facts

USENET corpus (2005-2011)

Utah Open Data

Varieties of Democracy.

Western Pennsylvania Regional Data Center

WHO Data Repository

Wikipedia List of Datasets for Machine Learning

WordNet

World Values Survey

World Wealth & Income Database

World Wide Web: 3.5 billion web pages and their relations

Yahoo Data for Researchers

YouTube Network 2007-2008

Where to look for your next job? An Interactive Map of the US Job Market

The people at Predictive Talent, Inc. took a sample of 23.4 million job postings from 5,200+ job boards and 1,800+ cities around the US. They classified these jobs using the BLS Standard Occupational Classification tree and identified their primary work locations, primary job roles, estimated salaries, and 17 other job search-related characteristics. Next, they calculated five metrics for each role and city in order to identify the 123 biggest job shortages in the US:

Monthly Demand (#): How many people are companies hiring every month? This is simply the number of unique jobs posted every month.
Unmet Demand (%): What percentage of jobs did companies have a hard time filling? Details aside, basically, if a company re-posts the same job every week for 6 weeks, one may assume that they couldn’t find someone for the first 5 weeks.
Salary ($): What’s the estimated salary for these jobs near this city? Using 145,000+ data points from the federal government and proprietary sources, along with salary information parsed from jobs themselves, they estimated the median salary for similar jobs within 100 miles of the city.
Delight (#): On a scale of 1 (least) to 10 (most delight), how easy should the job search be for the average job-seeker? This is basically the opposite of Agony.

The end result is this amazing map of the job market in the U.S, which you can interactively explore here to see where you could best start your next job hunt.