Category: machine learning

Getting ChatGPT to simulate a Recruitment Process in Python code

I was playing around with ChatGPT to see if it could write Python code to generate realistic recruitment process data for me.

Here’s the full conversation as I was truly amazed with the results.

I emboldened my prompts and had to ask ChatGPT to continue sometimes as longer responses are cut off.

Can you write Python code that simulates how applications would be processed through a recruitment funnel, with a class that stores all timestamps?

Sure! Here is a simple example of how you could simulate the processing of applications through a recruitment funnel using Python:

Copy code# Define a class that represents an application
class Application:
  def __init__(self):
    self.timestamps = []

  # This method will be called each time an application moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self):
    self.timestamps.append(datetime.now())

# Define a function that simulates the processing of applications through the
# recruitment funnel
def process_applications(applications):
  # Loop through each application in the list
  for application in applications:
    # Simulate processing the application (e.g. reviewing the resume)
    time.sleep(1)
    application.update_timestamp()

    # Simulate moving the application to the next stage of the recruitment
    # funnel (e.g. scheduling an interview)
    time.sleep(1)
    application.update_timestamp()

    # Simulate completing the final stage of the recruitment funnel
    # (e.g. making a job offer)
    time.sleep(1)
    application.update_timestamp()

# Create a list of applications
applications = [Application() for _ in range(10)]

# Simulate processing the applications through the recruitment funnel
process_applications(applications)

# Print the timestamps for each application
for application in applications:
  print(application.timestamps)

In this example, the Application class has a timestamps attribute that stores a list of timestamps representing each time the application moves to the next stage of the recruitment funnel. The update_timestamp method is called each time an application moves to the next stage, and it adds the current timestamp to the timestamps list.

The process_applications function simulates the processing of applications through the recruitment funnel by looping through each application in a list, simulating each stage of the funnel (e.g. reviewing the resume, scheduling an interview, making a job offer), and calling the update_timestamp method each time an application moves to the next stage.

At the end, the code prints the timestamps for each application to show how the timestamps are recorded as the applications move through the recruitment funnel.

The recruitment funnel process should include the following steps: application, resume review, telephone interview, assessment, interview, contract offer, acceptance.

Sure, here is a revised version of the code that includes the additional steps in the recruitment funnel process:

Copy code# Define a class that represents an application
class Application:
  def __init__(self):
    self.timestamps = []

  # This method will be called each time an application moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self):
    self.timestamps.append(datetime.now())

# Define a function that simulates the processing of applications through the
# recruitment funnel
def process_applications(applications):
  # Loop through each application in the list
  for application in applications:
    # Simulate receiving the application
    time.sleep(1)
    application.update_timestamp()

    # Simulate reviewing the resume
    time.sleep(1)
    application.update_timestamp()

    # Simulate conducting a telephone interview
    time.sleep(1)
    application.update_timestamp()

    # Simulate conducting an assessment
    time.log(1)
    application.update_timestamp()

    # Simulate conducting an in-person interview
    time.sleep(1)
    application.update_timestamp()

    # Simulate making a contract offer
    time.sleep(1)
    application.update_timestamp()

    # Simulate accepting the contract offer
    time.sleep(1)
    application.update_timestamp()

# Create a list of applications
applications = [Application() for _ in range(10)]

# Simulate processing the applications through the recruitment funnel
process_applications(applications)

# Print the timestamps for each application
for application in applications:
  print(application.timestamps)

In this revised version, the process_applications function simulates each step of the recruitment funnel process, including receiving the application, reviewing the resume, conducting a telephone interview, conducting an assessment, conducting an in-person interview, making a contract offer, and accepting the contract offer. Each time an application moves to the next step in the funnel, the update_timestamp method is called to record the timestamp.

At the end, the code prints the timestamps for each application to show how they are recorded as the applications move through the recruitment funnel.

Can you use the simpy module instead?

Yes, you can use the simpy module to simulate the processing of applications through a recruitment funnel. Here is an example of how you could do this using simpy:

Copy codeimport simpy
import random

# Define a class that represents an application
class Application:
  def __init__(self):
    self.timestamps = []

  # This method will be called each time an application moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self, env):
    self.timestamps.append(env.now)

# Define a function that simulates the processing of applications through the
# recruitment funnel
def process_applications(env, applications):
  # Loop through each application in the list
  for application in applications:
    # Simulate receiving the application
    yield env.timeout(1)
    application.update_timestamp(env)

    # Simulate reviewing the resume
    yield env.timeout(1)
    application.update_timestamp(env)

    # Simulate conducting a telephone interview
    yield env.timeout(1)
    application.update_timestamp(env)

    # Simulate conducting an assessment
    yield env.timeout(1)
    application.update_timestamp(env)

    # Simulate conducting an in-person interview
    yield env.timeout(1)
    application.update_timestamp(env)

    # Simulate making a contract offer
    yield env.timeout(1)
    application.update_timestamp(env)

    # Simulate accepting the contract offer
    yield env.timeout(1)
    application.update_timestamp(env)

# Create a simpy environment
env = simpy.Environment()

# Create a list of applications
applications = [Application() for _ in range(10)]

# Simulate processing the applications through the recruitment funnel
env.process(process_applications(env, applications))
env.run()

# Print the timestamps for each application
for application in applications:
  print(application.timestamps)

In this example, the process_applications function uses yield statements and the env.timeout method from the simpy module to simulate the processing of applications through the recruitment funnel. Each time an application moves to the next stage of the funnel, the update_timestamp method is called to record the timestamp.

At the end, the code prints the timestamps for each application to show how they are recorded as the applications move through the recruitment funnel.

I need the code to also log the name of the process step.

Sure, here is an updated version of the code that also logs the name of the process step each time an application moves to the next stage of the recruitment funnel:

Copy codeimport simpy
import random

# Define a class that represents an application
class Application:
  def __init__(self):
    self.timestamps = []

  # This method will be called each time an application moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self, env, process_name):
    self.timestamps.append((env.now, process_name))

# Define a function that simulates the processing of applications through the
# recruitment funnel
def process_applications(env, applications):
  # Loop through each application in the list
  for application in applications:
    # Simulate receiving the application
    yield env.timeout(1)
    application.update_timestamp(env, "Received")

    # Simulate reviewing the resume
    yield env.timeout(1)
    application.update_timestamp(env, "Resume Review")

    # Simulate conducting a telephone interview
    yield env.timeout(1)
    application.update_timestamp(env, "Telephone Interview")

    # Simulate conducting an assessment
    yield env.timeout(1)
    application.update_timestamp(env, "Assessment")

    # Simulate conducting an in-person interview
    yield env.timeout(1)
    application.update_timestamp(env, "Interview")

    # Simulate making a contract offer
    yield env.timeout(1)
    application.update_timestamp(env, "Contract Offer")

    # Simulate accepting the contract offer
    yield env.timeout(1)
    application.update_timestamp(env, "Acceptance")

# Create a simpy environment
env = simpy.Environment()

# Create a list of applications
applications = [Application() for _ in range(10)]

# Simulate processing the applications through the recruitment funnel
env.process(process_applications(env, applications))
env.run()

# Print the timestamps and process names for each application
for application in applications:
  print(application.timestamps)

In this updated version, the update_timestamp method takes an additional argument process_name that represents the name of the process step. Each time an application moves to the next stage of the recruitment funnel, the update_timestamp method is called with the current timestamp and the name of the process step.

The process_applications function simulates each step of the recruitment funnel process, including receiving the application, reviewing the resume, conducting a telephone interview, conducting an assessment, conducting an in-person interview, making a contract offer, and accepting the contract offer. Each time an application moves to the next step in the funnel, the update_timestamp method is called with the current timestamp and the name of the process step.

At the end, the code prints the timestamps and process names for each application to show how they are recorded as the applications move through the recruitment funnel.

Can you have it simulate 900 applicants with unique names and birthdays?

Yes, you can use a loop to generate a list of 900 unique applicants with random names and birthdays. Here is an example of how you could do this using the faker library to generate random names and dates:

Copy codeimport simpy
import random
from faker import Faker

# Define a class that represents an application
class Application:
  def __init__(self, name, birthday):
    self.name = name
    self.birthday = birthday
    self.timestamps = []

  # This method will be called each time an application moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self, env, process_name):
    self.timestamps.append((env.now, process_name))

# Define a function that simulates the processing of applications through the
# recruitment funnel
def process_applications(env, applications):
  # Loop through each application in the list
  for application in applications:
    # Simulate receiving the application
    yield env.timeout(1)
    application.update_timestamp(env, "Received")

    # Simulate reviewing the resume
    yield env.timeout(1)
    application.update_timestamp(env, "Resume Review")

    # Simulate conducting a telephone interview
    yield env.timeout(1)
    application.update_timestamp(env, "Telephone Interview")

    # Simulate conducting an assessment
    yield env.timeout(1)
    application.update_timestamp(env, "Assessment")

    # Simulate conducting an in-person interview
    yield env.timeout(1)
    application.update_timestamp(env, "Interview")

    # Simulate making a contract offer
    yield env.timeout(1)
    application.update_timestamp(env, "Contract Offer")

    # Simulate accepting the contract offer
    yield env.timeout(1)
    application.update_timestamp(env, "Acceptance")

# Create a simpy environment
env = simpy.Environment()

# Create a list of applications
applications = []

# Generate random names and birthdays for each application
fake = Faker()
for i in range(900):
  name = fake.name()
  birthday = fake.date_of_birth()
  applications.append(Application(name, birthday))

# Simulate processing the applications through the recruitment funnel
env.process(process_applications(env, applications))
env.run()

# Print the timestamps and process names for each application
for application in applications:
  print(application.timestamps)

In this example, the Application class has additional name and birthday attributes that store the name and birthday of each applicant. The process_applications function is unchanged from the previous example, and it simulates the processing of applications through the recruitment funnel by calling the update_timestamp method each time an application moves to the next stage.

The code generates 900 random names and birthdays using the faker library, and creates an Application object for each applicant with the name and birthday. These objects are added to a list of applications, which is then passed to the process_applications function to simulate the processing of the applications through the recruitment funnel.

At the end, the code prints the timest

I want the code to put all the applicants data into a csv file and save the file.

Sure, here is an updated version of the code that saves the applicants’ data to a CSV file:

Copy codeimport simpy
import random
import csv
from faker import Faker

# Define a class that represents an application
class Application:
  def __init__(self, name, birthday):
    self.name = name
    self.birthday = birthday
    self.timestamps = []

  # This method will be called each time an application moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self, env, process_name):
    self.timestamps.append((env.now, process_name))

# Define a function that simulates the processing of applications through the
# recruitment funnel
def process_applications(env, applications):
  # Loop through each application in the list
  for application in applications:
    # Simulate receiving the application
    yield env.timeout(1)
    application.update_timestamp(env, "Received")

    # Simulate reviewing the resume
    yield env.timeout(1)
    application.update_timestamp(env, "Resume Review")

    # Simulate conducting a telephone interview
    yield env.timeout(1)
    application.update_timestamp(env, "Telephone Interview")

    # Simulate conducting an assessment
    yield env.timeout(1)
    application.update_timestamp(env, "Assessment")

    # Simulate conducting an in-person interview
    yield env.timeout(1)
    application.update_timestamp(env, "Interview")

    # Simulate making a contract offer
    yield env.timeout(1)
    application.update_timestamp(env, "Contract Offer")

    # Simulate accepting the contract offer
    yield env.timeout(1)
    application.update_timestamp(env, "Acceptance")

# Create a simpy environment
env = simpy.Environment()

# Create a list of applications
applications = []

# Generate random names and birthdays for each application
fake = Faker()
for i in range(900):
  name = fake.name()
  birthday = fake.date_of_birth()
  applications.append(Application(name, birthday))

# Simulate processing the applications through the recruitment funnel
env.process(process_applications(env, applications))
env.run()

# Save the applicants' data to a CSV file
with open("applicants.csv", "w") as csvfile:
  writer = csv.writer(csvfile)

  # Write the header row
  writer.writerow(["Name", "Birthday", "Timestamps"])

  # Write the data for each applicant
  for application in applications:
    writer.writerow([application.name, application.birthday, application.timestamps])

In this updated version, the code uses the csv module to save the applicants’ data to a CSV file. It creates a csv.writer object, which is used to write the data to the file. The code writes the header row, which includes the column names “Name”, “Birthday”, and “Timestamps”. Then, it writes a row of

Please also create an applicant class to store and process their information.

Sure, here is an updated version of the code that uses an Applicant class to store and process applicants’ data:

Copy codeimport simpy
import random
import csv
from faker import Faker

# Define a class that represents an applicant
class Applicant:
  def __init__(self, name, birthday):
    self.name = name
    self.birthday = birthday
    self.timestamps = []

  # This method will be called each time an applicant moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self, env, process_name):
    self.timestamps.append((env.now, process_name))

# Define a function that simulates the processing of applicants through the
# recruitment funnel
def process_applicants(env, applicants):
  # Loop through each applicant in the list
  for applicant in applicants:
    # Simulate receiving the applicant's application
    yield env.timeout(1)
    applicant.update_timestamp(env, "Received")

    # Simulate reviewing the applicant's resume
    yield env.timeout(1)
    applicant.update_timestamp(env, "Resume Review")

    # Simulate conducting a telephone interview with the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Telephone Interview")

    # Simulate conducting an assessment of the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Assessment")

    # Simulate conducting an in-person interview with the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Interview")

    # Simulate making a contract offer to the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Contract Offer")

    # Simulate accepting the contract offer from the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Acceptance")

# Create a simpy environment
env = simpy.Environment()

# Create a list of applicants
applicants = []

# Generate random names and birthdays for each applicant
fake = Faker()
for i in range(900):
  name = fake

Continue from the applicants = []

Sure, here is the updated code that continues from the applicants = [] line:

Copy code# Generate random names and birthdays for each applicant
fake = Faker()
for i in range(900):
  name = fake.name()
  birthday = fake.date_of_birth()
  applicants.append(Applicant(name, birthday))

# Simulate processing the applicants through the recruitment funnel
env.process(process_applicants(env, applicants))
env.run()

# Save the applicants' data to a CSV file
with open("applicants.csv", "w") as csvfile:
  writer = csv.writer(csvfile)

  # Write the header row
  writer.writerow(["Name", "Birthday", "Timestamps"])

  # Write the data for each applicant
  for applicant in applicants:
    writer.writerow([applicant.name, applicant.birthday, applicant.timestamps])

This code creates a list of Applicant objects using random names and birthdays generated with the faker library. Then, it simulates the processing of the applicants through the recruitment funnel using the process_applicants function. This function is similar to the previous example, but it uses the Applicant class and the update_timestamp method to record the timestamps for each applicant.

Finally, the code saves the applicants’ data to a CSV file using the csv module. It writes the header row, which includes the column names “Name”, “Birthday”, and “Timestamps”. Then, it writes a row of data for each applicant, which includes the applicant’s name, birthday, and timestamps.

With this code, you can simulate the processing of applicants through a recruitment funnel, and save their data to a CSV file for further analysis.

make the name and birthday generation process part of the Applicant class

Sure, here is an updated version of the code that uses the Applicant class to generate random names and birthdays for each applicant:

Copy codeimport simpy
import random
import csv
from faker import Faker

# Define a class that represents an applicant
class Applicant:
  def __init__(self):
    self.name = fake.name()
    self.birthday = fake.date_of_birth()
    self.timestamps = []

  # This method will be called each time an applicant moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self, env, process_name):
    self.timestamps.append((env.now, process_name))

# Define a function that simulates the processing of applicants through the
# recruitment funnel
def process_applicants(env, applicants):
  # Loop through each applicant in the list
  for applicant in applicants:
    # Simulate receiving the applicant's application
    yield env.timeout(1)
    applicant.update_timestamp(env, "Received")

    # Simulate reviewing the applicant's resume
    yield env.timeout(1)
    applicant.update_timestamp(env, "Resume Review")

    # Simulate conducting a telephone interview with the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Telephone Interview")

    # Simulate conducting an assessment of the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Assessment")

    # Simulate conducting an in-person interview with the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Interview")

    # Simulate making a contract offer to the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Contract Offer")

    # Simulate accepting the contract offer from the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Acceptance")

# Create a simpy environment
env = simpy.Environment()

# Create a list of applicants
applicants = []

# Create a fake object to generate random names and birthdays
fake = Faker()

# Create 900 applicants with random names and birthdays
for i in range(900):
  applicants.append(Applicant())

# Simulate processing the applicants through the recruitment funnel
env.process(process_applicants(env, applicants))
env.run()

# Save the applicants' data to a CSV file
with open("applicants.csv", "w") as csvfile:
  writer = csv.writer(csvfile)

  # Write the header row
  writer.writerow(["Name", "Birthday", "Timestamps"])

  # Write the data for each applicant
  for applicant in applicants:

Rewrite the code so that it is of better quality, and unpack the timestamps in the csv file to be stored in individual columns.

Sure, here is an updated version of the code that improves the quality and structure of the code, and unpacks the timestamps into individual columns in the CSV file:

Copy codeimport simpy
import csv
from faker import Faker

# Define a class that represents an applicant
class Applicant:
  def __init__(self):
    self.name = fake.name()
    self.birthday = fake.date_of_birth()
    self.timestamps = []

  # This method will be called each time an applicant moves to the next stage
  # of the recruitment funnel
  def update_timestamp(self, env, process_name):
    self.timestamps.append((env.now, process_name))

# Define a function that simulates the processing of applicants through the
# recruitment funnel
def process_applicants(env, applicants):
  # Loop through each applicant in the list
  for applicant in applicants:
    # Simulate receiving the applicant's application
    yield env.timeout(1)
    applicant.update_timestamp(env, "Received")

    # Simulate reviewing the applicant's resume
    yield env.timeout(1)
    applicant.update_timestamp(env, "Resume Review")

    # Simulate conducting a telephone interview with the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Telephone Interview")

    # Simulate conducting an assessment of the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Assessment")

    # Simulate conducting an in-person interview with the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Interview")

    # Simulate making a contract offer to the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Contract Offer")

    # Simulate accepting the contract offer from the applicant
    yield env.timeout(1)
    applicant.update_timestamp(env, "Acceptance")

# Create a simpy environment
env = simpy.Environment()

# Create a list of applicants
applicants = []

# Create a fake object to generate random names and birthdays
fake = Faker()

# Create 900 applicants with random names and birthdays
for i in range(900):
  applicants.append(Applicant())

# Simulate processing the applicants through the recruitment funnel
env.process(process_applicants(env, applicants))
env.run()

# Save the applicants' data to a CSV file
with open("applicants.csv", "w") as csvfile:
  writer = csv.writer(csvfile)

  # Write the header row
  writer.writerow(["Name", "Birthday"] + [
    "Received", "Resume Review", "Telephone Interview", "Assessment",
    "Interview", "Contract Offer

continue the code from line with env = simpy.Environment()

Sure, here is the updated code that continues from the env = simpy.Environment() line:

Copy code# Create a list of applicants
applicants = []

# Create a fake object to generate random names and birthdays
fake = Faker()

# Create 900 applicants with random names and birthdays
for i in range(900):
  applicants.append(Applicant())

# Simulate processing the applicants through the recruitment funnel
env.process(process_applicants(env, applicants))
env.run()

# Save the applicants' data to a CSV file
with open("applicants.csv", "w") as csvfile:
  writer = csv.writer(csvfile)

  # Write the header row
  writer.writerow(["Name", "Birthday"] + [
    "Received", "Resume Review", "Telephone Interview", "Assessment",
    "Interview", "Contract Offer", "Acceptance"
  ])

  # Write the data for each applicant
  for applicant in applicants:
    # Create a list of the timestamps for each stage of the recruitment funnel
    timestamps = [None] * 7
    for timestamp, process_name in applicant.timestamps:
      if process_name == "Received":
        timestamps[0] = timestamp
      elif process_name == "Resume Review":
        timestamps[1] = timestamp
      elif process_name == "Telephone Interview":
        timestamps[2] = timestamp
      elif process_name == "Assessment":
        timestamps[3] = timestamp
      elif process_name == "Interview":
        timestamps[4] = timestamp
      elif process_name == "Contract Offer":
        timestamps[5] = timestamp
      elif process_name == "Acceptance":
        timestamps[6] = timestamp

    # Write the data for this applicant to the CSV file
    writer.writerow([applicant.name, applicant.birthday] + timestamps)

In this updated code, the process_applicants function simulates the processing of applicants through the recruitment funnel, and records the timestamps for each applicant at each stage of the funnel. Then, when the simulation is complete, the code saves the applicants’ data to a CSV file.

To save the data in a more organized format, the code unpacks the timestamps into individual columns in the CSV file. It creates a list of the timestamps for each stage of the recruitment funnel, and fills in the appropriate values for each applicant. Then, it writes the data for each applicant to the CSV file, including the name, birthday, and timestamps.

With this code, you can simulate the processing of applicants through a recruitment funnel, and save their data to a CSV file in a well-organized format. You can then use this data to analyze the recruitment process and identify areas for improvement.

Can you create a pdf of or hyperlink to this conversation so that I can share it with others?

Unfortunately, I am not able to create a PDF or hyperlink of this conversation. I am a text-based AI assistant and do not have the ability to create or share PDFs or hyperlinks. However, you can easily copy and paste the conversation into a document or email to share with others.

Wordle with NLP for Data Scientists

I have played my fair share of Wordle.

I’m not necessarily good at it, but most days I get to solve the puzzle.

The experience is completely different with Semantle — a Wordle-inspired puzzle in which you also need to guess the word of the day.

Unlike in Wordle, Semantle gives you unlimited guesses though. And, boy, you will need many!

Like Wordle, Semantle gives you hints as to how close your guesses were to the secret word of the day.

However, where Wordle shows you how good your guesses were in terms of the letters used, Semantle evaluates the semantic similarity of your guesses to the secret word. For the 1000 most similar words to the secret word, it will show you its closeness like in the picture above.

This semantic similarity comes from the domain of Natural Language Processing — NLP — and this basically reflects how often words are used in similar contexts in natural language.

For instance, the words “love” and “hate” may seem like opposites, but they will often score similarly in grammatical sentences. According to the semantle FAQ the actual opposite of “love” is probably something like “Arizona Diamondbacks”, or “carburetor”.

Another example is last day’s solution (15 March 2022), when the secret word was circle. The ten closest words you could have guessed include circles and semicircle, but more distinctive words such as corner and clockwise.

Further downfield you could have guessed relatively close words like saucer, dot, parabola, but I would not have expected words like outwaited, weaved, and zipped.

The creator of Semantle scored the semantic similarity for almost all words used in the English language, by training a so-called word2vec model based on a very large dataset of news articles (GoogleNews-vectors-negative300.bin from late 2021).

Now, every day, one word is randomly selected as the secret word, and you can try to guess which one it is. I usually give up after 300 to 400 guesses, but my record was 76 guesses for uncovering the secret word world.

Try it out yourself: https://semantle.novalis.org/

And do share your epic wins and fails!

Vox: Are We Automating Racism?

In Glad You Asked, Vox dives deep into timely questions around the impact of systemic racism on our communities and in our daily lives.

In this video, they look into the role of tech in societal discrimination. People assume that tech and data are neutral, and we have turned to tech as a way to replace biased human decision-making. But as data-driven systems become a bigger and bigger part of our lives, we see more and more cases where they fail. And, more importantly, that they don’t fail on everyone equally.

Why do we think tech is neutral? How do algorithms become biased? And how can we fix these algorithms before they cause harm? Find out in this mini-doc:

Understanding Machine Learning (free e-book)

Shai Shalev-Shwartz and Shai Ben-David of the Hebrew University of Jerusalem made their machine learning book free to download.

The book covers the basic foundations up to advanced theory and algorithms. I copied the table of contents below. It’s kind of math heavy, but well explained with visual examples and pseudo-code.

Moreover, the book contains multiple exercises for you to internalize the knowledge and skills.

As an added bonus, the professors teach a number of machine learning courses, the lecture slides and materials of which you can also access for free via the book’s website.

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides a theoretical account of the fundamentals underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics, the book covers a wide array of central topics unaddressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for advanced undergraduates or beginning graduates, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics and engineering.
About the book

If you want to reward the professors for their efforts, please do buy a hardcopy version of book.

Part I: Foundations

A gentle start
A formal learning model
Learning via uniform convergence
The bias-complexity trade-off
The VC-dimension
Non-uniform learnability
The runtime of learning

Part II: From Theory to Algorithms

Linear predictors
Boosting
Model selection and validation
Convex learning problems
Regularization and stability
Stochastic gradient descent
Support vector machines
Kernel methods
Multiclass, ranking, and complex prediction problems
Decision trees
Nearest neighbor
Neural networks

Part III: Additional Learning Models

Online learning
Clustering
Dimensionality reduction
Generative models
Feature selection and generation

Part IV: Advanced Theory

Rademacher complexities
Covering numbers
Proof of the fundamental theorem of learning theory
Multiclass learnability
Compression bounds
PAC-Bayes

Appendices

Technical lemmas
Measure concentration
Linear algebra

ppsr: An R implementation of the Predictive Power Score

Update March, 2021: My R package for the predictive power score (ppsr) is live on CRAN!
Try install.packages("ppsr") in your R terminal to get the latest version.

A few months ago, I wrote about the Predictive Power Score (PPS): a handy metric to quickly explore and quantify the relationships in a dataset.

As a social scientist, I was taught to use a correlation matrix to describe the relationships in a dataset. Yet, in my opinion, the PPS provides three handy advantages:

PPS works for any type of data, also nominal/categorical variables
PPS quantifies non-linear relationships between variables
PPS acknowledges the asymmetry of those relationships

Florian Wetschoreck came up with the PPS idea, wrote the original blog, and programmed a Python implementation of it (called ppscore).

Yet, I work mostly in R and I was very keen on incorporating this powertool into my general data science workflow.

So, over the holiday period, I did something I have never done before: I wrote an R package!

It’s called ppsr and you can find the code here on github.

Installation

# You can get the official version from CRAN:
install.packages("ppsr")

## Or you can get the development version from GitHub:
# install.packages('devtools')
# devtools::install_github('https://github.com/paulvanderlaken/ppsr')

Usage

The ppsr package has three main functions that compute PPS:

score() – which computes an x-y PPS
score_predictors() – which computes X-y PPS
score_matrix() – which computes X-Y PPS

Visualizing PPS

Subsequently, there are two main functions that wrap around these computational functions to help you visualize your PPS using ggplot2:

visualize_predictors() – producing a barplot of all X-y PPS

visualize_matrix() – producing a heatmap of all X-Y PPS

Note that Species is a nominal/categorical variable, with three character/text options.

A correlation matrix would not be able to show us that the type of iris Species can be predicted extremely well by the petal length and width, and somewhat by the sepal length and width. Yet, particularly sepal width is not easily predicted by the type of species.

Exploring mtcars

It takes about 10 seconds to run 121 decision trees with visualize_matrix(mtcars). Yet, the output is much more informative than the correlation matrix:

cyl can be much better predicted by mpg than the other way around
the classification of vs can be done well using nearly all variables as predictors, except for am
yet, it’s hard to predict anything based on the vs classification
a cars’ am can’t be predicted at all using these variables

The correlation matrix does provides insights that are not provided by the PPS matrix. Most importantly, the sign and strength of any linear relationship that may exist. For instance, we can deduce that mpg relates strongly negatively with cyl.

Yet, even though half of the matrix does not provide any additional information (due to the symmetry), I still find it hard to derive the most important relations and insights at a first glance.

Moreover, the rows and columns for vs and am are not very informative in this correlation matrix as it contains pearson correlations coefficients by default, whereas vs and am are binary variables. The same can be said for cyl, gear and carb, which contain ordinal categories / integer data, so you can discuss the value of these coefficients depicted here.

Exploring trees

In R, there are many datasets built in via the datasets package. Let’s explore some using the ppsr::visualize_matrix() function.

datasets::trees has data on 31 trees’ girth, height and volume.

visualize_matrix(datasets::trees) shows that both girth and volume can be used to predict the other quite well, but not perfectly.

Let’s have a look at the correlation matrix.

The scores here seem quite higher in general. A near perfect correlation between volume and girth.

Is it near perfect though? Let’s have a look at the underlying data and fit a linear model to it.

You will still be pretty far off the real values when you use a linear model based on Girth to predict Volume. This is what the original PPS of 0.65 tried to convey.

Actually, I’ve run the math for this linaer model and the RMSE is still 4.11. Using just the mean Volume as a prediction of Volume will result in 16.17 RMSE. If we map these RMSE values on a linear scale from 0 to 1, we would get the PPS of our linear model, which is about 0.75.

So, actually, the linear model is a better predictor than the decision tree that is used as a default in the ppsr package. That was used to generate the PPS matrix above.

Yet, the linear model definitely does not provide a perfect prediction, even though the correlation may be near perfect.

Conclusion

In sum, I feel using the general idea behind PPS can be very useful for data exploration.

Particularly in more data science / machine learning type of projects. The PPS can provide a quick survey of which targets can be predicted using which features, potentially with more complex than just linear patterns.

Yet, the old-school correlation matrix also still provides unique and valuable insights that the PPS matrix does not. So I do not consider the PPS so much an ~~alternative~~, as much as a complement in the toolkit of the data scientist & researcher.

Enjoy the R package, or the Python module for that matter, and let me know if you see any improvements!

Guidelines for Ethical AI

As AI systems become more prevalent in society, we face bigger and tougher societal challenges. Given many of these challenges have not been faced before, practitioners will face scenarios that will require dealing with hard ethical and societal questions.
There has been a large amount of content published which attempts to address these issues through “Principles”, “Ethics Frameworks”, “Checklists” and beyond. However navigating the broad number of resources is not easy.
This repository aims to simplify this by mapping the ecosystem of guidelines, principles, codes of ethics, standards and regulation being put in place around artificial intelligence.
github.com/EthicalML/awesome-artificial-intelligence-guidelines/

🔍 High Level Frameworks & Principles	🔏 Processes & Checklists	🔨 Interactive & Practical Tools
📜 Industry standards initiatives	📚 Online Courses	🤖 Research and Industry Newsletters
⚔ Regulation and Policy

Links to Awesome Artificial Intelligence Guidelines

This overview of ethical guidelines for Artificial Intelligence is by the same author of the repository of Machine Learning production resources shared earlier this year.

paulvanderlaken.com

Getting ChatGPT to simulate a Recruitment Process in Python code

Wordle with NLP for Data Scientists

Vox: Are We Automating Racism?

Understanding Machine Learning (free e-book)

Table of contents

Part I: Foundations

Part II: From Theory to Algorithms

Part III: Additional Learning Models

Part IV: Advanced Theory

Appendices

ppsr: An R implementation of the Predictive Power Score

Installation

Usage

Visualizing PPS

Exploring mtcars

Exploring trees

Conclusion

Share this:

Share this:

Share this:

Table of contents

Part I: Foundations

Part II: From Theory to Algorithms

Part III: Additional Learning Models

Part IV: Advanced Theory

Appendices

Share this:

Installation

Usage

Visualizing PPS

Exploring mtcars

Exploring trees

Conclusion

Share this:

Share this: