Tag: datascience

100 Python pandas tips and tricks

100 Python pandas tips and tricks

Working with Python’s pandas library often?

This resource will be worth its length in gold!

Kevin Markham shares his tips and tricks for the most common data handling tasks on twitter. He compiled the top 100 in this one amazing overview page. Find the hyperlinks to specific sections below!

Quicklinks to categories

Kevin even made a video demonstrating his 25 most useful tricks:

Tutorial: Demystifying Deep Learning for Data Scientists

Tutorial: Demystifying Deep Learning for Data Scientists

In this great tutorial for PyCon 2020, Eric Ma proposes a very simple framework for machine learning, consisting of only three elements:

  1. Model
  2. Loss function
  3. Optimizer

By adjusting the three elements in this simple framework, you can build any type of machine learning program.

In the tutorial, Eric shows you how to implement this same framework in Python (using jax) and implement linear regression, logistic regression, and artificial neural networks all in the same way (using gradient descent).

I can’t even begin to explain it as well as Eric does himself, so I highly recommend you watch and code along with the Youtube tutorial (~1 hour):

If you want to code along, here’s the github repository: github.com/ericmjl/dl-workshop

Have you ever wondered what goes on behind the scenes of a deep learning framework? Or what is going on behind that pre-trained model that you took from Kaggle? Then this tutorial is for you! In this tutorial, we will demystify the internals of deep learning frameworks – in the process equipping us with foundational knowledge that lets us understand what is going on when we train and fit a deep learning model. By learning the foundations without a deep learning framework as a pedagogical crutch, you will walk away with foundational knowledge that will give you the confidence to implement any model you want in any framework you choose.

https://www.youtube.com/watch?v=gGu3pPC_fBM

Using data science to uncover botnets on Twitter

I love how people are using data and data science to fight fake news these days (see also Identifying Dirty Twitter Bots), and I recently came across another great example.

Conspirador Norteño (real name unkown) is a member of what they call #TheResistance. It’s a group of data scientists discovering and analyzing so-called botnets – networks of artificial accounts on social media websites, like Twitter.

TheResistance uses quantitative analysis to unveil large groups of fake accounts, spreading potential fake news, or fake-endorsing the (fake) news spread by others.

In a recent Twitter thread, Norteno shows how they discovered that many of Dr. Shiva Ayyadurai (self-proclaimed Inventor of Email) his early followers are likely bots.

They looked at the date of these accounts started following Shiva, offset by the date of their accounts’ creation. A remarkeable pattern appeared:

Afbeelding
Via https://twitter.com/conspirator0/status/1244411551546847233/photo/1

Although @va_shiva‘s recent followers look unremarkable, a significant majority of his first 5000 followers appear to have been created in batches and to have subsequently followed @va_shiva in rapid succession.

Looking at those followers in more detail, other suspicious patterns emerge. Their names follow a same pattern, they have an about equal amount of followers, followings, tweets, and (no) likes. Moreover, they were created only seconds apart. Many of them seem to follow each other as well.

Afbeelding
Via https://twitter.com/conspirator0/status/1244411636410187782/photo/1

If that wasn’t enough proof of something’s off, here’s a variety of their tweets… Not really what everyday folks would tweet right? Plus similar patterns again across acounts.

Afbeelding
Via: https://twitter.com/conspirator0/status/1244411760129515522/photo/1

At first, I thought, so what? This Shiva guy probably just set up some automated (Python?) scripts to make Twitter account and follow him. Good for him. It worked out, as his most recent 10k followers followed him organically.

However, it becomes more scary if you notice this Shiva guy is (succesfully) promoting the firing of people working for the government:

Anyways, wanted to share this simple though cool approach to finding bots & fake news networks on social media. I hope you liked it, and would love to hear your thoughts in the comments!

Learn Julia for Data Science

Learn Julia for Data Science

Most data scientists favor Python as a programming language these days. However, there’s also still a large group of data scientists coming from a statistics, econometrics, or social science and therefore favoring R, the programming language they learned in university. Now there’s a new kid on the block: Julia.

Image result for julia programming"
Via Medium

Advantages & Disadvantages

According to some, you can think of Julia as a mixture of R and Python, but faster. As a programming language for data science, Julia has some major advantages:

  1. Julia is light-weight and efficient and will run on the tiniest of computers
  2. Julia is just-in-time (JIT) compiled, and can approach or match the speed of C
  3. Julia is a functional language at its core
  4. Julia support metaprogramming: Julia programs can generate other Julia programs
  5. Julia has a math-friendly syntax
  6. Julia has refined parallelization compared to other data science languages
  7. Julia can call C, Fortran, Python or R packages

However, others also argue that Julia comes with some disadvantages for data science, like data frame printing, 1-indexing, and its external package management.

Comparing Julia to Python and R

Open Risk Manual published this side-by-side review of the main open source Data Science languages: Julia, Python, R.

You can click the links below to jump directly to the section you’re interested in. Once there, you can compare the packages and functions that allow you to perform Data Science tasks in the three languages.

GeneralDevelopmentAlgorithms & Datascience
History and CommunityDevelopment EnvironmentGeneral Purpose Mathematical Libraries
Devices and Operating SystemsFiles, Databases and Data ManipulationCore Statistics Libraries
Package ManagementWeb, Desktop and Mobile DeploymentEconometrics / Timeseries Libraries
Package DocumentationSemantic Web / Semantic DataMachine Learning Libraries
Language CharacteristicsHigh Performance ComputingGeoSpatial Libraries
Using R, Python and Julia togetherVisualization
Via openriskmanual.org/wiki/Overview_of_the_Julia-Python-R_Universe

Starting with Julia for Data Science

Here’s a very well written Medium article that guides you through installing Julia and starting with some simple Data Science tasks. At least, Julia’s plots look like:

Via Medium
Free course: SQL for Data Science

Free course: SQL for Data Science

Kirsten Kehrer from datamovesme.com does all kinds of super valuable stuff in SQL and end of 2019 she decided to share it with the world via a free SQL course.

Screenshot from youtube.com/watch?v=W5AiLYR02l8

The course is focused on data science and has 5 modules with video lectures:

Kirsten advises to take the course with dual monitors, as she also provides an online SQL query builder environment, where you can write your queries during the videos.

https://kristenkehrer.github.io/datamovesme-sqlcourse/

Moreover, Kirsten also published the slides and the code to go with the course, so you can really learn along:

A nice touch is that Kirsten simulated some data for a fictitious e-commerce company, that really allows you to get a feel for the type of data you’d be working with in practice:

Screenshot from youtube.com/watch?v=itpvD0Eb_s4

Why Gordon Shotwell uses R

Why Gordon Shotwell uses R

This blog by Gordon Shotwell has passed my Twitter feed a couple of times now and I thought I’d share it here: blog.shotwell.ca/posts/why_i_use_r

It in, Gordon present his reasons for using R, describing R’s four unique selling point, and outlining a discussion full of perfectly quotable thoughts and opinions.

Do have a look at the original blog as well, but here’s my 3-minute summary:

Gordon finds that there are four main features of the R programming language that are essential to his work and in a sense unique to the R language. Here they are, along with quotes by Gordon explaining R’s unique selling points in his words:

(1) Native data science structures

It’s relatively easy to do data science in R without any external libraries. You can read data from a csv into a data frame, plot and clean that data, and analyse it using built-in statistical models.

(2) Non-standard evaluation

Non-standard evaluation lets you do things like use a variable name in a plot title, or evaluate a user-supplied expression in a different environment.

[…]

For example, R lets you specify models with a formula interface like this: lm(mtcars, mpg ~ cyl). This is a natural way for statisticians to specify statistical models because they’re usually familliar with the syntax, but without NSE there’s no way to make that function work as written because mpg and cylare not objects in the calling environment. 

(3) Packaging concensus

R let me get up and running, installing packages, filtering data, and printing plots in under 20 minutes, which meant that I stayed interested in the language and eventually started using it professionally. I had actually started to learn Python at around the same time but just found it too difficult. 
[…]

The user that I care the most about only has 20 minutes of attention and no real programming skill, so the only thing they can “just” do is copy and paste one line of code into a console. If that doesn’t work, I’ve lost them, and they’ll spend another lonely year renewing their SPSS licenses.

(4) Functional programming

I really like this pattern of [functional] programming because breaking complicated jobs down into small functional bricks gives me confidence that the overall solution is correct. I can work on the small functions, verify that they’re correct through tests, and then know that combining those building blocks together won’t change their behaviour.

Although I personally do not fully agree with these four points (e.g., I very much like to leverage functional programming in Python and it works like a charm!) I very much liked the outline Gordon provides. I’d love to hear your thoughts as well, so do share them in the comments.

For now, let’s end with some other lovely quotes by Gordon:

The thing is, I don’t use R out of some blind brand loyalty but because I don’t like working hard. 

I came to R from an Excel background, and for a long time I had internalized the feeling that serious engineers used Python, while analysts or researchers could use languages like R. Over time I’ve realized that the people making that statement often aren’t really informed. They rarely know anything about R, and often don’t really write production-quality code themselves.

In contrast, most of the very senior engineers I’ve met understand that all programming languages are basically just bundles of trade-offs, and so no single language is going to be globally superior to another. There really are no production languages – only production engineers.

https://blog.shotwell.ca/posts/why_i_use_r/