Most data scientists favor Python as a programming language these days. However, there’s also still a large group of data scientists coming from a statistics, econometrics, or social science and therefore favoring R, the programming language they learned in university. Now there’s a new kid on the block: Julia.
Advantages & Disadvantages
According to some, you can think of Julia as a mixture of R and Python, but faster. As a programming language for data science, Julia has some major advantages:
Julia is light-weight and efficient and will run on the tiniest of computers
Julia is just-in-time (JIT) compiled, and can approach or match the speed of C
Julia is a functional language at its core
Julia support metaprogramming: Julia programs can generate other Julia programs
Julia has a math-friendly syntax
Julia has refined parallelization compared to other data science languages
Julia can call C, Fortran, Python or R packages
However, others also argue that Julia comes with some disadvantages for data science, like data frame printing, 1-indexing, and its external package management.
You can click the links below to jump directly to the section you’re interested in. Once there, you can compare the packages and functions that allow you to perform Data Science tasks in the three languages.
Lisa Charlotte Rost of DataWrapper often writes about data visualization and lately she has focused on the (im)proper use of color in visualization. In this recent blog, she gives a bunch of great tips and best practices, some of which I copied below:
Do you have a bunch of data but you can’t seem to figure out how to display it? Or looking for that one specific visualization of which you can’t remember the name?
www.datavizproject.com provides a most comprehensive overview of all the different ways to visualize your data. You can sort all options by Family, Input, Function, and Shape to find that one dataviz that best conveys your message.
Update: look at some of these other repositories here or here.
XGBOOST stands for eXtreme Gradient Boosting. A big brother of the earlier AdaBoost, XGB is a supervised learning algorithm that uses an ensemble of adaptively boosted decision trees. For those unfamiliar with adaptive boosting algorithms, here’s a 2-minute explanation video and a written tutorial. Although XGBOOST often performs well in predictive tasks, the training process can be quite time-consuming (similar to other bagging/boosting algorithms (e.g., random forest)).
In a recent blog, Analytics Vidhya compares the inner workings as well as the predictive accuracy of the XGBOOST algorithm to an upcoming boosting algorithm: Light GBM. The blog demonstrates a stepwise implementation of both algorithms in Python. The table below reflects the main conclusion of the comparison: Although the algorithms are comparable in terms of their predictive performance, light GBM is much faster to train. With continuously increasing data volumes, light GBM, therefore, seems the way forward.
Laurae also benchmarked lightGBM against xgboost on a Bosch dataset and her results show that, on average, LightGBM (binning) is between 11x to 15x faster than xgboost (without binning):
However, the differences get smaller as more threads are used due to thread inefficiencies (idle-time increases because threads are not scheduled a next task fast enough).
Neil Schneider tested the three algorithms for gradient boosting in R (GBM, xgboost, and lightGBM) and sums up their (dis)advantages:
GBM has no specific advantages but its disadvantages include no early stopping, slower training and decreased accuracy,
xgboost has demonstrated successful on kaggle and though traditionally slower than lightGBM, tree_method = 'hist' (histogram binning) provides a significant improvement.
lightGBM has the advantages of training efficiency, low memory usage, high accuracy, parallel learning, corporate support, and scale-ability. However, its’ newness is its main disadvantage because there is little community support.