Version control is an essential tool for any software developer. Hence, any respectable data scientist has to make sure his/her analysis programs and machine learning pipelines are reproducible and maintainable through version control.
Often, we use git for version control. If you don’t know what git is yet, I advise you begin here. If you work in R, start here and here. If you work in Python, start here.
This blog is intended for those already familiar working with git, but who want to learn how to write better, more informative git commit messages. Actually, this blog is just a summary fragment of this original blog by Chris Beams, which I thought deserved a wider audience.
Summarize changes in around 50 characters or less
More detailed explanatory text, if necessary. Wrap it to about 72
characters or so. In some contexts, the first line is treated as the
subject of the commit and the rest of the text as the body. The
blank line separating the summary from the body is critical (unless
you omit the body entirely); various tools like `log`, `shortlog`
and `rebase` can get confused if you run the two together.
Explain the problem that this commit is solving. Focus on why you
are making this change as opposed to how (the code explains that).
Are there side effects or other unintuitive consequences of this
change? Here's the place to explain them.
Further paragraphs come after blank lines.
- Bullet points are okay, too
- Typically a hyphen or asterisk is used for the bullet, preceded
by a single space, with blank lines in between, but conventions
If you use an issue tracker, put references to them at the bottom,
See also: #456, #789
If you’re having a hard time summarizing your commits in a single line or message, you might be committing too many changes at once. Instead, you should try to aim for what’s called atomic commits.
In this tutorial, you will be introduced to the command line. We have selected a set of commands we think will be useful in general to a wide range of audience. […] after completing this tutorial, readers should be able to use the shell for version control, managing cloud services (like deploying your own shiny server etc.), execute commands in R & RMarkdown and execute R scripts in the shell.
If you want a deeper understanding of using command line for data science, the original authors suggest you read Data Science at the Command Line. Moreover, Software Carpentry has a lesson on shell. More references are listed at the end of the original tutorial. Use the clickable table of contents to quickly browse to the topic of your interest:
On your machine, you’re going to need two things: images, and containers. Images are the definition of the OS, while the containers are the actual running instances of the images. […] To compare with R, this is the same principle as installing vs loading a package: a package is to be downloaded once, while it has to be launched every time you need it. And a package can be launched in several R sessions at the same time easily.
Peter Cottle built this great interactive Git tutorial that teaches you all vital branching skills right in your browser. It’s interactive, beautiful, and very informative, introducing every concept and Git command in a step-by-step fashion.
The tutorial includes many levels that progressively teach you the Git commands you’ll need to apply version control on a daily basis:
There’s also a sandbox mode where you can interactively explore and build your own Git tree.
LearnGitBranching is a git repository visualizer, sandbox, and a series of educational tutorials and challenges. Its primary purpose is to help developers understand git through the power of visualization (something that’s absent when working on the command line). This is achieved through a game with different levels to get acquainted with the different git commands.
You can input a variety of commands into LearnGitBranching (LGB) — as commands are processed, the nearby commit tree will dynamically update to reflect the effects of each command.