Get started with Git and emulate the style of working from real software and machine learning engineers by using version control systems.
Are you already using version control with Git for your data science and machine learning projects? If you haven’t thought about this yet, let me ask you a second question, where are you on your data science or machine learning journey?
… tinkering in Jupyter notebooks doesn’t seem to be the right choice anymore. — Yves Boutellier
This question I asked myself two months ago after I finished my Bachelor in Biology and Computational Sciences. During my studies I took a total of 6 courses in the direction of data science and machine learning. Even though I solved a couple of projects for those courses. Most often we as students sticked to the most simple way of working on those projects. Tinkering in Jupyter notebooks until the code and the results were satisfying and then submitting those.
However, coming back to the intro question: I asked myself, do I want to work with the Data Science and Machine Learning knowledge I gained during my studies. I answered this question with yes and I decided to pause my studies and work part-time on my data science / MLE projects to get a job or trying to turn this or another project into a startup. But now tinkering in Jupyter notebooks doesn’t seem to be the right choice anymore.
I wanted to emulate the style of working of a machine learning engineer and that’s why I decided to learn the basics on version control with Git and apply it to my projects.
This article will give you the basics to follow this idea and upgrade your portfolio with another skill: Git Version Control
Why Git?
I used to work like a hobbyist tinkering on my code. What’s bad about it? Well, if you want for example to create a project for your resume to get a job, you want to make sure it’s a decent project that is not completed in a short sprint. They want to see that you can work on something bigger than on small projects, because if they hired you, you would work on their products and of course that are not small projects either. You should show that you code in an organised manner, that you can plan, test and deploy distinct features, while still developing other parts of the code independently.
Git enables you to do exactly this. Maintain multiple different versions and keep track of changes you made. 1) we look at the basics, 2) remote repositories and then 3) branching.
After each topic (1–3) I will present command line commands to practice the obtained knowledge from each paragraph.
1. Git basics
How does Git think about files?
Git thinks about the files like a series of snapshots. If you want the version control system, ie. Git, to save your most recent changes, Git basically takes a snapshot of what all your files look like in this moment and stores a reference to this snapshot. Thereby is Git highly efficient, if one or multiple files haven’t changed, Git stores a link to the previous identical version.
Please remember three states
Git has three states in which a file can be in:
- modified: file is modified but not committed
- staged: file is modified, marked to be part of your next commit snapshot
- committed: file is safely stored in your local database
We can visualise our knowledge now as follows:
The working tree (working directory) is one version of your project. This version is for you to modify. Either it’s an older version from the .git directory or it reflects the most recent version that is modified but not committed or staged.
The staging area is a file that stores information that says which files are committed with the next commit.
Finally the Git directory is where Git saves the object database and metadata for your project. It can be used to share the project with other computers and people.
It’s time to look at the Git workflow since you previously learned the essential basics.
Basic Git workflow
- You modify your files
- You add files with changes to your staging area
- You commit, which saves the files like they are in the staging area and stores that snapshot permanently to your Git directory. We now call those files with those changes, ie. versions, committed.
The next paragraph introduces an example to practice the gained knowledge yourself.
Exercise 1 - Practice Git Basics
First of all, we want to setup a Git repository to which we have a linked staging area and a working tree. You have two possibilities. Either you work on a project that is already existing. Or you create a new project.
1.1 Creating a new project
To create a new project, create a directory on your local computer and go to this location where you want your project to be, either manually or with cd ( which means change directory).
$ cd /Users/user/my_project (MacOs or Linux)
$ cd C:/Users/user/my_project (Windows)
Next you type in:
$ git init
This creates for you a local git repository. A .git
folder. More specifically, this will be a subdirectory in your directory my_project.
1.2 Cloning already existing project
Another way is to obtain an already existing project. To make this happen you just copy the url of the remote repository, here denoted as <url> and type
$ git clone <url>
Now remember, we have three states, modified, staged and committed
1.3 Stage and Commit
We can create files and write something in them. They are in the state “modified”. To transfer them into the state “staged” we can use the command add
.
$ git add HelloWorld.py
Finally, to commit it to the local git repository we write:
$ git commit -m 'This message describes what changes I made'
Every commit needs a message, -m is a flag that says the following string is the message.
You now may continue to remote places :)
2. Remote basics
Most operations only need local files. However, if you consider to work with other colleagues on your project or you want to show your work on for example Github.com you need to learn a few other tools in order to do that.
Remote repositories are hosted by a server. One very popular place is GitHub. Associated with a remote repository is a name and an url. It’s very straightforward to create a remote repository once you have a local Git repository.
Exercise 2 - Create Remote Repository
Once you wrote git init
and you have set up a local repository you can also associate a remote repository with it. Simply choose a name (origin
or main
) are popular names for your remote and copy the url from e.g Github.
git remote add <remoteName> <url>
It could look like this for example
git remote add origin https://github.com/myusername/ourproject
Once you are happy with some code in your project you add on top of the workflow you learned in ex.1
$ git push <remoteName> <localName>
In total you will write something similar to this
$ git add HelloWorld.py
$ git commit -m 'This message describes what changes I made'
$ git push origin main
And now HelloWorld.py is part of the remote repository on GitHub.
Other colleagues can clone this repository as mentioned in section 1 and work along with you. But what if you work and try on different things. That’s when you probably need branching.
3. Branching
Branching would blow up the size of this article and is thus covered in this popular article here:
Summary
You came with the urge to learn version control, but you didn’t know how to get started, then you walked through the introductory explanations. You learned to apply the knowledge with the little exercises. You can for sure create a repository and add and commit files to it. Also remote repositories became familiar to you. Remember remote repositories are where you can share your work via Github for your team. If your mind is still hungry to learn more, the article on Branching brings your knowledge on Version Control on another level.
If Graph Machine Learning is more your thing you might want to check out: