Subrat's Technical Blog: Learn Basic Git Commands for Your Data Science Works

Git as a Version Control System

According to the git official site, git is a free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. The development of git began on 3 April 2005 by Linus Torvald, the Linux founder and developer when many developers of the Linux kernel gave up access to BitKeeper.

Above illustration clearly explains to us about the functionality of git as a versioning tool. It lets us work with a single file instead of many files with their own version. Working with basic versioning (many files) tends to be untrackable and unstructured. We can not track our revision as well. But, using git, any revision or changes will be recorded on the git system and as users, we can move back to a certain version of files as we want to. From git logs, we can track what the revision in our works, from the beginning.

As the collaboration tool, git moves basic collaboration way like using email, SMS, chat, etc to the collaborated system. Basic collaboration tends to be not integrated, misunderstanding, bad logs, and Inefficient for a big project with a lot of people working in it.

Git is designed for text-based data, for instance codes, books, papers, article, etc

Many companies and projects out there had selected git as their VCS, like Google, Facebook, Microsoft, Twitter, LinkedIn, and others. So, this is why we need to learn the basic git commands for our future work and career in the tech company.

On a very basic level, there are two awesome things git allows us to do:

We can track changes in our files
It simplifies working on files and projects with multiple people

How to Install Git

It seems easy to install git on your computer. But it depends on your operating system as well. For Windows user, you might need to head over to git official site to download it. Luckily, for Linux user, just open our favourite terminal and run the script as follow:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install git

After the installation, everything would be okay. Check our git version with the following command.

# Check our git version
git --version

Configuration

For the first step, before talking deeper about git and its environment, we need to configure our git. We provide a username and email to it.

# Setup the git configuration
git config --global user.name "your-username"
git config --global user.email "your-email"

Git as Versioning Tools

For our first touch with git, let’s create a directory, for instance, git-stk in my case and move into that directory. Then, begin our practice with git init. Important to note that git status will print the status on our works like is there any changes in the directory or files. As long as we didn't do any changes, the command of git status will print no commits yet.

So, what does git init means? Simply, this command will create a .git folder that contains several folders and files, like hooks, info, objects, refs, config, description, and HEAD in order to track our works. The logs of our changes will be recorded there.

To interact with git commands, let’s create a new markdown file, namely README.md that filled with a sentence of “First line of README.md file”.

Let’s check the current status with git status command. It prints with red marks indicating that we have done any changes folders or files within the git-stk. It also means that our changes are not marked yet. We are in the modified state in git lifecycle.

To mark our change in file README.md, run git add command. To look up the differences of status between modified and staged state, just run git status again. The green mark indicates that our changes have been marked and now we are in the staged state.

Our README.md file has been marked but has not been recorded yet. Record our changes like when we were working on Microsoft Word or other tools with git commit which is followed by the commit messages.

Git Lifecycle

So far, we have conducted experiments within a git environment and assigned git commands. But now, we will look deeper into git lifecycle. What we did before is what we see in git lifecycle. The git lifecycle is divided into four states namely:

Modified: any changes didn’t been marked yet. We can do anything here, manipulate files, create or delete a new folder, and other things
Staged: a condition when our changes have been marked but didn’t been recorded yet
Committed: folders or files successfully are been recorded into our .git folder

Keep Tracking our Logs

Can you imagine if we did some mistakes within our folders? Can we just move backwards to our last commit or two last commits? The answer is of course we can. Git keeps our changes (commits) and we can go anywhere if it is necessary for our works. Okay, here is the new git command you must know: git log. We will record our entire commits, so we might easily track our works or commits.

On the previous commit, we created file namely README.md. Now, we will modify that file and inspect its logs using git logs command. First, add our README.md file with the sentence “Second line of README.md file” and then commit it. Just follow and run the scripts based on the following figure.

Check out the status until it prints “Nothing to commit”. It means that we have committed all of our changes. Okay, to track our works (commits), we just run git log and it will print the author (username and email), date-time of changes, messages, and SHA.

git log --oneline : print logs just in one line
git log --graph : show logs in a beautiful way with the author’s line
git log --author=<username> : show logs for the certain author (if we work with many users in it)
git log <filename> : show logs for certain file

We must add commit messages as clear and simple as possible because it will help us to track our works

One of the must-to-know command on git is git diff. It helps us to compare between commits on branches, specific files, or an entire directory. For instance, we will compare our two commits.

The pattern is git diff <SHA-before> <SHA-after>. Input <SHA-before> and <SHA-after> with SHA of our commits.

According to the output, we get information about the differences between our first commit and second commit on README.md file. It is useful, right?

Move Backward to the Previous States

As a human, we’re not perfect, right? Sometimes we get some mistakes that impact other things. It applies to tech company too, for instance as data science team, we are developing new features to our machine learning flows. But found a bug. We must trace back from the beginning of works, looking into line per line of codes. Git simplifies our works with the move backwards feature. Instead of looking from the beginning of works, we can track based on the previous commit before a bug was found.

There are three commands to move back on git. It is used as our needs.

git checkout: it is like a time machine, we can restore the condition of the project file to the designated time. However, this is temporary. These are not stored in the git database
git reset: this command makes us unable to go back to the future. We will lose our commits. After the git logs were reset, we need to write a new commit
git revert: will take the existing file condition in the past, then merge them with the last commit

We can call the git checkout command to check the file condition at each commit. To return from the past use the command git checkout master

When we decide to use git reset, there are three options we can select. It depends on our problem and which states we want to go to. Let’s check them out and try on your computer!

git reset --soft: will restore a file to the staged state
git reset --mixed: will restore a file to the modified state
git reset --hard: will restore a file to the committed state

The git reflog is an amazing resource for recovering project history. We can recover almost anything—anything we’ve committed via the reflog.

We’re probably familiar with the git log command, which shows a list of commits. The git reflog is similar, but instead shows a list of times when HEAD changed.

The HEAD in Git is the pointer to the current branch reference, which is, in turn, a pointer to the last commit you made or the last commit that was checked out into your working directory.
That also means it will be the parent of the next commit you do. It’s generally simplest to think of it as HEAD is the snapshot of your last commit

Let’s modify our README.md file again to show us the git checkout command. We just add a new line with “Third line of README.md file”. After that, like the previous step, add it into the committed state.

Remember that we have restored our file with git reset and it removed our last commit. Then, using git checkout, we will check the file condition at each commit.

Remember: Until this step, we just have three commits

To check our first commit, we use the HEAD pointer. Follow the instruction to understand what does git checkout HEAD mean. It helps us track our content in README.md file each commits, are there any changes? What kind of changes?

To understand the functionality of tilde (~) and caret (^) on HEAD, please look at the following figure. It lets us go to every commits on every branch in our works.

Git as Collaboration Tools — Branching

When talking about git as a versioning tool, we just talk about how git helps us keeping the version of our files. But, now we will talk git as a collaboration tool.

The default branch name in Git is master

When developing new features on our codes, it is recommended to create a new branch. Let the master branch as our main codes. If our features are already done and there is no bug found, just merge with the master branch.

In this case, we will create a new branch namely new_branch that add two new lines on the README.md file.

The git checkout command is also used to move and create branches

Add a modification to committed state on the new_branch branch and return to the master branch. Check the README.md file. Our modification on new_branch is not applied to the master because we worked on a different branch. It keeps our main codes clean and safe.

Look at git logs and now you will find a new branch on red mark. Yeah, this indicates there are any commits on another branch.

We just created a new branch and commits in it. Now, we must merge that branch to the master and clean any conflicts.

Git as Collaboration Tools — Merging

When we are working with a team, we must have many branches with its features. Our next task is to merge these branches and manage conflict within it. It is helped by git merge command. When files on the master are selected as the main file or code, we must merge another branch to the master. So we need to go to master! If our codes conflict, as the main developer or project leader, it is our task to choose whether codes are removed or added it into the master.

The two branch codes are separated by ======

After our merging is done properly, just check out the git logs. Voila! We have merged new_branch to the master. Our master is clean now! This is the simple task of merging on git.

Show what revision and author last modified each line of a file, we can use git blame command.

git blame L <start><end>

annotate only the given line range
may be specified multiple times
overlapping ranges are allowed

Git as Collaboration Tools — Remote Repository

After talking a lot about git as versioning tools next, we talk git as collaboration tools via the remote repository. There are many platforms providing services to collaborate our scripts to people or our team, for instance, GitHub, GitLab, Bitbucket, and etc. For this tutorial, we use GitHub. So you might need to register your account and follow the instructions!

Please log in to GitHub page and create a new repository. Choose your own repo name and click create repository to create the new one.

Our aim is to upload our local repository to the remote repository on GitHub. So, after the new repository is created, we can copy our remote repository link (via SSH or HTTPS) to the terminal and run git remote add command. It will create a new connection to a remote repository.

Run git push to upload our local repository to a remote repository. This command takes two argument <remote-name> <branch-name> .

<remote-name> : a remote name, for instance, origin
<brach-name> : a branch name, for instance, master (default)

Using the SSH protocol, we can connect and authenticate to remote servers and services. With SSH keys, we can connect to GitHub without supplying our username or password at each visit

Summarization

We are here right now, coders! To summarize our tutorial, I just create this simple cheat sheet to help you understand the basic git commands we have already learned.

git config: get and set repository or global options
git init: create an empty git repository or reinitialize an existing one
git status: show the working tree status
git add: add or mark file contents to the staged
git commit: record changes to the repository
git log: show commit logs
git checkout: switch branches or restore working tree files
git branch: list, create or delete branches
git merge: join two or more development histories together
git blame: show what revision and author last modified each line of a file
git remote: manage set of tracked repositories
git push: update remote refs along with associated objects

Notes

You might be want to get the resume of this basic git tutorial, just visit my GitHub page https://audhiaprilliant.github.io/git-version-control-system/. It would be great to share and learn together with everyone across the world! Thanks!

References

[1] Anonim, Git (2020), https://docs.github.com/.

Subrat's Technical Blog

Thursday, October 22, 2020

Learn Basic Git Commands for Your Data Science Works