Git commands data scientists use on a day-to-day basis
Before you may ask why the heck am I still using command line Git in 2021, wait, I can explain! So yours truly is not the biggest fan of using the command line, especially when there are so many perfectly suited GUIs that exist out there.
Lately, much of my work has shifted to Linux machines and I was looking to find a git extension or GUI that can support the Jupyterlab remote server. Having wasted many a precious hour following this issue here, I finally decided to buckle up and learn a few commands that might be useful for any given scenario as a data scientist. Likewise, I am hoping learning these would come in handy when working in server environments where GUI is not possible and the Git command line is the only option.
More than a cheat sheet, I would like to think of this article as a list of hypothetical what-would-you-do scenarios, one that I am probably going to use more than any of you (although if it does help you, well … the more the merrier! :) ). Some scenarios are basic and I am sure you know how to git init
your way through them, others are more context-specific although I am sure you can relate to them.
This is by no means an exhaustive list but some of the more recurring themes in my day-to-day work environment.
Disclaimer: The article is written for an audience with some basic knowledge of Git. By basic I mean, if you own a Github account AND don’t think I am speaking some alien language when I say things like ‘check out a branch’ and ‘commit a change’, you are all set!
Terms and Terminology
Throughout this article, whenever I refer to local changes I mean changes you are making locally on the copy of the project you have on your computer. Remote changes mean you are making changes to the copy of the project that exists on the Github server.
As a rule of thumb, play all you want with the local copy but be very careful when modifying the remote copy with your local changes.
With all that taken care of, let’s navigate through some common scenarios beginners might face and how to resolve them.
P.S: These scenarios are in no particular order.
Scenario 1
The company started a new project and you want to be in on it so that you can start contributing to it.
Once you have been added as a collaborator by one of the team members, all you have to do is go to the Git bash or Mac terminal, do a cd Desktop
or cd Downloads
depending on whether you want to create this local repo on your Desktop or Downloads folder and then do:
git clone https://github.com/V-Sher/medium_demo.git
Here I am trying to clone one of the (private) repos I created for this article but feel free to replace it with your own remote URL. The remote URL to the repo of your interest can be found on the project’s Github page (see the text highlighted in blue in the image below).
Scenario 2
Create your own branch and start working on it and sharing with colleagues.
It’s always good practice to create a branch and start work on it rather than messing with the main
branch. Before doing so, better check all the available local branches first:
git branch
The branch you are currently on will have a *
next to it. In my case, it seems I am on main
and I would like to create a branch called agenda
off of this branch.
git checkout -b agenda
Here agenda
is the name of my branch, but feel free to pick any name that you like. Running this command will make agenda
as you current working branch and if you run the git branch
command again you will see the *
would have shifted.
From here, I am going to add two files to the medium_demo
folder on my local machine, called agenda.py
and fake_agenda.py
and then do a git add .
to add both the files to the staging area and then commit these changes using git commit -m "Adding a real and fake agenda"
.
Now my local copy of the repo has two extra files which I must push to remote using:
git push -u origin agenda
Here origin
is the name of the remote Github (P.S. we will be discussing this in more detail towards the end of the article) and -u
means a tracking connection is established between the ‘local’ agenda
branch and the ‘remote agenda
branch.
Now going back to the Github page for your repo, you can see the changes there which are ready to become part of a pull request (PR):
You can go ahead and click on the Compare & pull request button to create a PR and assign your colleague/senior as a Reviewer to it.
Once your colleague has reviewed it, they might leave their own comments on this PR, something like this:
As you can see, the reviewer is telling us we must remove the unnecessary fake_agenda.py
file. Let’s see how to do this in Scenario 3.
Scenario 3
Remove a file you added by mistake as part of the pull request.
Clearly, we didn’t want the manager to see the fake agenda we were creating for this project. To fix this we need to remove the file fake_agenda.py
from the PR.
If we do this :
git rm fake_agenda.py
you’ll notice it completely removes the file from your working directory. That’s not what we want (or at least not what I would want). We wanted to delete it from PR but NOT from our local repo.
Instead, let’s bring the file back first:
git reset -- fake_agenda.py
git checkout -- fake_agenda.py
Now to remove it as part of our PR and make it an untracked file:
git rm --cached fake_agenda.py
To commit this change:
git commit -m "the file fake_agenda is gone from the repository"
Finally, push this commit just like last time:
git push -u origin agenda
If you go back to the Github page of the PR, you’ll see that another commit was added describing how the unnecessary file was removed.
Also, to confirm this file is no longer a part of the PR, see the Files changed tab and it should say 1 (previously it was 2 because of the additionalfake_agenda.py
file).
Scenario 4
Get the remote changes onto your local machine while you are working on some other branch.
While you are busy working on your own branch (branched off of the main
branch) locally, things might change on the main
branch in the remote repo. This happens quite often as more and more people work on a project — each on their own branches — after a while their branches will start to merge with the main
branch (once it has passed all the checks and reviews, of course).
For instance, the senior data scientist in your team decided to update the Readme file in the main
‘remote’ branch.
But these changes will not be present in your ‘local’ main
branch. Hence, you should get hold of those changes to keep your local main
branch up to date (after all, this is the branch from where you will be creating new branches in the future). To do so, we first check out the main
branch and then pull in the changes.
git checkout main
git pull
Now if you open the Readme file, you will see the additional line there. This means your local main
branch is all caught up.
However, if you go back to the branch you were working on locally i.e. agenda
git checkout agenda
And open the Readme file, it will still be the same as before, i.e. no changes. Ideally, I want to have all changes in localmain
incorporated as part of the branch I am working on so I know I have the most up-to-date files for the project.
More generally speaking, to bring a local branch X up to speed as the local branch Y, make sure you have checked out X first (in our case X would be the local agenda
branch and Y would be main
) and then do a merge:
git checkout agenda
git merge main
All your previous files in the agenda
branch would still be intact (to confirm go and manually check the files), with some new additions w.r.t. changes introduced by the senior data scientist.
At times, the merging process may not be as smooth as you would have hoped for. What does that even mean? It means there might be few issues to resolve — caused by two people changing the same file on the exact same line — before the merging process can be completed. Let’s see how to fix them in the next scenario.
Scenario 5
How do I resolve merge conflicts?
Continuing from Scenario 4, let’s assume you and the senior data scientist both changed the same file at the exact same line. In this case, merging from the last scenario will prompt an error like this:
The message is quite self-explanatory and tells us it couldn’t merge README.md because both I and my senior were trying to modify the exact same line. Hence, I must resolve the error before merging. To do this open the file in an editor (my personal preference is VScode) and you will see the issues that need to be resolved between <<<<<<<
and >>>>>>
.
It appears both of us were trying to modify the last line (I wanna say apple and she wanna say orange). At this point, you can either accept the incoming changes or stick with the current changes or do something else. Once you are happy and satisfied with the changes, save and close the file. Go back to the terminal, add and commit this change and finally push it:
git add README.md
git commit -m "Apples and oranges issue averted"
git push
Going back to the Github page of the PR, you can see the new commits have been added as part of the PR.
Once your reviewer is happy with all the changes, she will merge the remote agenda
branch with the main
branch on remote and you can see it on the Github page.
Once merged, you no longer need the agenda
branch — you can even delete it for all you care. From here on, I strongly recommend anytime you want to work on some other aspect of the project, you should create a new branch from the local main
branch after making sure you have pulled the latest version of the remotemain
branch onto your local main
branch.
Scenario 6
Bring back a colleague’s file you messed up, back to its original form.
Sometimes while working in your branch you end up messing up an existing file that perhaps you or your colleague created. For instance, let’s say I mistakenly changed the print statement in the agenda.py
file
from this
print(“The agenda for this demo is as follows: Do A then B and finally C.”)
to this
print(“The agenda for this demo is as follows: Do A and finally C.”)
AND also committed these changes using
git add agenda.py
git commit -m "Messing up the file by removing B's info"
To bring it back to its original form (i.e. how it looked at the time of the last commit), simply do git checkout <commit_id> — filename
. Here the commit_id
can be retrieved quickly if you go to the relevant file on the Github Web UI and pick its latest commit number.
In our case the commit_id for the agenda.py
file is 3c8ebf0
. Hence to make it look just like the last time it was committed, simply do
git checkout 3c8ebf0 -- agenda.py
followed by:
git add agenda.py
git commit -m "Restoring B's info in the agenda"
Scenario 7
A colleague invited you to work on her existing remote branch.
Very often, one of your colleagues who is already working on a branch, say featureB
, will invite you to work on it. Let’s just assume for now this branch contains a single file featureB.py
, in addition to the Readme file.
To start collaborating, you need to create a local featureB
branch and establish some sort of tracking connection between remote and local featureB
branches so that you can push commits and make changes to them.
Firstly, make sure the colleague’s branch is present on the remote server — meaning your colleague has pushed her branch onto origin — and that you have access to it. To check which all remote branches you have access to, do a git fetch
followed by:
git branch -r
Seeing the branch my colleague wants me to chime in i.e. origin/featureB
, we will simply do:
git checkout --track origin/featureB
This will create a local branch called featureB
and you should now be seeing your local repo comprised of the file featureB.py
. Next, I am going to go ahead and make some simple changes to this file (basically add my thoughts as a junior data scientists) and push it to the remote featureB
branch using git push origin featureB
. This is what the file looks like on remote after my changes were added:
Because you both could be working on the file simultaneously and making changes to it, I would highly recommend doing git fetch
followed bygit pull
every time before you start working on featureB
branch. As always there might be some conflicts which you need to resolve and once you are done:
git add featureB.py
git commit -m "Resolved conflict"
git push origin featureB
Scenario 8
How to fetch a PR in to the local repo for reviewing?
There will be instances when you are assigned as a ‘reviewer’ for a PR, meaning your job is to ensure everything works by analyzing a colleague’s branch. As part of the reviewing process, one can simply scan through the code on the Github Web UI itself but I generally prefer to make a local read-only copy of the PR and run it at my end.
P.S.: A read-only copy means you cannot push changes to this PR, which is ideal because we should not be pushing changes to a PR created by someone else, only the author of the PR should be making changes.
To do so, use git fetch origin pull/PULL_REQUEST_ID/head:NEW-BRANCH-NAME
, followed by git checkout BRANCH_NAME
.
P.S. Everything you see in capslock in the above commands is something you need to provide:
- PULL_REQUEST_ID can be retrieved from the Github page of the PR. It is usually a number following the # symbol. For us, it is 4.
- NEW-BRANCH_NAME can be anything you want to call the read-only copy of the PR. I like to call mine
review_***
.
Once you have both these things in your kitty, go ahead and do:
git fetch origin pull/4/head:review_featureb1
git checkout review_featureb1
On the review_featureb1
branch, you can review as much as you would like. Once you are done, you can either tell the author what changes have to be made by leaving reviews on the Github page of the PR or you can make those changes yourself and push them to a newly created branch using git push -u origin patch-test_featureb1
.
Even though I mentioned that no one other than the PR author should not be making changes to the PR, there will be times when you would like to make some changes to someone’s PR directly — for instance, a very small typo needs fixing — and make those changes as part of the PR.
In that case, you need to first get the name of the branch from which the PR was created (let’s call it <PR-branch-name>
). It will usually be available on the PR page on the Github Web UI and will read like this: <repo-author> wants to merge X commits into main from <PR-branch_name>
(see Figure 2 above).
In our case, it seems the PR was created from featureb1
. Let’s check out this branch using one of the following methods:
(a) git checkout featureb1
(if you are tracking featureb1
locally)
OR
(b) steps described in Scenario 7 (if you are not tracking that branch locally).
Afterward, you make the desired changes and finally push to the branch of the same name on the remote using git push origin featureb1
such that it becomes part of the PR.
Note: Having explained both methods of reviewing, I would strongly recommend making a new branch (rather than modifying existing PR) because it gives someone a chance to review your work (before merging), and this way you are not messing up someone’s PR unknowingly.
Scenario 9
How can I push my local changes to remote repo?
The short answer is — it depends on your scenario.
If you are in something like Scenario 2 i.e. working on a branch that you created locally, (meaning it doesn’t exist yet on remote), you can do
git push -u origin <my_branch_name>
which automatically create the branch of the same name on the remote and where <my_branch_name>
should be replaced by the name of the branch you created locally.
Alternatively, if you in Scenario 7, i.e. working locally on a branch that is being tracked remotely, then you can do:
git push origin <remote_branch_name>
where <remote_branch_name>
should be replaced by a branch name that exists on the remote repo and is visible when you run the command git branch -r
.
Scenario 10
How to push an empty directory on Github?
To see how Git handles empty directories, let’s do a quick setup.
Checkout any branch using git checkout
and try creating an empty folder e_folder
in your local repo.
Next, in the terminal try:
git status
In an ideal world, git should have prompted you to add this newly created folder to the staging area but nothing like that happens. Now go ahead and add any file to e_folder
so that it’s no longer empty:
If you were to do a git status
now, you would see a prompt telling you to add the folder using git add
command
In a nutshell, Git will only track non-empty directories by default — which on most occasions poses no problem for us. However, there will be times when we would like to push an empty directory. For instance, I like to have an empty directory as a placeholder to tell people where they can upload their custom images when using my model.
To explicitly inform Git to track our empty directory e_folder
, first, let’s go ahead and manually remove the img.jpeg
file from it. Then go to the Git command line:
cd e_folder
touch .gitignore
vi .gitignore
Once you are in the visual editor,
- Press
i
on your keyboard - Paste (or Ctrl+V) the following lines in there [Credits: Stackoverflow]
# Ignore everything in this directory
*
# Except this file
!.gitignore
- Press
esc
on keyboard - type
:wq
On the git terminal, go back one step to the medium_repo
directory using cd ..
, followed by
git status
You will see now git is ready to add your empty folder to the staging area
Go ahead add it, commit it and push it:
git add e_folder
git commit -m "placeholder for images"
git push
Bonus Scenario 1
I checked out a branch X, made some changes in there, saved the changes using File -> Save. Then I proceeded to checkout some other branch Y and I can see the same changes as the one I made in X. Why is it that making changes in one branch, git changes every other branch?
Check out the long answer on Stackoverflow.
In a nutshell, if you are not committing changes you make in a branch using git commit
, the change is coming along with you if you switch branches. This means you are at the risk of corrupting your other branches unintentionally when you check them out. So commit as much as you can, as often as you can, especially when making some pretty major modifications to the code!
Bonus Scenario 2
My coworker is working on a Git repo and I want to push/pull/review some of the files from their repo.
Remember what I said about the origin? It is the name of our remote Github repo. To check which all remote Github repos you have access to:
git remote -v
So far we have worked mainly on one repo which was named origin
but let’s say I also wanted to work on some other remote repositories hosted on GitHub. How about adding the Github repo that my coworker ABC
created for one of his medium articles? I am sure I can help review some of her content. Its remote URL is https://github.com/ABC/coworker_medium_repo.git
.
The first thing to do is add this remote repository i.e. establish some sort of connection, and set a name for it. I cannot name my coworker’s repo origin
as well because it might confuse Git and result in a fatal error. Hence, it’s best to use a name corresponding to the repo — I tend to name them as origin_***
.
git remote add origin_medium https://github.com/ABC/coworker_medium_repo.git
Now if you do a git remote -v
command in the terminal, you will see all the remotes you have access to:
Similarly, you can keep on creating references to coworkers’ repos using their repo URL.
Note: Take a look at Figure 1 if you need a refresher on how to locate the repo URL.
git remote add project1_origin http:github.com/XYZ/project1.gitgit remote add project2_origin http:github.com/MNO/project2.gitgit remote add project3_origin http:github.com/PQR/project3.git
Now let’s say your coworker has made a PR on Github for project 2 and you must review it. The steps to do so have been discussed in Scenario 8 (make sure you have the appropriate PULL_REQUEST_ID). The only change you need to make this time is pointing to the correct origin i.e. the correct repo name:
git fetch origin_medium pull/PULL_REQUEST_ID/head:REVIEW_FEATURENAME
git checkout REVIEW_FEATURENAME
You can even fetch one of their branches and track them locally using:
git fetch origin_medium new_agenda
git checkout new_agenda
where new_agenda
is one of the remote branches on my coworker’s repo. Next, make some changes — add and commit and push. You know the drill now:
git commit -a -m "Some input from V-Sher's end as well"
git push
Note: Keep in mind that in our example there exists a PR made by the coworker from the new_agenda
branch and any changes we just pushed to this branch will also be a part of the PR (see figure below). If this is not intentional, please make a local copy of the PR, creating a new branch, and push it to the coworker’s repo.
You can even pull one of the coworker’s remote branches into your own local branch. That is, I am going to check out my local branch agenda
and try to pull my coworker’s second_agenda
branch into it:
Note: Frankly speaking, I do not know when will I ever use this since the commit history of the two projects will be unrelated. If you have ideas, let me know!
git checkout agenda
git pull origin_medium second_agenda --allow-unrelated-histories
There are some conflicts but nothing that we cannot resolve using steps from Scenario 5.
Parting words of wisdom
- You can never do enough
git fetch
andgit status
— the former informs you of new developments in the remote repo and the latter alerts you of uncommitted files in your local repo. - Make it a habit to
git commit
as soon as you make some significant changes to your code. (Nope, File -> Save is not sufficient) - If remote tracking is enabled*,
git push
(andgit pull
) will push (and pull) changes from the currently checked out branch to the corresponding branch on the remote. If it is not enabled, an error will be thrown and you must fix it by explicitly stating the repo name and the branch name to which you wish to pull/push.
Note: I am a bit more paranoid than most people so you will always find me explicitly stating the repo name and branch name. For instancegit push origin_name branch_name
.
*: to check if remote tracking is enabled for a branch, trygit remote show origin
. This will show each local branch and its linked remote branch.
Congrats on making it this far…
I truly believe you’ll be able to get by with these Git commands in your day-to-day. For those of you who want to learn much more complex commands (for scenarios that might someday arise at your workplace), I would say set up two dummy Github accounts — you need to play the role of both the collaborators — and start practicing your crazy once-in-a-lifetime scenarios.
As always, if there’s an easier way to do some of the things I mentioned in this article, please do let me know.
Until next time :)
Comments