Arguably one of the best things you can do before starting a PhD is invest time in learning how to properly use version control. With version control, you can track, save, and revert changes to any kind of project. There are several options available, but I’m partial to Git & GitHub. Even if you never touch a piece of code, version control is very helpful.
I found a lot of information about Git & GitHub confusing. The documentation is written for software engineers and people that are immersed in writing code. I am not that person and a lot of the information isn’t relevant to my situation. This guide is written specifically for PhDs. I doubt I am leveraging these tools’ full functionality, but this is a good place to start if you’ve never heard of version control before or were overwhelmed by the documentation.
Version control is important because projects never work in a linear format. Regardless of whether you’re writing a paper or writing code, things change. Version control allows you to track those changes and keep notes on the logic behind why you did the things you did.
If you’re writing a paper, you might sometimes find yourself deleting paragraphs that you later need. With version control, you can save these paragraphs and revert to earlier writing if necessary.
With code, you can add or remove features without worrying about damaging existing, working code. I did not understand the full functionality of version control until I started working on cartography. A lot of cartography is trial and error (or at least it has been for me) and I got tired of having files named “Code_final” and “Code_final_final” or worse: “Really_final_version.” Version control lets me branch off sections of my code, add or remove functionality, without damaging my main code.
Repo/Repository: These are like directories or folders on your computer. It’s where all the information about a project is stored.
Git: is a version control system that allows you to track and revert changes incrementally. This means that as you make changes on a project, you can describe the changes and save them. If you make a change that you don’t like or that broke something, you can revert to a previous version. Git also allows you create branches or forks which are helpful for trying out changes without affecting the main code.
GitHub: is a cloud-based Git service. It allows you to access your Git repositories from anywhere. You can use Git without GitHub, but you can’t use GitHub without Git.
I might use Git & GitHub interchangeably which drives software engineers bonkers.
There are three main ways of using Git and Github:
1) GitHub Website
2) GitHub Desktop
3) Command line
Purists will tell you that you should always use the command line while others will tell you it doesn’t matter. Honestly, I use all three ways based solely on whatever I’m feeling in the moment. I’m going to focus on command line usage because it will help illuminate what’s happening behind the scenes when you use either the website or the Desktop app.
I’ve also written about using GitHub Desktop and the GitHub website. The information across all three is the same. My suggestion is to choose whichever one you’re most comfortable with and ignore anyone who makes you feel bad because you prefer to use an app over using command line. There’s more important things to worry about.
Repos are just directories. If you’re even mildly organized, you probably create a new folder on your hard drive for each project. If you don’t, I highly recommend it because it does help keep things neat and orderly. When you use either Git or GitHub, your repos will be stored in their own folders. This can get unwieldy very quickly. I like to store all my repos in one folder located in My Documents on my hard drive. I just created a folder called GitHub and anytime I initialize or clone a repo, I make sure I do so from this folder. Here is what my GitHub folder looks like:
As you can see, it’s located in the Documents folder on my PC. Each folder you see listed here is a repo in my GitHub and I don’t save anything else to this folder except initialized or cloned repos.
To use GitHub from command line you can either use the command prompt or powershell on Windows, terminal on Mac, or a console emulator like cmdr or cygwin. When you download Git it will also install Git CMD which does the same thing as any other console or terminal app. They all operate in the same way and do the same things, so pick whichever makes you happy.
Before getting into the important commands, the first thing you’ll want to do is navigate to the folder you’re keeping your repos in. For me it’s the GitHub folder I created in my Documents above. There’s two ways you can do this.
cmd
in the navigation bar and hit enter and it will open a command prompt in that folder.cd
stands for change directory. It’s followed by the path to the folder you want to open.
Your computer should allow tab complete. This just means you can start typing Doc, hit the tab key on your keyboard and it should autocomplete. Sometimes you’ll need to go up a directory or two. Say, for example, I’m in the GitHub folder, but I want to go to the Documents folder using the command line. To do so, I would put
If I were to put only one dot:
I could navigate to another folder inside the Documents folder.
These are just helpful navigation options so you don’t always have to put the entire path to a folder.
Once you’re in the folder where you want to keep the repo, you’re ready to use Git for version control.
The Git documentation is full of commands, options, flags, and various other usage information. I’d say a solid 99% of it, I’ve never used. I’m only going to go over the commands I use on a daily basis because they’re all you really need to get started. I’m going to cover what each term means, then add basic workflows at the end.
Note: Anything in [brackets] is where you would enter in information. You would omit the brackets. For example, git init using-github
would initialize an new repo called using-github.
git init [project-directory-name]
: You only need to initialize a repo once. Each repo in your GitHub has to have a unique-to-you name. That is, you can't have two repos named Project. Good repo names are short and descriptive. If I were to initialize a Git for this project I would name it using-github because that is what this guide is about. git clone [project-url]
: If you have previously initialized a repo (either on the website or through GitHub Desktop) and want to add it to your current hard drive you will need to clone (copy) it. This is useful if you use a desktop and a laptop. I often initialize repos on my desktop using git init
and then clone them to my laptop using git clone
. You can also use git clone
to clone repos created by other people. If you find a cool project on GitHub.com that you want to modify for your own use, you can clone it to your hard drive using git clone project-url]
.
After you initialize or clone a repo, you can work on your project. You would do work like any other time. Just navigate to the folder where your repo is located and create or modify any files you need to. When you’re done working for the day, you’re ready to stage and commit your changes.
There are four steps you’ll always follow when working with a repo:
First you have to add a file, then stage it, commit it, and push it.
git add [filename or directory name]
You'll use git add
to state individual files or directories. You just add the path to the file or directory after git add
then hit enter. Once staged, you'll add a message, then commit.git add -A
will stage all modified files.git commit -m "A useful message here"
before you can push your files, you have to commit them. Committing just records a change on the local hard drive. It's like taking a snapshot of a project in its current state. Messages allow you to describe the change and its justification. Git won't let you commit without adding a message so here is a good guide on writing good commit messages.git push origin [branch-name]
usually, [branch-name] will be main. Sometimes it will be a different branch. Origin is a way of referencing a specific repo. This way you don't have to constantly refer to its url. Pushing sends your commits to GitHub. You can then access those changes on the website or on another computer by pulling the changes to the machine. git pull origin [branch name]
again, [branch-name] will usually be main. Pull brings changes you pushed to GitHub onto the local machine. Think of this as syncing changes. It's especially useful when using more than one computer.One of the best things about Git & GitHub is that it allows branching and forking. Branching is the most useful because it creates a temporary space for you to work without threatening the integrity of the main project.
checkout -b [branch name]
will allow you to create and switch to a new branchgit branch [branch name]
is how you create a branch without switching to it. git checkout [branch name]
allows you to switch to an existing branch. If you like what you did in a branch and want to merge it with main so that you can keep the updated version of the project you’ll need to switch to the main branch and then merge.
1
2
git checkout main
git merge [branch name]
If you don’t like what you did in a branch and want to delete it entirely, here’s how:
1
2
3
4
5
## to delete the branch on your local machine
git branch -d localBranchName
## to delete it it from the remote repo
git push origin --delete remoteBranchName
These are basic workflows that you can use.
Initializing a repo:
1
2
3
4
5
git init [repo name]
## do the work you need to
git add -A
git commit -m "a useful message"
git push origin main
Cloning an existing repo:
1
2
3
4
5
git clone https://github.com/liz-muehlmann/Election_Guides.git
## do the work you need to
git add -A
git commit -m "a useful message"
git push origin main
Create a branch and merge it with main:
1
2
3
4
git branch [branch name]
## work on the branch until you are happy
git checkout main
git merge [branch name]
Those are really the only commands you need to know to use Git & GitHub. I can’t stress enough how important version control is when programming - especially when working on cartography projects. One problem you’ll inevitably run into is file size. GitHub will warn you if your file is over 50MB and it will reject your push if any of your files are over 100MB.
There are two ways around GitHub’s file limits which I go over in my post about DVC (Data Version Control). If you’re only using Git & GitHub to version control your writing, you don’t really need to worry about large file sizes. However, once you start working with datasets the file size limit gets in the way quickly.
If you’re uncomfortable with the command line, GitHub has a desktop application that you can use which is very user friendly. You can learn about GitHub Desktop and the GitHub Website through my other posts.
Liz | 17 Sep 2022