Notes

dvc (data version control)


Version control is helpful when you want to track your project’s changes. However, GitHub has one major (yet, understandable) shortcoming: file size. The free version of GitHub will warn you if your file is over 50MB and completely reject your push if the file is over 100MB. This is a huge problem when you’re working with shapefiles (.shp) which contain the geographic coordinates necessary for cartography. “Officially” there are three ways around GitHub’s file size limits, but I have a clear favorite.

  1. Pay for GitHub Premium
  2. LFS (Large File Storage)
  3. DVC (Data Version Control)

My favorite is using DVC, so I’m only going to go over #1 and #2 briefly.

GitHub Pro

GitHub has a paid-version which is available here. It starts at $48 a year. Both the free and pro versions have repo limit of 2GB, Team version limits it to 4GB, and enterprise limits files to 5GB.

You still can’t push files larger than 100MB, so unless you’re doing a lot of programming I don’t think the Pro version is worth it. I would suggest checking if your school or employer offers an enterprise account. My school does not.

LFS [Large File Storage]

LFS is a way around GitHub’s file size limits when pushing changes. To install LFS just go to the website and download it.

You will need to use the command line to use LFS, but it’s fairly straightforward. Once you’ve downloaded LFS and install it you’ll need to open a console or terminal. Powershell on Windows, terminal on Mac, cmder console emulator, or any other command line interface will work.

First you’ll need to navigate to wherever you store your repos and install LFS:

1
2
    cd Documents\Github
    git lfs install

The basic logic of LFS is to track certain file extensions so that Git ignores them, but LFS does not.

For example, I know that .shp files are very large. I would track all .shp files using LFS to get around the file size limit. To track an extension with LFS you use:

    git lfs track "*.shp"

The * before .shp tells LFS to add any file, regardless of what it’s called, that has the .shp extension.

The final step is to make a .gitattributes file.

    git add .gitattributes

Once you follow those steps, you can use Git and GitHub as you normally would.

The problem with this option is that it does not get around GitHub’s limit on total repos size. You’ll still be limited to repos no larger than 2GB. This is a problem if you’re doing a complex cartography project with multiple .shp files or you’re mapping data from something like the CES. The only way around this, that I’ve found, is to use DVC [data version control].

DVC [Data Version Control]

DVC is an open-source version control system for data. It works in parallel with Git.

To start, you need to download and install DVC from here. Once you do, navigate to the repo that has large files and initialize DVC:

1
2
    cd Documents\Github
    dvc init

Once you initialize DVC several files are created. You’ll need to push the new files to GitHub

1
2
3
    git add -A
    git commit -m "Initialize DVC"
    git push origin main

To actually track data you’ll need to add them to DVC. It’s done in much the same way as using Git in command line.

To add an individual file you’d use:

1
    dvc add data/filename.ext

where [filename.ext] is the name of your file or the directory you want to track.

Before you can push changes to DVC, you’ll need to add a remote storage site. My school gives us a lot of storage through Google and Microsoft, so I connected my DVC to my Google Drive.

If, like me, you want to use Google Drive I suggest making a folder called “DVC” on your Google Drive and using it to store your DVC files.

To use the folder you created, you’ll need to add the remote using the command line.

1
2
    dvc remote add -d [remote-name] gdrive://[folder-ID]
    dvc push

[remote-name] is whatever you want to save the drive URL to. It’s easiest just to name it “myremote” but it’s up to you.

The [Folder-ID] is the jumble of letters and numbers at the end of the URL. To get it, you’ll need to navigate to the folder you want to use on Google Drive in your browser. At the top you’ll see the URL and you’ll need to copy the bit after the last /.

Repo Options

You only need to do this step once.

When you’re ready to use DVC all you do is add the file you want and push it.

1
2
    dvc add [filename]
    dvc push

Conclusion

DVC is feature-rich and useful. It allows you to store your data in the same repo without running into GitHub’s file size limits. The documentation for DVC is well written, so I suggest looking at it if you want to use another storage site like Microsoft One Drive or Dropbox

With all the data management and version control sorted you can turn, in earnest, to cartography in R!

Liz | 17 Sep 2022

tags: dvctutorialversion control