Reproducible code is about transparency of research. Transparency not only enables others to understand and replicate your research, but it helps you to understand your own work. Working on large projects (such as a dissertation) requires many scripts that merge, clean, visualize, and analyze data. Organization, version control, README files, and other tools will help you remember what files and scripts you used, what plots you included, etc. without requiring a forensic analysis of your folders. So here are my current tips and tricks to making this possible:
N.B., Some of my examples below are in R; however, I am confident that they can be replicated in Python, STATA, or your program of choice.
Tip #1: Organize files in folders that make sense to you and others. For inspiration, here’s a peek into my organizing system:
For me, it makes sense to organize folders into stages of the research process.
- In the above example, the “construct” folder includes all the necessary scripts and files to construct the dataset(s) to be used in the analysis.
- Within the “code” folder, I have files organized by the stage of construction. For instance, I create two data sets that differ on the level of data aggregation, and each level gets its own folder. The numbered labeling structure indicates the order in the construction process.
- I have a “tmp” folder to organize the intermediary (or temporary) files needed to construct the dataset.
- In the “analysis” folder, I have all the files that I need for the analysis and drafting. Using RMarkdown makes this a little easier, but I do have an additional file for code that hasn’t made it into the draft. I save a PDF of every version of the manuscript in the “drafts” folder.
- The trick is to try to balance the length of file paths and the intelligibility of organizing, which can at times be at odds with each other.
Tip #2: Include README files for your folders. README.txt files will help anyone else accessing your files (or even your future self) understand your organization system. What I include in this file depends on the folder it is describing. I include a description of folder contents, any notes about labeling that might be relevant, which data/script files are the most important, etc. I have found that this has been even helpful for me when I’m trying to make sense of what I did a year ago. (I include a “0-” in the file path because I like having these README files at the top of the folder. )
Tip #3: Use Master Scripts that link to subscripts. With larger projects, one long master script can be unwieldy and make changing the file annoying (especially with a lot of merges, cleaning, and data manipulation). Here is what I suggest:
- Break down your scripts into specific tasks.
- Have separate scripts to prep, merge, and clean, and manipulate your data.
- Note the purpose of the file and any relevant information at the top of the file. You may also consider noting the date of the last modification and the name of the last person to modify (if working with others).
- Write a Master Script to allow you to run all of your subscripts in one shot when compiling your data.
- In R, all this requires is using the source function (documentation)
Tip #4: Set a working directory, especially when working with others. There is nothing more frustrating than having to change every file path in a script because someone didn’t set a working directory. For those readers who have never done this, this allows you to truncate all of the file paths that follow. This means that other users only need to change one line of code and the rest of the code will work without a hitch. It also makes the script more readable.
- In R:
- In Python:
- In STATA:
Tip #5: Version Control. I have a few suggestions here, which to some degree are redundancies.
- Back up regularly, and back up in a way that saves snapshots of your work in time. Don’t completely overwrite your files, just in case you need to revert back to a previous version of that file. I back up in two ways: a) quarterly on an external hard drive, and b) all the time using Backblaze, a cloud-based backup service for $5 a month (website). When connected to the internet, Backblaze constantly syncs your files to the cloud, creating timestamped versions of every file as the file changes. Whatever you use to back-up, creating time-stamped snapshots is crucial.
- Use GitHub. GitHub not only allows you to see the changes in your code but also has functionalities to reconcile those differences. A tutorial to get started with GitHub is forthcoming!
Bonus Tip: Spring Cleaning. Every once and a while prune and clean out your folders to make sure the folders are not filled with old files that don’t reflect the present state of the project. Trying to find the right file in a sea of out-dated files can be time-consuming. In cases where I think I need to actually delete the file, I find that moving old files to a folder labeled “delete” has helped this process. I can move the file to the “delete” folder for a period of time until I feel confident that there will be no negative consequences to my housekeeping. Also, regular backups will prevent you from every completely getting rid of files, while not cluttering up your workflow.