Do we want a ‘GitHub’ for Data Management?
Denver, Colorado. Recently sitting in a session on how researchers should do data management at the #DLFforum a researcher (turned data librarian) was saying that the easiest way to get researchers on the path to doing data management (e.g. enabling their data to be shared with other researchers) was to get them to create a ReadMe file for their dataset (e.g. Excel Spreadsheet). What I found interesting about this is that this is what I tell most first time open source developers when reviewing their code, create a ReadMe.md file which acts as the Table of Contents to their code. This got me thinking… In a way ‘code’ is very data-like. That is to say, most code is made up of ‘key:value’ pairs where methods are assigned actions (or libraries assigned objects, etc). So why shouldn’t the data management community be borrowing the transferable skills of data management from open source code developers? My ‘low hanging’ list of things I look for in open source code (which could easily be applied to data) are:
- 1.) a ReadMe.md markdown file that list the basic installation guide as well as environment variables => why not get researchers to provide a basic ReadMe file on their data along with what each of their column rows mean, ideally with a link to what their colum header means, e.g. if you have a column of data that lists salinity in water parts per million then link to the wikipedia page on that measurement in your ReadMe file.
- 2.) InLine code comments => why not get researchers to use ‘add comment’ to individual cell features (rather than putting comments off in some randomly chosen cell within the spreadsheet), e.g. in GoogleDocs Spreadhseets you simply highlight the cell and push Ctrl+Alt+M and you can add as many comments as you like to each cell, i.e. “this measurement is an outlier and we think this is due to someone sneezing on the instrument”.
- 3.) Use of version control branch ‘pulling’ => when another researcher wants to reuse the data they first create a ‘branch’ of the data which they can work on in their own ‘data repository branch'; once the researcher has worked on their branch of the data they can then submit a ‘pull request’ back to the originating research, which would enable the original researcher to work with the secondary dataset while still maintaining a version of the original dataset. If the original researche r doesn’t want to pull that branch into the main data trunk then the branch just remain as a branch (plenty examples of the latter in GitHub).
- 4.) Ability to cite and show who has worked on the code, e.g. GitHub has a great feature where you can see who has worked on the code via various visualisations of commits over time, this in turn is able to be cited and shown to show how mature the code is, e.g. developers know how well the code is being used.
There are about a dozen other key tools, skills and methods that developer use, but what about the above as a basic ‘minimum data management plan’ (MDMP) set of skills we could recommend any researcher to use when starting down the path of research data management?
In short, do we want a GitHub for Data Management?
Or, can researchers learn the code sharing skills of developers to make their research data more reusable?