Do we want a ‘GitHub’ for Data Management?

Denver, Colorado. Recently sitting in a session on how researchers should do data management at the #DLFforum a researcher (turned data librarian) was saying that the easiest way to get researchers on the path to doing data management (e.g. enabling their data to be shared with other researchers) was to get them to create a ReadMe file for their dataset (e.g. Excel Spreadsheet). What I found interesting about this is that this is what I tell most first time open source developers when reviewing their code, create a file which acts as the Table of Contents to their code. This got me thinking… In a way ‘code’ is very data-like. That is to say, most code is made up of ‘key:value’ pairs where methods are assigned actions (or libraries assigned objects, etc). So why shouldn’t the data management community be borrowing the transferable skills of data management from open source code developers? My ‘low hanging’ list of things I look for in open source code (which could easily be applied to data) are:

  • 1.) a markdown file that list the basic installation guide as well as environment variables => why not get researchers to provide a basic ReadMe file on their data along with what each of their column rows mean, ideally with a link to what their colum header means, e.g. if you have a column of data that lists salinity in water parts per million then link to the wikipedia page on that measurement in your ReadMe file.
  • 2.) InLine code comments => why not get researchers to use ‘add comment’ to individual cell features (rather than putting comments off in some randomly chosen cell within the spreadsheet), e.g. in GoogleDocs Spreadhseets you simply highlight the cell and push Ctrl+Alt+M and you can add as many comments as you like to each cell, i.e. “this measurement is an outlier and we think this is due to someone sneezing on the instrument”.
  • 3.) Use of version control branch ‘pulling’ => when another researcher wants to reuse the data they first create a ‘branch’ of the data which they can work on in their own ‘data repository branch’; once the researcher has worked on their branch of the data they can then submit a ‘pull request’ back to the originating research, which would enable the original researcher to work with the secondary dataset while still maintaining a version of the original dataset. If the original researche r doesn’t want to pull that branch into the main data trunk then the branch just remain as a branch (plenty examples of the latter in GitHub).
  • 4.) Ability to cite and show who has worked on the code, e.g. GitHub has a great feature where you can see who has worked on the code via various visualisations of commits over time, this in turn is able to be cited and shown to show how mature the code is, e.g. developers know how well the code is being used.

There are about a dozen other key tools, skills and methods that developer use, but what about the above as a basic ‘minimum data management plan’ (MDMP) set of skills we could recommend any researcher to use when starting down the path of research data management?

In short, do we want a GitHub for Data Management?

Or, can researchers learn the code sharing skills of developers to make their research data more reusable?


~ by dfflanders on November 5, 2012.

5 Responses to “Do we want a ‘GitHub’ for Data Management?”

  1. A couple of other things for the workflow/git repo:
    – use Google Refine or Stanford Data Wrangler in data cleaning processes and add the project file and cleaned data sets via commits, leaving the original datafile as is?
    – use tools such as R/Studio for creating analyses, charts etc /as code/ that can be added to the repo, along with outputs (rendered charts; PDF/markdown created HTML files etc); if you can find a way of identifying the version of the data used as input to the report/analysis generation, so much the better.

    I think that there is a great opportunity for creating robust workflows that can be of benefit not just to academic community, but also in “open data” and policy development work, as well as “data driven journalism”.

  2. David,
    As usual quite provocative and entertaining. How would you respond to researchers asking what’s in it for me? Especially if this means they will have to start tackling new tasks and therefore preventing them max time on their primary task, namely research.

  3. Interesting – you’re basically arguing for structured data, and unstructured metadata. Which is kind of the opposite of repositories/registries like Research Data Australia, AEKOS ( etc, which emphasise high quality metadata, but treat actual data as an opaque blob.

    I think my next question is not “is this approach correct” but, “in which subdomains of eResearch will this approach work well”? (I definitely think the “link to a Wikipedia page” approach is more practical than “link to a semantic web ontology”, at least in the short term.)

  4. Reblogged this on Sutoprise Avenue, A SutoCom Source.

  5. +1 to Tony’s comment re R et al. The R notebook idea takes the readme one step further than a mere readme, it can provide a literate programming commentary on how raw data turns into clean data, and gets processed, if this is done well then it will provide a form of structured metadata.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: