Software Carpentry: What to Say at a Bootcamp, After It's All Said and Done

I'm going to be teaching my first ever bootcamp in the coming days. My biggest fear (and there are many!) is that the participants will walk out of the room on the final afternoon and go straight back to their old habits, never taking the time to incorporate what they've learned into their daily workflow. In an attempt to avoid this eventuality, I've planned a rousing concluding address to explain why the content taught at a Software Carpentry bootcamp is so important. It goes something like this...

Consider the workflow of Nigel Nobootcamp, which is typical of many researchers today:

Nigel collects some data and stores it on a machine that is occasionally backed up by his department.
He then writes or modifies a few small programs (which also reside on his machine) to analyse that data, with assistance from Google, Stack Overflow and a helpful colleague.
Once he has some results that look right (which alleviates his fear that there might be bugs in his code), he writes them up and submits the paper. He includes his data—a growing number of journals require this—but doesn't include his code.
Time passes.
The journal sends him reviews written anonymously by a handful of other people in his field. He revises the paper to satisfy them, during which time he also modifies the scripts he wrote earlier, and resubmits.
More time passes.
The paper is eventually published. It includes a link to an online copy of his data, but the paper itself is behind a paywall: only people who have personal or institutional access are able to read it.

If you read any research related blogs or scan the editorial sections of prominent journals like Science or Nature [e.g. 1, 2, 3], you'll know that "open" and "web" are two of the major buzz words right now. This is because there are now a whole range of web based tools out there that make it possible for this typical workflow to be more collaborative, transparent, reproducible and reliable. All scientists would agree that these are desirable improvements, but most lack the requisite computing skills to fully participate in this open science revolution. It is for precisely this reason that the Mozilla Science Lab—whose mission is to help researchers use the open web to shape science's future—is now the organisational home for Software Carpentry. They recognise that in order for these tools and practices to make it out of buzz word editorials and into the default workflow of everyday scientists like Nigel, the entire profession needs to upskill. Graduates of a two-day Software Carpentry bootcamp, whether they immediately realise it or not, have all the basic skills and knowledge needed to transform the way they do research. In comparison to Nigel, consider the workflow of Betty Bootcamp:

The data that Betty collects are stored in an open access repository called Dryad as soon as they're collected, and given their own DOI.
She creates a new version control repository on GitHub to hold her work.
As she does her analysis, she tracks changes to her scripts (and some output files) within that repository. She also uses the repository for her paper; it's then the hub for collaboration with her colleagues.
With respect to data analysis, Betty regularly attends her nearest SciPy conference and is on the mailing lists of her favourite data analysis packages. She is also using best practices like unit testing and defensive programming, which give her confidence that her results are reliable.
When she's happy with the state of her paper, she posts a version to a preprint server called arXiv to invite feedback from peers.
Based on that feedback, she posts several revisions before finally submitting her paper to a journal.
The published paper includes links to her preprint and to her code and data repositories.

At an individual level, it's obvious that Betty's workflow would probably produce higher quality research. Her use of unit testing (as opposed to Nigel's "looks right" approach) has reduced the chance of errors, while the ease with which others could collaborate (GitHub) and provide commentary (arXiv) on her work means it's likely to have been exposed to a higher level of scrutiny prior to journal submission. The data analysis process was probably also far less painful for Betty. Aside from the best practices she learned to make her code more readable and easy to manage, her bootcamp experience has given her the confidence to interact with the programming community and further develop her skills (SciPy, mailing lists). When she runs into trouble—which still happens regularly—her support network goes far beyond random Google searches and friendly colleagues down the hall.

While improved research quality is a noble pursuit, in many cases the academic world places more emphasis on quantity. It's therefore noteworthy that Betty's workflow would probably also be more efficient. The key to this is the fact that in both her programming (via unit testing and defensive programming) and manuscript preparation (via reviews on GitHub and comments on the arXiv document) she is identifying errors earlier than Nigel. Anyone who has spent days tracking down a bug, only to eventually find it in a section of code written weeks/months ago, will know that catching errors early is the key to enhancing efficiency. Betty's programming skills are also superior to Nigel's (bootcamp, regular interaction with programming community), which means she performs simple tasks faster and doesn't reinvent as many wheels.

Finally, at a broader community level it's important to note that Betty's research is more transparent than Nigel's. If other researchers want to reproduce and ultimately build on her work, they have open access to a description of her methodology (arXiv), her data (Dryad) and her code (GitHub). This will surely accelerate the discovery process in her field of research, and is probably pretty good for her citation statistics too!

Originally posted 2013-11-19 by Damien Irving in Bootcamps.