Software Carpentry: Links for Summer Interns

Our summer interns started today; our first job is to define exactly what they'll be working on this summer, so it seems like a good time to round up a few links on interesting topics. My apologies for those hidden behind paywalls...

Steve's Project Ideas

Social network analysis for scientists
Electronic lab notebooks

Reproducible Research

If I said, "I just got a really interesting result in the lab, but I didn't record the steps I took or the settings on the machine," no reputable journal would publish my paper. If I said, "I just got a really interesting computational result," most reviewers and editors wouldn't even ask if I'd archived my code and the parameters I used, or whether that code would run on someone else's machine. Reproducible research (RR) is the idea of making computational science as trustworthy as experimental science by creating tools and working practices that will allow scientists to re-create past results.

WaveLab and Reproducible Research
The Madagascar project
The Sweave project
Special issue of Computing in Science & Engineering on reproducibility

Data Provenance

The "provenance" of an object is the history of where it came from, and how it got here. The provenance of a piece of data is similar: what raw values is it derived from, and what processing was done to create it? Ideally, every piece of scientific software should track this automatically; in practice, very few do, and most scientists don't take advantage of the capability when it's there. That's changing, though, particularly as emphasis on reproducibility grows.

The Provenance Challenge: a series of competitions to benchmark provenance tools against one another.
Special issue of Concurrency and Computation: Practice & Experience reporting the results of the first challenge

Science 2.0

Also called "computer-supported collaborative science", this is the idea of leveraging modern web-based collaboration tools to better connect scientists, their experiments, and their results. It encompasses a broad range of ideas, but "social networking for scientists" based on their interests is near the core, as is "open science" (the idea of making scientific results public in the same way as open source software or Creative Commons publications).

Overview article in Scientific American
Jon Udell's Internet Groupware for Scientific Collaboration may be several years old, but it's still prescient
Jean Claude Bradley's blog
Cameron Neylon's personal blog (see for example his post on "FriendFeed for Scientists") and lab blog

Scientific Programming Environments

Compared to professional software developers, most scientists use fairly primitive programming environments, in part because they've been too busy learning quantum chemistry to learn distributed version control, and in part because software developers seem to go out of their way to make tools hard to set up and learn. Lots of people have tackled this from a variety of angles. Unfortunately, a lot of work to date has focused on supercomputing, which is sort of like studying modern medicine by focusing on heart surgeons...

Greg Wilson's "Where's the Real Bottleneck in Scientific Computing?"
Carver, Kendall, Squires, and Post's "Software Development Environments for Scientific and Engineering Software: A Series of Case Studies"
Matthews, Wilson, and Easterbrook's "Configuration Management for Large-Scale Scientific Computing at the UK Met Office" is an example of tools done right

Originally posted 2009-05-11 by Greg Wilson in Content.