Software Carpentry: Key Points

On the flight back from Vancouver yesterday, I finally did what I should have done eight months ago and compiled the key points from our core lesson content. The results are presented below, broken down by lesson and topic; going forward, we're going to use something like this as a basis for defining what Software Carpentry is, and what workshop attendees can expect to learn.

The Shell

What and Why

The shell is a program whose primary purpose is to read commands, run programs, and display results.

Files and Directories

The file system is responsible for managing information on disk.
Information is stored in files, which are stored in directories (folders).
Directories can also store other directories, which forms a directory tree.
/ on its own is the root directory of the whole filesystem.
A relative path specifies a location starting from the current location.
An absolute path specifies a location from the root of the filesystem.
Directory names in a path are separated with '/' on Unix, but '\' on Windows.
'..' means "the directory above the current one"; '.' on its own means "the current directory".
Most files' names are something.extension; the extension isn't required, and doesn't guarantee anything, but is normally used to indicate the type of data in the file.
cd path changes the current working directory.
ls path prints a listing of a specific file or directory; ls on its own lists the current working directory.
pwd prints the user's current working directory (current default location in the filesystem).
whoami shows the user's current identity.
Most commands take options (flags) which begin with a '-'.

Creating Things

Unix documentation uses '^A' to mean "control-A".
The shell does not have a trash bin: once something is deleted, it's really gone.
mkdir path creates a new directory.
cp old new copies a file.
mv old new moves (renames) a file or directory.
nano is a very simple text editor—please use something else for real work.
rm path removes (deletes) a file.
rmdir path removes (deletes) an empty directory.

Pipes and Filters

'*' is a wildcard pattern that matches zero or more characters in a pathname.
'?' is a wildcard pattern that matches any single character.
The shell matches wildcards before running commands.
command > file redirects a command's output to a file.
first | second is a pipeline: the output of the first command is used as the input to the second.
The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).
cat displays the contents of its inputs.
head displays the first few lines of its input.
sort sorts its inputs.
tail displays the last few lines of its input.
wc counts lines, words, and characters in its inputs.

Loops

Use a for loop to repeat commands once for every thing in a list.
Every for loop needs a variable to refer to the current "thing".
Use $name to expand a variable (i.e., get its value).
Do not use spaces, quotes, or wildcard characters such as '*' or '?' in filenames, as it complicates variable expansion.
Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
Use the up-arrow key to scroll up through previous commands to edit and repeat them.
Use history to display recent commands, and !number to repeat a command by number.
Use ^C (control-C) to terminate a running command.

Shell Scripts

Save commands in files (usually called shell scripts) for re-use.
Use bash filename to run saved commands.
$* refers to all of a shell script's command-line arguments.
$1, $2, etc., refer to specified command-line arguments.
Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

Finding Things

Everything is stored as bytes, but the bytes in binary files do not represent characters.
Use nested loops to run commands for every combination of two lists of things.
Use '\' to break one logical line into several physical lines.
Use parentheses '()' to keep things combined.
Use $(command) to insert a command's output in place.
find finds files with specific properties that match patterns.
grep selects lines in files that match patterns.
man command displays the manual page for a given command.

Version Control with Subversion

Version control is a better way to manage shared files than email or shared folders.
The master copy is stored in a repository.
Nobody ever edits the master directory: instead, each person edits a local working copy.
People share changes by committing them to the master or updating their local copy from the master.
The version control system prevents people from overwriting each other's work by forcing them to merge concurrent changes before committing.
It also keeps a complete history of changes made to the master so that old versions can be recovered reliably.
Version control systems work best with text files, but can also handle binary files such as images and Word documents.

Basic Use

Every repository is identified by a URL.
Working copies of different repositories may not overlap.
Each changed to the master copy is identified by a unique revision number.
Revisions identify snapshots of the entire repository, not changes to individual files.
Each change should be commented to make the history more readable.
Commits are transactions: either all changes are successfully committed, or none are.
The basic workflow for version control is update-change-commit.
svn add things tells Subversion to start managing particular files or directories.
svn checkout url checks out a working copy of a repository.
svn commit -m "message" things sends changes to the repository.
svn diff compares the current state of a working copy to the state after the most recent update.
svn diff -r HEAD compares the current state of a working copy to the state of the master copy.
svn history shows the history of a working copy.
svn status shows the status of a working copy.
svn update updates a working copy from the repository.

Merging Conflicts

Conflicts must be resolved before a commit can be completed.
Subversion puts markers in text files to show regions of conflict.
For each conflicted file, Subversion creates auxiliary files containing the common parent, the master version, and the local version.
svn resolve files tells Subversion that conflicts have been resolved.

Recovering Old Versions

Old versions of files can be recovered by merging their old state with their current state.
Recovering an old version of a file does not erase the intervening changes.
Use branches to support parallel independent development.
svn merge merges two revisions of a file.
svn revert undoes local changes to files.

Setting up a Repository

Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.
svnadmin create name creates a new repository.

Provenance

$Keyword:$ in a file can be filled in with a property value each time the file is committed.
Put version numbers in programs' output to establish provenance for data.
svn propset svn:keywords property files tells Subversion to start filling in property values.

Basic Programming

Basic Operations

Use '=' to assign a value to a variable.
Assigning to one variable does not change the values associated with other variables.
Use print to display values.
Variables are created when values are assigned to them.
Variables cannot be used until they have been created.
Addition ('+'), subtraction ('-'), and multiplication ('*') work as usual in Python.
Use meaningful, descriptive names for variables.

Creating Programs

Store programs in files whose names end in .py and run them with python name.py.

Types

The most commonly used data types in Python are integers (int), floating-point numbers (float), and strings (str).
Strings can start and end with either single quote (') or double quote (").
Division ('/') produces an int result when given int values: one or both arguments must be float to get a float result.
"Adding" strings concatenates them, multiplying strings by numbers repeats them.
Strings and numbers cannot be added because the behavior is ambiguous: convert one to the other type first.
Variables do not have types, but values do.

Reading Files

Data is either in memory, on disk, or far away.
Most things in Python are objects, and have attached functions called methods.
When lines are read from files, Python keeps their end-of-line characters.
Use str.strip to remove leading and trailing whitespace (including end-of-line characters).
Use file(name, mode) to open a file for reading ('r'), writing ('w'), or appending ('a').
Opening a file for writing erases any existing content.
Use file.readline to read a line from a file.
Use file.close to close an open file.
Use print >> file to print to a file.

Standard Input and Output

The operating system automatically gives every program three open "files" called standard input, standard output, and standard error.
Standard input gets data from the keyboard, from a file when redirected with '<', or from the previous stage in a pipeline with '|'.
Standard output writes data to the screen, to a file when redirected with '>', or to the next stage in a pipeline with '|'.
Standard error also writes data to the screen, and is not redirected by '>' or '|'.
Use import library to import a library.
Use library.thing to refer to something imported from a library.
The sys library provides open "files" called sys.stdin and sys.stdout for standard input and output.

Repeating Things

Use for variable in something: to loop over the parts of something.
The body of a loop must be indented consistently.
The parts of a string are its characters; the parts of a file are its lines.

Making Choices

Use if test to do something only when a condition is true.
Use else to do something when a preceding if test is not true.
The body of an if or else must be indented consistently.
Combine tests using and and or.
Use '<', '<=', '>=', and '>' to compare numbers or strings.
Use '==' to test for equality and '!=' to test for inequality.
Use variable += expression as a shorthand for variable = variable + expression (and similarly for other arithmetic operations).

Flags

The two Boolean values True and False can be assigned to variables like any other values.
Programs often use Boolean values as flags to indicate whether something has happened yet or not.

Reading Data Files

Use str.split() to split a string into pieces on whitespace.
Values can be assigned to any number of variables at once.

Provenance Revisited

Put version numbers in programs' output to establish provenance for data.

Lists

Use [value, value, ...] to create a list of values.
for loops process the elements of a list, in order.
len(list) returns the length of a list.
[] is an empty list with no values.

More About Lists

Lists are mutable: they can be changed in place.
Use list.append(value) to append something to the end of a list.
Use list[index] to access a list element by location.
The index of the first element of a list is 0; the index of the last element is len(list)-1.
Negative indices count backward from the end of the list, so list[-1] is the last element.
Trying to access an element with an out-of-bounds index is an error.
range(number) produces the list of numbers [0, 1, ..., number-1].
range(len(list)) produces the list of legal indices for list.

Checking and Smoothing Data

range(start, end) creates the list of numbers from start up to, but not including, end.
range(start, end, stride) creates the list of numbers from start up to end in steps of stride.

Nesting Loops

Use nested loops to do things for combinations of things.
Make the range of the inner loop depend on the state of the outer loop to automatically adjust how much data is processed.
Use min(...) and max(...) to find the minimum and maximum of any number of values.

Nesting Lists

Use nested lists to store multi-dimensional data or values that have regular internal structure (such as XYZ coordinates).
Use list_of_lists[first] to access an entire sub-list.
Use list_of_lists[first][second] to access a particular element of a sub-list.
Use nested loops to process nested lists.

Aliasing

Several variables can alias the same data.
If that data is mutable (e.g., a list), a change made through one variable is visible through all other aliases.

Functions and Libraries

How Functions Work

Define a function using def name(...)
The body of a function must be indented.
Use name(...) to call a function.
Use return to return a value from a function.
The values passed into a function are assigned to its parameters in left-to-right order.
Function calls are recorded on a call stack.
Every function call creates a new stack frame.
The variables in a stack frame are discarded when the function call completes.
Grouping operations in functions makes code easier to understand and re-use.

Global Variables

Every function always has access to variables defined in the global scope.
Programmers often write constants' names in upper case to make their intention easier to recognize.
Functions should not communicate by modifying global variables.

Multiple Arguments

A function may take any number of arguments.
Define default values for parameters to make functions more convenient to use.
Defining default values only makes sense when there are sensible defaults.

Returning Values

A function may return values at any point.
A function should have zero or more return statements at its start to handle special cases, and then one at the end to handle the general case.
"Accidentally" correct behavior is hard to understand.
If a function ends without an explicit return, it returns None.

Aliasing

Values are actually passed into functions by reference, which means that they are aliased.
Aliasing means that changes made to a mutable object like a list inside a function are visible after the function call completes.

Libraries

Any Python file can be imported as a library.
The code in a file is executed when it is imported.
Every Python file is a scope, just like every function.

Standard Libraries

Use from library import something to import something under its own name.
Use from library import something as alias to import something under the name alias.
from library import * imports everything in library under its own name, which is usually a bad idea.
The math library defines common mathematical constants and functions.
The system library sys defines constants and functions used in the interpreter itself.
sys.argv is a list of all the command-line arguments used to run the program.
sys.argv[0] is the program's name.
sys.argv[1:] is everything except the program's name.

Building Filters

If a program isn't told what files to process, it should process standard input.
Programs that explicitly test values' types are more brittle than ones that rely on those values' common properties.
The variable __name__ is assigned the string '__main__' in a module when that module is the main program, and the module's name when it is imported by something else.
If the first thing in a module or function is a string that isn't assigned to a variable, that string is used as the module or function's documentation.
Use help(name) to display the documentation for something.

Functions as Objects

A function is just another kind of data.
Defining a function creates a function object and assigns it to a variable.
Functions can be assigned to other variables, put in lists, and passed as parameters.
Writing higher-order functions helps eliminate redundancy in programs.
Use filter to select values from a list.
Use map to apply a function to each element of a list.
Use reduce to combine the elements of a list.

Databases

A relational database stores information in tables with fields and records.
A database manager is a program that manipulates a database.
The commands or queries given to a database manager are usually written in a specialized language called SQL.

Selecting

SQL is case insensitive.
The rows and columns of a database table aren't stored in any particular order.
Use SELECT fields FROM table to get all the values for specific fields from a single table.
Use SELECT * FROM table to select everything from a table.

Removing Duplicates

Use SELECT DISTINCT to eliminate duplicates from a query's output.

Calculating New Values

Use expressions in place of field names to calculate per-record values.

Filtering

Use WHERE test in a query to filter records based on logical tests.
Use AND and OR to combine tests in filters.
Use IN to test whether a value is in a set.
Build up queries a bit at a time, and test them against small data sets.

Sorting

Use ORDER BY field ASC (or DESC) to order a query's results in ascending (or descending) order.

Aggregation

Use aggregation functions like SUM MAX to combine many query results into a single value.
Use the COUNT function to count the number of results.
If some fields are aggregated, and others are not, the database manager displays an arbitrary result for the unaggregated field.
Use GROUP BY to group values before aggregation.

Database Design

Each field in a database table should store a single atomic value.
No fact in a database should ever be duplicated.

Combining Data

Use JOIN to create all possible combinations of records from two or more tables.
Use JOIN tables ON test to keep only those combinations that pass some test.
Use table.field to specify a particular field of a particular table.
Use aliases to make queries more readable.
Every record in a table should be uniquely identified by the value of its primary key.

Self Join

Use a self join to combine a table with itself.

Missing Data

Use NULL in place of missing information.
Almost every operation involving NULL produces NULL as a result.
Test for nulls using IS NULL and IS NOT NULL.
Most aggregation functions skip nulls when combining values.

Nested Queries

Use nested queries to create temporary sets of results for further querying.
Use nested queries to subtract unwanted results from all results to leave desired results.

Creating and Modifying Tables

Use CREATE TABlE name(...) to create a table.
Use DROP TABLE name to erase a table.
Specify field names and types when creating tables.
Specify PRIMARY KEY, NOT NULL, and other constraints when creating tables.
Use INSERT INTO table VALUES(...) to add records to a table.
Use DELETE FROM table WHERE test to erase records from a table.
Maintain referential integrity when creating or deleting information.

Transactions

Place operations in a transaction to ensure that they appear to be atomic, consistent, isolated, and durable.

Programming With Databases

Most applications that use databases embed SQL in a general-purpose programming language.
Database libraries use connections and cursors to manage interactions.
Programs can fetch all results at once, or a few results at a time.
If queries are constructed dynamically using input from users, malicious users may be able to inject their own commands into the queries.
Dynamically-constructed queries can use SQL's native formatting to safeguard against such attacks.

Number Crunching with NumPy

High-level libraries are usually more efficient for numerical programming than hand-coded loops.
Most such libraries use a data-parallel programming model.
Arrays can be used as matrices, as physical grids, or to store general multi-dimensional data.

Basics

NumPy is a high-level array library for Python.
import numpy to import NumPy into a program.
Use numpy.array(values) to create an array.
Initial values must be provided in a list (or a list of lists).
NumPy arrays store homogeneous values whose type is identified by array.dtype.
Use old.astype(newtype) to create a new array with a different type rather than assigning to dtype.
numpy.zeros creates a new array filled with 0.
numpy.ones creates a new array filled with 1.
numpy.identity creates a new identity matrix.
numpy.empty creates an array but does not initialize its values (which means they are unpredictable).
Assigning an array to a variable creates an alias rather than copying the array.
Use array.copy to create a copy of an array.
Put all array indices in a single set of square brackets, like array[i0, i1].
array.shape is a tuple of the array's size in each dimension.
array.size is the total number of elements in the array.

Storage

Arrays are stored using descriptors and data blocks.
Many operations create a new descriptor, but alias the original data block.
Array elements are stored in row-major order.
array.transpose creates a transposed alias for an array's data.
array.ravel creates a one-dimensional alias for an array's data.
array.reshape creates an arbitrarily-shaped alias for an array's data.
array.resize resizes an array's data in place, filling with zero as necessary.

Indexing

Arrays can be sliced using start:end:stride along each axis.
Values can be assigned to slices as well as read from them.
Arrays can be used as subscripts to select items in arbitrary ways.
Masks containing True and False can be used to select subsets of elements from arrays.
Use '&' and '|' (or logical_and and logical_or) to combine tests when subscripting arrays.
Use where, choose, or select to select elements or alternatives in a single step.

Linear Algebra

Addition, multiplication, and other arithmetic operations work on arrays element-by-element.
Operations involving arrays and scalars combine the scalar with each element of the array.
array.dot performs "real" matrix multiplication.
array.sum calculates sums or partial sums of array elements.
array.mean calculates array averages.

Making Recommendations

Getting data in the right format for processing often requires more code than actually processing it.
Data with many gaps should be stored in sparse arrays.
numpy.cov calculates variancess and covariances.

The Game of Life

Padding arrays with fixed elements is an easy way to implement boundary conditions.
scipy.signal.convolve applies a weighted mask to each element of an array.

Quality

Defensive Programming

Design programs to catch both internal errors and usage errors.
Use assertions to check whether things that ought to be true in a program actually are.
Assertions help people understand how programs work.
Fail early, fail often.
When bugs are fixed, add assertions to the program to prevent their reappearance.

Handling Errors

Use raise to raise exceptions.
Raise exceptions to report errors rather than trying to handle them inline.
Use try and except to handle exceptions.
Catch exceptions where something useful can be done about the underlying problem.
An exception raised in a function may be caught anywhere in the active call stack.

Unit Testing

Testing cannot prove that a program is correct, but is still worth doing.
Use a unit testing library like Nose to test short pieces of code.
Write each test as a function that creates a fixture, executes an operation, and checks the result using assertions.
Every test should be able to run independently: tests should not depend on one another.
Focus testing on boundary cases.
Writing tests helps us design better code by clarifying our intentions.

Numbers

Floating point numbers are approximations to actual values.
Use tolerances rather than exact equality when comparing floating point values.
Use integers to count and floating point numbers to measure.
Most tests should be written in terms of relative error rather than absolute error.
When testing scientific software, compare results to exact analytic solutions, experimental data, or results from simpler or previously-tested programs.

Coverage

Use a coverage analyzer to see which parts of a program have been tested and which have not.

Debugging

Use an interactive symbolic debugger instead of print statements to diagnose problems.
Set breakpoints to halt the program at interesting points instead of stepping through execution.
Try to get things right the first time.
Make sure you know what the program is supposed to do before trying to debug it.
Make sure the program is actually running the test case you think it is.
Make the program fail reliably.
Simplify the test case or the program in order to localize the problem.
Change one thing at a time.
Be humble.

Designing Testable Code

Separating interface from implementation makes code easier to test and re-use.
Replace some components with simplified versions of themselves in order to simplify testing of other components.
Do not create arbitrary, variable, or random results, as they are extremely hard to test.
Isolate interactions with the outside world when writing tests.

Sets and Dictionaries

Sets

Use sets to store distinct unique values.
Create sets using set() or {v1, v2, ...}.
Sets are mutable, i.e., they can be updated in place like lists.
A loop over a set produces each element once, in arbitrary order.
Use sets to find unique things.

Storage

Sets are stored in hash tables, which guarantee fast access for arbitrary keys.
The values in sets must be immutable to prevent hash tables misplacing them.
Use tuples to store multi-part elements in sets.

Dictionaries

Use dictionaries to store key-value pairs with distinct keys.
Create dictionaries using {k1:v1, k2:v2, ...}
Dictionaries are mutable, i.e., they can be updated in place.
Dictionary keys must be immutable, but values can be anything.
Use tuples to store multi-part keys in dictionaries.
dict[key] refers to the dictionary entry with a particular key.
key in dict tests whether a key is in a dictionary.
len(dict) returns the number of entries in a dictionary.
A loop over a dictionary produces each key once, in arbitrary order.
dict.keys() creates a list of the keys in a dictionary.
dict.values() creates a list of the keys in a dictionary.

Simple Examples

Use dictionaries to count things.
Initialize values from actual data instead of trying to guess what values could "never" occur.

Phylogenetic Trees

Problems that are described using matrices can often be solved more efficiently using dictionaries.
When using tuples as multi-part dictionary keys, order the tuple entries to avoid accidental duplication.

Development

The Grid

Get something simple working, then start to add features, rather than putting everything in the program at the start.
Leave FIXME markers in programs as you are developing them to remind yourself what still needs to be done.

Aliasing

Draw pictures of data structures to aid debugging.

Randomness

Use a well-tested random number generation library to generate pseudorandom values.
If a random number generation library is given the same seed, it will produce the same sequence of values.

Neighbors

and and or stop evaluating arguments as soon as they have an answer.

Bugs

Test programs with successively more complex cases.

Refactoring

Refactor programs as necessary to make testing easier.
Replace randomness with predictability to make testing easier.

Performance

Scientists want faster programs both to handle bigger problems and to handle more problems with available resources.
Before speeding a program up, ask, "Does it need to be faster?" and, "Is it correct?"
Recording start and end times is a simple way to measure performance.
Analyze algorithms to predict how a program's performance will change with problem size.

Profiling

Use a profiler to determine which parts of a program are responsible for most of its running time.

A New Beginning

Better algorithms are better than better hardware.

Originally posted 2012-10-23 by Greg Wilson in Content.