Software Carpentry: Introducing Programming a Different Way

Our quick introduction to Python is the module I'm least happy with, so I've been thinking about how to re-design it. I've included a new outline below; comments would be very welcome.

Programming is what you do when you can't find an off-the-shelf tool to do what you want

Scientists' needs are so specialized that there often isn't such a tool
Even if it exists, the cost of finding it may be too high (can't Google by semantics)

Why is programming hard to teach/learn?

Trying to convey three things at once:
1. Here's something interesting tha tyou might actually want to do.
2. Here's the syntax of whatever programming language we're using.
3. Here are some key ideas about programs and programming that will help you do things.
#1 is what engages your interest
#2 is what you'll grapple with (and will need to master in order to do #1)
#3 is what's most useful in the long term
- But it's hard or impossible to learn the general without first learning some specifics
- And you have deadlines: getting that graph plotted is more important right now than the big picture

We will teach basic programming by example

Show you small programs, then explain the things they contain
Also explain a few principles along the way

We will use Python

Widely used for scientific programming, but that's not the main reason
Our experience is that novices find it more readable than alternatives
And it allows us to do useful things before introducing advanced concepts (e.g., OOP)

Will not start with multimedia programming, 3D graphics, etc.

Guzdial et al have found that it's more engaging for students
But getting those packages installed (and for us, maintaining them) is hard work

We assume that you've done some programming, in some language, at some point

Have at least heard terms like "variable", "loop", "if/else", "array"
Quick test: can you read a list of numbers, one per line, and print the least and greatest?

Before we dive in, what is a program?

Cynical answer: "anything that might contain bugs"
Classic answer: "Instructions for a computer"
Our answer adds: "...that a human being can understand"
- Takes a lot of work to turn the things we write into instructions a computer can execute
Time to solution is (time to write code that's correct) + (time for that code to execute)
- The latter halves every 18 months (Moore's Law)
- The former depends on human factors that change on a scale of years (languages) or generations (psychology)

Programs store data and do calculations

Use variables for the first
Write instructions that use those variables to describe the second

Put the following in a text file (not Word) and run it

# Convert temperature in Fahrenheit to Kelvin.
temp_in_f = 98.6
temp_in_k = (temp_in_f - 32.0) * (5.0 / 9.0) + 273.15
print "body temperature in Kelvin:", temp_in_k
body temperature in Kelvin: 310.15

Variable is a name that labels a value (picture)

Created by assignment

[Box] versus declaration (static typing) or create-by-read (and why the latter is bad)

Usual rules of arithmetic: * before +, parentheses

Print displays values

Put text (character strings) in quotes
Print automatically puts a space between values, and ends the line

Need to know it: use "5/9" instead of "5.0/9.0"

# Convert temperature in Fahrenheit to Kelvin.
temp_in_f = 98.6
temp_in_k = (temp_in_f - 32.0) * (5 / 9) + 273.15  # this line is different
print "body temperature in Kelvin:", temp_in_k
body temperature in Kelvin: 273.15

Run interpreter, try 5/9, get 0

Shows that Python can be used interactively

Integer vs. float, and what division does

Automatic conversion: 5.0/9 does the right thing

[Box] Why are so many decimal places shown in 5.0/9

Need to know it: sometimes Python doesn't know what to do

# Try adding numbers and strings.
print "2 + 3:", 2 + 3
print "two + three:", "two" + "three"
print "2 + three:", 2 + "three"
2 + 3: 5 two + three: twothree 2 + three: Traceback (most recent call last): File "add-numbers-strings.py", line 5, in <module> print "2 + three:", 2 + "three" TypeError: unsupported operand type(s) for +: 'int' and 'str'

In this case, "2three" would be sensible

But what about "1" + 2?

The character "1" is not the number 1

On your own, try "two" * 3

Back to useful things

Computers are useful because they can do lots of calculations on lots of data

Which means we need a concise way to represent multiple values and multiple steps

Writing out a million additions would take longer than doing them

# Find the mean.
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
total = 0
number = 0
for value in data:
    total = total + value
    number = number + 1
mean = total / number
print "mean is", mean
mean is 2

Use list to store multiple values

Like a vector in mathematics

Use loop to perform multiple operations

Like Σ in mathematics
But we have to break it down into sequential steps
And since we're doing that, have to be able to update variables's values

Can trace execution step by step manually or in a debugger

An important skill
Want to write programs so that tracing is easy
See what this means when we talk about functions

Did you notice that the result in the example above is wrong?

25/9 is 2, but 25.0/9.0 is 2.77777777778

Problem is that total starts as an integer, we're adding integers, we wind up doing int/int (again)

Could fix it by initializing total to 0.0

Or use a function to do the conversion explicitly

# Find the mean.
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
total = 0
number = 0
for value in data:
    total = total + value
    number = number + 1
mean = float(total) / number   # this line has changed
print "mean is", mean
mean is 2.77777777778

Functions do what they do in mathematics

Values in, values out

Spend a whole chapter on them, since they're key to building large programs

Right now, most important lesson is that just because a program runs, doesn't mean it's correct

Could check the original program by using a smaller data set
- E.g., [1, 4] produces 2 instead of 2.5
Writing programs so that they're checkable is a big idea that we'll return to

Need to know it: the len function

# Find the mean.
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
total = 0
for value in data:
    total = total + value
mean = float(total) / len(data) # this line has changed
print "mean is", mean
mean is 2.77777777778

Need to know it: list are mutable

# Calculate running sum by creating new list.
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
result = []
current = 0
for value in data:
    current = current + value
    result.append(current)
print "running total:", result
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]

Start with the empty list

result.append is a method

Like a function, but attached to some kind of object like a list
noun and verb
Important enough that we'll spend a whole chapter on this, too

How to double the values in place?

# Try to double the values in place.
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
for value in data:
    value = 2 * value
print "doubled data is:", data
doubled data is [1, 4, 2, 3, 3, 4, 3, 4, 1]

New values are being created, but never assigned to list elements

Easiest to understand with a picture

Need to know it: list indexing

Mathematicians use subscripts, we use square brackets

Index from 0..N-1 rather than 1..N for reasons that made sense in 1970 and have become customary since

Python and Java use 0..N-1 (like C)
Fortran and MATLAB use 1..N (like human beings)

# Try to double the values in place.
data = [1, 4, 2]
data[0] = 2 * data[0]
data[1] = 2 * data[1]
data[2] = 2 * data[2]
print "doubled data is:", data
doubled data is [2, 8, 4]

Clearly doesn't scale...

Need to get all the indices for a list of length N

The range function produces a list of numbers from 0..N-1

Examples
Exactly the indices for a list

You will almost never be the first person to need something

It's probably in the language, or in a library
Hard part is finding it...

# Double the values in a list in place
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
length = len(data) # 9
indices = range(length) # [0, 1, 2, 3, 4, 5, 6, 7, 8]
for i in indices:
    data[i] = 2 * data[i]
print "doubled data is:", data
doubled data is: [2, 8, 4, 6, 6, 8, 6, 8, 2]

Fold this together by combining function calls (like \sqrt{sin(x)})

# Double the values in a list in place.
data = [1, 4, 2, 3, 3, 4, 3, 4, 1]
for i in range(len(data)):
    data[i] = 2 * data[i]
print "doubled data is:", data
doubled data is: [2, 8, 4, 6, 6, 8, 6, 8, 2]

Usually won't type in our data

Store it outside program

[Box] Explain difference between memory, local disk, and remote disk
And why we don't do everything one way or the other: performance vs. cost

# Count the number of lines in a file
reader = open("data.txt", "r")
number = 0
for line in reader:
    number = number + 1
reader.close()
print number, "values in file"
9 lines in file

What about mean?

# Find the mean.
reader = open("data.txt", "r")
total = 0.0
number = 0
for line in reader:
    total = total + line
    number = number + 1
reader.close()
print "mean is", total / number
Traceback (most recent call last): File "mean-read-broken.py", line 7, in <module> total = total + line TypeError: unsupported operand type(s) for +: 'float' and 'str'

Data in file is text, so we need to convert

# Find the mean.
reader = open("data.txt", "r")
total = 0.0
number = 0
for line in reader:
    value = float(line)
    total = total + value
    number = number + 1
reader.close()
print "mean is", total / number
mean is 2.77777777778

Notice that we're using the original program as an oracle

Pro: start simple, make more complex, confidence at each step
Con: bug in original perpetuated indefinitely

Real-world data is never clean

Count how many scores were not between 0 and 5

# Count number of values out of range.
data = [0, 3, 2, -1, 1, 4, 4, 6, 5, 5, 6]
num_outliers = 0
for value in data:
    if value < 0:         num_outliers = num_outliers + 1     if value > 5:
        num_outliers = num_outliers + 1
print num_outliers, "values out of range"
3 values out of range

Need to know it: combine tests using and and or

# Count number of values out of range.
data = [0, 3, 2, -1, 1, 4, 4, 6, 5, 5, 6]
num_outliers = 0
for value in data:
    if (value < 0) or (value > 5):
        num_outliers = num_outliers + 1
print num_outliers, "values out of range"
3 values out of range

Need to know it: in-place operators

# Count number of values out of range.
data = [0, 3, 2, -1, 1, 4, 4, 6, 5, 5, 6]
num_outliers = 0
for value in data:
    if (value < 0) or (value > 5):
        num_outliers += 1
print num_outliers, "values out of range"
3 values out of range

Don't actually "need" to know it

But it's a common idiom in many languages

Data cleanup

Values are supposed to be monotonic increasing
Check that they are, report where it fails if they're not

# Report where values are not monotonically inreasing
data = [1, 2, 2, 3, 4, 4, 5, 6, 5, 6, 7, 7, 8]
for i in range(2, len(data)):
    if data[i] < data[i-1]:
        print "failure:", i
    i = i + 1
failure: 8

Group by threes

# Combine successive triples of data.
data = [1, 2, 2, 3, 4, 4, 5, 6, 5, 6, 7, 7, 8]
result = []
for i in range(0, len(data), 3):
    sum = data[i] + data[i+1] + data[i+2]
    result.append(sum)
print "grouped data:", result
Traceback (most recent call last): File "group-by-threes-fails.py", line 6, in <module> sum = data[i] + data[i+1] + data[i+2] IndexError: list index out of range

13 values = 4 groups of 3 and 1 left over

First question must be, what's the right thing to do scientifically?

Let's assume, "Add up as many as are there"

# Combine successive triples of data.
data = [1, 2, 2, 3, 4, 4, 5, 6, 5, 6, 7, 7, 8]
result = []
for i in range(0, len(data), 3):
    sum = data[i]
    if (i+1) < len(data):
        sum += data[i+1]
    if (i+2) < len(data):
        sum += data[i+2]
    result.append(sum)
print "grouped data:", result
grouped data: [5, 11, 16, 20, 8]

But this is clumsy

How do we add up the first three, or as many as are there?

Don't want to have to keep modifying the list as we try out ideas

So use a list of lists.

# Add up the first three, or as many as are there.
test_cases = [[],                     # no data at all
              [10],                   # just one value
              [10, 20],               # two values
              [10, 20, 30],           # three
              [10, 20, 30, 40]]       # more than enough

for data in test_cases:
    print data
[] [10] [10, 20] [10, 20, 30] [10, 20, 30, 40]

Can now try all our tests by running one program

Back to our original problem: sum of at most the first three

# Sum up at most the first three values.
test_cases = [[],                     # no data at all
              [10],                   # just one value
              [10, 20],               # two values
              [10, 20, 30],           # three
              [10, 20, 30, 40]]       # more than enough

for data in test_cases:
    limit = min(3, len(data))
    sum = 0
    for i in range(limit):
        sum += data[i]
    print data, "=>", sum
[] => 0 [10] => 10 [10, 20] => 30 [10, 20, 30] => 60 [10, 20, 30, 40] => 60

That looks right

Though if there were 100 tests cases, we would want different output
Come back to this idea later

Need one more tool: nested loops

# Loops can run inside loops.
for i in range(4):
    for j in range(i):
        print i, j
1 0 2 0 2 1 3 0 3 1 3 2

Easiest to understand with a picture

Final step: instead of starting at zero every time, start at 0, 3, 6, 9, etc.

Need more test cases

Don't need to test everything (which is why we skip from 40 to 60 to 80)

We'll come back to how we decide what is or isn't a useful test case

# Sum up in groups of three.
test_cases = [[],
              [10],
              [10, 20],
              [10, 20, 30],
              [10, 20, 30, 40],
              [10, 20, 30, 40, 50, 60],
              [10, 20, 30, 40, 50, 60, 70, 80]]

for data in test_cases:
    result = []
    for i in range(0, len(data), 3):
        limit = min(i+3, len(data))
        sum = 0
        for i in range(i, limit):
            sum += data[i]
        result.append(sum)
    print data, "=>", result
[] => [] [10] => [10] [10, 20] => [30] [10, 20, 30] => [60] [10, 20, 30, 40] => [60, 40] [10, 20, 30, 40, 50, 60] => [60, 150] [10, 20, 30, 40, 50, 60, 70, 80] => [60, 150, 150]

Understand this in pieces

Outer for loop is selecting a test case

So think about its body in terms of one test case

Inner loop is going in strides of three

So think about its body for some arbitrary value of i

limit is as far as we can go toward three values up from i

So range(i, limit) guaranteed to be valid indices for list

Human beings can only keep a few things in working memory at once

"Seven plus or minus two"

How we actually understand this program is:

for data in test_cases:
    result = sum_by_threes(data)
    print data, "=>", result

to sum_by_threes given a list data:
    result = []
    for i in range(0, len(data), 3):
        limit = min(i+3, len(data))
        sum = sum_from(data, i, limit)
        result.append(sum)

to sum_from given a list data, and start and end indices:
    sum = 0
    for i in range(start, end):
        sum += data[i]

The computer doesn't care one way or another

But what we need is a way to write our programs in pieces, then combine the pieces

That's the subject of the next chapter

Originally posted 2011-08-08 by Greg Wilson in Content.