Contents

Python Dev - Scripts, Modules and Packages

Foreword

ACSE 1 Lecture One - 12th October 2020 - Version 3.0.5


Pre-Sessional material:

  • Introduction to git, Python & the bash shell

In this lecture: Python as a development platform

  • The Python Interpreter :- running code [20 minutes]
    • Notebooks
    • The command line interpreter
    • The IPython console
  • Python Scripts :- reusable code (40 minutes)
    • ways to run a script, VS Code, shebangs
    • Text encoding
    • PEP8 and pylint - code linters
    • Options parsers
    • `matplotlib` in scripts
  • Python Modules :- shareable code (45 minutes)
    • Python docstrings, PEP 257 & numpydoc
    • APIs
    • `import`, `sys.path` & `$PYTHONPATH`
    • Extension Modules
  • Python Packages :- distributable code (1 hour)
    • Directory Structure
    • `setup.py` & `setuptools`
    • `pip` & `conda` installation
Contact details:
  • Dr. James Percival

  • Room 4.85 RSM building (but not much this term)

  • email: j.percival@imperial.ac.uk

  • Teams: @Percival, James R in acse1 or General, or DM me.

By the end of this lecture you should:

  • Be able to write Python using Visual Studio Code.

  • Understand the similarities and difference between:

    • Python scripts

    • Python modules

    • Python packages

  • Know about Python coding standards, PEP8 and linters

  • Be able to make & install your own Python package.


A note on colours in this notebook:

In general, ordinary text looks like this. Much of this will cover the same material presented in the spoken lecture portions. Assessment in this module and the following ones (particularly the miniprojects) will expect you to be familar with these subjects.

# This cell sets the css styles for the rest of the notebook.
# Unless you are particlarly interested in that kind of thing,
# you can run this once, thensafely ignore it

%run add_colours.py
css_styling()

Danger boxes

These boxes contain important warnings. Ignoring information in them might be dangerous, or lead to unexpected behaviour from your computer.

Danger boxes will thankfully be rare.

Information boxes

These boxes contain information which is important, but not vital, and which can safely be ignored if you are running short of time.

Exercise: An example Exercise

These boxes contain exercises to test your knowledge and practice important conceptual ideas.

The many ways to use Python

Python code gets used in at least five different ways:

  1. Jupyter notebooks

  2. Hacking in the interpreter/ipython console

  3. Small, frequently modified scripts

  4. Module files, grouping together useful code for reuse

  5. Large, stable project packages

Please follow along as we look at some of these methods.

Jupyter notebooks

320px-Jupyter_logo.svg.png

These should need no further introduction, since you’re currently reading one. Jupyter notebooks combine data permenance, editable code and text comments in the same place.

When a cell is marked as a code cell, and a python kernel is running, it becomes an editable coding environment.

# we can write and run code here

Jupyter has good points and bad points. On average, data scientists like it a little bit more than computational scientists do.

The Python interpreter

Python_logo_and_wordmark.svg.png

On Windows Anaconda you can type

python

from the Anaconda command prompt to start a basic, no frills python interpreter session. On linux/Mac you may sometimes need to use the command python3 instead.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:14:23) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

This is probably the least user friendly way possible to run an interactive Python session, although it is the best supported. Many Mac and Linux systems come with Python as a default installation (although sometimes quite an old one), so it has a very high probability of being installed on machines you are asked to connect to using ssh. The easiest way to quit is to call exit(). On Mac/Linux you can also use Ctrl+D or on Windows Ctrl+Z then Return.

Warning!

If the first line starts with Python 2.X.Y like

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

rather than Python 3.X.Y as shown above, then you’re running a very old interpreter. You can try typing python3 instead to see if Python 3 is installed on the system. If that still gives you Python 2.7, then something has gone wrong with your machine.

The IPython console

IPython_Logo.png

IPython (or Interactive Python) provides a much more “batteries included” Python experience, with a built in history editor, tab completions and inline matplotlib support. Anaconda provides a version, QtConsole in its own Qt window, so that the user experience on Windows, Mac and Linux is virtually identical.

At start-up an IPython interpreter session looks like

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

Unlike the vanilla python interpreter (what you get by typing python in a terminal/command prompt), it contains useful features like tab autocompletions, a richer browsable history (using the arrow keys, additional access to the inbuilt documentation system and the easy ability to call out to the underlying operating system.

Many of the features available to you should be familiar even to those of you who have only used Jupyter notebooks before, since they are also available inside Jupyter notebook code cells. In fact “under the hood” Jupyter is running an IPython console (the Python “kernel”) to process Python3 code.

Exercise One: Running Python Code


Run the following commands

def square_and_cube(n):
    return sorted([i**2 for i in range(n)]+[i**3 for i in range(n)])
print(square_and_cube(3))
print(square_and_cube(10))

and

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 6*np.pi)
plt.plot(x, np.sin(x))

in a notebook, in a vanilla Python interpreter and in a QtConsole/IPython console.

  • In each case, try modifying the square_and_cube function to also include the 4th power of n.

  • The sorted function returns a new sorted list from an iterable. Try accessing the Python online help in the ordinary interpreter to invert the order of the list. Note you’ll need to use the help function, since the sorted? syntax is in IPython/Jupyter only.

Tip: for the IPython console, you may find the ipython magic command %paste useful.

Python Scripts

A Python script file is just a regular plain text file containing only valid python code and comments (i.e lines starting with the hash/pound character, #), which the Python interpreter transforms into instructions for the computer to perform. Script files are written in the same way you would write Python code in an interactive interpreter.

An Example

An example script, rot13.py might look like

#!/usr/bin/env python3
# -*- coding: ascii -*-

import codecs
import sys

print(codecs.encode(sys.argv[1], 'rot13'))

This file reads a string from the command line and applies the ROT-13 cypher, which cycles letters in the Latin alphabet through to the one 13 places forward/backward (i.e. maps A => N, N => A, g => t and so on). This cypher is its own inverse.

Warning

ROT-13 is useful to make text hard to read casually, but is not remotely crypographically secret, let alone secure. Never use it in a situation where it wouldn’t be acceptable to use plain text.

#inside a notebook, the ! allows calls out to the OS shell
!curl https://msc-acse.github.io/ACSE-1/lectures/rot13.py -o rot13.py
!python rot13.py "Uryyb rirelobql!" 

The above command will only work if the script is in the same directory as the notebook, or your computer is connected to the internet. Inside the IPython console and in notebooks, we can also use the run statement:

Warning

The ! command lets Jupyter notebooks run commands in the operating system with the same privileges that the user (i.e. you) have, and similar tricks can be played with %%sh, %%cmd and similar cell magics. Don’t just run random notebooks off the internet unless you understand what they’re doing, or fully trust the person who you got them from.

%run rot13.py "Uryyb rirelobql!"

Now that we can run a file, lets have another look at the contents.

#!/usr/bin/env python3
# -*- coding: ascii -*-

import codecs
import sys

print(codecs.encode(sys.argv[1], 'rot13'))

Reviewing the contents of a script

Shebangs and executable files

The “shebang line”, #!/usr/bin/env python3 tells Linux/MacOSX systems that this script should be run with Python 3. If present This means that on those systems we can also turn the script into an executable file and run it straight off:

$ chmod 755 rot13.py
$ ./rot13.py "This works on Linux/Mac systems"

Warning

Note that the shebang line refers to Python 3 explicitly as python3. This is typical behaviour on computer systems with both Python 3 & Python 2 installed, where python may still run Python 2. For those of you running Anaconda on Windows, python means Python 3 there, and the python3 executable may not exist.

Text encoding

The next line # -*- coding: ascii -*- tells python (and possibly your text editor) that the script uses the ASCII (American Standard Code for Information Interchange) text encoding. Text encodings map the numbers that computers are able to store onto the characters that humans can read. If a file is opened using the wrong encoding, then it will either read as nonsense, or contain many blank “unknown” characters.

ascii-table-1.1.png Table above by Tom Gibara CC-BY-SA.

The file doesn’t have to be in ASCII. In fact the Python3 default is to use Unicode encoding (utf-8) if no explicit encoding is given. This gives access to characters from most world languages. You can even use letter-like symobls from the Unicode standard as well as the more usual Latin characters in the names of functions and objects. For example, let’s write a more international “Hello World function”.

def 你好(x):
    print('Hello', x)

你好('World!')

Similarly with the default utf-8 encoding you can use any Unicode characters from the standard you like in comments and strings.

def sorry():
    """😊"""
    return "不好意思, 我不会说中文."
print(sorry())
print(sorry.__doc__)

Fortunately, you can’t actually use emoji in function names, so code like

def 😊(x):
   return "This doesn't work"

will raise a SyntaxError exception.

Writing a Python script

Since a python script is just a text file, you just need a text editor to write them. Indeed, providing you save it as Plain Text, you could even write it in Microsoft Word (please, please don’t). Your lecture on the shell introduced some console text editors which can be used on remote systems, but this course will use Visual Studio Code, a cross platform lightweight code editor (debatably an IDE, or integrated development environment) distributed by Microsoft, which makes writing, running and understanding Python scripts easier.

Warning

There are many reasons not to write code in Microsoft Word, including the autocorrect tool, which has an annoying tendency to “fix” code keywords like elif in a way which tends to break code. However the most incidious feature (which also affects many code listings on the web) is “smart quotes”. Using pretty unicode punctuation like “ and ” or ‘ and ’ instead of the unidirectional ascii version" and ' turns Python strings into nonsense.

Some other IDEs/code editors (multilanguage):

  • Spyder another IDE which comes bundled with Anaconda Python installations.

  • Visual Studio (Mostly Windows) Visual Studio Code’s big brother. The package also contains Windows compilers for various languages.

  • Xcode (Mac only) Apple’s IDE equivalent of Visual Studio.

  • Eclipse A cross-platforn open source IDE (python only):

  • PyCharm A Python IDE similar to Spyder.

  • and many others…

Generic text editors with code syntax highlighting include:

  • Jupyter - as well as notebooks, it can edit plain text files.

  • Emacs (cross platform) Console/Windowed text editor.

  • Nano (cross platform) Console text editor.

  • Notepad++ (Windows only) GUI text editor

  • Vim (cross platform) Console text editor.

Your choice of editting platform is personal, and each individual should find out what works for them. Don’t be afraid to experiment, but if you have already spent a lot of time writing code using a tool which supports Python, then we recommend you carry on using it.

logo-stable.png In your lecture you will open up VS Code be given a quick introduction to the interface.

If you haven’t already downloaded the Python VS Code extension, it can be found (for free) here on the Visual Studio Code marketplace.

Information

As with many other IDEs, VS Code has a large community producing extensions, covering a wide range of programming and markup languages. Some other ones you might be interested in:

Option parsing in Python

Reading from the command line with sys.argv

The sys.argv variable is a list of the string arguments given when executing the script, with the first variable (sys.argv[0]) being the name of the script itself. We can use this to communicate with the script from the command line, so that one file can do many things without needing to edit it. For example, the following script counts the number of uses of the letter ‘e’ in a file:

import sys

e_count = 0
with open(sys.argv[1],'r') as infile:
    for line in infile.readline():
        e_count += line.count('e')
print("There were %d letter e's"%e_count)

Exercise Two: Find the primes


Using VS Code (or your own prefered data entry method), write a Python script to output the first 20 prime numbers. If you answered lecture 2 in the introductory exercises, you can start from the code you wrote ther, or start from fresh.

Tips:

  • One way of doing this uses an outer loop counting how many primes you have, and then code to find the next prime number.
  • Note that a number cannot be prime if it divides by a prime number and that 1 is not prime.
  • If a number is not prime, it must have at least one factor smaller than its own square root. This can be used to improve the efficiency of your search.
  • If a divides b exactly, then a%b==0, which gives a quick test.

When testing your code, you should expect the output for the first 5 primes to be [2, 3, 5, 7, 11].

Try to convert the script into a routine to calculate all prime numbers smaller than an input, \(n\) using `sys.argv`.

A model solution for the script is available.

argparse and options parsing

To pass more complicated options to a script, there is the argparse module, part of the standard python library. This module gives python scripts the (relatively) simple ability to take flags and process (or parse) other complicated inputs.

For full details, you should read the documentation linked to above, but as a short example, we can write a program which download current tube statuses from Transport for London.

status.py:


from urllib.request import urlopen
import json

parser = argparse.ArgumentParser()

parser.add_argument("mode", nargs='*',
                    help="transport modes to consider: eg. tube, bus or dlr.",
                    default=("tube", "overground"))
parser.add_argument("-l", "--lines", nargs='+',
                    help="specific lines/bus routes to list: eg. Circle, 73.")

args = parser.parse_args()

if args.lines:
    url = "https://api.tfl.gov.uk/line/%s/status"%','.join(args.lines)
else:
    url = "https://api.tfl.gov.uk/line/mode/%s/status"%','.join(args.mode)

status = json.load(url)

short_status = {s['name']:s['lineStatuses'][0]['statusSeverityDescription']
	             for s in status}
	
for _ in short_status.items():
    print('%s: %s'%_)

This code uses the argparse module to accept multiple positional arguments for modes of transport, e.g.

python status.py tube bus national-rail

as well as flag based options for individual lines

python status.py -l central

Exercise Three: Find the mean

Write a script to calculate the mean of a sequence of numbers. As an extension, tru make it take extra options (using the `argparse` module) -b, -o and -x to work with with binary (i.e. base 2, with 101 == 5 decimal), octal (i.e. base 8, with 31 == 25 decimal) and hexadecimal (i.e. base 16 2A == 42 decimal) numbers.

Test your basic script on the following sequences: 1 (mean 1) 1 5 7 13 8 (mean 6.8), 2.5, 4 ,-3.2, 9.3 (mean 3.15).

Also try feeding it no input.

Tips:

For the longer version you can use the 2 argument version of the int function to change the base of numbers. For example int('11',2)==3 and int('3A', 16)==58.

Model answers are provided for the short exercise and for the long version which takes options flags.

interlude

Let’s take a break from talking about Python scripts to point out a weird way that python behaves that can sometimes catch people out when writing code.

If you haven’t seen this before, try and guess the output produced by repeatedly calling these functions in the cells below.

def f(tmp=[]):
    """Try to default to have tmp as an empty list."""
    for i in range(4):
        tmp.append(i)
    return tmp

def g(tmp=None):
    """Doing the same thing explicitly."""
    tmp = tmp or []
    for i in range(4):
        tmp.append(i)
    return tmp
print('f()', f())
print('g()', g())
print('f()', f())
print('g()', g())

So, what’s going on here?

In the first case Python creates the empty list variable once when the function is first defined, and then reuses it on every subsequent call, since tmp is initialised to refer to it whenever we reenter the function. In the second case tmp is pointed at None each time, then changed to point at a new empty list each time the tmp = tmp or [] line is called.

The x = y or z syntax means x is set to y if y is “truthful”, or z if it isn’t.None, 0 and empty containers like lists are all not truthful).

End of interlude

A reminder on using matplotlib in scripts

In scripts which are run in the terminal, rather than in a notebook or an IPython console, matplotlib may not automatically put interactive plots on screen. In this case, you will need to use the matplotlib.show() or the pyplot.show() command to see your figures.

Alternatively, as you learnt last week, you can use a command like matplotlib.savefig('mycoolplot.png') to write the images to disk without any human interaction. The output format is guessed from the filename that you give it (e.g. .png, .jpg, .pdf).

Exercise Four: Plots in scripts

Write a script to plot the functions $y=\sin(x)$, $y=\cos(x)$ and $y=\tan(x)$ to screen over the range [0,$2\pi$] and then run it in a terminal/prompt.

Make sure to include labels on your axes.

Change the script to output a .png file to disk.

Next do the same to write a .pdf.

Model answers are available.

PEP8 - The Python style guide

Although as you saw earlier non-ACSII function names and comments are allowed in the Python 3 standards, you are strongly discouraged from using them in code which other people are going to see (including the assignments on this Masters course). That is actually one of the recommendations of the Python Style Guide, known as PEP8.

Python Enhancement Proposals (PEPs) are the mechanism through which Python expands and improves, with suggestions discussed and debated before being implemented or rejected. PEP8 describes suggestions for good style in Python code, based on Guido van Rossum (the original Python creator) noting that (with the exception of throw-away scripts) most code is read more often than it is written. As such, the single most important aspect of code is readability, by you and by others.

Note that PEP8 does not cover every single decision necessary in generating Python code in a consistent style. As such, there are many more detailed guides, either at the project level , or for entire organizations. For an example of the former, see the discussion of numpy later in this lecture. For an example of the later, see the Google Python Style Guide. When choosing what to do on your own projects, you are the boss, but PEP8 is a useful minimum (and will gain/lose you marks during the assessed exercises in this course) and it is useful to consider the thinking in the choices other projects make.

Code linters, and static code analysis

For Python, as with many other languages, there exist automated tools which check your code against an encoding of a style guide and point out mistakes. These are known as code linters, by analogy with clothes lint and the lint rollers used in laundries. Like the cleaning tool they remove mess and ‘fluff’ from your code to leave things looking neat and tidy.

lint_roller_50pc.jpg By Frank C. Müller, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=636140

There are many tools to perform code linting with python, including the lightweight pycodestyle (formerly known as pep8) package, which simply checks for conformity with the basic PEP8 guidelines. Some tools, such as pyflakes and pylint also perform static code analysis. That is, they parse and examine your code, without actually running it, looking for bad “code smells”, or for syntax which is guaranteed to fail.

An extension to run the pylint tool is offered as an optiona when using VS Code to edit Python code. You can also elect to use automatic pep8 corrections as you type, as well as running a full pep8 check each time a document is saved, by installing the relevant python packages and turning on the relevant options in the extensions settings.

Other hints for writing good Python scripts:

  • Explicit is better than implicit.

  • Don’t duplicate code, use functions.

  • Try to keep things compact enough to read in one go.

  • Make variable names meaningful if used on more than one line.

  • Simple is often better than clever.

  • Practise the principle of least astonishment.

  • Add comments when they add meaning.

For further discussion, see resources such as Google’s Python Style Guide, the Hitchhiker’s Guide to Python and style guides for large open source python projects such as Django which define, discuss and give verdicts for a number of open questions not covered by PEP8 (or where they disagree).

Finally, remember once again that much about code style is a social issue. You certainly don’t have to decide to behave the way any guide tells you if it affects no-one else, and nobody else ever interacts with your code. You should behave the way you and your team mates agree (or how your boss tells you!)

Exercise Five: Fix the script

Copy the following script into your editor/IDE and run the static analysis tool pylint on it. Fix the errors and warnings that it gives you.

value={1:'Ace',11:'Jack',12:'Queen',13:'King'}; 
for _ in range(2,11):
    value[_]=_
suit={0:'Spades',1:'Hearts',2:'Diamonds',3:'Clubs'}
def  the_name_of_your_card(v,s = 0,*args, **kwargs):
    
    
   """Name of card as a string.
   """ 
   if (v < 1  or v > 13 or s not in (0,1,2,3)):
      raise ValueError
      
   
   return """I have read your mind and predict your card is the %s of %s."""%(value[v], suit[ s])
print( the_name_of_your_card(2,  s= 3))
      

Interlude

If you haven’t seen this before, try and guess the output produced by the functions in the cells below. Can you explain what’s going on?

a = [_**2 for _ in range(5)]

for i, k in enumerate(a):
    print('%s: %s'%(i, k))
print('sum:', sum(a))
b = (_**2 for _ in range(5))

for i, k in enumerate(b):
    print('%s: %s'%(i, k))
print('sum:', sum(b))
c = (_**2 for _ in range(5))
print('sum:', sum(c))

So what’s going on here?

While [ a for a in b ] is a list comprehenstion, making up a list (an iterable) from the elements you ask for, the generator syntax ( a for a in b) does’t make up a tuple (despite looking like it should), but a pure interator. That means that it creates its elements only when asked for them, but can only be cycled through once.

Iterators and generators can be useful to save system memory when dealing with very large sequences. For example, range(10**6) returns an iterator over the numbers from 0-999,999, and takes just 48 bytes of memory, while list(range(10**6)) fills upwards of 8Mb.

In general, generators and comprenhesions can be very useful ways to code in Python, and are often faster to run than the equivalent for loop construction would be (though not usually as fast as using numpy for numerical operations where that’s possible.

Python Modules

Python module files denote code which you can use with an import command in your own scripts and programs. That is to say that it describes an external file from which you are using (or reusing) content. In other languages, a very similar concept might be called a library file. A pure Python module file has the same format as a script, except it expects to be imported into other files, or into the interpreter directly, usually without visually interacting with the user. This means that a typical module file contains definitions for functions and classes, but doesn’t produce any output (or try to do any significant extra work) by itself.

The code for a module, code_mod.py:

"""Wrapper for rot13 encoding.""" 

import codecs

def rot13(input):
    """Return the rot13 encoding of an input string."""
    return codecs.encode(str(input), 'rot13')
import code_mod
code_mod.rot13("Uryyb rirelobql!")

The import command

The import search path

After looking in the current directory, Python uses the other directories inside the sys.path variable, in order, when asked to find files via an import command.

import sys
print(sys.path)

This means that this variable can be changed within a Python script itself, or can be influenced when the Python session starts through the PYTHONPATH environment variable.

The importlib.reload and %reset commands.

The python command reload in the importlib module tells the interpreter to update its record of the contents of an indivual module. This can be useful during an interactive interpreter session if you update a code in a module or package, whether automatically, or by editting the file by hand.

The IPython/Jupyter magic command %reset clears elements of the interpreter history and resets things back to their original blank state.

warning

The reload command only updates the contents of the module passed as an argument, not necessarily the contents of modules that are imported inside it. If in doubt, it’s safest to exit the interpreter and restart.

x=7
print(x)
# by itself, %reset asks the user for confirmation.
# %reset -f forces it to proceed.
%reset -f
try:
    print(x)
except NameError as e:
    print(e)

Python docstrings for scripts, modules and packages.

Documentation where it’s needed

As you were told in the introcution to python course, the text between the “”” blocks is called a docstring. It should appear at the top of scripts & module files, (or just below the file encoding, if one is needed) and as the first text lines inside classes or function def blocks. Python uses it to generate help information if asked. This information is stored in the object __doc__ variable.

import code_mod
code_mod.rot13?

There is a sctually a PEP, PEP257 which gives suggestions for a good docstring. In particular it suggests:

  • One line docstrings should look like

    def mod5(a):
        """Return the value of a number modulus 5."""
        return a%5
    

    I.e. the docstring is a full sentence, ending in a period, describing the effect as a command (“Do this”, “Return that”).

  • Multiline docstings should start with a one line summary with similar syntax and have the terminating “”” on its own line.

  • The docstring of a pure script should be a “usage” message.

  • The docstring for a module should list the classes and functions (and any other objects) exported by the module, with a one-line summary of each.

The numpydoc standard

The numpy package has its own standards, which are well suited to numerical code, especially code which interfaces with numpy package itself, e.g. by using numpy multidimensional arrays. You have already seen examples of the numpydoc style in previous lectures, but lets give another one

%matplotlib inline
import numpy as np
import matplotlib.pyplot as pyplot

def mandelbrot(c, a=2.0, n=20):
    """
    Approximate the local Mandelbrot period of a point. 
    
    Parameters
    ----------
    
    c : complex
        Point in the complex plane
    a : float
        A positive bounding length on the horizon of the point z_n
    n : int
        Maximum number of iterations .
    
    Returns
    -------
    
    int
        i such that |z_i|>a if i < n, NaN otherwise.
    
    """
    
    z = c
    for _ in range(n):
        if abs(z)>a:
            return _
        z = z**2 + c
    return np.nan

dx = np.linspace(-2, 1, 300)
dy = np.linspace(-1.5, 1.5, 300)
x, y= np.meshgrid(dx, dy)
z = np.empty(x.shape)

for i in range(len(dx)):
    for j in range(len(dy)):
        z[i, j] = mandelbrot(x[i, j]+1j*y[i, j],100)
    
    
pyplot.pcolormesh(x, y, z)
pyplot.xlabel('$x$')
pyplot.ylabel('$y$')
pyplot.get_cmap().set_bad('black')

In the numpydoc style, the Parameters and Results sections prescribe the data types (int, float, complex str etc.) of the inputs and outputs of the method. This uses the syntax of a text markup language called reStructured text. We will revisit this tomorrow when we introuduce the documentation generator, sphinx.

A note on types

By default Python practises form of dynamic typing called “duck typing”, where “as long as it looks like a duck and quacks like a duck, it’s a duck”. This can sometimes cause suprising problems when the names of functions clash.

class Duck(object):
    def quack(self):
        print ("Quack!")
        return self
    def fly(self):
        print("Flap, flap, flap")
        return self
        
class Bugs(object):
    def spider(self):
        print("8 legs")
        return None
    def fly(self):
        print("6 legs")
        return None
        
def takeoff(x):
    return x.fly()

#This works
duck = Duck()
takeoff(duck).quack()


try:
    # this won't
    bugs = Bugs()
    takeoff(bugs).quack()
except AttributeError:
    import traceback
    traceback.print_exc()
        

In a strongly typed language like C, this kind of error would be caught when the code was compiled. In Python, errors can often only be caught when the branch which holds them is run.

Design by contract

Those of you used to strongly typed languages like C will find the numpydoc specification familiar. The numpydoc docstrings are also a weak example of a wider code design philosophy called design by contract, or programming by contract. In that system, the developer explicitly lists all the assumptions that a function makes about its inputs, as well as the guarantees that it makes about its outputs.

Exercise Six: Complex square root

Write a function which accepts a real number and returns the complex square roots of that number.

Your function should include a docstring conforming to the numpydoc standard.

Tips:

  • You can use the `sqrt` function in the `math` module to obtain the square root of a positive real number.
  • Python uses the notation `1j` for a unit length imaginary number (which a mathematician would typically denote $i$), where (loosely) $\sqrt{-1}=\pm 1j$.

Questions: how many complex square roots does each real number have? Is it the same for every real number?

A model answer is available.

interlude: Code Quality

Code quality is often a balance between three things:

  • Maintainability: The code is easy to read and to understand.
  • Performance: The code is as fast and secure to run as we can make it.
  • Resources: This is both the size of the machine and the developer time available to address the problem

This is frequently a case of “which two do you want?” As such, there are compromises necessary when designing code. However, it’s important that they are recognised, and only made when appropriate. To quote Donald Knuth

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

The moment that code is going to be read a second time (including by you in two months time) then it becomes unacceptable to write it as though it is disposable. Functions need docstrings, and variables should have names which make sense (and not just to you personally right now).

Similarly, when you’ve tested your code, and you know that a specific function takes 90% of the runtime, it may make sense to rewrite it in a faster way, even if that is harder to maintain (more numpy, using numba, writing your own C extension modules, and so on).

Combined files

Mixed scripts & modules

A file can be both a script and a module providing you use a special if test to check how it is being used (to avoid being antisocial, and doing all the calculating and printing your script is set up to do):

rot13m.py:

import codecs

# module definitions

def rot13(input):
    """Return the rot13 encoding of an input"""
    return codes.encode(str(input), 'rot13')
    
if __name__ == "__main__":
    # Code in this block runs only as a script,
    # not as an import
    import sys
    print(rot13(sys.argv[1]))

Although it started as something of a hack, the if __name__ == "__main__" idiom is now accepted as fully Pythonic, and is something you will see often in modules which also have sensible script like behaviour.

Exercise Seven: A primes module


Make a copy of your script to calculate prime numbers and:

  1. add the ability to read the number of primes to output from the command line,
  2. turn it into a version which can also be used as a module,
  3. test this by running a copy of the interpreter and `import`ing it, then calling your routine.
  4. Try running it from the terminal/Anaconda command prompt using the following syntax:
python -m rot13m "this runs a python module"

See what happens if you change directories.
A model answer is available

Python Packages

An example package

Python packages bundle multiple modules into one place, to make installing and uninstalling them easier and to simplify usage. A simple python package just consists of python files inside a directory tree.

A typical template for a fairly basic python package called mycoolproject might look like:

mycoolproject
 ├── __init__.py
 ├── cool_module.py
 ├── another_cool_module.py
 └── extras
      ├── __init__.py
      ├── __main__.py
      └── extra_stuff.py
requirements.txt
setup.py
LICENSE
README.md

The __init__.py file is slightly special (as is common in python with double underscore names, or dunders), in that it gets read when you run import mycoolproject (or whatever the name of the directory is). The other files can be imported by themselves as mycoolproject.cool_module, mycoolproject.another_cool_module, etc. Similarly the __main__.py file acts like a python version of the if __name__=='__main__': block for modules, in that it is activated if the package is run like python -m mycoolpackage

In a typical package the __init__.py file mostly consists of import commands to load functions and classes from the other modules in the package into the main namespace, as well as possibly defining a few special variables for itself.

mycoolproject/__init.py__:

from .cool_module import my_cool_function, my_cool_class
from .another_cool_module.py import *

Since as the auther of the package, you are in control of everything, this may be the only time the from modX import * idiom is appropriate.

When importing modules, remember that levels of directories are separated using the . symbol, so

from mycoolproject.cool_module.extras.extra_stuff import super_cool_function

Exercise Eight: A primes package


Turn your “find the primes” module file into a package called primes by creating a suitable directory structure and an __init__.py so that you can access a function to give you the first \(n\) primes as well as all primes smaller than \(n\).

Try importing your new package from the IPython console. Check that you can call your function.

If you have time, add a function to the package to give you a list of the prime factors of an integer (i.e the prime numbers which divide it with no remainder).

A model answer is available.

Information

You can also use ‘.’ for relative import statements, with a syntax similar to the unix shell. So, in the example above the file /mycoolproject/extras/__init__.py can write:

# one . for the current directory
from  . import extra_stuff
## two ..s for its parent
from .. `import another_cool_module
## this also works
from ..coolmodule import foo

Licensing


Warning

I am not a laywer! More specifically, I am not your lawyer. Lawyers spend a lot of money on insurance, so that they are safe to give specific legal advice without the fear of liability. While I will try to be as accurate as possible in the information provided here, don’t plan on using these notes as a defence in court.

Licences grant permissions

As a copyright holder, you can always grant others the ability to use, copy and distribute your software. The easiest and simplest way to do this is to publish a licence (that’s the UK spelling) together with your code. As a user & developer, ensuring that software you use has a licence with terms compatible with what you intend to do with it prevents long, costly and embarrassing legal action further down the line. Best advice is thus to store a licence file in any repository people can see and copy from, and possibly even add it to the header of your source code

Although in theory you could always write your own licence, few scientists are also lawyers. Because legal text has legal meaning, it is always safer to use one of the well known and well understood existing copyright licences. If in doubt, the MIT License (that’s how the Americans spell it) is popular and well understood. If you feel strongly that your work must always remain in circulation, use the latest version of the GPL.

Note that the legal concept of licensing is almost entirely separate from the academic concept of plagiarism. A licence can you the legal right to reuse or modify someone else’s work, you cannot be given the moral right to falsely claim it as your own work, and should identify the original author in an appropriate manner.

Installation and distribution

Setup.py, distutils and setuptools

The setup.py file is a standard name for an install script for Python packages (written in Python itself). Python even comes with a module in its standard library, distutils, to automate this as much as possible. We will use an enhanced version called setuptools, compatible with the Python package manager, pip. For a simple package in plain Python, the setup.py file might look like the following:

from setuptools import setup

setup(
    name='mycoolproject',  # Name of package, required
    version='1.0.0',  # Version number, required
    packages=['mycoolproject'],  # directories to install, required
    # One-line description or tagline of what your project does
    description='A sample implementation of quaternions.',  # Optional
    url='https://www.mycoolproject.com',  # Optional
    author='James Percival',  # Optional
    author_email='j.percival@imperial.ac.uk',  # Optional
)

This script can be called in several modes. For pure Python packages, the most useful is probably

python setup.py install

or

python setup.py install --user

These both copy the files in the package into a directory in the standard search path. The first installs for everybody (and might need admin rights) the second installs just for the current user.

Version Numbers

The version keyword in the setup.py file allows you to specify a version number. There are many formats for version numbers used in software development. These range from the absurdly simple (eg. build 1, build 2, build 3 …) to the complicated (eg. the Linux kernel has versions like 4.15.0-36-generic), to the unusual (eg. the TeX typesetting system is currently on version 3.14159265, with a successive digit of \(\pi\) added with each new version). As is often the case, there is even a PEP about it (PEP440).

Unless you have a good reason to do something different “semantic versioning” is a convenient standard to stick with. This is just an ordered set of three integers, separated by dots, e.g. 0.2.3 or 13.4.2. The structure is (major version).(minor version).(patch version), where a major version increment (e.g. from 10.2.3 to 11.0.0) implies big changes in the code, which are likely to break things designed for previous versions, while a minor increment means small changes which might cause problems. Incrementing the patch version implies only bug fixes, while not changing external APIs.

Because the differences between major versions can prevent people upgrading, it’s commmon to “backport” fixes and features from the mainline “trunk” of development back to new minor versions of the previous generation of code. A good example is Python itself, where version 2.7.0 was released on July 3rd, 2010 (and it’s now up to 2.7.14), whereas Python 3.0 was released on December 3rd, 2008.

Some communities (e.g. the Linux kernel developers) add on additional meaning to the semantic numbers. For example a common scheme is that odd minor versions are “development” or “unstable”, whereas even numbers are for general release, or “stable”. That means there are more likely to be bugs (and thus more patches) in the unstable versions of releases of the code base, but new features appear there first.

The pip and conda package managers

Although you can install packages yourself by hand, it is simpler to use a tool, called a “package manager”, to control things. This allows for easier installs, uninstalls and for sandboxing environments to run specific software (this described in Wednesday’s lecture).

Your Anaconda installation comes with two inbuilt package managers, conda, specially written for Anaconda itself and pip, which is more widely available on non-Anaconda installs. Since conda understands about pip, we will describe that tool in more detail here.

Dependencies

An individual Python package typically has its own Python dependecies (i.e. other non-system packages which this package itself imports). A common way to documents these is a requirements.txt file consisting of a list of package names (one per line), possibly also indicating a minimum or exact version number to be installed.

requirements.txt

jupyter
numpy >= 13.1.0
scipy == 1.0.0
mpltools

The lines with just the name allow any version, the lines with >= demand a version which is “greater than or equal to” that specified (where eg. 2.0.0 > 1.9.1 and 1.2.0 > 1.1.9) and the lines with == demand a specific version. The packages listed in the requirements.txt file, or at least suitable versions of them can then be pip installed in one go, via the compact command:

pip install -r requirements.txt

The conda manager accepts similar files in a format called .yml or .yaml (short for “yet another markup language”, or possibly “YAML ain’t markup language”). YAML formatted files are normally used for software configuration, where data elements mostly consist of named strings and lists. A conda environment.yml file looks like

environment.yml

name: acse
dependencies:
 - jupyter
 - numpy
 - scipy
 - pip:
   - mpltools

Here the pip subsection in the dependencies lists packages for which there isn’t a full conda package produced, and where a straight pip install must be used instead.

As yet another route, you can also include your dependencies in your ‘setup.py’ file using the install_requires keyword in your setup function. While each method we recommend the pip and requirements.txt route as typically more common, repeatable and robust.

Exercise Nine

Make a setup.py script for your module and try installing and uninstalling it using pip. In the directory containg the setup.py file run

pip install .`

and

pip uninstall <the name of your module>

From an interpreter console started in another directory, see when you can and can’t import your new module.

PyPI, the Python Package Index

When not given a local setup.py file to work from, pip defaults to scanning PyPI, the Python Package Index. This is a very large repository of python software, and is a very good place to check before naming your projects. It also has a useful tutorial on the packaging process, and points to https://choosealicense.com/ as a resource for picking licenses.

If you would like to test upload packages yourself, then there is an option to register for their test server, then follow the tutorial instructions to archive and upload your packages there.

With that step completed, you now know everything you need to write, package and distribute open source software to the world. All you need to add is your time and creativity.

interlude: Coupling Python and other languages

As you may have realised, although powerful, Python code is not always particularly fast to execute. One way around this (as followed by packages such as numpy) is to write small, frequently called sections in a compiled language such as C. Since the usual Python implementation is itself written as C code, there is a well documented path to do so, called the Python C API.

As a concrete example, consider the following C file:

example.c:

#define PY_SSIZE_T_CLEAN
#include <Python.h>

typedef struct {
    PyObject_HEAD
    char data_name[255];
} exampleExample;

PyObject* exampleExample_NEW(void);
int exampleExample_Check(PyObject*);
PyTypeObject* exampleExample_Type(void);

static PyObject *
my_fun(PyObject *self, PyObject *args)
{
    long l1, l2;

    if (!PyArg_ParseTuple(args, "ll", &l1, &l2))
        return NULL;
    return PyLong_FromLong(2*l1+l2);
}

static PyMethodDef exampleMethods[] = {
    {"my_fun",  long_add, METH_VARARGS,
     "my_fun(a, b)\n--\n\n Return 2*a+b."}, /* function documentation */
    {NULL, NULL, 0, NULL}        /* Sentinel indicating end of module methods */
};

static struct PyModuleDef examplemodule = {
    PyModuleDef_HEAD_INIT,
    "example",   /* name of module */
    "C based example extension", /* module documentation, may be NULL */
    -1,       /* size of per-interpreter state of the module,
                 or -1 if the module keeps state in global variables. */
    exampleMethods
};

PyMODINIT_FUNC
PyInit__example(void)
{
    PyObject *m;

    m = PyModule_Create(&examplemodule);
    if (m == NULL)
        return NULL;

    if (PyType_Ready(exampleExample_Type()) < 0)  return NULL;

    Py_INCREF(exampleExample_Type());
    PyModule_AddObject(m,"Example", (PyObject*)exampleExample_Type());

    return m;
}

Although this looks complicated, as with most C code it mostly follows a standard template. The most important part is the definition of the function my_fun which we are turning into a Python module method.

To actually build the code (on a system with a suitable C compiler), we can use a slightly different form of setup.py file:

setup.py:


#!/usr/bin/env python

from setuptools import setup, Extension

mod1 = Extension('example',
       sources=["example.c"])

setup(name='Example',
      version='1.0',
      description='An example template',
      author='James Percival',
      author_email='j.percival@imperial.ac.uk',
      ext_modules=[mod1]
     )

Now we can (hopefully) just run python3 setup.py build_ext --inplace.

There exist tools such as Cython (for the Python=>C side) and SWIG (for C=>Python) which somewhat simplify these workflows.

More programming exercises.

The website Project Euler contains a large number of computational mathematics problems which can be used as exercises in any programming language to practise thinking algorithmically (warning, some of them use complicated mathematics). We will list a few here:

Exercise: Project Euler Problem 1


If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Exercise: Project Euler Problem 5


2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder.

What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?

Exercise : Project Euler Problem 8


Consider the 1000 digit number


73167176531330624919225119674426574742355349194934 96983520312774506326239578318016984801869478851843 85861560789112949495459501737958331952853208805511 12540698747158523863050715693290963295227443043557 66896648950445244523161731856403098711121722383113 62229893423380308135336276614282806444486645238749 30358907296290491560440772390713810515859307960866 70172427121883998797908792274921901699720888093776 65727333001053367881220235421809751254540594752243 52584907711670556013604839586446706324415722155397 53697817977846174064955149290862569321978468622482 83972241375657056057490261407972968652414535100474 82166370484403199890008895243450658541227588666881 16427171479924442928230863465674813919123162824586 17866458359124566529476545682848912883142607690042 24219022671055626321111109370544217506941658960408 07198403850962455444362981230987879927244284909188 84580156166097919133875499200524063689912560717606 05886116467109405077541002256983155200055935729725 71636269561882670428252483600823257530420752963450

The four adjacent digits in this number that have the greatest product are 9 × 9 × 8 × 9 = 5832. Find the thirteen adjacent digits in the 1000-digit number that have the greatest product. What is the value of this product?

Summary

In this lecture we learned:

  • The difference between Python scripts, modules and packages.
  • Code standards and code linters.
  • To make & install your own Python package.

Tomorrow:

  • More on scientific Python packages: scipy, sympy etc.

Further Reading:

  • PEP8, the Python style guide

  • The Google Python Style Guide

  • A tutorial for the Visual Studio Code IDE

  • The python documentation page on modules & packages.

  • PEP257 - docstring conventions.

  • The numpydoc docstring standard

  • Writing C extensions for Python