Code profiling and optimisation¶
….or “My python code is too slow! What do I do now?”
Navjot Kukreja nkukreja@imperial.ac.uk
The following is based on a scipy lecture.
“Premature optimization is the root of all evil” -Donald Knuth
Development workflow¶
Make it work: write the code in a simple legible way.
Make it work reliably: write automated test cases, make really sure that your algorithm is right and that if you break it, the tests will capture the breakage.
Profile-optimise: Optimise the code by profiling simple use-cases to find the bottlenecks and speeding up these bottleneck, finding a better algorithm or implementation. Keep in mind that a trade off should be found between profiling on a realistic example and the simplicity and speed of execution of the code. For efficient work, it is best to work with profiling runs lasting around 10s.
Make it work¶
My personal opinion: Readable (but slow) code is better than fast but complicated code.
import this
Profiling Python code¶
Timeit¶
import numpy as np
a = np.arange(1000)
%timeit a ** 2
%timeit a ** 2.1
%timeit a * a
%timeit a ** (0.5)
Profiler¶
!cat -n demo.py
Note: This is a combination of two unsupervised learning techniques, principal component analysis (PCA) and independent component analysis (ICA). PCA is a technique for dimensionality reduction, i.e. an algorithm to explain the observed variance in your data using less dimensions. ICA is a source separation technique, for example to unmix multiple signals that have been recorded through multiple sensors. Doing a PCA first and then an ICA can be useful if you have more sensors than signals. For more information see: the FastICA example from scikits-learn.
%run -t demo.py
%run -p demo.py
Clearly the svd
(in decomp.py
) is what takes most of our time, a.k.a. the bottleneck. We have to find a way to make this step go faster, or to avoid this step (algorithmic optimization). Spending time on the rest of the code is useless.
Profiling outside of IPython, running cProfile
Similar profiling can be done outside of IPython, simply calling the built-in Python profilers cProfile and profile.
$ python -m cProfile -o demo.prof demo.py
Using the -o
switch will output the profiler results to the file demo.prof
to view with an external tool. This can be useful if you wish to process the profiler output with a visualization tool.
Line-Profiler¶
The profiler tells us which function takes most of the time, but not where it is called.
For this, we use the line_profiler
: in the source file, we decorate a few functions that we want to inspect with @profile
(no need to import it)
!pip install line_profiler
!kernprof -l -v demo.py
Exercise¶
Write a function to calculate the factorial of a given number. Write two versions of this function, one using recursion, and one using loops. Use timeit to decide which one to keep.
Making code go faster¶
Once we have identified the bottlenecks, we need to make the corresponding code go faster.
Algorithmic optimization¶
The first thing to look for is algorithmic optimization: are there ways to compute less, or better? For a high-level view of the problem, a good understanding of the maths behind the algorithm helps. However, it is not uncommon to find simple changes, like moving computation or memory allocation outside a for loop, that bring in big gains.
SVD example¶
In the example above, the SVD - Singular Value Decomposition - is what takes most of the time. Indeed, the computational cost of this algorithm is roughly \(n^3\) in the size of the input matrix.
However, in both of these example, we are not using all the output of the SVD, but only the first few rows of its first return argument. If we use the svd
implementation of scipy, we can ask for an incomplete version of the SVD. Note that implementations of linear algebra in scipy are richer then those in numpy and should be preferred.
data = np.random.random((5000, 100))
%timeit np.linalg.svd(data)
from scipy import linalg
%timeit linalg.svd(data)
%timeit linalg.svd(data, full_matrices=False)
%timeit np.linalg.svd(data, full_matrices=False)
We can then use this insight to optimize the previous code.
import demo
%timeit demo.test()
import demo_opt
%timeit demo_opt.test()
Real incomplete SVDs, e.g. computing only the first 10 eigenvectors, can be computed with arpack, available in scipy.sparse.linalg.eigsh
.
Computational linear algebra¶
For certain algorithms, many of the bottlenecks will be linear algebra computations. In this case, using the right function to solve the right problem is key. For instance, an eigenvalue problem with a symmetric matrix is easier to solve than with a general matrix. Also, most often, you can avoid inverting a matrix and use a less costly (and more numerically stable) operation.
Know your computational linear algebra. When in doubt, explore scipy.linalg
, and use %timeit
to try out different alternatives on your data.
Exercise¶
Can you rewrite the following code (think algorithmic optimisations) to run faster?
xs = np.random.randint(0, 1000, 10000)
ys = np.random.randint(0, 1000, 10000)
def common1(xs, ys):
"""Find the common elements in the lists provided"""
zs = []
for x in xs:
for y in ys:
if x==y and x not in zs:
zs.append(x)
return zs
%timeit -n1 -r1 common1(xs, ys)
Writing faster numerical code¶
A complete discussion on advanced use of numpy is found in chapter Advanced NumPy, or in the article The NumPy array: a structure for efficient numerical computation by van der Walt et al. Here we discuss only some commonly encountered tricks to make code faster.
For loops¶
Find tricks to avoid for loops. When using numpy arrays, masks and indices arrays can be useful.
import csv
def read_postcodes():
with open('postcodes.csv', encoding="utf-8-sig") as rf:
reader = csv.DictReader(rf)
postcodes = [row['Postcode'].strip() for row in reader]
return postcodes
postcodes = read_postcodes()
def upper_for(postcodes):
codes = []
for code in postcodes:
codes.append(code.upper())
return codes
def upper_listcomp(postcodes):
return [code.upper() for code in postcodes]
%timeit upper_for(postcodes)
%timeit upper_listcomp(postcodes)
%timeit map(str.upper, postcodes)
import numpy as np
a = np.arange(100)
def increment_for(a):
for i in range(len(a)):
a[i] = a[i] + 1
def increment_vectorised(a):
a[:] += 1
%timeit increment_for(a)
%timeit increment_vectorised(a)
Memoisation¶
Do not compute the same thing twice
def fibonacci(n):
if n == 0: # There is no 0'th number
return 0
elif n == 1: # We define the first number as 1
return 1
return fibonacci(n - 1) + fibonacci(n-2)
%timeit fibonacci(35)
import functools
@functools.lru_cache(maxsize=128)
def fibonacci(n):
if n == 0:
return 0
elif n == 1:
return 1
return fibonacci(n - 1) + fibonacci(n-2)
%timeit fibonacci(35)
Broadcasting¶
Use broadcasting to do operations on arrays as small as possible before combining them.
def increment_bcast(n):
x = np.arange(n)
x += 1
def increment_no_bcast(n):
x = np.arange(n)
x += np.ones(n, dtype=np.int)
%timeit increment_bcast(10**7)
%timeit increment_no_bcast(10**7)
In-place operations¶
a = np.zeros(int(1e7))
%timeit global a ; a = 0*a
%timeit global a ; a *= 0
Note: we need global a
in the timeit
for functional completeness, as it is assigning to a, and thus considers it as a local variable.
Be easy on the memory: use views, and not copies¶
Copying big arrays is as costly as making simple numerical operations on them:
a = np.zeros(int(1e7))
%timeit a.copy()
%timeit a + 1
Beware of cache effects¶
Memory access is cheaper when it is grouped: accessing a big array in a continuous way is much faster than random access. This implies amongst other things that smaller strides are faster (see CPU cache effects):
c = np.zeros((int(1e4), int(1e4)), order='C')
%timeit c.sum(axis=0)
%timeit c.sum(axis=1)
c.strides
This is the reason why Fortran ordering or C ordering may make a big difference on operations:
a = np.random.rand(20, 2**18)
b = np.random.rand(20, 2**18)
%timeit np.dot(b, a.T)
c = np.ascontiguousarray(a.T)
%timeit np.dot(b, c)
%timeit c = np.ascontiguousarray(a.T)
Using numexpr can be useful to automatically optimise code for such effects.
Use compiled code
The last resort, once you are sure that all the high-level optimisations have been explored, is to transfer the hot spots, i.e. the few lines or functions in which most of the time is spent, to compiled code. A simple first option in this case is numba. For more advanced scenarios, it might be advisable to use Cython: it is easy to transform existing Python code into compiled code, and with a good use of the numpy support yields efficient code on numpy arrays, for instance by unrolling loops.
Profiling memory usage¶
!pip install memory_profiler
%load_ext memory_profiler
def sum_of_lists(N):
total = 0
for i in range(5):
L = [j ^ (j >> i) for j in range(N)]
total += sum(L)
# del L # remove reference to L
return total
%memit sum_of_lists(1000000)
Airspeed velocity¶
For larger projects, it is often sensible to have profiling automated and integrated into CI systems. Here is an example.
My code is still slow. What now?¶
Replace as much of your code with third-party libraries as you can - they are likely better optimised than what you had time for.
Are you using all the cores in your system? Following step 1 above first will likely mean this is true. If you still see only one core being used, you need to parallelise your program (Good luck!). Thread-based parallelism is hard to do in Python because of the GIL. The multiprocessing module provides an easy way to introduce process-based parallelism in python programs.
If you’ve followed all the advice so far and your program is still slow, you probably need to run it on more than one computer, i.e. distributed parallelism. Tomorrow’s lecture will introduce that.
One accurate measurement is worth a thousand expert opinions. -Grace Hopper