Sandboxing & virtual environments

Yesterday you learned how to wrap python code up into a package with its own name and version number. There are several situations in which it can be useful to “sandbox” code into its own space so that other package installations cannot interfere with it, and so that it cannot interfere with them.

Contact details:

  • Dr. Rhodri Nelson

  • Room 4.85 RSM building

  • email: rhodri.nelson@imperial.ac.uk

  • Teams: @Nelson, Rhodri B in #acse1 or #General, or DM me.

Enviroment variables

From Wikipedia (https://en.wikipedia.org/wiki/Environment_variable):

An environment variable is a dynamic-named value that can affect the way running processes will behave on a computer. They are part of the environment in which a process runs. For example, a running process can query the value of the TEMP environment variable to discover a suitable location to store temporary files, or the HOME or USERPROFILE variable to find the directory structure owned by the user running the process.

As an example, one important environment variable is the PATH variable. To view the value of this variable, on a Unix machine you can type

echo $PATH

or on a windows machine

echo %PATH%

The variable is a list of directory paths. When the user types a command without providing the full path, this list is checked to see whether it contains a path that leads to the command.

In summary, the values of environment variables govern how certain as aspects of your environment function, e.g. which executables and libraries will be called/accessed by default, or which options will be used when executing certain commands.

Why do we need a virtual environment?

You’re also now familiar with the Python package manager pip. Consider the following two ‘dummy’ packages and their requirements:

  • Package A, requires the following packages:

    • a, version >= 1.0

    • b, version 1.2

    • c, version >= 2.2

    • d, version >= 5.0

  • Package B, requires:

    • a, version >= 1.0

    • b, version >= 1.3

    • e, version 1.0

    • f, version >= 7.0.

Reviewing the above, we can see there is a conflict for package b. Clearly, using pip to switch between two versions of package b every time want to use A or B is not a good solution. But further, in reality, when working on larger development projects such dependency conflicts may arise for several, of even dozens(!), or packages. Clearly, a better solution is to have both versions of the software installed and an easy way to switch between the appropriate environment variables when using either A or B. This is where virtual environments come in handy.

A virtual environment is a tool that helps to keep dependencies required by different projects separate by creating isolated python virtual environments for them. This is one of the most important tools that most Python developers use.

When and where to use a virtual environment?

By default, every project on your system will use the same directories (defined via environment variables) to store and retrieve site packages (third party libraries). Why does this matter? In the above example of two projects, you have two versions of package b. This is a real problem for Python since it can’t differentiate between versions in the “site-packages” directory. So both v1.2 and v1.3 would reside in the same directory with the same name. This is where virtual environments come into play. To solve this problem, we just need to create two separate virtual environments, one for each project. The great thing about this is that there are no limits to the number of environments you can have since they’re just directories containing a few scripts.

Along with the above example, we may also want to make use of virtual environments because

  • Sometimes packages have the same name, but do different things, creating a namespace clash.

  • Sometimes you need a clean environment to test your package dependencies in order to write your requirements.txt file (we will talk more about such files later).

venv

Python comes with an inbuilt library called venv which can be used to create so-called “virtual environments”. Inside a virtual environment, only the packages and tools you explicitly choose to copy across are available, and only at the version numbers you request. This gives a quick, easy access to a “clean” system, be it for testing, or to run mutually incompatible software.

To create a new venv environment you can run a command like

python -m venv foo

or, on systems with both Python 2 and Python 3 available,

python3 -m venv foo

This will create a new directory ./foo/ containing the files relevant to that virtual environment. To start the environment on Windows run

foo\Scripts\activate.bat

or on unix shell like systems

source foo/bin/activate

To disable the environment, on windows systems run

.\foo\Scripts\deactivate.bat

or, in most unix based shells

deactivate

Building a requirements.txt file using venv

Switching to a venv environment is one way to build up a short list of required packages for a new project. You can start up a blank environment and then try to build and run your code. If a dependency is missing, this should fail with an ImportError message, something along the lines of

Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/tmp/pip-ukwnrb23-build/setup.py", line 5, in <module>
     from numpy import get_include
   ImportError: No module named 'numpy'

Ideally you will then be able to recognise the missing dependency (in this case numpy) and fix it by running a command like pip install numpy. After repeating as needed to fix any further requirements you can generate a requirements.txt compatible list for your Python environment (with the command pip freeze, this also lists the currently installed version).

Exercise : Make a `venv`

Create your own venv environment, giving it a name of your choice. Activate it. Note the difference it makes to your command prompt.

Double check the installed package list using pip list. Install a package into the virtual environment (such as matplotlib) using pip. Check that the list of installed packages inside the environment changes.

Some other tools (before we talk about Anaconda)

This is a popular tool for creating isolated Python environments for Python libraries. virtualenv functions by installing various files in a directory (e.g. env/), and then modifying the PATH environment variable to prefix it with a custom bin directory (e.g. env/bin/). An exact copy of the python or python3 binary is placed in this directory, but Python is programmed to look for libraries relative to its path first, in the environment directory. It’s not part of Python’s standard library, but is officially blessed by the PyPA (Python Packaging Authority). Once activated, you can install packages in the virtual environment using pip.

A virtualenv only encapsulates Python dependencies. A Docker container encapsulates an entire operating system (OS). With a Python virtualenv, you can easily switch between Python versions and dependencies, but you’re stuck with your host OS. With a Docker image, you can swap out the entire OS - install and run Python on Ubuntu, Debian, Alpine, even Windows Server Core. There are Docker images out there with every combination of OS and Python versions you can think of, ready to pull down and use on any system with Docker installed.

Tools such as Docker are excellent for testing software packages and cross operating system/hardware compatibility. For software development, tools such as Anaconda are generally more convenient.

What is Anaconda?

A open-source distribution of Python that simplifies package management. It comes with applications such as Jupyter Notebook, the Conda environment manager, and Pip for package installation and management.

Anaconda also comes with hundreds of Python packages such as matplotlib, NumPy, SymPy and so forth.

It eliminates the need to install common applications separately and will (generally) make installing Python on your computer much easier.

Note that a large range of helpful Anaconda tutorials can be found online.

To learn more about the usage of Anaconda, let us together work through an exercise.

For this we will fork, then clone and play around with a dummy package made for this purpose.

Go to the address https://github.com/rhodrin/environments_acse1 and click on the fork button as shown in image below (make sure you’re logged into your Github account before doing this):

Then clone the forked package (Make sure you’re in an appropriate folder before performing the clone)- in a terminal type

git clone https://github.com/<my github name>/environments_acse1.git

and then checkout v1 of the package via

git checkout tags/v1 -b v1

The package can also be cloned via the Visual Studio Code GUI.

In the base folder notice the presence of both an environment.yml file and a requirements.txt file. The environment.yml file defines the Anaconda virtual environment we wish to create. If we look at its contents, we see (importantly) name: envtest, specifying the name of the environment we’re going to create and some dependencies. We’ll talk more about them later, but now let us create a ‘conda’ environment. In cloned directory type

conda env create -f environment.yml

When that command is complete type

conda activate envtest

and following this (making sure that the your terminal has now been modified such that (envtest) is appearing) type

pip install -e .

to install the envtest package (recall that the operations performed by this command are governed by the contents of setup.py). Once that is done, let us view the contents of requirements.txt. Currently, we see that only one dependency is listed, that of numpy (version > 1.6). Lets install this dependency via

pip install -r requirements.txt

Now, to check everything is correctly set up, from within the base directory run

python scripts/random_number_array.py

You should see output along the lines of

[[0.22330655 0.07439368 0.69014812]
 [0.90354345 0.06734495 0.13096386]
 [0.22487417 0.6394524  0.41603555]]

(noting that the actual numbers you see will be slightly different since the routine is generating a 3x3 array of random numbers between 0 and 1).

Also, lets now look at the result of echo $PATH (or echo %PATH%) again. Notice the modified value within our environment.

Now, lets add a few further functions, dependencies and scripts to our repository.

In the file envtest/builtins.py:

  • add the following two lines right after the existing import

from scipy.ndimage import gaussian_filter
from scipy import misc
  • Modify __all__ = ['rand_array'] to __all__ = ['rand_array', 'smooth_image']

  • Add the following function:

def smooth_image(a, sigma=1):
    return gaussian_filter(a, sigma=sigma)

Then, modify the file scripts/smooth_image.py so that it reads (i.e. remove any existing text):

from envtest import smooth_image

from scipy import misc
import matplotlib.pyplot as plt


image = misc.ascent()
sigma = 5

smoothed_image = smooth_image(image, sigma)

f = plt.figure()
f.add_subplot(1, 2, 1)
plt.imshow(image)
f.add_subplot(1, 2, 2)
plt.imshow(smoothed_image)
plt.show(block=True)

Currently, if we try running this script, e.g. python scripts/smooth_image.py, we’ll see an error of the following form:

Traceback (most recent call last):
  File "smooth_image.py", line 1, in <module>
    from envtest import smooth_image
  File "/data/programs/environments_acse1/envtest/__init__.py", line 1, in <module>
    from envtest.builtins import *
  File "/data/programs/environments_acse1/envtest/builtins.py", line 2, in <module>
    from scipy.ndimage import gaussian_filter
ModuleNotFoundError: No module named 'scipy'

This is of course because we have not yet installed the ‘new’ required dependencies. These are scipy and matplotlib, so lets add them to our requirements.txt file and install them. That is, modify requirements.txt so that is now reads:

numpy>1.16
scipy
matplotlib

and then type

pip install -r requirements.txt

again. (Note that a pip install scipy etc would also do the job, but since we want to keep our requirements file up to date it doesn’t hurt to use it directly).

Following this, after running the script we should see a plot with the original image on the left and the smoothed image on the right

Lets go through this exercise once more. To builtins.py add the following function:

def my_mat_solve(A, b):
    return A.inv()*b

Remember that we need to make this function visible within the package and hence must modify the __all__ = [...] line appropriately.

Then, lets add a new script to make use of it. In the scripts folder create a new file called solve_matrix_equation.py and within it paste the following text

from envtest import my_mat_solve

from sympy.matrices import Matrix, MatrixSymbol

# Call function to solve the linear equation A*x=b symbolically

A = Matrix([[2, 1, 3], [4, 7, 1], [2, 6, 8]])
b = Matrix(MatrixSymbol('b', 3, 1))
x = my_mat_solve(A, b)

print(x)

Our new dependency is SymPy. Hence, lets also add that to requirements.txt (simply add sympy to the end of the file) and then repeat the install command we used previously.

To ensure the newly added script runs successfully, upon execution we should see the following output:

Matrix([[b[0, 0]/2 + b[1, 0]/10 - b[2, 0]/5], [-3*b[0, 0]/10 + b[1, 0]/10 + b[2, 0]/10], [b[0, 0]/10 - b[1, 0]/10 + b[2, 0]/10]])

Exercise: Finalizing our repository

We’ll now update our environment.yml file and then rebuild our environment to ensure that within our updated environment everything works correctly ‘out of the box’.

  • First, commit the changes we’ve made via

git commit -a -m "<my commit message>"
  • Following this lets checkout the master branch (note that the changes we’ve made above have simply brought us up to date with the master branch)

git checkout master
  • Then, add a further function of your choice to builtins.py and an accompanying script to utilize this function. Ensure that this new function requires at least one new additional dependency (remember to modify requirements.txt etc. appropriately). If you’re not sure what new package to use, how about a quick Pandas example? (You’ll learn more about Pandas later in this course).

  • When this is done, and you’ve confirmed that the new script is working as intended, modify the environment.yml and add all new dependencies to it. That is, is should now look along the lines of the following but with your additional dependencies also added:

name: envtest
channels:
  - defaults
  - conda-forge
dependencies:
  - python>=3.6
  - pip
  - numpy>1.16
  - scipy
  - matplotlib
  - sympy
  • When this is done, commit all these changes to the repository, remembering to add any new files first - e.g.

git add scripts/my_new_script.py

followed by

git commit -a -m "<my commit message>"
  • Push these changes to github

git push

IMPORTANT: Next, ensure your github repository has updated correctly - you can check this through checking some files in your web-browser.

Now, as a test, we’ll deactivate and delete our environment and remake it using our updated environments.yml file.

The required commands are as follows:

  • conda deactivate

  • conda remove --name envtest --all

  • conda env create -f environment.yml

Then, one the environment has been created activate it again via conda activate envtest. Not that if we now look at pip list we will see the full list of required packages along with their dependencies have been installed already.

(NOTE: In practice we’d generally create a new environment with a different name to test everything is working, but since this is a ‘dummy’ package and to avoid ‘clutter’ we’ll do it this way for the time being).

As a final note, think about why is it important to have both environment.yml and requirements.txt files.

  • The environment.yml was used only when creating our environment. Remember it was useful to have the requirements.txt file to install the required packages when developing our environment. (Although we could have also continuously updated our environment file and then updated our environment via conda env update -f environment.yml).

  • In any case, we generally want both for people making use of our package who are not using Anaconda. As we saw earlier, we could use such a requirements file in venv.

  • Additionally, what if we want some packages to not be installed automatically when creating our environment? For various reasons, we may with to have an, e.g. requirements-optional.txt file present (generally the packages listed in environment.yml and requirements.txt should be in sync unless there’s a good reason for them not to be). Any such optional requirements can be installed via the pip install -r ... .txt command once again.