[Numpy-discussion] Funded work on Numpy: proposed improvements and request for feedback
Mon Aug 3 21:42:17 CDT 2009
I (David Cournapeau) and the people at Berkeley (Jarrod Millman,
Fernando Perez, Matthew Brett) have been in discussion so that I could
do some funded work on NumPy/SciPy. Although they are obviously
interested in improvements that help their own projects, they are
willing to make sure the work will impact numpy/scipy as a whole. As
such we would like to get some feedback about the proposal.
There are several areas we discussed about, but the main 'vision' is to
make more of the C code in numpy reusable to 3rd parties, in particular
purely computational (fft, linear algebra, etc...) code. A first draft
of the proposal is pasted below.
Comments, request for details, objections are welcomed,
Thank you for your attention,
The Berkeley team, Gael Varoquaux and David Cournapeau
Proposal for improvements to numpy
NumPy is a solid foundation for efficient numerical computation with the python
programming language. It consists in a set of extensions to add a powerful
multi-dimensional array object. SciPy is built upon NumPy to add more high
level functionalities such as numerical integration, linear algebra,
statistical functions, etc\.\.\. Although the numpy codebase is mature, and can be
reused both at the C and Python levels, there are some limitations in the numpy
codebase which prevent some functionalities from being reused by third parties.
This means that users of numpy either need to reimplement the functionalities,
or to use workarounds. The main goal of this proposal is to improve numpy to circumvent
those limitations in general manner.
Reusable C libraries
A lot of NumPy and SciPy code is in a compiled language (mostly C and Fortran).
For computational code, it is generally advisable to split it into a purely
computational code and a wrapping part, marshalling back and forth python
objects/structures into basic C types. For example, when computing the
exponential of the items in an array, most of NumPy's job is to extract the
data from the array into one of the basic C type (int, double, etc...), call
the C function exp, and marshall the data back into python objects. Making the
marshalling and purely computational code separate has several advantages:
1. The code is easier to follow
2. The purely computational code could be reused by third parties. For example,
even for simple C math functions, there is a vast difference in
platform/toolchains support. NumPy makes sure that functions to handle
special float values (NaN, Inf, etc...) work on every supported platform, in
a consistent manner. Making those functions available to third parties would
enable developers to reuse this portable functions, and stay consistent even
on platforms they don't care about.
3. Working on optimizing the computational code is easier.
4. It would enable easier replacement of the purely computational code at
runtime. For example, one could imagine loading SSE-enabled code if the CPU
supports SSE extensions.
5. It would also helps for py3k porting, as only the marshalling code would
need to change
To make purely computational code available to third parties, two things are
1. the code itself needs to make the split explicit.
2. there needs to be support so that reusing those functionalities is as
painless as possible, from a build point of view (Note: this is almost
done in the upcoming numpy 1.4.0 as long as static linking is OK).
Splitting the code
The amount of work is directly proportional to the amount of functions to be
made available. The most obvious candidates are:
1. C99 math functions: a lot of this has already been done. In particular math
constants, and special values support is already implemented. Almost every
real function in numpy has a portable npy\_ implementation in C.
2. C99-like complex support: this naturally extends the previous series. The
main difficult is to support platforms without C99 complex support, and the
corresponding C99 complex functions.
3. FFT code: there is no support to reuse FFT at the C level at the moment.
4. Random: there is no support either
5. Linalg: idem.
Once the code itself is split, there needs some support so that the code can be
reused by third-parties. The following issues need to be solved:
1. Compile the computational code into shared or static library
2. Once built, making the libraries available to third parties (distutils
issues). Ideally, it should work in installed, in-place builds, etc\.\.\.
3. Versioning, ABI/API compatibility issues
When dealing with multi-dimensional arrays, the best abstraction to deal with
indexing in a dimension-independent way is to use iterator. NumPy already has
some iterators to walk into every item of an array, or every item but one axis.
More general iterators are useful for more complicated cases, when one needs to
walk into a subset of every item of the array. For example, for image
processing, it is often necessary to walk in a neighborhood of an array.
Boundaries conditions can be handled automatically, so that padding is
transparant to the user. More elaborate iterators, e.g. with a mask (for
morphological image processing) can be considered as well.
Several packages in scipy implement those iterators (ndimage), or handle
boundaries conditions manually in the algorithmic code (scipy.signal). Implementing
iterators in numpy would enable better code reuse.
A neighborhood iterator is already available in numpy. It can handles zero,
one, constant, and mirror padding. Potential improvements can be considered
from a speed POV, in particular by splitting areas which neeed boundaries
handling from the ones which do not.
A masked neighborhood iterator is not available. Itk is one toolkit which
implemented such an iterator.
C code coverage and static numpy linking
NumPy community has focused a lot on improving the test suite. We went from a
few hundred of unit tests in 2006 to more than 2000 unit tests for numpy 1.4.
Although code coverage at the python level is relatively easy to obtain using
some nose plugins, C code coverage is not possible ATM.
The traditional code coverage tool for C code is gprof, the GNU profiler.
Unfortunately, gprof cannot profile code which is dynamically linked, as is the
case for python extensions. One solution is thus to statically link numpy to
the python interpreter. This poses challenges both as build and code levels.
Some preliminary work showed that the approach works, but something which could
be integrated upstream, and make numpy easily linkable to the python
interpreter would be better.
Also, some people have expressed interest in distributing a python interpreter
with numpy statically linked (e.g. for easy distribution).
More information about the NumPy-Discussion