[Numpy-discussion] Funded work on Numpy: proposed improvements and request for feedback

Charles R Harris charlesr.harris@gmail....
Mon Aug 3 23:23:47 CDT 2009


On Mon, Aug 3, 2009 at 8:42 PM, David Cournapeau <
david@ar.media.kyoto-u.ac.jp> wrote:

> Hi All,
>
>    I (David Cournapeau) and the people at Berkeley (Jarrod Millman,
> Fernando Perez, Matthew Brett) have been in discussion so that I could
> do some funded work on NumPy/SciPy. Although they are obviously
> interested in improvements that help their own projects, they are
> willing to make sure the work will impact numpy/scipy as a whole. As
> such we would like to get some feedback about the proposal.
>
> There are several areas we discussed about, but the main 'vision' is to
> make more of the C code in numpy reusable to 3rd parties, in particular
> purely computational (fft, linear algebra, etc...) code. A first draft
> of the proposal is pasted below.
>
> Comments, request for details, objections are welcomed,
>
> Thank you for your attention,
>
> The Berkeley team, Gael Varoquaux and David Cournapeau
>
> ==================================
> Proposal for improvements to numpy
> ==================================
>
> NumPy is a solid foundation for efficient numerical computation with the
> python
> programming language. It consists in a set of extensions to add a powerful
> multi-dimensional array object. SciPy is built upon NumPy to add more high
> level functionalities such as numerical integration, linear algebra,
> statistical functions, etc\.\.\. Although the numpy codebase is mature, and
> can be
> reused both at the C and Python levels, there are some limitations in the
> numpy
> codebase which prevent some functionalities from being reused by third
> parties.
> This means that users of numpy either need to reimplement the
> functionalities,
> or to use workarounds. The main goal of this proposal is to improve numpy
> to circumvent
> those limitations in general manner.
>
> Reusable C libraries
> ====================
>
> A lot of NumPy and SciPy code is in a compiled language (mostly C and
> Fortran).
> For computational code, it is generally advisable to split it into a purely
> computational code and a wrapping part, marshalling back and forth python
> objects/structures into basic C types. For example, when computing the
> exponential of the items in an array, most of NumPy's job is to extract the
> data from the array into one of the basic C type (int, double, etc...),
> call
> the C function exp, and marshall the data back into python objects. Making
> the
> marshalling and purely computational code separate has several advantages:
>
> 1. The code is easier to follow
> 2. The purely computational code could be reused by third parties. For
> example,
>   even for simple C math functions, there is a vast difference in
>   platform/toolchains support. NumPy makes sure that functions to handle
>   special float values (NaN, Inf, etc...) work on every supported platform,
> in
>   a consistent manner. Making those functions available to third parties
> would
>   enable developers to reuse this portable functions, and stay consistent
> even
>   on platforms they don't care about.
> 3. Working on optimizing the computational code is easier.
> 4. It would enable easier replacement of the purely computational code at
>   runtime. For example, one could imagine loading SSE-enabled code if the
> CPU
>   supports SSE extensions.
> 5. It would also helps for py3k porting, as only the marshalling code would
>   need to change
>
> To make purely computational code available to third parties, two things
> are
> needed:
>
> 1. the code itself needs to make the split explicit.
> 2. there needs to be support so that reusing those functionalities is as
>   painless as possible, from a build point of view (Note: this is almost
>   done in the upcoming numpy 1.4.0 as long as static linking is OK).
>

Ah, it itches. This is certainly a worthy goal, but are there third parties
who have expressed an interest in this? I mean, besides trying to avoid
duplicate bits of code in Scipy.


>
> Splitting the code
> ------------------
>
> The amount of work is directly proportional to the amount of functions to
> be
> made available. The most obvious candidates are:
>
> 1. C99 math functions: a lot of this has already been done. In particular
> math
>   constants, and special values support is already implemented. Almost
> every
>   real function in numpy has a portable npy\_ implementation in C.
> 2. C99-like complex support: this naturally extends the previous series.
>  The
>   main difficult is to support platforms without C99 complex support, and
> the
>   corresponding C99 complex functions.
> 3. FFT code: there is no support to reuse FFT at the C level at the moment.
> 4. Random: there is no support either
> 5. Linalg: idem.
>

This is good. I think it should go along with code reorganization. The files
are now broken up but I am not convinced that everything is yet where it
should be.

The complex support could be a major effort in its own right if we need to
rewrite all the current functions. That said, it would be nice if the
complex support was separated out like the current real support.  Test to go
along with it would be helpful. This also ties in with having build support
for many platforms.



> Build support
> -------------
>
> Once the code itself is split, there needs some support so that the code
> can be
> reused by third-parties. The following issues need to be solved:
>
> 1. Compile the computational code into shared or static library
> 2. Once built, making the libraries available to third parties (distutils
>   issues). Ideally, it should work in installed, in-place builds, etc\.\.\.
>   situations.
> 3. Versioning, ABI/API compatibility issues
>
>
Trying to break out the build support itself might be useful. It would be
good to get some feedback here from other projects that might be interested.
But this is a wheel that probably gets reinvented on a regular basis.


>
> Iterators
> =========
>
> When dealing with multi-dimensional arrays, the best abstraction to deal
> with
> indexing in a dimension-independent way is to use iterator. NumPy already
> has
> some iterators to walk into every item of an array, or every item but one
> axis.
> More general iterators are useful for more complicated cases, when one
> needs to
> walk into a subset of every item of the array. For example, for image
> processing, it is often necessary to walk in a neighborhood of an array.
> Boundaries conditions can be handled automatically, so that padding is
> transparant to the user. More elaborate iterators, e.g. with a mask (for
> morphological image processing) can be considered as well.
>
> Several packages in scipy implement those iterators (ndimage), or handle
> boundaries conditions manually in the algorithmic code (scipy.signal).
> Implementing
> iterators in numpy would enable better code reuse.
>
> Possible iterators
> ------------------
>
> A neighborhood iterator is already available in numpy. It can handles zero,
> one, constant, and mirror padding. Potential improvements can be considered
> from a speed POV, in particular by splitting areas which neeed boundaries
> handling from the ones which do not.
>
> A masked neighborhood iterator is not available. Itk is one toolkit which
> implemented such an iterator.
>
>
I think this needs some thought. This would essentially be a c library of
iterator code. C++ is probably an easier language for such things as it
handles the classes and inlining automatically. Which is to say if I had to
deal with a lot of iterators I might choose a different language for
implementation.


>
> C code coverage and static numpy linking
> ========================================
>
> NumPy community has focused a lot on improving the test suite. We went from
> a
> few hundred of unit tests in 2006 to more than 2000 unit tests for numpy
> 1.4.
> Although code coverage at the python level is relatively easy to obtain
> using
> some nose plugins, C code coverage is not possible ATM.
>
> The traditional code coverage tool for C code is gprof, the GNU profiler.
> Unfortunately, gprof cannot profile code which is dynamically linked, as is
> the
> case for python extensions. One solution is thus to statically link numpy
> to
> the python interpreter. This poses challenges both as build and code
> levels.
> Some preliminary work showed that the approach works, but something which
> could
> be integrated upstream, and make numpy easily linkable to the python
> interpreter would be better.
>
> Also, some people have expressed interest in distributing a python
> interpreter
> with numpy statically linked (e.g. for easy distribution).
>

I don't have an opinion here. As a side issue, it would be nice to have some
infrastructure for documenting the c code. That way after I have worked my
way through one of the numpy functions I could document it so that I
wouldn't have to repeat the whole process at some later date.

As to choosing a project, you should pick one that really interests you. How
would you rank your own interest in these various proposals?

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20090803/6d2d9e25/attachment.html 


More information about the NumPy-Discussion mailing list