[SciPy-Dev] GSoC -- numpy interactions with sparse matrices

Pauli Virtanen pav@iki...
Tue Jul 2 15:15:18 CDT 2013


(cc'd Numpy list --- the discussion is about making np.multiply(A, B) et 
al. do sensible rather than non-sensible things if A or B is a sparse 
matrix)


Hi Blake,

02.07.2013 06:40, Blake Griffith kirjoitti:
> Today marks the first day of the next phase of my GSoC proposal, numpy
> interactions and overriding ufuncs for sparse matrices.
>
> I've been thinking about how I will handle the ufunc interactions. I'm
> still considering how to implement a ufunc override mechanism. However I
> think the existing functionality in numpy, (__array_prepare__,
> __array_finalize__, etc) would be enough for a few cases:

I think using __array_prepare__ and __array_wrap__ et al. for this will 
run into difficulties:

- array_prepare requires that the strides and dimensions stay the same
   as in the "original array".

- Sparse matrices are not subclasses of ndarrays, so I think they are
   considered 0-dim (or NxN?) object arrays by the ufuncs.

   This then wreaks all sorts of havoc.

The issue is sort of that the whole wrapping mechanism is written under 
the assumption that the data is still contained in an ndarray stored in 
memory in a strided layout with the correct shape etc., which is not 
true for sparse matrices.

You can perhaps check the above for yourself, in case I'm mistaken. 
However, as far as I know, the facilities in Numpy don't currently allow 
getting sensible behavior out.

     ***

One option is to try to make array_prepare to allow returning arbitrary 
arrays.

I think this approach has a number of problems:

- It leads to an unnatural way to write the code --- you are trying to
   trick the chosen ufunc to do what you want, rather than coding up
   straightforwardly the operation that should give the result.

- It seems somewhat a nasty job to do: the ufunc loop selection
   (which IIRC can do stuff like optimizing iteration order etc)
   runs *before* array_prepare, so you cannot change the memory layout
   any more...

So I think extending this mechanism does not sound very promising...

     ***

The simplest approach seems to me to add a way to completely override 
the ufuncs immediately on call.

The logic could be of the style: On entry to ufunc:

- if an input has __array_priority__, check if it also has
   __ufunc_override__

- if some of the inputs have __ufunc_override__, choose the one with the
   maximum __array_priority__

- call it, passing it the arguments of the ufuncs, and any keywords (in
   a dict?)

- if it returns NotImplemented, try the next-highest priority one
   (or just give up)

It's perhaps easiest to prototype the solution first in pure Python, 
following Nathaniel's example:

     https://github.com/njsmith/numpyNEP/blob/master/numpyNEP.py

I.e., monkeypatch all ufuncs in Numpy, and insert them also to the 
numerical operations using set_numeric_ops. This should allow you to try 
out different solutions before diving deeply into the Numpy source code.

Another things to consider: does this add significantly to the runtime 
cost? I think not, because the ufunc machinery makes many checks for 
__array_priority__, so adding one more attribute check will probably not 
significantly hurt performance. But I think this needs to be measured at 
the end.

So I would suggest the following starting procedure: re-check if the 
array_prepare/array_wrap machinery can be used (I think not), and then 
we need to spec for the ufunc override. Here, prototyping it out will 
help, so I suggest you take a simple scheme how the override should work 
(eg. as above), and then try to use it in scipy.sparse with a 
pure-Python prototype monkeypatch implementation. This then hopefully 
shows how workable the scheme is, and whether we need to consider some 
other option.

The drawback here is that this would add yet another knob for the ufunc 
vs. custom classes interaction for the users to fiddle with. On the 
other, it seems that currently there is no way to do it.

     ***

The option C would be to add __array__ to sparse matrices, which would 
make np.multiply and other routines just cast the sparse matrices to 
dense ones. Or perhaps __array__ should raise an error?

This can also be considered if the ufunc override ends up being too radical.

-- 
Pauli Virtanen



More information about the SciPy-Dev mailing list