[SciPy-dev] Ideas for scipy.sparse?

Viral Shah vshah@interactivesupercomputing....
Mon Apr 14 13:36:31 CDT 2008

Hi Brian, and others,

I have been following this thread for the past few days, and its been  
quite interesting. Some of my comments interspersed below:
> So, I am currently implementing a distributed memory array package  
> for python:
> http://projects.scipy.org/ipython/ipython/browser/ipythondistarray
> The goal is to have distributed/parallel arrays that look and feel
> just like numpy arrays.  Here is an example:
> import ipythondistarray as ipda
> a = ipda.random.rand((10,100,100), dist=(None,'b','c'))
> b = ipda.random.rand((10,100,100), dist=(None,'b','c'))
> c = 0.5*ipda.sin(a) + 0.5*ipda.cos(b)
> print c.sum(), c.mean(), c.std(), c.var()
> This works today on multiple processors.  Don't get too excited
> though, there is still _tons_ of work to be done....
This looks really exciting. I looked at the project, and it seems like  
you are using MPI as the basic infrastructure - but I also see  
references to GASNet. My first question looking at this was, why not  
just use Global Arrays for the basic array infrastructure ? It has a  
bunch of stuff that would let you get quite far quickly:


I believe, looking at the project organization, that you plan to add  
fft, sparse arrays, the works.

I must mention that I work with Star-P (www.interactivesupercomputing.com 
), and we have plugged in our parallel runtime to overload parts of  
numpy and scipy. We have sparse support for M (matlab), but not for  
python yet.

I am guessing that your framework allows one to run several instances  
of python scripts as well on each node ? That way, one could solve  
several ODEs across nodes. You'd need some dynamic scheduling for  
workload balancing, perhaps.

Have you thought about integrating this with the stackless python  
project, for a nice programming model that abstracts away from MPI and  
allows high performance parallel programs to be written entirely in  
python ?
> Here is the main issue:  I am running into a need for sparse arrays.
> There are two places I am running into this:
> 1) I want to implement sparse distributed arrays and need sparse local
> arrays for this.
> 2) There are other places in the implementation where sparse arrays  
> are needed.
> Obviously, my first though was scipy.sparse.  I am _really_ excited
> about the massive improvements that have been happening in this area
> recently.  Here are the problems I am running into:
Not having followed the numpy/scipy debate for too long, I must say  
that I was a little surprised to find sparse array support in scipy  
rather than numpy.
> 1) I need N-dimensional sparse arrays.  Some of the storage formats in
> scipy.sparse (dok, coo, maybe lil) could be generalized to
> N-dimensions, but some work would have to be done.
 From the discussion, it seemed to me that these formats are written  
in python itself. In general, I like the idea of writing data  
structures in the same language, and as much in higher level languages.

Some people pointed out that this tends to be slow when called in for  
loops. I couldn't figure out what the cause of this was. Is it that  
for loops in python are generally slow, or is it that indexing  
individual elements in sparse data structures is slow or are python's  
data structures slow, or some combination of all three ?

I can say something about Matlab's sparse matrices, which I am  
familiar with. Its not that Matlab's for loops are the cause of  
slowness (they may be slow), but that indexing single elements in a  
sparse matrix is inherently slow. Every instance of A(i, j) has to  
perform a binary search. Insertion is even more expensive because it  
requires a memmove for CSC/CSR formats. Thus you almost never work  
with for loops and single element indexing for sparse matrices in  
Matlab. That said, one could design a data structure that partially  
solves some of these problems.

I thought initially that dok may solve this for some kinds of sparse  
matrix problems, but it seems that its too slow for large problems.  
Perhaps a little benchmark must be setup.


For N-d are you thinking of something along the lines of Tammy Kolda's  
tensor toolbox for Matlab ? That could be accomplished with just  
distributed dense vectors - I'm not sure what the performance would be  
like. Its the same as using a distributed coo format.


> 3) That we begin to move their implementation to using Cython (as an
> aside, cython does play very well with templated C++ code).  This
> could provide a much nicer way of tying into the c++ code than using
> swig.
Not being intimately familiar with cython or swig, I did take a look  
at the cython homepage and it sounded like a good fit. It does seem  
that some people have strong oppositions to it.  Support for templated  
C++ code would be nice to have though !


More information about the Scipy-dev mailing list