[SciPy-dev] Ideas for scipy.sparse?
Viral Shah
vshah@interactivesupercomputing....
Mon Apr 14 13:36:31 CDT 2008
Hi Brian, and others,
I have been following this thread for the past few days, and its been
quite interesting. Some of my comments interspersed below:
> So, I am currently implementing a distributed memory array package
> for python:
>
> http://projects.scipy.org/ipython/ipython/browser/ipythondistarray
>
> The goal is to have distributed/parallel arrays that look and feel
> just like numpy arrays. Here is an example:
>
> import ipythondistarray as ipda
>
> a = ipda.random.rand((10,100,100), dist=(None,'b','c'))
> b = ipda.random.rand((10,100,100), dist=(None,'b','c'))
> c = 0.5*ipda.sin(a) + 0.5*ipda.cos(b)
> print c.sum(), c.mean(), c.std(), c.var()
>
> This works today on multiple processors. Don't get too excited
> though, there is still _tons_ of work to be done....
This looks really exciting. I looked at the project, and it seems like
you are using MPI as the basic infrastructure - but I also see
references to GASNet. My first question looking at this was, why not
just use Global Arrays for the basic array infrastructure ? It has a
bunch of stuff that would let you get quite far quickly:
http://www.emsl.pnl.gov/docs/global/
I believe, looking at the project organization, that you plan to add
fft, sparse arrays, the works.
I must mention that I work with Star-P (www.interactivesupercomputing.com
), and we have plugged in our parallel runtime to overload parts of
numpy and scipy. We have sparse support for M (matlab), but not for
python yet.
I am guessing that your framework allows one to run several instances
of python scripts as well on each node ? That way, one could solve
several ODEs across nodes. You'd need some dynamic scheduling for
workload balancing, perhaps.
Have you thought about integrating this with the stackless python
project, for a nice programming model that abstracts away from MPI and
allows high performance parallel programs to be written entirely in
python ?
> Here is the main issue: I am running into a need for sparse arrays.
> There are two places I am running into this:
>
> 1) I want to implement sparse distributed arrays and need sparse local
> arrays for this.
>
> 2) There are other places in the implementation where sparse arrays
> are needed.
>
> Obviously, my first though was scipy.sparse. I am _really_ excited
> about the massive improvements that have been happening in this area
> recently. Here are the problems I am running into:
Not having followed the numpy/scipy debate for too long, I must say
that I was a little surprised to find sparse array support in scipy
rather than numpy.
> 1) I need N-dimensional sparse arrays. Some of the storage formats in
> scipy.sparse (dok, coo, maybe lil) could be generalized to
> N-dimensions, but some work would have to be done.
From the discussion, it seemed to me that these formats are written
in python itself. In general, I like the idea of writing data
structures in the same language, and as much in higher level languages.
Some people pointed out that this tends to be slow when called in for
loops. I couldn't figure out what the cause of this was. Is it that
for loops in python are generally slow, or is it that indexing
individual elements in sparse data structures is slow or are python's
data structures slow, or some combination of all three ?
I can say something about Matlab's sparse matrices, which I am
familiar with. Its not that Matlab's for loops are the cause of
slowness (they may be slow), but that indexing single elements in a
sparse matrix is inherently slow. Every instance of A(i, j) has to
perform a binary search. Insertion is even more expensive because it
requires a memmove for CSC/CSR formats. Thus you almost never work
with for loops and single element indexing for sparse matrices in
Matlab. That said, one could design a data structure that partially
solves some of these problems.
I thought initially that dok may solve this for some kinds of sparse
matrix problems, but it seems that its too slow for large problems.
Perhaps a little benchmark must be setup.
Brian,
For N-d are you thinking of something along the lines of Tammy Kolda's
tensor toolbox for Matlab ? That could be accomplished with just
distributed dense vectors - I'm not sure what the performance would be
like. Its the same as using a distributed coo format.
http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
http://www.models.kvl.dk/source/nwaytoolbox/index.asp
<snip>
>
> 3) That we begin to move their implementation to using Cython (as an
> aside, cython does play very well with templated C++ code). This
> could provide a much nicer way of tying into the c++ code than using
> swig.
Not being intimately familiar with cython or swig, I did take a look
at the cython homepage and it sounded like a good fit. It does seem
that some people have strong oppositions to it. Support for templated
C++ code would be nice to have though !
-viral
More information about the Scipy-dev
mailing list