[SciPy-Dev] sparse vectors / matrices / tensors

Nathaniel Smith njs@pobox....
Tue Sep 20 22:30:27 CDT 2011


On Tue, Sep 20, 2011 at 10:26 AM, David Cournapeau <cournape@gmail.com> wrote:
> I think those statementds are pretty uncontroversial. Implementing a
> sparse array (not limited to 1/2 dimensions) is a non trivial
> endeavour because the efficient 2d representations (csr/csc) do not
> generalize well.
>
> I started looking into the literature about those issues, and found a
> few interesting data structures, such as the UB-tree
> (http://www.scholarpedia.org/article/B-tree_and_UB-tree). I am
> actually currently looking into implementing this for a basic sparse
> tensor to get an idea of its efficiency (memory and speed-

The Matlab Tensor Toolbox uses a COO format for sparse tensors (arrays):
  http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
They explain their reasoning here:
  http://csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/SIAM-67648.pdf

I wouldn't be surprised if UB-trees or similar were better, but the
COO format is extremely simple to reason about and seems reasonably
efficient. The memory overhead versus CSR/CSC is not much, and if you
want to switch from column-wise to row-wise access, you just do an
in-place sort on one of the coordinate columns. Sorting is a fast
operation, and even more so if you have an algorithm that is (a)
stable (so re-sorting along one dimension leaves some structure in the
previous dimension), and (b) is able to exploit such structure to make
later sorts faster. Python's timsort is just such an algorithm. A
similar strategy works for applying element-wise binary operations to
two sparse COO arrays -- you treat the two arrays as being a single
big unsorted array, sort it to bring matching non-zero elements
together, and then use your binary operation to combine duplicate
entries.

As for general interfaces, I think it'd make more sense to talk about
that once we had at least one example of such a library released
publically... trying to generalize over zero examples is hard :-).

-- Nathaniel


More information about the SciPy-Dev mailing list