[SciPy-Dev] sparse vectors / matrices / tensors

Yannick Versley yversley@gmail....
Tue Sep 20 17:12:43 CDT 2011


I think the most important tradeoffs are about

- writable vs. read-only; where I assume that writes are batched (so you can put
the contents of writes somewhere and transition it to a compact read-only form
at the next read access)

- only access in lexical order vs. multidimensional access; my own
assumes always access in lexical order (or enumeration in that order), but
other people (Dan Goodman) have written about adding auxiliary indices
to facilitate
column-wise access.

My guess is that there will be something that works well for a
commonly encountered
subset of those requirements, but nothing that fits in a
one-size-fits-all fashion
if one also considers multidimensional access. (Keeping both the matrix and its
transpose enough is sometimes ok, but fails when you want to modify the values).

>* - scipy.sparse is limited to matrices and has no vectors or order-k tensors*>* - LIL and DOK are not really efficient or convenient data structures to*>* create*>*   sparse matrices (my own library basically keeps a list of unordered COO*>* items*>*   and compacts/sorts them when the matrix is actually used as a matrix)*
> I think those statementds are pretty uncontroversial. Implementing a
> sparse array (not limited to 1/2 dimensions) is a non trivial
> endeavour because the efficient 2d representations (csr/csc) do not
> generalize well.

Basically, you have dense access where you compute the offset
of the data from the address itself, and then you have COO where
you store (address,data) tuples (either separately or together).
CSR uses a mixture where you use a computed offset for a prefix of
the address and then store (address,data) tuples for a suffix of the
address. You could generalize this by always use a computed offset
for the first dimension of the address and then use (address,data)
for the rest (which gives you diminishing returns in terms of space
saved); or you could use some multilevel scheme where you first use
a computed offset and then look up N-2 (address,offset) pairs (with
the range end being specified by the offset in the following tuple)
followed by one last group of (address,data) tuples.

> I started looking into the literature about those issues, and found a
> few interesting data structures, such as the UB-tree
> (http://www.scholarpedia.org/article/B-tree_and_UB-tree). I am
> actually currently looking into implementing this for a basic sparse
> tensor to get an idea of its efficiency (memory and speed-wise).

Neat idea. Space-filling curves should give a decent tradeoff for
multidimensional access. (Using B-trees vs batching up writes is
yet another case of "it depends" - which would suggest a common
interface rather than a single common data structure for all sparse

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20110921/ecf1d724/attachment.html 

More information about the SciPy-Dev mailing list