[SciPy-dev] Sparse data structures.

Viral Shah vshah@interactivesupercomputing....
Mon Apr 7 14:15:19 CDT 2008


Sparse folks,

I had a couple of other general questions/comments about Python's  
sparse matrix infrastructure. Coming from the Matlab land, I am used  
to the system providing me just one data storage for sparse matrices.  
Python provides 5. The rationale for Matlab's sparse matrix design is  
described in great detail in:

http://www.hpl.hp.com/personal/Robert_Schreiber/papers/Sparse%20Matrices%20in%20Matlab/simax.pdf

For linear algebra operations such as direct solvers and iterative  
methods, it doesn't really matter what format you start out in. The  
cost of conversion is small compared to the cost of the operation.  
Most of these are provided in libraries, and hence not a huge source  
of worry. The real questions are array operations that are performed  
in code that a Python programmer may write. These include operations  
such as assembly, indexing, assignment, and concatenation. This is  
where the choice of data structure becomes important.

Offering 5 choices to the naive programmer who doesn't understand the  
details about sparse matrix implementations may make Scipy look  
daunting. I am wondering whats the best way to address this ? I  
summarize my understanding of the different formats below. I'd love to  
have a conversation with someone on how to make this useful in a  
coherent way (one default data structure, for example), or at the very  
least, update the documentation with some information to guide someone.

For row/col indexing, CSR and CSC formats can be decent - especially  
if you are extracting columns from CSC and rows from CSR. Random  
access is decent, but not the greatest. These formats are also good  
for matrix-vector multiplication, the workhorse of many iterative  
solvers - but one could use a library like OSKI for that. CSR/CSC are  
terrible for assembly and assignment. They can be decent for  
concatenation along the preferred dimension.

A dictionary representation is great for random access, assembly, and  
concatenation, but not very efficient for submatrix indexing and  
assignment. Co-ordinate representation can be good for operations,  
given that the right rule for combining duplicates is used. The good  
thing about both these representations is that they can be extended to  
N-d sparse arrays.

The list of lists approach lies somewhere in the middle. I am assuming  
these lists are kept sorted. The main drawback is the excessive memory  
usage to store all the pointers. I haven't yet looked at the code, and  
I could be wrong.

Thanks,

-viral





More information about the Scipy-dev mailing list