[SciPy-user] Creating coo_matrix from data in text file

Dinesh B Vadhia dineshbvadhia@hotmail....
Thu Feb 7 16:49:55 CST 2008


Thank-you Nathan but I was looking for a method that didn't use the interim arrays:

row  = IJV[:,0]
col  = IJV[:,1]
data = IJV[:,2]

because our datasets are very large and using these interim arrays causes out of memory errors.

We are looking for a method to populate a coo_matrix (or csr_matrix) directly from a file (containing the i,j, v items).  We can then save/load the csr_matrix using Andrew Straw's fast code.

Hope this makes sense!

Dinesh

------------------------------
Date: Tue, 5 Feb 2008 18:07:51 -0600
From: "Nathan Bell" <wnbell@gmail.com>
Subject: Re: [SciPy-user] Creating coo_matrix from data in text file
To: "SciPy Users List" <scipy-user@scipy.org>
Message-ID:
<d05265cb0802051607g7fde0e22q1a6b1324c71ee21b@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Feb 5, 2008 5:08 PM, Dinesh B Vadhia <dineshbvadhia@hotmail.com> wrote:
> The sparse coo_matrix method performs really well but our data sets are very
> large and the working arrays (ie. ij, row, column and data) take up
> significant memory.  The judicious use of <del working array object> helps
> but not that much.
>
> Is there a fast method available similar to coo_matrix to create a sparse
> matrix from a text file instead of through a set of interim working arrays?
> The file would contain the coordinates (i, j) and the value of each item.
> Once the sparse matrix has been created we can then save/load it at will
> (using Andrew Straw's fast load/save code).

Suppose you have a file named matrix.txt with the following contents:

$ cat matrix.txt
0 1 10
0 2 20
5 3 -5
6 4 14


now run this script:

from numpy import fromfile
from scipy.sparse import coo_matrix

IJV = fromfile("matrix.txt",sep=" ").reshape(-1,3)

row  = IJV[:,0]
col  = IJV[:,1]
data = IJV[:,2]

A = coo_matrix( (data,(row,col)) )

print repr(A)
print A.todense()



You should see:

<7x5 sparse matrix of type '<type 'numpy.float64'>'
        with 4 stored elements in COOrdinate format>
[[  0.  10.  20.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.  -5.   0.]
 [  0.   0.   0.   0.  14.]]


This should be very fast.  The only thing that would be faster is the
recent scipy.io MATLAB file support which stores data in binary format
(or storing your own binary format I suppose)


-- 
Nathan Bell wnbell@gmail.com
http://graphics.cs.uiuc.edu/~wnbell/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/scipy-user/attachments/20080207/d73b8e11/attachment-0001.html 


More information about the SciPy-user mailing list