[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?

Dan Lenski Daniel.Lenski@seagate....
Wed Aug 13 14:56:48 CDT 2008


Hi all,
I'm using NumPy to read and process data from ASCII UCD files.  This is a 
file format for describing unstructured finite-element meshes.

Most of the file consists of rectangular, numerical text matrices, easily 
and efficiently read with loadtxt().  But there is one particularly nasty 
section that consists of matrices with variable numbers of columns, like 
this:

# index property type nodes
1       1        tet  620 583 1578 1792
2       1        tet  656 551 553 566
3       1        tet  1565 766 1600 1646
4       1        tet  1545 631 1566 1665
5       1        hex  1531 1512 1559 1647 1648 1732
6       1        hex  777 1536 1556 1599 1601 1701
7       1        quad 296 1568 1535 1604
8       1        quad 54 711 285 666

As you might guess, the "type" label in the third column does indicate 
the number of following columns.

Some of my files contain sections like this of *more than 1 million 
lines*, so I need to be able to read them fast.  I have not yet come up
with a good way to do this.  What I do right now is I split them up into 
separate arrays based on the "type" label:

lines = [f.next() for i in range(n)]
lines = [l.split(None, 3) for l in lines]
id, prop, types, nodes = apply(zip, lines) # THIS TAKES /FOREVER/

id = array(id, dtype=uint)
prop = array(id, dtype=uint)
types = array(types)

cells = {}
for t in N.unique(types):
 these = N.nonzero(types==t)
 # THIS NEXT LINE TAKES FOREVER
 these_nodes = array([nodes[ii].split() for ii in these], dtype=uint).T
 cells[t] = N.row_stack(( id[these], prop[these], these_nodes ))

This is really pretty slow and sub-optimal.  Has anyone developed a more 
efficient way to read arrays with variable numbers of columns???

Dan



More information about the Numpy-discussion mailing list