[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?
Dan Lenski
Daniel.Lenski@seagate....
Thu Aug 14 16:38:28 CDT 2008
On Thu, 14 Aug 2008 04:40:16 +0000, Daniel Lenski wrote:
> I assume that list-of-arrays is more memory-efficient since array
> elements don't have the overhead of full-blown Python objects. But
> list- of-lists is probably more time-efficient since I think it's faster
> to convert the whole array at once than do it row-by-row.
>
> Dan
Just a follow-up...
Well, I tried the simple, straightforward list-of-lists approach and it's
the fastest. About 20 seconds for 1.5 million cells on my machine:
def _read_cells(self, f, n, debug=False):
cells = dict()
for i in xrange(n):
cell = f.readline().split()
celltype = cell.pop(2)
if celltype not in cells: cells[celltype]=[]
cells[celltype].append(cell)
for k in cells:
cells[k] = N.array(cells[k], dtype=int).T
return cells
List-of-arrays uses about 20% less memory, but is about 4-5 times slower
(presumably due to the overhead of array creation?).
And my preallocation approach is 2-3 times slower than list-of-lists.
Again, I *think* this is due to array creation/conversion overhead, when
assigning to a slice of the array:
def _read_cells2(self, f, n, debug=False):
cells = dict()
count = dict()
curtype = None
for i in xrange(n):
cell = f.readline().split()
celltype = cell[2]
if celltype!=curtype:
curtype = celltype
if curtype not in cells:
cells[curtype] = N.empty((n-i, len(cell)-1), type=int)
count[curtype] = 0
block = cells[curtype]
block[count[curtype]] = cell[:2]+cell[3:] ### THIS LINE HERE
count[curtype] += 1
for k in cells:
cells[k] = cells[k][:count[k]].T
return cells
So my conclusion is... you guys are right. List-of-lists is the fastest
way to build up an array. Then do the string-to-numeric and list-to-
array conversion ALL AT ONCE with numpy.array(list_of_lists, dtype=int).
Thanks for all the insight!
Dan
More information about the Numpy-discussion
mailing list