[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?

Dan Lenski Daniel.Lenski@seagate....
Thu Aug 14 16:38:28 CDT 2008

```On Thu, 14 Aug 2008 04:40:16 +0000, Daniel Lenski wrote:
> I assume that list-of-arrays is more memory-efficient since array
> elements don't have the overhead of full-blown Python objects.  But
> list- of-lists is probably more time-efficient since I think it's faster
> to convert the whole array at once than do it row-by-row.
>
> Dan

Just a follow-up...

Well, I tried the simple, straightforward list-of-lists approach and it's
the fastest.  About 20 seconds for 1.5 million cells on my machine:

cells = dict()
for i in xrange(n):
celltype = cell.pop(2)
if celltype not in cells: cells[celltype]=[]
cells[celltype].append(cell)
for k in cells:
cells[k] = N.array(cells[k], dtype=int).T
return cells

List-of-arrays uses about 20% less memory, but is about 4-5 times slower
(presumably due to the overhead of array creation?).

And my preallocation approach is 2-3 times slower than list-of-lists.
Again, I *think* this is due to array creation/conversion overhead, when
assigning to a slice of the array:

cells = dict()
count = dict()
curtype = None

for i in xrange(n):
celltype = cell[2]

if celltype!=curtype:
curtype = celltype
if curtype not in cells:
cells[curtype] = N.empty((n-i, len(cell)-1), type=int)
count[curtype] = 0
block = cells[curtype]
block[count[curtype]] = cell[:2]+cell[3:] ### THIS LINE HERE
count[curtype] += 1

for k in cells:
cells[k] = cells[k][:count[k]].T

return cells

So my conclusion is... you guys are right.  List-of-lists is the fastest
way to build up an array.  Then do the string-to-numeric and list-to-
array conversion ALL AT ONCE with numpy.array(list_of_lists, dtype=int).

Thanks for all the insight!

Dan

```