[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?

Dan Lenski Daniel.Lenski@seagate....
Thu Aug 14 16:38:28 CDT 2008

On Thu, 14 Aug 2008 04:40:16 +0000, Daniel Lenski wrote:
> I assume that list-of-arrays is more memory-efficient since array
> elements don't have the overhead of full-blown Python objects.  But
> list- of-lists is probably more time-efficient since I think it's faster
> to convert the whole array at once than do it row-by-row.
> Dan

Just a follow-up...

Well, I tried the simple, straightforward list-of-lists approach and it's 
the fastest.  About 20 seconds for 1.5 million cells on my machine:

    def _read_cells(self, f, n, debug=False):
        cells = dict()
        for i in xrange(n):
            cell = f.readline().split()
            celltype = cell.pop(2)
            if celltype not in cells: cells[celltype]=[]
        for k in cells:
            cells[k] = N.array(cells[k], dtype=int).T
        return cells

List-of-arrays uses about 20% less memory, but is about 4-5 times slower 
(presumably due to the overhead of array creation?).

And my preallocation approach is 2-3 times slower than list-of-lists.  
Again, I *think* this is due to array creation/conversion overhead, when 
assigning to a slice of the array:

    def _read_cells2(self, f, n, debug=False):
        cells = dict()
        count = dict()
        curtype = None

        for i in xrange(n):
            cell = f.readline().split()
            celltype = cell[2]

            if celltype!=curtype:
                curtype = celltype
                if curtype not in cells:
                    cells[curtype] = N.empty((n-i, len(cell)-1), type=int)
                    count[curtype] = 0
                block = cells[curtype]
            block[count[curtype]] = cell[:2]+cell[3:] ### THIS LINE HERE
            count[curtype] += 1

        for k in cells:
            cells[k] = cells[k][:count[k]].T

        return cells

So my conclusion is... you guys are right.  List-of-lists is the fastest 
way to build up an array.  Then do the string-to-numeric and list-to-
array conversion ALL AT ONCE with numpy.array(list_of_lists, dtype=int).

Thanks for all the insight!


More information about the Numpy-discussion mailing list