[Numpy-discussion] reading *big* inhomogenous text matrices *fast*?

robert.kern@gmai... robert.kern@gmai...
Wed Aug 13 20:55:02 CDT 2008


On 2008-08-13, Daniel Lenski <dlenski@gmail.com> wrote:
> On Wed, 13 Aug 2008 16:57:32 -0400, Zachary Pincus wrote:
>> Your approach generates numerous large temporary arrays and lists. If
>> the files are large, the slowdown could be because all that memory
>> allocation is causing some VM thrashing. I've run into that at times
>> parsing large text files.
>
> Thanks, Zach.  I do think you have the right explanation for what was
> wrong with my code.
>
> I thought the slowdown was due to the overhead of interpreted code.  So I
> tried to do everything in list comprehensions and array statements rather
> than explicit Python loops.  But your were definitely right, the slowdown
> was due to memory use, not interpreted code.
>
>> Perhaps better would be to iterate through the file, building up your
>> cells dictionary  incrementally. Finally, once the file is read in
>> fully, you could convert what you can to arrays...
>>
>> f = open('big_file')
>> header = f.readline()
>> cells = {'tet':[], 'hex':[], 'quad':[]} for line in f:
>>    vals = line.split()
>>    index_property = vals[:2]
>>    type=vals[3]
>>    nodes = vals[3:]
>>    cells[type].append(index_property+nodes)
>> for type, vals in cells:
>>    cells[type] = numpy.array(vals, dtype=int)
>
> This is similar to what I tried originally!  Unfortunately, repeatedly
> appending to a list seems to be very slow... I guess Python keeps
> reallocating and copying the list as it grows.  (It would be nice to be
> able to tune the increments by which the list size increases.)

The list reallocation schedule is actually fairly well-tuned as it is.
Appending to a list object should be amortized O(1) time.

>> I'm not sure if this is exactly what you want, but you get the idea...
>> Anyhow, the above only uses about twice as much RAM as the size of the
>> file. Your approach looks like it uses several times more than that.
>>
>> Also you could see if:
>>    cells[type].append(numpy.array([index, property]+nodes, dtype=int))
>>
>> is faster than what's above... it's worth testing.
>
> Repeatedly concatenating arrays with numpy.append or numpy.concatenate is
> also quite slow, unfortunately. :-(

Yes. There is no preallocation here.

>> If even that's too slow, maybe you'll need to do this in C? That
>> shouldn't be too hard, really.
>
> Yeah, I eventually came up with a decent solution Python solution:
> preallocate the arrays to the maximum size that might be needed.  Trim
> them down afterwards.  This is very wasteful of memory when there may be
> many cell types (less so if the OS does lazy allocation), but in the
> typical case of only a few cell types it works great:

Another approach would be to preallocate a substantial chunk at a
time, then concatenate all of the chunks.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco


More information about the Numpy-discussion mailing list