[Numpy-discussion] fast constructor for arrays from byte data (unpickling?)

Robert Kern kern at caltech.edu
Wed Aug 8 16:24:14 CDT 2001


On Wed, Aug 08, 2001 at 01:37:02PM -0700, Chris Barker wrote:
> Robert Kern wrote:
> 
> > > Thanks, that works, but now I am wondering: what I want is a fast and
> > > memory efficient way to get the contents of a file into a NujmPy array,
> > > this sure doesn't look any better than:
> > >
> > > a = fromstring(file.read()))
> > 
> > Depends on how large the file is. file.read() creates a temporary string the
> > size of the file. That string isn't freed until fromstring() finishes and
> > returns the array object. For a brief time, both the string and the array have
> > duplicates of the same data taking up space in memory.
> 
> Exactly. Also is the memory used for hte string guarranteed to be freed
> right away? I have no idea how Python internals work.

Once fromstring returns, the string's refcount is 0 and should be freed just
about right away. I'm not sure, but I don't think the new gc will affect that.

> Anyway, that's why I want a "fromfile()" method, like the one inthe
> library array module.

Yes, I agree, it would be nice to have.

> > I don't know the details of mmap, so it's certainly possible that the only way
> > that fromstring knows how to access the data is to pull all of it into memory
> > first, thus recreating the problem. Alas.
> 
> I don't understand mmap at all. From the name, it sounds like the entire
> contents of the file is mapped into memory, so the memory would get used
> as soon as you set it up. If anyone knows, I'd like to hear...

Performing a very cursory, very non-scientific study, I created a 110M file,
mmap'ed it, then made a string from some of it. I'm using 64MB of RAM, 128MB
swap partition on a Linux 2.4.7 machine. According to top(1), the memory use 
didn't jump up until I made the string. Also, given that the call to mmap
returned almost instantaneously, I'd say that mmap doesn't pull the whole file
into memory when the object is created (one could just read the file for that).

I don't know what test to perform to see whether the fromstring() constructor
uses double memory, but my guess would be that memcpy won't pull in the whole
file before copying. OTOH, the accessed portions of the mmap'ed file may be kept
in memory.

Does anyone know the details on mmap? I'm shooting in the dark, here.
<reads Phil Austin's e-mail>  Ooh, nice.

> -Chris

-- 
Robert Kern
kern at caltech.edu

"In the fields of hell where the grass grows high
 Are the graves of dreams allowed to die."
  -- Richard Harter




More information about the Numpy-discussion mailing list