[Numpy-discussion] Possible roadmap addendum: building better text file readers

Robert Kern robert.kern@gmail....
Wed Feb 29 09:51:45 CST 2012

On Wed, Feb 29, 2012 at 15:11, Erin Sheldon <erin.sheldon@gmail.com> wrote:
> Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
>> > Even for binary, there are pathological cases, e.g. 1) reading a random
>> > subset of nearly all rows.  2) reading a single column when rows are
>> > small.  In case 2 you will only go this route in the first place if you
>> > need to save memory.  The user should be aware of these issues.
>> FWIW, this route actually doesn't save any memory as compared to np.memmap.
> Actually, for numpy.memmap you will read the whole file if you try to
> grab a single column and read a large fraction of the rows.  Here is an
> example that will end up pulling the entire file into memory
>    mm=numpy.memmap(fname, dtype=dtype)
>    rows=numpy.arange(mm.size)
>    x=mm['x'][rows]
> I just tested this on a 3G binary file and I'm sitting at 3G memory
> usage.  I believe this is because numpy.memmap only understands rows.  I
> don't fully understand the reason for that, but I suspect it is related
> to the fact that the ndarray really only has a concept of itemsize, and
> the fields are really just a reinterpretation of those bytes.  It may be
> that one could tweak the ndarray code to get around this.  But I would
> appreciate enlightenment on this subject.

Each record (I would avoid the word "row" in this context) is
contiguous in memory whether that memory is mapped to disk or not.
Additionally, the way that virtual memory (i.e. mapped memory) works
is that when you request the data at a given virtual address, the OS
will go look up the page it resides in (typically 4-8k in size) and
pull the whole page into main memory. Since you are requesting most of
the records, you are probably pulling all of the file into main
memory. Memory mapping works best when you pull out contiguous chunks
at a time rather than pulling out stripes.

numpy structured arrays do not rearrange your data to put all of the
'x' data contiguous with each other. You can arrange that yourself, if
you like (use a structured scalar with a dtype such that each field is
an array of the appropriate length and dtype). Then pulling out all of
the 'x' field values will only touch a smaller fraction of the file.

Robert Kern

More information about the NumPy-Discussion mailing list