[Numpy-discussion] Possible roadmap addendum: building better text file readers

Nathaniel Smith njs@pobox....
Wed Feb 29 12:17:53 CST 2012


On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
> Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
>> > Even for binary, there are pathological cases, e.g. 1) reading a random
>> > subset of nearly all rows.  2) reading a single column when rows are
>> > small.  In case 2 you will only go this route in the first place if you
>> > need to save memory.  The user should be aware of these issues.
>>
>> FWIW, this route actually doesn't save any memory as compared to np.memmap.
>
> Actually, for numpy.memmap you will read the whole file if you try to
> grab a single column and read a large fraction of the rows.  Here is an
> example that will end up pulling the entire file into memory
>
>    mm=numpy.memmap(fname, dtype=dtype)
>    rows=numpy.arange(mm.size)
>    x=mm['x'][rows]
>
> I just tested this on a 3G binary file and I'm sitting at 3G memory
> usage.  I believe this is because numpy.memmap only understands rows.  I
> don't fully understand the reason for that, but I suspect it is related
> to the fact that the ndarray really only has a concept of itemsize, and
> the fields are really just a reinterpretation of those bytes.  It may be
> that one could tweak the ndarray code to get around this.  But I would
> appreciate enlightenment on this subject.

Ahh, that makes sense. But, the tool you are using to measure memory
usage is misleading you -- you haven't mentioned what platform you're
on, but AFAICT none of them have very good tools for describing memory
usage when mmap is in use. (There isn't a very good way to handle it.)

What's happening is this: numpy read out just that column from the
mmap'ed memory region. The OS saw this and decided to read the entire
file, for reasons discussed previously. Then, since it had read the
entire file, it decided to keep it around in memory for now, just in
case some program wanted it again in the near future.

Now, if you instead fetched just those bytes from the file using
seek+read or whatever, the OS would treat that request in the exact
same way: it'd still read the entire file, and it would still keep the
whole thing around in memory. On Linux, you could test this by
dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much
memory is listed as "free" in top, and then using your code to read
the same file -- you'll see that the 'free' memory drops by 3
gigabytes, and the 'buffers' or 'cached' numbers will grow by 3
gigabytes.

[Note: if you try this experiment, make sure that you don't have the
same file opened with np.memmap -- for some reason Linux seems to
ignore the request to drop_caches for files that are mmap'ed.]

The difference between mmap and reading is that in the former case,
then this cache memory will be "counted against" your process's
resident set size. The same memory is used either way -- it's just
that it gets reported differently by your tool. And in fact, this
memory is not really "used" at all, in the way we usually mean that
term -- it's just a cache that the OS keeps, and it will immediately
throw it away if there's a better use for that memory. The only reason
it's loading the whole 3 gigabytes into memory in the first place is
that you have >3 gigabytes of memory to spare.

You might even be able to tell the OS that you *won't* be reading that
file again, so there's no point in keeping it all cached -- on Unix
this is done via the madavise() or posix_fadvise() syscalls. (No
guarantee the OS will actually listen, though.)

> This fact was the original motivator for writing my code; the text
> reading ability came later.
>
>> Cool. I'm just a little concerned that, since we seem to have like...
>> 5 different implementations of this stuff all being worked on at the
>> same time, we need to get some consensus on which features actually
>> matter, so they can be melded together into the Single Best File
>> Reader Evar. An interface where indexing and file-reading are combined
>> is significantly more complicated than one where the core file-reading
>> inner-loop can ignore indexing. So far I'm not sure why this
>> complexity would be worthwhile, so that's what I'm trying to
>> understand.
>
> I think I've addressed the reason why the low level C code was written.
> And I think a unified, high level interface to binary and text files,
> which the Recfile class provides, is worthwhile.
>
> Can you please say more about "...one where the core file-reading
> inner-loop can ignore indexing"?  I didn't catch the meaning.

Sure, sorry. What I mean is just, it's easier to write code that only
knows how to do a dumb sequential read, and doesn't know how to seek
to particular places and pick out just the fields that are being
requested. And it's easier to maintain, and optimize, and document,
and add features, and so forth. (And we can still have a high-level
interface on top of it, if that's useful.) So I'm trying to understand
if there's really a compelling advantage that we get by build seeking
smarts into our low-level C code, that we can't get otherwise.

-- Nathaniel


More information about the NumPy-Discussion mailing list