[Numpy-discussion] Possible roadmap addendum: building better text file readers
Wed Feb 29 12:57:05 CST 2012
Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012:
> On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <firstname.lastname@example.org> wrote:
> > Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
> >> > Even for binary, there are pathological cases, e.g. 1) reading a random
> >> > subset of nearly all rows. 2) reading a single column when rows are
> >> > small. In case 2 you will only go this route in the first place if you
> >> > need to save memory. The user should be aware of these issues.
> >> FWIW, this route actually doesn't save any memory as compared to np.memmap.
> > Actually, for numpy.memmap you will read the whole file if you try to
> > grab a single column and read a large fraction of the rows. Here is an
> > example that will end up pulling the entire file into memory
> > mm=numpy.memmap(fname, dtype=dtype)
> > rows=numpy.arange(mm.size)
> > x=mm['x'][rows]
> > I just tested this on a 3G binary file and I'm sitting at 3G memory
> > usage. I believe this is because numpy.memmap only understands rows. I
> > don't fully understand the reason for that, but I suspect it is related
> > to the fact that the ndarray really only has a concept of itemsize, and
> > the fields are really just a reinterpretation of those bytes. It may be
> > that one could tweak the ndarray code to get around this. But I would
> > appreciate enlightenment on this subject.
> Ahh, that makes sense. But, the tool you are using to measure memory
> usage is misleading you -- you haven't mentioned what platform you're
> on, but AFAICT none of them have very good tools for describing memory
> usage when mmap is in use. (There isn't a very good way to handle it.)
> What's happening is this: numpy read out just that column from the
> mmap'ed memory region. The OS saw this and decided to read the entire
> file, for reasons discussed previously. Then, since it had read the
> entire file, it decided to keep it around in memory for now, just in
> case some program wanted it again in the near future.
> Now, if you instead fetched just those bytes from the file using
> seek+read or whatever, the OS would treat that request in the exact
> same way: it'd still read the entire file, and it would still keep the
> whole thing around in memory. On Linux, you could test this by
> dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much
> memory is listed as "free" in top, and then using your code to read
> the same file -- you'll see that the 'free' memory drops by 3
> gigabytes, and the 'buffers' or 'cached' numbers will grow by 3
> [Note: if you try this experiment, make sure that you don't have the
> same file opened with np.memmap -- for some reason Linux seems to
> ignore the request to drop_caches for files that are mmap'ed.]
> The difference between mmap and reading is that in the former case,
> then this cache memory will be "counted against" your process's
> resident set size. The same memory is used either way -- it's just
> that it gets reported differently by your tool. And in fact, this
> memory is not really "used" at all, in the way we usually mean that
> term -- it's just a cache that the OS keeps, and it will immediately
> throw it away if there's a better use for that memory. The only reason
> it's loading the whole 3 gigabytes into memory in the first place is
> that you have >3 gigabytes of memory to spare.
> You might even be able to tell the OS that you *won't* be reading that
> file again, so there's no point in keeping it all cached -- on Unix
> this is done via the madavise() or posix_fadvise() syscalls. (No
> guarantee the OS will actually listen, though.)
This is interesting, and on my machine I think I've verified that what
you say is actually true.
This all makes theoretical sense, but goes against some experiments I
and my colleagues have done. For example, a colleague of mine was able
to read a couple of large files in using my code but not using memmap.
The combined files were greater than memory size. With memmap the code
started swapping. This was on 32-bit OSX. But as I said, I just tested
this on my linux box and it works fine with numpy.memmap. I don't have
an OSX box to test this.
So if what you say holds up on non-linux systems, it is in fact an
indicator that the section of my code dealing with binary could be
dropped; that bit was trivial anyway.
Erin Scott Sheldon
Brookhaven National Laboratory
More information about the NumPy-Discussion