[Numpy-discussion] Possible roadmap addendum: building better text file readers

Warren Weckesser warren.weckesser@enthought....
Sun Feb 26 15:22:35 CST 2012

On Sun, Feb 26, 2012 at 3:00 PM, Nathaniel Smith <njs@pobox.com> wrote:

> On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser
> <warren.weckesser@enthought.com> wrote:
> > Right, I got that.  Sorry if the placement of the notes about how to
> clear
> > the cache seemed to imply otherwise.
> OK, cool, np.
> >> Clearing the disk cache is very important for getting meaningful,
> >> repeatable benchmarks in code where you know that the cache will
> >> usually be cold and where hitting the disk will have unpredictable
> >> effects (i.e., pretty much anything doing random access, like
> >> databases, which have complicated locality patterns, you may or may
> >> not trigger readahead, etc.). But here we're talking about pure
> >> sequential reads, where the disk just goes however fast it goes, and
> >> your code can either keep up or not.
> >>
> >> One minor point where the OS interface could matter: it's good to set
> >> up your code so it can use mmap() instead of read(), since this can
> >> reduce overhead. read() has to copy the data from the disk into OS
> >> memory, and then from OS memory into your process's memory; mmap()
> >> skips the second step.
> >
> > Thanks for the tip.  Do you happen to have any sample code that
> demonstrates
> > this?  I'd like to explore this more.
> No, I've never actually run into a situation where I needed it myself,
> but I learned the trick from Tridge so I tend to believe it :-).
> mmap() is actually a pretty simple interface -- the only thing I'd
> watch out for is that you want to mmap() the file in pieces (so as to
> avoid VM exhaustion on 32-bit systems), but you want to use pretty big
> pieces (because each call to mmap()/munmap() has overhead). So you
> might want to use chunks in the 32-128 MiB range. Or since I guess
> you're probably developing on a 64-bit system you can just be lazy and
> mmap the whole file for initial testing. git uses mmap, but I'm not
> sure it's very useful example code.
> Also it's not going to do magic. Your code has to be fairly quick
> before avoiding a single memcpy() will be noticeable.
> HTH,

Yes, thanks!   I'm working on a mmap version now.  I'm very curious to see
just how much of an improvement it can give.

