[Numpy-discussion] Possible roadmap addendum: building better text file readers
Sun Feb 26 13:58:35 CST 2012
On Sun, Feb 26, 2012 at 1:49 PM, Nathaniel Smith <firstname.lastname@example.org> wrote:
> On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
> <email@example.com> wrote:
> > On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <firstname.lastname@example.org> wrote:
> >> For this kind of benchmarking, you'd really rather be measuring the
> >> CPU time, or reading byte streams that are already in memory. If you
> >> can process more MB/s than the drive can provide, then your code is
> >> effectively perfectly fast. Looking at this number has a few
> >> advantages:
> >> - You get more repeatable measurements (no disk buffers and stuff
> >> messing with you)
> >> - If your code can go faster than your drive, then the drive won't
> >> make your benchmark look bad
> >> - There are probably users out there that have faster drives than you
> >> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
> >> array), so it's nice to be able to measure optimizations even after
> >> they stop mattering on your equipment.
> > For anyone benchmarking software like this, be sure to clear the disk
> > before each run. In linux:
> Err, my argument was that you should do exactly the opposite, and just
> worry about hot-cache times (or time reading a big in-memory buffer,
> to avoid having to think about the OS's caching strategies).
Right, I got that. Sorry if the placement of the notes about how to clear
the cache seemed to imply otherwise.
> Clearing the disk cache is very important for getting meaningful,
> repeatable benchmarks in code where you know that the cache will
> usually be cold and where hitting the disk will have unpredictable
> effects (i.e., pretty much anything doing random access, like
> databases, which have complicated locality patterns, you may or may
> not trigger readahead, etc.). But here we're talking about pure
> sequential reads, where the disk just goes however fast it goes, and
> your code can either keep up or not.
> One minor point where the OS interface could matter: it's good to set
> up your code so it can use mmap() instead of read(), since this can
> reduce overhead. read() has to copy the data from the disk into OS
> memory, and then from OS memory into your process's memory; mmap()
> skips the second step.
Thanks for the tip. Do you happen to have any sample code that
demonstrates this? I'd like to explore this more.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion