[Numpy-discussion] Possible roadmap addendum: building better text file readers

Francesc Alted francesc@continuum...
Sun Feb 26 14:00:01 CST 2012


On Feb 26, 2012, at 1:49 PM, Nathaniel Smith wrote:

> On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
> <warren.weckesser@enthought.com> wrote:
>> On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
>>> For this kind of benchmarking, you'd really rather be measuring the
>>> CPU time, or reading byte streams that are already in memory. If you
>>> can process more MB/s than the drive can provide, then your code is
>>> effectively perfectly fast. Looking at this number has a few
>>> advantages:
>>>  - You get more repeatable measurements (no disk buffers and stuff
>>> messing with you)
>>>  - If your code can go faster than your drive, then the drive won't
>>> make your benchmark look bad
>>>  - There are probably users out there that have faster drives than you
>>> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
>>> array), so it's nice to be able to measure optimizations even after
>>> they stop mattering on your equipment.
>> 
>> 
>> For anyone benchmarking software like this, be sure to clear the disk cache
>> before each run.  In linux:
> 
> Err, my argument was that you should do exactly the opposite, and just
> worry about hot-cache times (or time reading a big in-memory buffer,
> to avoid having to think about the OS's caching strategies).
> 
> Clearing the disk cache is very important for getting meaningful,
> repeatable benchmarks in code where you know that the cache will
> usually be cold and where hitting the disk will have unpredictable
> effects (i.e., pretty much anything doing random access, like
> databases, which have complicated locality patterns, you may or may
> not trigger readahead, etc.). But here we're talking about pure
> sequential reads, where the disk just goes however fast it goes, and
> your code can either keep up or not.

Exactly.

> One minor point where the OS interface could matter: it's good to set
> up your code so it can use mmap() instead of read(), since this can
> reduce overhead. read() has to copy the data from the disk into OS
> memory, and then from OS memory into your process's memory; mmap()
> skips the second step.

Cool.  Nice trick!

-- Francesc Alted





More information about the NumPy-Discussion mailing list