[Numpy-discussion] Possible roadmap addendum: building better text file readers

Nathaniel Smith njs@pobox....
Sun Feb 26 13:49:47 CST 2012

On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
<warren.weckesser@enthought.com> wrote:
> On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> For this kind of benchmarking, you'd really rather be measuring the
>> CPU time, or reading byte streams that are already in memory. If you
>> can process more MB/s than the drive can provide, then your code is
>> effectively perfectly fast. Looking at this number has a few
>> advantages:
>>  - You get more repeatable measurements (no disk buffers and stuff
>> messing with you)
>>  - If your code can go faster than your drive, then the drive won't
>> make your benchmark look bad
>>  - There are probably users out there that have faster drives than you
>> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
>> array), so it's nice to be able to measure optimizations even after
>> they stop mattering on your equipment.
> For anyone benchmarking software like this, be sure to clear the disk cache
> before each run.  In linux:

Err, my argument was that you should do exactly the opposite, and just
worry about hot-cache times (or time reading a big in-memory buffer,
to avoid having to think about the OS's caching strategies).

Clearing the disk cache is very important for getting meaningful,
repeatable benchmarks in code where you know that the cache will
usually be cold and where hitting the disk will have unpredictable
effects (i.e., pretty much anything doing random access, like
databases, which have complicated locality patterns, you may or may
not trigger readahead, etc.). But here we're talking about pure
sequential reads, where the disk just goes however fast it goes, and
your code can either keep up or not.

One minor point where the OS interface could matter: it's good to set
up your code so it can use mmap() instead of read(), since this can
reduce overhead. read() has to copy the data from the disk into OS
memory, and then from OS memory into your process's memory; mmap()
skips the second step.

-- Nathaniel

More information about the NumPy-Discussion mailing list