[Numpy-discussion] Possible roadmap addendum: building better text file readers

Nathaniel Smith njs@pobox....
Sun Feb 26 13:00:38 CST 2012


On Sun, Feb 26, 2012 at 5:23 PM, Warren Weckesser
<warren.weckesser@enthought.com> wrote:
> I haven't pushed it to the extreme, but the "big" example (in the examples/
> directory) is a 1 gig text file with 2 million rows and 50 fields in each
> row.  This is read in less than 30 seconds (but that's with a solid state
> drive).

Obviously this was just a quick test, but FYI, a solid state drive
shouldn't really make any difference here -- this is a pure sequential
read, and for those, SSDs are if anything actually slower than
traditional spinning-platter drives.

For this kind of benchmarking, you'd really rather be measuring the
CPU time, or reading byte streams that are already in memory. If you
can process more MB/s than the drive can provide, then your code is
effectively perfectly fast. Looking at this number has a few
advantages:
 - You get more repeatable measurements (no disk buffers and stuff
messing with you)
 - If your code can go faster than your drive, then the drive won't
make your benchmark look bad
 - There are probably users out there that have faster drives than you
(e.g., I just measured ~340 megabytes/s off our lab's main RAID
array), so it's nice to be able to measure optimizations even after
they stop mattering on your equipment.

Cheers,
-- Nathaniel


More information about the NumPy-Discussion mailing list