[Numpy-discussion] Possible roadmap addendum: building better text file readers

Nathaniel Smith njs@pobox....
Tue Feb 28 16:22:16 CST 2012


[Re-adding the list to the To: field, after it got dropped accidentally]

On Tue, Feb 28, 2012 at 12:28 AM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
> Excerpts from Nathaniel Smith's message of Mon Feb 27 17:33:52 -0500 2012:
>> On Mon, Feb 27, 2012 at 6:02 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
>> > Excerpts from Nathaniel Smith's message of Mon Feb 27 12:07:11 -0500 2012:
>> >> On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
>> >> > What I've got is a solution for writing and reading structured arrays to
>> >> > and from files, both in text files and binary files.  It is written in C
>> >> > and python.  It allows reading arbitrary subsets of the data efficiently
>> >> > without reading in the whole file.  It defines a class Recfile that
>> >> > exposes an array like interface for reading, e.g. x=rf[columns][rows].
>> >>
>> >> What format do you use for binary data? Something tiled? I don't
>> >> understand how you can read in a single column of a standard text or
>> >> mmap-style binary file any more efficiently than by reading the whole
>> >> file.
>> >
>> > For binary, I just seek to the appropriate bytes on disk and read them,
>> > no mmap.  The user must have input an accurate dtype describing rows in
>> > the file of course.  This saves a lot of memory and time on big files if
>> > you just need small subsets.
>>
>> Have you quantified the time savings? I'd expect this to either be the
>> same speed or slower than reading the entire file.
>
> Nathaniel -
>
> Yes I've verified it, but as you point out below there are pathological
> cases.    See below.
>
>> The reason is that usually the OS cannot read just a few bytes from a
>> middle of a file -- if it is reading at all, it will read at least a
>> page (4K on linux). If your rows are less than 4K in size, then
>> reading a little bit out of each row means that you will be loading
>> the entire file from disk regardless. You avoid any parsing overhead
>> for the skipped columns, but for binary files that should be zero.
>> (Even if you're doing endian conversion or something it should still
>> be trivial compared to disk speed.)
>
> I'll say up front, the speed gains for binary data are often huge over
> even numpy.memmap because memmap is not column aware.  My code doesn't
> have that limitation.

Hi Erin,

I don't doubt your observations, but... there must be something more
going on! In a modern VM implementation, what happens when you request
to read from an arbitrary offset in the file is:
  1) The OS works out which disk page (or set of pages, for a longer
read) contains the given offset
  2) It reads those pages from the disk, and loads them into some OS
owned buffers (the "page cache")
  3) It copies the relevant bytes out of the page cache into the
buffer passed to read()
And when you mmap() and then attempt to access some memory at an
arbitrary offset within the mmap region, what happens is:
  1) The processor notices that it doesn't actually know how the
memory address given maps to real memory (a tlb miss), so it asks the
OS
  2) The OS notices that this is a memory-mapped region, and works out
which disk page maps to the given memory address
  3) It reads that page from disk, and loads it into some OS owned
buffers (the "page cache")
  4) It tells the processor

That is, reading at a bunch of fixed offsets inside a large memory
mapped array (which is what numpy does when you request a single
column of a recarray) should end up issuing *exactly the same read
commands* as writing code that explicitly seeks to those addresses and
reads them.

But, I realized I've never actually tested this myself, so I wrote a
little test (attached). It reads a bunch of uint32's at equally-spaced
offsets from a large file, using either mmap, explicit seeks, or the
naive read-everything approach. I'm finding it very hard to get
precise results, because I don't have a spare drive and anything that
touches the disk really disrupts the timing here (and apparently
Ubuntu no longer has a real single-user mode :-(), but here are some
examples on a 200,000,000 byte file with different simulated row
sizes:

1024 byte rows:
Mode: MMAP. Checksum: bdd205e9. Time: 3.44 s
Mode: SEEK. Checksum: bdd205e9. Time: 3.34 s
Mode: READALL. Checksum: bdd205e9. Time: 3.53 s
Mode: MMAP. Checksum: bdd205e9. Time: 3.39 s
Mode: SEEK. Checksum: bdd205e9. Time: 3.30 s
Mode: READALL. Checksum: bdd205e9. Time: 3.17 s
Mode: MMAP. Checksum: bdd205e9. Time: 3.16 s
Mode: SEEK. Checksum: bdd205e9. Time: 3.41 s
Mode: READALL. Checksum: bdd205e9. Time: 3.43 s

65536 byte rows (64 KiB):
Mode: MMAP. Checksum: da4f9d8d. Time: 3.25 s
Mode: SEEK. Checksum: da4f9d8d. Time: 3.27 s
Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s
Mode: MMAP. Checksum: da4f9d8d. Time: 3.34 s
Mode: SEEK. Checksum: da4f9d8d. Time: 3.36 s
Mode: READALL. Checksum: da4f9d8d. Time: 3.44 s
Mode: MMAP. Checksum: da4f9d8d. Time: 3.18 s
Mode: SEEK. Checksum: da4f9d8d. Time: 3.19 s
Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s

1048576 byte rows (1 MiB):
Mode: MMAP. Checksum: 22963df9. Time: 1.57 s
Mode: SEEK. Checksum: 22963df9. Time: 1.44 s
Mode: READALL. Checksum: 22963df9. Time: 3.13 s
Mode: MMAP. Checksum: 22963df9. Time: 1.59 s
Mode: SEEK. Checksum: 22963df9. Time: 1.43 s
Mode: READALL. Checksum: 22963df9. Time: 3.16 s
Mode: MMAP. Checksum: 22963df9. Time: 1.55 s
Mode: SEEK. Checksum: 22963df9. Time: 1.66 s
Mode: READALL. Checksum: 22963df9. Time: 3.15 s

And for comparison:

In [32]: a = np.memmap("src/bigfile", np.uint32, "r")
In [33]: time hex(np.sum(a[::1048576//4][:-1], dtype=a.dtype))
CPU times: user 0.00 s, sys: 0.01 s, total: 0.01 s
Wall time: 1.54 s
Out[34]: '0x22963df9L'

(Ubuntu Linux 2.6.38-13, traditional spinning-disk media)

So, in this test:
  For small rows: seeking is irrelevant, reading everything is just as
fast. (And the cutoff for "small" is not very small... I tried 512KiB
too and it looked like 32KiB).
  For large rows: seeking is faster than reading everything. But mmap,
explicit seeks, and np.memmap all act the same.

I guess it's possible the difference you're seeing could just mean
that, like, Windows has a terrible VM subsystem, but that would be
weird.

> In the ascii case the gains for speed are less
> for the reasons you point out; you have to read through the data even to
> skip rows and fields.  Then it is really about memory.
>
>
> Even for binary, there are pathological cases, e.g. 1) reading a random
> subset of nearly all rows.  2) reading a single column when rows are
> small.  In case 2 you will only go this route in the first place if you
> need to save memory.  The user should be aware of these issues.

FWIW, this route actually doesn't save any memory as compared to np.memmap.

> I wrote this code to deal with a typical use case of mine where small
> subsets of binary or ascii data are begin read from a huge file.  It
> solves that problem.

Cool.

>> If your rows are greater than 4K in size, then seeking will allow you
>> to avoid loading some parts of the file from disk... but it also might
>> defeat the OS's readahead heuristics, which means that instead of
>> doing a streaming read, you're might be doing a disk seek for every
>> row. On an SSD, this is fine, and probably a nice win. On a
>> traditional spinning-disk, losing readahead will cause a huge
>> slowdown; you should only win if each row is like... a megabyte
>> apiece. Seeks are much much slower than continuous reads. So I'm
>> surprised if this actually helps! But that's just theory, so I am
>> curious to see the actual numbers.
>>
>> Re: memory savings -- it's definitely a win to avoid allocating the
>> whole array if you just want to read one column, but you can
>> accomplish that without any particular cleverness in the low-level
>> file reading code. You just read the first N rows into a fixed-size
>> temp buffer, copy out the relevant column into the output array,
>> repeat.
>
> Certainly this could also be done that way, and would be equally good
> for some cases.
>
> I've already got all the C code to do this stuff, so it not much work
> for me to hack it into my numpy fork.  If it turns out there are cases
> that are faster using another method, we should root them out during
> testing and add the appropriate logic.

Cool. I'm just a little concerned that, since we seem to have like...
5 different implementations of this stuff all being worked on at the
same time, we need to get some consensus on which features actually
matter, so they can be melded together into the Single Best File
Reader Evar. An interface where indexing and file-reading are combined
is significantly more complicated than one where the core file-reading
inner-loop can ignore indexing. So far I'm not sure why this
complexity would be worthwhile, so that's what I'm trying to
understand.

Cheers,
-- Nathaniel

> Also, for some crazy ascii files we may want to revert to pure python
> anyway, but I think these should be special cases that can be flagged
> at runtime through keyword arguments to the python functions.
>
> BTW, did you mean to go off-list?
>
> cheers,
>
> -e
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: read-by.c
Type: text/x-csrc
Size: 3525 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120228/bf21a619/attachment.bin 


More information about the NumPy-Discussion mailing list