[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Robert Kern robert.kern@gmail....
Tue Mar 1 19:13:31 CST 2011


On Tue, Mar 1, 2011 at 18:40, Jon Olav Vik <jonovik@gmail.com> wrote:
> Robert Kern <robert.kern <at> gmail.com> writes:
>> > Within a single machine, that sounds fine. What about processes running on
>> > different nodes, with different main memories?
>>
>> You mean mmaping a file on a shared file system?
>
> Yes. GPFS, I believe, presumably this:
> http://en.wikipedia.org/wiki/GPFS
> Horrible latency on first access, but otherwise fast enough for my uses. I
> could have worked on local disk, copied them to my home directory, then
> consolidated the results, but the convenience of a single file appeals to my
> one-screenful attention span.
>
>> Then it's up the file
>> system. I'm honestly not sure what would happen for your particular
>> file system. Try it and report back.
>>
>> In any case, using the offset won't help. The virtual memory manager
>> always deals with whole pages of size mmap.ALLOCATIONBOUNDARY aligned
>> with the start of the file. Under the covers, np.memmap() rounds the
>> offset down to the nearest page boundary and then readjusts the
>> pointer.
>
> I have had 1440 processes writing timing information to a numpy file with about
> 60000 records of (starttime, finishtime) without problem. Likewise, I've
> written large amounts of output, which was sanity-checked during analysis. I
> ought to have noticed any errors.
>
>> For performance reasons, I don't recommend doing it anyway. The
>> networked file system becomes the bottleneck, in my experience.
>
> What would you suggest instead? Using separate files is an option, but requires
> a final pass to collect data, or lots of code to navigate the results. Having a
> master node collect data and write them to file is cumbersome on our queueing
> system (it's easier to schedule many small jobs that can run whenever there is
> free time, than require a master and workers to run at the same time).
>
> I don't recall the exact numbers, but I have had several hundred processors
> running simultaneously, writing to the same numpy file on disk. It has been my
> impression that this is much faster than doing it through a single process. I
> was hoping to get the speed of writing separate files with the self-
> documentation of a single structured np.array on disk (open_memmap also saves
> me a few lines of code in writing output back to disk).
>
> That was before I learned about how the virtual memory manager enters into
> memory-mapping, though -- maybe I was just imagining things 8-/

Well, if it's working for you, that's great!

>> > Then I humbly suggest that having offset and shape arguments to open_memmap
> is
>> > useful.
>>
>> I disagree. The important bit is to get the header information and the
>> data offset out of the file without loading any data. Once you have
>> that, np.memmap() suffices. You don't need to alter np.open_memmap()
>> at all.
>
> But if you're suggesting that the end user
> 1) use read_magic to check the version,
> 2) use read_array_header_1_0 to get shape, fortran_order, dtype,
> 3) call np.memmap with a suitable offset
> -- isn't that pretty much tantamount to duplicating np.lib.format.open_memmap?
>
> Actually, the function names read_array_header_1_0 and read_magic sound rather
> internal, not like something intended for an end-user. read_array_header_1_0
> seems to be used only by open_memmap and read_array.

That's not what I suggested. I suggested that 1+2 could be wrapped in
a single function read_header() that provides the header information
and the offset to the data.

> Given the somewhat
> confusing array (pardon the pun) of ways to load files in Numpy, np.load()
> might in fact be a reasonable place to centralize all the options...

np.load()'s primary purpose is to be the one *simple* way to access
NPY files, not the one place to expose every way to access NPY files.

>> In fact, if you do use np.open_memmap() to read the
>> information, then you can't implement your "64-bit-large file on a
>> 32-bit machine" use case.
>
> Do you mean that it could be done with np.memmap? Pardon me for being slow, but
> what is the crucial difference between np.memmap and open_memmap in this
> respect?

You cannot use open_memmap() to read the header information on the
32-bit system since it will also try to map the data portion of the
too-large file. If you can read the header information with something
that does not try to read the data, then you can select a smaller
shape that does fit into your 32-bit address space. It's not that
np.memmap() works where open_memmap() doesn't; it's that my putative
read_header() function would work where open_memmap() doesn't.

My point about np.memmap() is that once you have the header
information loaded, you don't need to use open_memmap() any more.
np.memmap() does everything you need from that point on. There is no
point in making open_memmap() or np.load() more flexible to support
this use case. You just need the read_header() function.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco


More information about the NumPy-Discussion mailing list