[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load

Jon Olav Vik jonovik@gmail....
Tue Mar 1 18:40:17 CST 2011


Robert Kern <robert.kern <at> gmail.com> writes:
> > Within a single machine, that sounds fine. What about processes running on
> > different nodes, with different main memories?
> 
> You mean mmaping a file on a shared file system?

Yes. GPFS, I believe, presumably this:
http://en.wikipedia.org/wiki/GPFS
Horrible latency on first access, but otherwise fast enough for my uses. I 
could have worked on local disk, copied them to my home directory, then 
consolidated the results, but the convenience of a single file appeals to my 
one-screenful attention span.

> Then it's up the file
> system. I'm honestly not sure what would happen for your particular
> file system. Try it and report back.
> 
> In any case, using the offset won't help. The virtual memory manager
> always deals with whole pages of size mmap.ALLOCATIONBOUNDARY aligned
> with the start of the file. Under the covers, np.memmap() rounds the
> offset down to the nearest page boundary and then readjusts the
> pointer.

I have had 1440 processes writing timing information to a numpy file with about 
60000 records of (starttime, finishtime) without problem. Likewise, I've 
written large amounts of output, which was sanity-checked during analysis. I 
ought to have noticed any errors.

> For performance reasons, I don't recommend doing it anyway. The
> networked file system becomes the bottleneck, in my experience.

What would you suggest instead? Using separate files is an option, but requires 
a final pass to collect data, or lots of code to navigate the results. Having a 
master node collect data and write them to file is cumbersome on our queueing 
system (it's easier to schedule many small jobs that can run whenever there is 
free time, than require a master and workers to run at the same time).

I don't recall the exact numbers, but I have had several hundred processors 
running simultaneously, writing to the same numpy file on disk. It has been my 
impression that this is much faster than doing it through a single process. I 
was hoping to get the speed of writing separate files with the self-
documentation of a single structured np.array on disk (open_memmap also saves 
me a few lines of code in writing output back to disk).

That was before I learned about how the virtual memory manager enters into 
memory-mapping, though -- maybe I was just imagining things 8-/

> > Then I humbly suggest that having offset and shape arguments to open_memmap 
is
> > useful.
> 
> I disagree. The important bit is to get the header information and the
> data offset out of the file without loading any data. Once you have
> that, np.memmap() suffices. You don't need to alter np.open_memmap()
> at all.

But if you're suggesting that the end user 
1) use read_magic to check the version, 
2) use read_array_header_1_0 to get shape, fortran_order, dtype, 
3) call np.memmap with a suitable offset
-- isn't that pretty much tantamount to duplicating np.lib.format.open_memmap?

Actually, the function names read_array_header_1_0 and read_magic sound rather 
internal, not like something intended for an end-user. read_array_header_1_0 
seems to be used only by open_memmap and read_array. Given the somewhat 
confusing array (pardon the pun) of ways to load files in Numpy, np.load() 
might in fact be a reasonable place to centralize all the options...

> In fact, if you do use np.open_memmap() to read the
> information, then you can't implement your "64-bit-large file on a
> 32-bit machine" use case.

Do you mean that it could be done with np.memmap? Pardon me for being slow, but 
what is the crucial difference between np.memmap and open_memmap in this 
respect?



More information about the NumPy-Discussion mailing list