[Numpy-discussion] Adding `offset` argument to np.lib.format.open_memmap and np.load
Jon Olav Vik
Tue Mar 1 17:06:51 CST 2011
Robert Kern <robert.kern <at> gmail.com> writes:
> >> It's up to the virtual memory manager, but usually, it will just load
> >> those pages (chunks the size of mmap.PAGESIZE) that are touched by
> >> your request and write them back.
> > What if two processes touch adjacent chunks that are smaller than a page? Is
> > there a risk that writing back an entire page will overwrite the efforts of
> > another process?
> I believe that there is only one page in main memory. Each process is
> simply pointed to the same page. As long as you don't write to the
> same specific byte, you'll be fine.
Within a single machine, that sounds fine. What about processes running on
different nodes, with different main memories?
> > Pardon me if I misunderstand, but isn't that what np.load does already,
> > without my modifications?
> With your modifications, the user does not get to see the header
> information before they pick the offset and shape. I contend that the
> user ought to read the shape information before deciding the shape to
Actually, that is what I've done for my own use (trivial parallelism, where I
know that axis 0 is "long" and suitable for dividing the workload): Read the
shape first, divide its first dimension into chunks with np.array_split(), then
memmap the portion I need. I didn't submit that function for inclusion because
it is rather specific to my own work. For process "ID" out of "NID", the code
is roughly as follows:
def memmap_chunk(filename, ID, NID, mode="r"):
r = open_memmap(filename, "r")
n = r.shape
i = np.array_split(range(n), NID)[ID]
offset = i
shape = 1 + i[-1] - i
if len(i) > 0:
return open_memmap(filename, mode=mode, offset=offset, shape=shape)
return np.empty(0, r.dtype)
> I don't think that changing the no.load() API is the best way to
> solve this problem.
I can agree with that. What I actually use is open_memmap() as shown above, but
couldn't have done it without offset and shape arguments.
In retrospect, changing np.load() was maybe a misstep in trying to generalize
from my own hacks to something that might be useful to others. I kind of added
offset and shape to np.load "for completeness", as it offers a mmap_mode
argument but no way to memory-map just a portion of a file.
So to attempt a summary: memory-mapping with np.load may be useful to conserve
memory in a single process (with no need for offset and shape arguments), but
splitting workload across multiple processes is best done with open_memmap.
Then I humbly suggest that having offset and shape arguments to open_memmap is
More information about the NumPy-Discussion