[Numpy-discussion] Views of memmaps and offset

Nathaniel Smith njs@pobox....
Sun Sep 23 13:34:59 CDT 2012


On Sat, Sep 22, 2012 at 4:46 PM, Olivier Grisel
<olivier.grisel@ensta.org> wrote:
> There is also a third use case that is problematic on numpy master:
>
> orig = np.memmap('tmp.mmap', dtype=np.float64, shape=100, mode='w+')
> orig[:] = np.arange(orig.shape[0]) * -1.0  # negative markers to
> detect under / overflows
>
> a = np.memmap('tmp.mmap', dtype=np.float64, shape=50, mode='r+', offset=16)
> a[:] = np.arange(50)
> b = np.asarray(a[10:])
>
> Now b does not even have a 'filename' attribute anymore. `b.base` is a
> python mmap instance but the later is created with a file descriptor.
>
> It would still be possible to use:
>
> from _multiprocessing import address_of_buffer
>
> to find the memory address of the mmap buffer and use than to open new
> buffer views on the same memory segment from subprocesses using
> `numpy.frombuffer((ctypes.c_byte * n_byte).fromaddress(addr))` but in
> case of failure (e.g. the file has been deleted on the HDD) one gets a
> segmentation fault instead of a much more userfriendly catchable file
> not found exception.

On Unix, if the processes are related in a way that lets this work,
then this would actually be a far better solution... it will always
refer to the same file that was opened in the parent, even if it's has
since been deleted or renamed or replaced by a different file. (And if
they aren't related by fork(), then sending the fd would be better
than sending the filename, for the same reason.) Of course that
doesn't help for Windows; no idea what happens there.

Numpy in general really does not provide any reliable way of tracking
the relationship between different views of the same buffer.
Introspecting on .base will work in many cases, but it's not
guaranteed to even in earlier versions. Maybe you don't care because
it works well enough but it's an inherently rickety design :-). Trying
to think of the correct solution here, I think it would have to be
something like... have the numpy mmap code keep a global scorecard of
all extant  memory mappings -- filename, offset, length, memory
address. And then when you want to do an "mmap aware pickle", you
check the address of the array you're trying to save to see if it
falls into an mmap'ed region. That'd be simpler and more reliable than
anything involving base tracking.

-n


More information about the NumPy-Discussion mailing list