[Numpy-discussion] Deserialized arrays with base mutate strings

Hrvoje Niksic hrvoje.niksic@avl....
Wed Sep 23 02:15:44 CDT 2009

Numpy arrays with the "base" property are deserialized as arrays
pointing to a storage contained within a Python string.  This is a
problem since such arrays are mutable and can mutate existing strings.
Here is how to create one:

  >>> import numpy, cPickle as p
  >>> a = numpy.array([1, 2, 3])    # create an array
  >>> b = a[::-1]                   # create a view
  >>> b
array([3, 2, 1])
  >>> b.base                        # view's base is the original array
array([1, 2, 3])
  >>> c = p.loads(p.dumps(b, -1))   # roundtrip the view through pickle
  >>> c
array([3, 2, 1])
  >>> c.base                        # base is now a simple string:
  >>> s = c.base
  >>> s
  >>> type(s)
<type 'str'>
  >>> c[0] = 4                      # when the array is mutated...
  >>> s                             # ...the string changes value!

This is somewhat disconcerting, as Python strings are supposed to be
immutable.  In this case the string was created by numpy and is probably
not shared by anyone, so it doesn't present a problem in practice.  But
in corner cases it can lead to serious bugs.  Python has a cache of
one-letter strings, which cannot be turned off.  This means that
one-byte array views can change existing Python strings used elsewhere
in the code.  For example:

  >>> a = numpy.array([65], 'int8')
  >>> b = a[::-1]
  >>> c = p.loads(p.dumps(b, -1))
  >>> c
array([65], dtype=int8)
  >>> c.base
  >>> c[0] = 66
  >>> c.base
  >>> 'A'

Note how changing a numpy array permanently changed the contents of all
'A' strings in this python instance, rendering python unusable.

The fix should be straightforward: use a string subclass (which will
skip the one-letter cache), or an entirely separate type for storage of
"base" memory referenced by deserialized arrays.

More information about the NumPy-Discussion mailing list