[Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

Dag Sverre Seljebotn d.s.seljebotn@astro.uio...
Fri Mar 11 12:41:39 CST 2011


There's a few libraries out there that needs to know whether or not an 
array changed since the last time it was used: joblib and pymc comes to 
mind. I believe joblib computes a SHA1 or md5 hash of array contents, 
while pymc simply assume you never change an array and uses the id().

The pymc approach is fragile, while in my case the joblib approach is 
too expensive since I'll call the function again many times in a row 
with the same large array (yes, I can code around it, but the code gets 
less streamlined).

So, would it be possible to very quickly detect whether a NumPy array is 
guaranteed to not have changed? Here's a revision counter approach:

  1) Introduce a new 64-bit int field "modification_count" in the array 
object struct.

  2) modification_count is incremented any time it is possible that an 
array changes. In particular, PyArray_DATA would increment the counter.

  3) A new PyArray_READONLYDATA is introduced that does not increment 
the counter, which can be used in strategic spots. However, the point is 
simply to rule out *most* sources of having to recompute a checksum for 
the array -- a non-matching modification_count is not a guarantee the 
array has changed, but an unmatched modification_count is a guarantee of 
an unchanged array

  4) The counter can be ignored for readonly (base) arrays.

  5a) A method is introduced Python-side,  
arr.checksum(algorithm="md5"|"sha1"), that uses this machinery to cache 
checksum computation and that can be plugged into joblib.

  5b) Alternatively, the modification count is exposed directly to 
Python-side, and it is up to users to store the modification count (e.g. 
in a WeakKeyDictionary indexed by the array's base array).

Another solution to the problem would be to allow registering event 
handlers. Main reason I'm not proposing that is because I don't want to 
spend the time to implement it (sounds a lot more difficult), it appears 
to be considerably less backwards-compatible, and so on.

Why not a simple dirty flag? Because you'd need one for every possible 
application of this (e.g, md5 and sha1 would need seperate dirty flags, 
and other uses than hashing would need yet more flags, and so on).

Dag Sverre


More information about the NumPy-Discussion mailing list