[SciPy-user] hash function on arrays

Robert Kern robert.kern@gmail....
Tue Oct 9 16:16:08 CDT 2007

Tom Johnson wrote:
> Hi,
> I need to do comparisons on a very large number of arrays.  I was
> thinking that it might speed things up if I first compared on a hash
> value...and then compared on those arrays with the same hash value (if
> needed)
> I am wondering:
> 1) about other methods for doing this...
> 2) how many collisions I can expect...(that is, will this technique be
> effective)
> 3) what is the hash function (specific to scipy arrays or just python's builtin)

numpy arrays do not have a hash function defined because they are mutable. You
can probably construct one suitable to your purpose if you are careful. If you
are looking for exact matches, then you could hash a tuple like so:

  ('numpy.ndarray', a.shape, a.dtype, a.strides, str(a.flags), buffer(a))

Note that this means that a0=arange(10) won't match a1=arange(10).astype(float)
although (a0 == a1).all() is True. If you do want to accept a0 and a1 as equal,
then you have a harder problem.

You will want to look at Objects/stringobject.c:string_hash() and
Objects/tupleobject.c:tuplehash() for the hash function implementations for
strings and tuples in order to determine how good this will be. Most likely,
you'll just have to try it and see.

Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

More information about the SciPy-user mailing list