[Numpy-discussion] dtype comparison, hash

David Cournapeau cournape@gmail....
Mon Mar 5 21:07:30 CST 2012


On Tue, Jan 17, 2012 at 9:28 AM, Robert Kern <robert.kern@gmail.com> wrote:
> On Tue, Jan 17, 2012 at 05:11, Andreas Kloeckner
> <lists@informa.tiker.net> wrote:
>> Hi Robert,
>>
>> On Fri, 30 Dec 2011 20:05:14 +0000, Robert Kern <robert.kern@gmail.com> wrote:
>>> On Fri, Dec 30, 2011 at 18:57, Andreas Kloeckner
>>> <lists@informa.tiker.net> wrote:
>>> > Hi Robert,
>>> >
>>> > On Tue, 27 Dec 2011 10:17:41 +0000, Robert Kern <robert.kern@gmail.com> wrote:
>>> >> On Tue, Dec 27, 2011 at 01:22, Andreas Kloeckner
>>> >> <lists@informa.tiker.net> wrote:
>>> >> > Hi all,
>>> >> >
>>> >> > Two questions:
>>> >> >
>>> >> > - Are dtypes supposed to be comparable (i.e. implement '==', '!=')?
>>> >>
>>> >> Yes.
>>> >>
>>> >> > - Are dtypes supposed to be hashable?
>>> >>
>>> >> Yes, with caveats. Strictly speaking, we violate the condition that
>>> >> objects that equal each other should hash equal since we define == to
>>> >> be rather free. Namely,
>>> >>
>>> >>   np.dtype(x) == x
>>> >>
>>> >> for all objects x that can be converted to a dtype.
>>> >>
>>> >>   np.dtype(float) == np.dtype('float')
>>> >>   np.dtype(float) == float
>>> >>   np.dtype(float) == 'float'
>>> >>
>>> >> Since hash(float) != hash('float') we cannot implement
>>> >> np.dtype.__hash__() to follow the stricture that objects that compare
>>> >> equal should hash equal.
>>> >>
>>> >> However, if you restrict the domain of objects to just dtypes (i.e.
>>> >> only consider dicts that use only actual dtype objects as keys instead
>>> >> of arbitrary mixtures of objects), then the stricture is obeyed. This
>>> >> is a useful domain that is used internally in numpy.
>>> >>
>>> >> Is this the problem that you found?
>>> >
>>> > Thanks for the reply.
>>> >
>>> > It doesn't seem like this is our issue--instead, we're encountering two
>>> > different dtype objects that claim to be float64, compare as equal, but
>>> > don't hash to the same value.
>>> >
>>> > I've asked the user who encountered the user to investigate, and I'll
>>> > be back with more detail in a bit.
>>>
>>> I think we've run into this before and tried to fix it. Try to find
>>> the version of numpy the user has and a minimal example, if you can.
>>
>> This is what Thomas found:
>>
>> http://projects.scipy.org/numpy/ticket/2017
>
> It looks like the .flags attribute is different between np.uintp and
> np.uint32. The .flags attribute forms part of the hashed information
> about the dtype (or PyArray_Descr at the C-level).
>
> [~]
> |15> np.dtype(np.uintp).flags
> 1536
>
> [~]
> |16> np.dtype(np.uint32).flags
> 2048
>
> The same goes for np.intp and np.int32 in numpy 1.6.1 on OS X, so
> unlike the comment in the ticket, they do have different hashes for
> me.
>
> However, diving through the source a bit, I'm not entirely sure I
> trust the values being given at the Python level. It appears that the
> flag member of the PyArray_Descr struct is declared as a char.
> However, it is exposed as a T_INT member in the PyMemberDef table by
> direct addressing. Basically, a Python descriptor gets added to the
> np.dtype type that will look up sizeof(long) bytes from the starting
> position of the flags member in the struct. This includes 3 bytes of
> the following type_num member. Obviously, 2048 does not fit into a
> char. Nonetheless, the type_num is also part of the hash, so either
> the flags member or the type_num member is different between the two.
>
> Two bugs for the price of one!

Good catch !

So basically, the flag was changed from a char to an int back to a
char, and some of the code did not follow.

I could not really follow the exact history from the log alone, but basically:
  - there is indeed a char vs int discrepency (T_INT vs char)
  - in most dtype functions handling the flag variable, temporary
computation were made with an int (but every possible flag combination
can fit in a char)
  - quite a few usage of "i" instead of "c" in PyArg_ParseTuple and
PyBuild_Value.

Even after all those things, the original bug is there, because uintp
and uin32 have different typenum, even in 32 bits. I would actually
consider this a big in PyArray_EquivTypes, but changing this now may
be quite disrupting. Shall I remove type_num from the hash input (in
which case the bug would be fixed) ?

David


More information about the NumPy-Discussion mailing list