[Numpy-discussion] Possible inconsisteny in enumerated type mapping
Travis Oliphant
oliphant.travis at ieee.org
Wed Sep 20 05:48:28 CDT 2006
Francesc Altet wrote:
> Hi,
>
> I'm sending a message here because discussing about this in the bug tracker is
> not very comfortable. This my last try before giving up, so don't be
> afraid ;-)
>
> In bug #283 (http://projects.scipy.org/scipy/numpy/ticket/283) I complained
> about the fact that a numpy.int32 is being mapped in NumPy to NPY_LONG
> enumerated type and I think I failed to explain well why I think this is a
> bad thing. Now, I'll try to expose an (real life) example, in the hope that
> things will make clearer.
>
> Realize that you are coding a C extension that receives NumPy arrays for
> saving them on-disk for a later retrieval. Realize also that an user is using
> your extension on a 32-bit platform. If she pass to this extension an array
> of type 'int32', and the extension tries to read the enumerated type (using
> array.dtype.num), it will get NPY_LONG.
> So, the extension use this code
> (NPY_LONG) to save the type (together with data) on-disk. Now, she send this
> data file to a teammate that works on a 64-bit machine, and tries to read the
> data using the same extension. The extension would see that the data is
> NPY_LONG type and would try to deserialize interpreting data elements as
> being as 64-bit integer (this is the size of a NPY_LONG in 64-bit platforms),
> and this is clearly wrong.
>
>
In my view, this "real-life" example points to a flaw in the coding
design that will not be fixed by altering what numpy.int32 maps to under
the covers. It is wrong to use a code for the platform c data-type
(NPY_LONG) as a key to understand data written to disk. This is and
always has been a bad idea. No matter what we do with numpy.int32 this
can cause problems. Just because a lot of platforms think an int is
32-bits does not mean all of them do. C gives you no such guarantee.
Notice that pickling of NumPy arrays does not store the "enumerated
type" as the code. Instead it stores the data-type object (which itself
pickles using the kind and element size so that the correct data-type
object can be reconstructed on the other end --- if it is available at all).
Thus, you should not be storing the enumerated type but instead
something like the kind and element-size.
> Besides this, if for making your C extension you are using a C library that is
> meant to save data in a platform-independent (say, HDF5), then, having a
> NPY_LONG will not automatically say which C library datatype maps to, because
> it only have datatypes that are of a definite size in all platforms. So, this
> is a second problem.
>
>
Making sure you get the correct data-type is why there are NPY_INT32 and
NPY_INT64 enumerated types. You can't code using NPY_LONG and expect
it will give you the same sizes when moving from 32-bit and 64-bit
platforms. That's a problem that has been fixed with the bitwidth
types. I don't understand why you are using the enumerated types at all
in this circumstance.
> Of course there are workarounds for this, but my impression is that they can
> be avoided with a more sensible mapping between NumPy Python types and NumPy
> enumerated types, like:
>
> numpy.int32 --> NPY_INT
> numpy.int64 --> NPY_LONGLONG
> numpy.int_ --> NPY_LONG
>
> in all platforms, avoiding the current situation of ambiguous mapping between
> platforms.
>
The problem is that C gives us this ambiguous mapping. You are asking
us to pretend it isn't there because it "simplifies" a hypothetical case
so that poor coding practice can be allowed to work in a special case.
I'm not convinced.
This persists the myth that C data-types have a defined length. This is
not guaranteed. The current system defines data-types with a guaranteed
length. Yes, there is ambiguity as to which is "the" underlying c-type
on certain platforms, but if you are running into trouble with the
difference, then you need to change how you are coding because you would
run into trouble on some combination of platforms even if we made the
change.
Basically, you are asking to make a major change, and at this point I'm
very hesitant to make such a change without a clear and pressing need
for it. Your hypothetical example does not rise to the level of "clear
and pressing need." In fact, I see your proposal as a step backwards.
Now, it is true that we could change the default type that gets first
grab at int32 to be int (instead of the current long) --- I could see
arguments for that. But, since the choice is ambiguous and the Python
integer type is the c-type long, I let long get first dibs on everything
as this seemed to work better for code I was wrapping in the past. I
don't see any point in changing this choice now and risk code breakage,
especially when your argument is that it would let users think that a c
int is always 32-bits.
Best regards,
-Travis
More information about the Numpy-discussion
mailing list