[Numpy-discussion] Possible inconsisteny in enumerated type mapping

Travis Oliphant oliphant.travis at ieee.org
Wed Sep 20 05:48:28 CDT 2006


Francesc Altet wrote:
> Hi,
>
> I'm sending a message here because discussing about this in the bug tracker is 
> not very comfortable. This my last try before giving up, so don't  be 
> afraid ;-)
>
> In bug #283 (http://projects.scipy.org/scipy/numpy/ticket/283) I complained 
> about the fact that a numpy.int32 is being mapped in NumPy to NPY_LONG 
> enumerated type and I think I failed to explain well why I think this is a 
> bad thing. Now, I'll try to expose an (real life) example, in the hope that 
> things will make clearer.
>
> Realize that you are coding a C extension that receives NumPy arrays for 
> saving them on-disk for a later retrieval. Realize also that an user is using 
> your extension on a 32-bit platform. If she pass to this extension an array 
> of type 'int32', and the extension tries to read the enumerated type (using 
> array.dtype.num), it will get NPY_LONG.
>  So, the extension use this code 
> (NPY_LONG) to save the type (together with data) on-disk. Now, she send this 
> data file to a teammate that works on a 64-bit machine, and tries to read the 
> data using the same extension. The extension would see that the data is 
> NPY_LONG type and would try to deserialize interpreting data elements as 
> being as 64-bit integer (this is the size of a NPY_LONG in 64-bit platforms), 
> and this is clearly wrong.
>
>   

In my view, this "real-life" example points to a flaw in the coding 
design that will not be fixed by altering what numpy.int32 maps to under 
the covers.   It is wrong to use a code for the platform c data-type 
(NPY_LONG) as a key to understand data written to disk.   This is and 
always has been a bad idea.  No matter what we do with numpy.int32 this 
can cause problems.  Just because a lot of platforms think an int is 
32-bits does not mean all of them do.  C gives you no such guarantee.   

Notice that pickling of NumPy arrays does not store the "enumerated 
type" as the code.  Instead it stores the data-type object (which itself 
pickles using the kind and element size so that the correct data-type 
object can be reconstructed on the other end --- if it is available at all).

Thus, you should not be storing the enumerated type but instead 
something like the kind and element-size.  

> Besides this, if for making your C extension you are using a C library that is 
> meant to save data in a platform-independent (say, HDF5), then, having a 
> NPY_LONG will not automatically say which C library datatype maps to, because 
> it only have datatypes that are of a definite size in all platforms. So, this 
> is a second problem.
>
>   
Making sure you get the correct data-type is why there are NPY_INT32 and 
NPY_INT64 enumerated types.   You can't code using NPY_LONG and expect 
it will give you the same sizes when moving from 32-bit and 64-bit 
platforms.   That's a problem that has been fixed with the bitwidth 
types.  I don't understand why you are using the enumerated types at all 
in this circumstance.


> Of course there are workarounds for this, but my impression is that they can 
> be avoided with a more sensible mapping between NumPy Python types and NumPy 
> enumerated types, like:
>
> numpy.int32 --> NPY_INT
> numpy.int64 --> NPY_LONGLONG
> numpy.int_  --> NPY_LONG
>
> in all platforms, avoiding the current situation of ambiguous mapping between 
> platforms.
>   

The problem is that C gives us this ambiguous mapping.  You are asking 
us to pretend it isn't there because it "simplifies" a hypothetical case 
so that poor coding practice can be allowed to work in a special case.  
I'm not convinced.

This persists the myth that C data-types have a defined length.  This is 
not guaranteed.  The current system defines data-types with a guaranteed 
length.   Yes, there is ambiguity as to which is "the" underlying c-type 
on certain platforms, but if you are running into trouble with the 
difference, then you need to change how you are coding because you would 
run into trouble on some combination of platforms even if we made the 
change.

Basically, you are asking to make a major change, and at this point I'm 
very hesitant to make such a change without a clear and pressing need 
for it.  Your hypothetical example does not rise to the level of "clear 
and pressing need."  In fact, I see your proposal as a step backwards. 

Now, it is true that we could change the default type that gets first 
grab at int32 to be int (instead of the current long) --- I could see 
arguments for that.  But, since the choice is ambiguous and the Python 
integer type is the c-type long, I let long get first dibs on everything 
as this seemed to work better for code I was wrapping in the past.   I 
don't see any point in changing this choice now and risk code breakage, 
especially when your argument is that it would let users think that a c 
int is always 32-bits.


Best regards,

-Travis







More information about the Numpy-discussion mailing list