[Numpy-discussion] Extent of unicode types in numpy

Tim Hochberg tim.hochberg at cox.net
Mon Feb 6 17:14:11 CST 2006


Travis Oliphant wrote:

> Francesc Altet wrote:
>
>> Hi,
>>
>> I'm a bit surprised by the fact that unicode types are the only ones
>> breaking the rule that must be specified with a different number of
>> bytes than it really takes. For example:
>>  
>>
>
> Right now, the array protocol typestring is a little ambiguous on 
> unicode characters.  Ideally, the array interface would describe what 
> kind of Unicode characters are being dealt with so that 2-byte and 
> 4-byte unicode characters have a different description in the typestring.
>
> Python can be compiled with Unicode as either 2-byte or 4-byte.    The 
> 'U#' descriptor is supposed to be the Python unicode data-type with # 
> representing the number of characters.   If this data-type is handed 
> off to a Python that is compiled with a different representation for 
> Unicode, then we have a problem.
>
> Right now, the typestring value gives the number of bytes in the 
> type.  Thus, "U4" gives dtype("<U8") on my system where 
> sizeof(Py_UNICODE)==2, but on another system it could give dtype("<U16").
> I know only a little-bit about unicode.  The full Unicode character is 
> a 4-byte entity, but there are standard 2-byte  (UTF-16) and even 
> 1-byte (UTF-8) encoders.
>
> I changed the source so that ("<U8") gets interpreted the same as "U4" 
> (i.e. if you specify an endianness then you are being byte-conscious 
> anyway and so the number is interpreted as a byte, otherwise the 
> number is interpreted as a length).  This fixes issues on the same 
> platform, but does not fix issues where data is saved out with one 
> Python interpreter and read in by another with a different value of 
> sizeof(Py_UNICODE). 

This sounds like a mess. I'm not sure what the level of Unicode 
expertise is one this list (I certainly don't add to it), but I'd be 
tempted to raise this issue on PythonDev and see if anyone there has any 
good suggestions.

I'm way out of my depth here, but it really sounds like there needs to 
be one descriptor for each type.  Just for example "U" could be 2-byte 
unicode and "V" (assuming it's not taken already) could be 4-byte 
unicode. Then the size for a given descriptor would be constant and 
things would be much less confusing.

-tim





More information about the Numpy-discussion mailing list