[Numpy-discussion] PyArray_Scalar() and Unicode

Pauli Virtanen pav@iki...
Sun Jun 13 07:55:22 CDT 2010


Sat, 12 Jun 2010 17:33:13 -0700, Dan Roberts wrote:
[clip: refactoring PyArray_Scalar]
>     There are a few problems with this.  The biggest problem for me is
> that it appears PyUCS2Buffer_FromUCS4() doesn't produce UCS2 at all, but
> rather UTF-16 since it produces surrogate pairs for code points above
> 0xFFFF.  My first question is: is there any time when the data produced
> by PyUCS2Buffer_FromUCS4() wouldn't be parseable by a standards
> compliant UTF-16 decoder?  

Since UTF-16 = UCS-2 + surrogate pairs, as far as I know, the data 
produced should always be parseable by DecodeUTF16.

Conversion to real UCS-2 from UCS-4 would be a lossy procedure, since not 
all code points can be represented with 2 bytes.

-- 
Pauli Virtanen



More information about the NumPy-Discussion mailing list