[Numpy-discussion] Extent of unicode types in numpy

Francesc Altet faltet at carabos.com
Wed Feb 8 02:10:07 CST 2006


A Dimecres 08 Febrer 2006 09:41, Travis Oliphant va escriure:
> Hmm.  I think I'm beginning to like your idea.   We could in fact make

Good :-)

> the NumPy Unicode type always UCS4 and then keep the Python Unicode
> scalar.  On Python UCS2 builds the conversion would use UTF-16 to go to
> the Python scalar (which would always inherit from the native unicode
> type).

Yes, exactly.

> But, all in all, it sounds like a good plan. If the time comes that
> somebody wants to add a reduced-size USC2 array of unicode characters
> then we can cross that bridge if and when it comes up.

Well, provided the recommendations about migrating to 32-bit unicode
objects, I'd say that this would be a strange desire. If the problem
is memory consumption, the users can always choose regular 8-bit
strings (of course, without supporting completely general unicode
characters).

> I still like using explicit typecode characters in the array interface
> to denote UCS2 or the UCS4 data-type.  We could still change from 'W',
> 'w' to other characters...

But, why do you want to do this? If data type for unicode in arrays is
always UCS4 and in scalars is always determined by the python build,
then why do we want to try to distinguish them with specific type
codes? At C level there should be straightforward ways to determine
whether a scalar is UCS2 or UCS4 (just looking at the native python
type), and at python level there is not an evident way to distinguish
(correct me if I'm wrong here) between an UCS2 and UCS4 unicode
string, and in fact, the user will not notice the difference in
general (but see later).

Besides, having an 'U' as indicator for unicode is compatible in the
way Python has to express 32-bit unicode chars (i.e. \Uxxxxxxxx). So I
find that keeping 'U' for specifying unicode types would be more than
enough and that introducing 'w' and 'W' (or whathever) will only
introduce unnecessary burden, IMO. Moreover, if a user tries to know
the type using the .dtype descriptor, he will find that the type
continues to be 'U' irregardingly of the build he is using. Something
like:

# We are in a UCS2 interpreter
In [30]: numpy.array([1],dtype="U2")[0].dtype
Out[30]: dtype('<U4')

In [31]: numpy.array([1],dtype="U2")[0].dtype.char
Out[31]: 'U'

Of course, he would be able to notice that their unicode scalars as
smaller than unicode in arrays, but only if he looks at the type
descriptor and notice the extend of the type is shorter than expected
(4 instead of 8), but apart from that, nothing else will be different.

BTW, it would be nice if, in order to penalize people as less as
possible, we can ask the python developers to make UCS4 the default
build, just to avoid conversions between UCS4<-->UCS2. I'm still
wondering why this is not the default... :-/

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"





More information about the Numpy-discussion mailing list