[Numpy-discussion] Extent of unicode types in numpy
Francesc Altet
faltet at carabos.com
Mon Feb 6 10:25:07 CST 2006
Hi,
I'm a bit surprised by the fact that unicode types are the only ones
breaking the rule that must be specified with a different number of
bytes than it really takes. For example:
In [120]:numpy.dtype([('x','c16')])
Out[120]:dtype([('x', '<c16')])
In [121]:numpy.dtype([('x','S16')])
Out[121]:dtype([('x', '|S16')])
but:
In [119]:numpy.dtype([('x','U4')])
Out[119]:dtype([('x', '<U16')])
Even worse:
In [126]:numpy.dtype(numpy.dtype('u4').str)
Out[126]:dtype('<u4')
but:
In [125]:numpy.dtype(numpy.dtype('U4').str)
Out[125]:dtype('<U64') # !!!!
which can quickly led to problems in users' code.
I think that, for the sake of consistency and exactly like the user must
know that a c16 is a complex taking 16 octets, he must know that a
unicode character should take 4 bytes. With this, we should have:
In [119]:numpy.dtype([('x','U4')])
Out[119]:dtype([('x', '<U4')])
and forbid unicode character length that are not multiple of 4. I know
that, initially, it would be a bit strange for the user to specify 'S4'
for a string with 4 chars and 'U16' for an unicode string of 4 chars as
well, but hopefully he would be used soon to this.
The only problem with that I see with what I'm proposing is that I don't
know whether the unicode would take always 4-bytes in all the platforms
(--> 64-bit issues?). OTOH, I thought that Python would represent
internally unicode strings with 16-bit chars. Oh well, I'm bit lost on
this. Anybody can bring some light?
Cheers,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the Numpy-discussion
mailing list