[Numpy-discussion] Extent of unicode types in numpy

Gerard Vermeulen gerard.vermeulen at grenoble.cnrs.fr
Tue Feb 7 10:50:04 CST 2006


On Tue, 7 Feb 2006 15:42:34 +0100
Francesc Altet <faltet at carabos.com> wrote:

> A Dimarts 07 Febrer 2006 08:16, Travis Oliphant va escriure:
> > In current SVN, numpy assumes 'w' is 2-byte unicode and 'W' is 4-byte
> > unicode in the array interface typestring.   Right now these codes
> > require that the number of bytes be specified explicitly (to satisfy the
> > array interface requirement).   There is still only 1 Unicode data-type
> > on the platform and it has the size of Python's Py_UNICODE type.  The
> > character 'U' continues to be useful on data-type construction to stand
> > for a unicode string of a specific character length. It's internal dtype
> > representation will use 'w' or 'W' depending on how Python was compiled.
> >
> > This may not solve all issues, but at least it's a bit more consistent
> > and solves the problem of
> >
> > dtype(dtype('U8').str) not producing the same datatype.
> >
> > It also solves the problem of unicode written out with one compilation
> > of Python and attempted to be written in with another (it won't let you
> > because only one of 'w#' or 'W#' is supported on a platform.
> 
> While I agree that this solution is more consistent, I must say that
> I'm not very confortable with having to deal with two different widths
> for unicode characters. What bothers me is the lack portability of
> unicode strings when saving them to disk in python interpreters
> UCS4-enabled and retrieving with UCS2-enabled ones in the context of
> PyTables (or any other database). Let's suppose that a user have a
> numpy object of type unicode that has been created in a python with
> UCS4. This would look like:
> 
> # UCS4-aware interpreter here
> >>> numpy.array(u"\U000110fc", "U1")
> array(u'\U000110fc', dtype=(unicode,4))
> 
> Now, suppose that you save this in a PyTables file (for example) and
> you want to regenerate it on a python interpreter compiled with UCS2.
> As the buffer on-disk has a fixed length, we are forced to use unicode
> types twice as larger as containers for this data. So the net effect
> is that we will end in the UCS2 interpreter with an object like:
> 
> # UCS2-aware interpreter here
> >>> numpy.array(u"\U000110fc", "U2")
> array(u'\U000110fc', dtype=(unicode,4))
> 
> which, apparently is the same than the one above, but not quite. To
> begin with, the former is an array that is an unicode scalar with only
> *one* character, while the later has *two* characters. But worse than
> that, the interpretation of the original content changes drastically
> in the UCS2 platform. For example, if we select the first and second
> characters of the string in the UCS2-aware platform, we have:
> 
> >>> numpy.array(u"\U000110fc", "U2")[()][0]
> u'\ud804'
> >>> numpy.array(u"\U000110fc", "U2")[()][1]
> u'\udcfc'
> 
> that have nothing to do with the original \U000110fc character (I'd
> expect to get at least the truncated values \u0001 and \u10fc). I
> think this is because of the conventions that are used to represent
> 32-bit unicode characters in UTF-16 using a technique called
> "surrogate pairs" (see: http://www.unicode.org/glossary/).
> 
> All in all, my opinion is that allowing the coexistence of different
> sizes of unicode types in numpy would be a receipt for disaster when
> one wants to transport unicode characters between platforms with
> python interpreters compiled with different unicode sizes.
> Consequently I'd propose to suport just one size of unicode sizes in
> numpy, namely, the 4-byte one, and if this size doesn't match the
> underlying python platform, then refuse to deliver native unicode
> objects if the user is asking for them. Something like would work:
> 
> # UCS2-aware interpreter here
> >>> h=numpy.array(u"\U000110fc", "U1")
> >>> h  # This is a 'true' 32-bit unicode array in numpy
> array(u'\U000110fc', dtype=(unicode,4))
> >>> h[()]    # Try to get a native unicode object in python
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> ValueError: unicode sizes in numpy and your python interpreter doesn't
> match. Sorry, but you should get an UCS4-enable python interpreter if
> you want to successfully complete this operation.
> 
> As a premium, we can get rid of the 'w' and 'W' typecodes that has
> been introduced a bit forcedly, IMO. I don't know, however, how
> difficult would be implementing this in numpy. Another option can be
> to refuse to compile numpy with UCS2-aware interpreters, but this
> sounds a bit extreme, but see below.
> 
> OTOH, I'm not an expert in Unicode, but after googling a bit, I've
> found interesting recommendations about its use in Python. The first
> is from Uge Ubuchi in http://www.xml.com/pub/a/2005/06/15/py-xml.html.
> Here is the relevant excerpt:
> 
> """
> I also want to mention another general principle to keep in mind: if
> possible, use a Python install compiled to use UCS4 character storage
> [...] UCS4 uses more space to store characters, but there are some
> problems for XML processing in UCS2, which the Python core team is
> reluctant to address because the only known fixes would be too much of
> a burden on performance. Luckily, most distributors have heeded this
> advice and ship UCS4 builds of Python.
> """
> 
> So, it seems that the Python crew is not interested in solving
> problems with with UCS2. Now, towards the end of the PEP 261 ('Support
> for "wide" Unicode characters') one can read this as a final
> conclusion:
> 
> """
> This PEP represents the least-effort solution. Over the next several
> years, 32-bit Unicode characters will become more common and that may
> either convince us that we need a more sophisticated solution or (on
> the other hand) convince us that simply mandating wide Unicode
> characters is an appropriate solution.
> """
> 
> This PEP dates from 27-Jun-2001, so the "next several years" the
> author is referring to is nowadays. In fact, the interpreters in my
> Debian based Linux, are both compiled with UCS4. Despite of this, it
> seems that the default for compiling python is using UCS2 provided
> that you still need to pass the flag "--enable-unicode=ucs4" if you
> want to end with a UCS4-enabled interpreter. I wonder why they are
> doing this if that can positively lead to problems with XML as Uge
> Ubuchi said (?).
> 
> Anyway, I don't know if the recommendation of compiling Python with
> UCS4 is spread enough or not in the different distributions, but
> people can easily check this with:
> 
> >>> len(buffer(u"u"))
> 4
> 
> if the output of this is 4 (as in my example), then the interpreter is
> using UCS4; if it is 2, it is using UCS2.
> 
> Finally, I agree that asking for help about these issues in the python
> list would be a good idea.
> 

I have no good solution for this problem, but the standard Python on my
1-year old Mandrake is still UCS2 and I quote from PEP-261:

    Windows builds will be narrow for a while based on the fact that
    there have been few requests for wide characters, those requests
    are mostly from hard-core programmers with the ability to buy
    their own Python and Windows itself is strongly biased towards
    16-bit characters.

Suppose that is still true. Maybe Vista will change that.

Wouldn't it be possible that numpy takes care of the "surrogate pairs"
when transferring unicode strings from UCS2-interpreters to UCS4-ndarrays
and vice-versa?

It would be nice to be able to cast explicitly between UCS2- and UCS4- arrays,
too.

Requesting users to recompile their Python is a rather brutal solution :-)


Gerard




More information about the Numpy-discussion mailing list