[Numpy-discussion] Unicode revisited
Charles R Harris
Fri Aug 3 21:05:18 CDT 2012
On Fri, Aug 3, 2012 at 7:03 PM, Travis Oliphant <firstname.lastname@example.org> wrote:
> Hey all,
> Ondrej has been working hard with feedback from many others on improving
> Unicode support in NumPy (especially for Python 3.3). Looking at what
> Python has done in Python 3.3 (PEP 393) and chatting on the Python issue
> tracker with the author of that PEP has made me wonder if we aren't "doing
> the wrong thing" in NumPy quite often.
> Basically, NumPy only supports UTF-32 in it's Unicode representation.
> All bytes in NumPy arrays should be either UTF-32LE or UTF-32BE. This is
> all pretty easy to understand as long as you stick with NumPy arrays only.
> The difficulty starts when you start to interact with the unicode array
> scalar (which is the same data-structure exactly as a Python unicode object
> with a different type-name --- numpy.unicode_). However, I overlooked
> the "encoding" argument to the standard "unicode" constructor which might
> have simplified what we are doing. If I understand things correctly,
> now, all we need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in
> the array (depending on the dtype) into a unicode object.
> This is easily accomplished with numpy.unicode_(<bytes object>,
> 'utf_32_be' or 'utf_32_le'). There is also an "encoding" equivalent to
> go from the Python unicode object to the bytes representation in the NumPy
> array. I think this is what we should be doing in most of the places and
> it should considerably simplify the Unicode code in NumPy --- eliminating
> possibly the ucsnarrow.c file.
> Am I missing something?
I can't comment on the rest, but I'd be happy to see the end of the
ucsnarrow.c file. It needs more work to be properly generalized and if
there is a way to avoid that, so much the better.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion