[Numpy-discussion] Unicode revisited
Fri Aug 3 20:03:17 CDT 2012
Ondrej has been working hard with feedback from many others on improving Unicode support in NumPy (especially for Python 3.3). Looking at what Python has done in Python 3.3 (PEP 393) and chatting on the Python issue tracker with the author of that PEP has made me wonder if we aren't "doing the wrong thing" in NumPy quite often.
Basically, NumPy only supports UTF-32 in it's Unicode representation. All bytes in NumPy arrays should be either UTF-32LE or UTF-32BE. This is all pretty easy to understand as long as you stick with NumPy arrays only.
The difficulty starts when you start to interact with the unicode array scalar (which is the same data-structure exactly as a Python unicode object with a different type-name --- numpy.unicode_). However, I overlooked the "encoding" argument to the standard "unicode" constructor which might have simplified what we are doing. If I understand things correctly, now, all we need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in the array (depending on the dtype) into a unicode object.
This is easily accomplished with numpy.unicode_(<bytes object>, 'utf_32_be' or 'utf_32_le'). There is also an "encoding" equivalent to go from the Python unicode object to the bytes representation in the NumPy array. I think this is what we should be doing in most of the places and it should considerably simplify the Unicode code in NumPy --- eliminating possibly the ucsnarrow.c file.
Am I missing something?
More information about the NumPy-Discussion