[Numpy-discussion] Bytes vs. Unicode in Python3
Fri Nov 27 06:23:10 CST 2009
On Fri, Nov 27, 2009 at 11:50 AM, Francesc Alted <email@example.com> wrote:
> A Friday 27 November 2009 11:27:00 Pauli Virtanen escrigué:
>> Yes. But now I wonder, should
>> array(['foo'], str)
>> be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U',
>> which will mean unavoidable code breakage -- there's probably no
>> avoiding it.
> Mmh, you are right. Yes, this seems to be difficult to solve. Well, I'm
> changing my mind and think that both 'str' and 'S' should stand for Unicode in
> NumPy for Python 3. If people is aware of the change for Python 3, they
> should be expecting the same change happening in NumPy too, I guess. Then, I
> suppose that a new dtype "bytes" that replaces the existing "string" would be
> absolutely necessary.
>> > Also, I suppose that there will be issues with the current Unicode
>> > support in NumPy:
>> > In : u = np.array(['asa'], dtype="U10")
>> > In : u
>> > Out: u'asa' # will become 'asa' in Python 3
>> > In : u.dtype.itemsize
>> > Out: 40 # not sure about the size in Python 3
>> I suspect the Unicode stuff will keep working without major changes,
>> except maybe dropping the u in repr. It is difficult to believe the
>> CPython guys would have significantly changed the current Unicode
>> implementation, if they didn't bother changing the names of the
>> functions :)
>> > For example, if it is true that internal strings in Python 3 and Unicode
>> > UTF-8 (as René seems to suggest), I suppose that the internal conversions
>> > from 2- bytes or 4-bytes (depending on how the Python interpreter has
>> > been compiled) in NumPy Unicode dtype to the new Python string should
>> > have to be reworked (perhaps you have dealt with that already).
>> I don't think they are internally UTF-8:
>> """Python’s default builds use a 16-bit type for Py_UNICODE and store
>> Unicode values internally as UCS2."""
> Ah! No changes for that matter. Much better then.
>>> 'Hello\u0020World !'.encode()
b'Hello World !'
The default encoding does appear to be utf-8 in py3.
Although it is compiled with something different, and stores it as
something different, that is UCS2 or UCS4.
I imagine dtype 'S' and 'U' need more clarification. As it misses the
concept of encodings it seems? Currently, S appears to mean 8bit
characters no encoding, and U appears to mean 16bit characters no
encoding? Or are some sort of default encodings assumed?
2to3/3to2 fixers will probably have to be written for users code
here... whatever is decided. At least warnings should be generated
btw, in my numpy tree there is a unicode_() alias to str in py3, and
to unicode in py2 (inside the compat.py file). This helped us in many
cases with compatible string code in the pygame port. This allows you
to create unicode strings on both platforms with the same code.
More information about the NumPy-Discussion