[Numpy-discussion] Bytes vs. Unicode in Python3
Fri Nov 27 06:49:21 CST 2009
A Friday 27 November 2009 13:23:10 René Dudfield escrigué:
> >> I don't think they are internally UTF-8:
> >> http://docs.python.org/3.1/c-api/unicode.html
> >> """Python’s default builds use a 16-bit type for Py_UNICODE and store
> >> Unicode values internally as UCS2."""
> > Ah! No changes for that matter. Much better then.
> in py3...
> >>> 'Hello\u0020World !'.encode()
> b'Hello World !'
> >>> "Äpfel".encode('utf-8')
> >>> "Äpfel".encode()
> The default encoding does appear to be utf-8 in py3.
> Although it is compiled with something different, and stores it as
> something different, that is UCS2 or UCS4.
OK. One thing is which is the default encoding for Unicode and another is how
Python keeps Unicode internally. And internally Python 3 is still using UCS2
or UCS4, i.e. the same thing than in Python 2, so no worries here.
> I imagine dtype 'S' and 'U' need more clarification. As it misses the
> concept of encodings it seems? Currently, S appears to mean 8bit
> characters no encoding, and U appears to mean 16bit characters no
> encoding? Or are some sort of default encodings assumed?
You only need encoding if you are going to represent Unicode strings with
other types (for example bytes). Currently, NumPy can transparently
import/export native Python Unicode strings (UCS2 or UCS4) into its own
Unicode type (always UCS4). So, we don't have to worry here either.
> btw, in my numpy tree there is a unicode_() alias to str in py3, and
> to unicode in py2 (inside the compat.py file). This helped us in many
> cases with compatible string code in the pygame port. This allows you
> to create unicode strings on both platforms with the same code.
Correct. But, in addition, we are going to need a new 'bytes' dtype for NumPy
for Python 3, right?
More information about the NumPy-Discussion