[Numpy-discussion] Bytes vs. Unicode in Python3

René Dudfield renesd@gmail....
Fri Nov 27 01:05:10 CST 2009


On Fri, Nov 27, 2009 at 1:37 AM, Charles R Harris
<charlesr.harris@gmail.com> wrote:
> Hi Pauli,
>
> On Thu, Nov 26, 2009 at 4:08 PM, Pauli Virtanen <pav@iki.fi> wrote:
>>
>> Hi,
>>
>> The Python 3 porting needs some decisions on what is Bytes and
>> what is Unicode.
>>
>> I'm currently taking the following approach. Comments?
>>
>>        ***
>>
>> dtype field names
>>
>>        Either Bytes or Unicode.
>>        But 'a' and b'a' are *different* fields.
>>
>>        The issue is that:
>>            Python 2: {'a': 2}[u'a'] == 2, {u'a': 2}['a'] == 2
>>            Python 3: {'a': 2}[b'a'], {b'a': 2}['a'] raise exceptions
>>        so the current assumptions in the C code of u'a' == b'a'
>>        cease to hold.
>>
>> dtype titles
>>
>>        If Bytes or Unicode, work similarly as field names.
>>
>> dtype format strings, datetime tuple, and any other "protocol" strings
>>
>>        Bytes. User can pass in Unicode, but it's converted using
>>        UTF8 codec.
>>
>>        This will likely change repr() of various objects. Acceptable?
>>
>
> I'm not clear on your recommendation here, is it that we should use bytes,
> with unicode converted to UTF8? Will that support arrays that have been
> pickled and such? Or will we just have a minimum of code to fix up? And
> could you expand on the changes that repr() might undergo?
>
> Mind, I think using bytes sounds best, but I haven't looked into the whole
> strings part of the transition and don't have an informed opinion on the
> matter.
>
> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

To help clarify for people who are not familiar with python 3...

To put it simply... in py3,
    str == unicode utf-8, there is no 'unicode' anymore.
    bytes == raw data... but kind of like a str type with less methods.
    bytes + encoding is how you do non utf-8 strings.
    'array' exists in both py2 and py3 with a very similar interface on both.

There's a more precise description of strings in python3 on these pages:
    http://diveintopython3.org/strings.html
    http://diveintopython3.org/porting-code-to-python-3-with-2to3.html


It depends on the use cases for each thing which will depend on how it
should work imho.  Mostly if you are using the str type, then keep
using the str type.

Many functions take both bytes and strings.  Since it is sane to work
on both bytes and strings from a users perspective.  There have been
some methods in the stdlib that have not consumed both, and they have
been treated as bugs, and are being fixed (eg, some urllib methods).



For dtype, using the python 'str' by default seems ok.  Since all of
those characters come out in the same manner on both pythons for the
data used by numpy.

eg.  'float32' is shown the same as a py3 string as a py2 string.
Internally it is unicode data however.

Within py2, we save a pickle with the str:
>>> import pickle
>>> pickle.dump(s, open('/tmp/p.pickle', 'wb'))
>>> pickle.dump(s, open('/tmp/p.pickle', 'wb'))
>>> pickle.dump('float32', open('/tmp/p.pickle', 'wb'))


Within py3 we open the pickle with the str:
>>> import pickle
>>> pickle.load(open('/tmp/p.pickle', 'rb'))
'float32'



cheers,


More information about the NumPy-Discussion mailing list