[Numpy-discussion] Bytes vs. Unicode in Python3

Pauli Virtanen pav@iki...
Fri Nov 27 03:47:53 CST 2009


to, 2009-11-26 kello 17:37 -0700, Charles R Harris kirjoitti:
[clip]
> I'm not clear on your recommendation here, is it that we should use
> bytes, with unicode converted to UTF8?

The point is that I don't think we can just decide to use Unicode or
Bytes in all places where PyString was used earlier. Which one it will
be should depend on the use. Users will expect that eg. array([1,2,3],
dtype='f4') still works, and they don't have to do e.g. array([1,2,3],
dtype=b'f4').

To summarize the use cases I've ran across so far:

1) For 'S' dtype, I believe we use Bytes for the raw data and the
   interface.

   Maybe we want to introduce a separate "bytes" dtype that's an alias
   for 'S'?

2) The field names:

	a = array([], dtype=[('a', int)])
	a = array([], dtype=[(b'a', int)])

This is somewhat of an internal issue. We need to decide whether we
internally coerce input to Unicode or Bytes. Or whether we allow for
both Unicode and Bytes (but preserving previous semantics in this case
requires extra work, due to semantic changes in PyDict).

Currently, there's some code in Numpy to allow for Unicode field names,
but it's not been coherently implemented in all places, so e.g. direct
creation of dtypes with unicode field names fails.

This has also implications on field titles, as also those are stored in
the fields dict.

3) Format strings

	a = array([], dtype=b'i4')

I don't think it makes sense to handle format strings in Unicode
internally -- they should always be coerced to bytes. This will make it
easier at many points, since it will be enought to do

	PyBytes_AS_STRING(str)

to get the char* pointer, rather than having to encode to utf-8 first.
Same for all other similar uses of string, e.g. protocol descriptors.
User input should just be coerced to ASCII on input, I believe.

The problem here is that preserving repr() in this case requires some
extra work. But maybe that has to be done.

> Will that support arrays that have been pickled and such?

Are the pickles backward compatible between Python 2 and 3 at all?
I think using Bytes for format strings will be backward-compatible.

Field names are then a bit more difficult. Actually, we'll probably just
have to coerce them to either Bytes or Unicode internally, since we'll
need to do that on unpickling if we want to be backward-compatible.

> Or will we just have a minimum of code to fix up?

I think we will need in any case to replace all use of PyString in Numpy
by PyBytes or PyUnicode, depending on context, and #define PyString
PyBytes for Python 2.

This seems to be the easiest way to make sure we have fixed all points
that need fixing.

Currently, 193 of 800 numpy.core tests don't pass, and this seems
largely due to Bytes vs. Unicode issues.

> And could you expand on the changes that repr() might undergo?

The main thing is that

	dtype('i4')
	dtype([('a', 'i4')])

may become

	dtype(b'i4')
	dtype([(b'a', b'i4')])

Of course, we can write and #ifdef separate repr formatting code for
Py3, but this is a bit of extra work.

> Mind, I think using bytes sounds best, but I haven't looked into the
> whole strings part of the transition and don't have an informed
> opinion on the matter.

	***

By the way, should I commit this stuff (after factoring the commits to
logical chunks) to SVN?

It does not break anything for Python 2, at least as far as the test
suite is concerned.

	Pauli




More information about the NumPy-Discussion mailing list