[Numpy-discussion] Bytes vs. Unicode in Python3

Francesc Alted faltet@pytables....
Fri Nov 27 04:50:03 CST 2009


A Friday 27 November 2009 11:27:00 Pauli Virtanen escrigué:
> Yes. But now I wonder, should
> 
> 	array(['foo'], str)
> 	array(['foo'])
> 
> be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U',
> which will mean unavoidable code breakage -- there's probably no
> avoiding it.

Mmh, you are right.  Yes, this seems to be difficult to solve.  Well, I'm 
changing my mind and think that both 'str' and 'S' should stand for Unicode in 
NumPy for Python 3.  If people is aware of the change for Python 3, they 
should be expecting the same change happening in NumPy too, I guess.  Then, I 
suppose that a new dtype "bytes" that replaces the existing "string" would be 
absolutely necessary.

> > Also, I suppose that there will be issues with the current Unicode
> > support in NumPy:
> >
> > In [5]: u = np.array(['asa'], dtype="U10")
> >
> > In [6]: u[0]
> > Out[6]: u'asa'  # will become 'asa' in Python 3
> >
> > In [7]: u.dtype.itemsize
> > Out[7]: 40      # not sure about the size in Python 3
> 
> I suspect the Unicode stuff will keep working without major changes,
> except maybe dropping the u in repr. It is difficult to believe the
> CPython guys would have significantly changed the current Unicode
> implementation, if they didn't bother changing the names of the
> functions :)
> 
> > For example, if it is true that internal strings in Python 3 and Unicode
> > UTF-8 (as René seems to suggest), I suppose that the internal conversions
> > from 2- bytes or 4-bytes (depending on how the Python interpreter has
> > been compiled) in NumPy Unicode dtype to the new Python string should
> > have to be reworked (perhaps you have dealt with that already).
> 
> I don't think they are internally UTF-8:
> http://docs.python.org/3.1/c-api/unicode.html
> 
> """Python’s default builds use a 16-bit type for Py_UNICODE and store
> Unicode values internally as UCS2."""

Ah!  No changes for that matter.  Much better then.

-- 
Francesc Alted


More information about the NumPy-Discussion mailing list