[Numpy-discussion] Bytes vs. Unicode in Python3

Pauli Virtanen pav+sp@iki...
Thu Dec 3 03:36:09 CST 2009

Fri, 27 Nov 2009 23:19:58 +0100, Dag Sverre Seljebotn wrote:
> One thing to keep in mind here is that PEP 3118 actually defines a
> standard dtype format string, which is (mostly) incompatible with
> NumPy's. It should probably be supported as well when PEP 3118 is
> implemented.

PEP 3118 is for the most part implemented in my Py3K branch now -- it was 
not actually much work, as I could steal most of the format string 
converter from numpy.pxd.

Some questions:

How hard do we want to try supplying a buffer? Eg. if the consumer does 
not specify strided but specifies suboffsets, should we try to compute 
suitable suboffsets? Should we try making contiguous copies of the data 
(I guess this would break buffer semantics?)?

> Just something to keep in the back of ones mind when discussing this.
> For instance one could, instead of inventing something new, adopt the
> characters PEP 3118 uses (if there isn't a conflict):
>   - b: Raw byte
>   - c: ucs-1 encoding (latin 1, one byte) 
>   - u: ucs-2 encoding, two bytes
>   - w: ucs-4 encoding, four bytes

The 'b' character is already taken so we can't easily use that. 'y' would 
be free for bYtes, however.

> Long-term I hope the NumPy-specific format string will be deprecated, so
> that repr print out the PEP 3118 format string etc. But, I'm aware that
> API breakage shouldn't happen when porting to Python 3.


A global switch could in principle be added for this, maybe -- the type 
codes are for the most part stored in a dict in numerictypes.py and could 
probably be easily replaced runtime.

Pauli Virtanen

More information about the NumPy-Discussion mailing list