[Numpy-discussion] Py3 merge

Pauli Virtanen pav@iki...
Mon Dec 7 09:32:50 CST 2009

ma, 2009-12-07 kello 09:50 -0500, Michael Droettboom kirjoitti:
> Pauli Virtanen wrote:
> > The character 'B' is already by unsigned bytes -- I wonder if it's easy
> > to support 'B123' and plain 'B' at the same time, or whether we have to
> > pick a different letter for "byte strings". 'y' would be free...
> It seems to me the motivation to change the 'S' dtype to something else 
> is to make things clearer with respect to the new conventions of Python 
> 3.  (Where str -> bytes, and unicode -> str). In that sense, I'm not 
> sure there's any advantage going from "S" to "y" (particularly without 
> doing "U" to "S"), whereas there's a strong backward-compatibility 
> advantage to keep it as "S", though admittedly it's confusing to someone 
> who doesn't know the pre Python 3 history. 

I think a better plan is to deprecate "U" instead of "S".

Also, I'm not completely convinced that staying with "S" == bytes has a
strong backward-compatibility advantage:

	array(['foo']).dtype == 'U'

and this will break code in several places. Also, for instance,

	array(['foo', 'bar'], dtype='S3')

will result to encoding errors. We probably don't want to start
implicitly casting Unicode to bytes, since Py3 does not do that either.
The only places where the dtype characters are used, AFAIK, is in repr
and in the dtype kwarg -- they are not used in pickles etc.

One can actually argue that changing 'U' to 'S' is more

	array(['foo', 'bar'], dtype='S3')

would still be valid code. Of course, the semantics change, but this
anyway occurs also on the Python side when moving to Py3.

The simplest way to get more insight would be to try to convert some
string-using Py2 code to work on Py3.

> I'm not sure your suggestion of making 'B' and 'B123' both work seems 
> like a good one because of the semantic differences between numbers and 
> strings. Would np.array(['a', 'b']) have a repr of [97, 98] or ['a', 
> 'b']?  Sorting them would also not necessarily do the right thing.

I think the point would be that 'B' and 'B1' would be treated as
completely separate dtypes with different typenums -- they'd look
similar only in the dtype character API (which is not so large) but not
internally. np.array([b'a', b'b']).dtype would be 'B1'. Might be a bit
confusing, though.


More information about the NumPy-Discussion mailing list