[Numpy-discussion] Massive differences in numpy vs. numeric string handling
Travis Oliphant
oliphant at ee.byu.edu
Wed Apr 12 15:04:06 CDT 2006
Jeremy Gore wrote:
> In Numeric:
>
> Numeric.array('test') -> array([t, e, s, t],'c'); shape = (4,)
> Numeric.array(['test','two']) ->
> array([[t, e, s, t],
> [t, w, o, ]],'c')
>
> but in numpy:
>
> numpy.array('test') -> array('test', dtype='|S4'); shape = ()
> numpy.array('test','S1') -> array('t', dtype='|S1'); shape = ()
>
> in fact you have to do an extra list cast:
>
> numpy.array(list('test'),'S1') -> array([t, e, s, t], dtype='|S1');
> shape = (4,)
>
> to get the desired result. I don't think this is very pythonic, as
> strings are fully indexable and iterable objects.
Let's not cast this discussion in Pythonic vs. un-pythonic because that
does not really shed light on the issues.
NumPy adds full support for string arrays. Numeric had this step-child
called a character array which was really just an array of bytes that
printed differently.
This does raise some compatibility issues that have been hard to get
exactly right, and convertcode indeed does not really solve the problem
for a heavy character-array user. I have resisted simply adding back
a 1-character string data-type back into NumPy, but that could be done
if it is really necessary. But, I don't think it is.
> Furthermore, converting/treating a string as an array of characters
> is a very common thing. convertcode.py would not appear to convert
> this part of the code correctly either. Also, the use of quotes in
> the shape () array but not in the shape (4,) array is inconsistent.
>
>
> I realize the ability to use strings of arbitrary length as array
> elements is important in numpy, but there really should be a more
> natural option to convert/cast strings as character arrays.
Perhaps all that is needed to simplify handling is to handle the 'S1'
case better so that
array('test','S1') works the same as array('test','c') used to work
(i.e. not stopping at strings for the sequence decomposition).
>
> Also, unlike Numeric.equal and 'c' arrays, numpy.equal cannot compare
> '|S1' arrays or presumably other strings for equality, although this
> is a very useful comparison to make.
This is a known missing feature due to the fact that comparisons use
ufuncs but ufuncs are not supported for variable-length arrays.
Currently, however you can use the chararray class which does allow
comparisons of strings.
There are simple ways to work around this, of course. If you do have
'S1' arrays, then you can simply view them as unsigned bytes (using the
.view method) and do comparison that way.
if s1 and s2 are "character arrays"
s1.view(ubyte) >= s2.view(ubyte)
-Travis
More information about the Numpy-discussion
mailing list