[Numpy-discussion] Massive differences in numpy vs. numeric string handling
tim.hochberg at cox.net
Wed Apr 12 15:32:04 CDT 2006
Travis Oliphant wrote:
> Jeremy Gore wrote:
>> In Numeric:
>> Numeric.array('test') -> array([t, e, s, t],'c'); shape = (4,)
>> Numeric.array(['test','two']) ->
>> array([[t, e, s, t],
>> [t, w, o, ]],'c')
>> but in numpy:
>> numpy.array('test') -> array('test', dtype='|S4'); shape = ()
>> numpy.array('test','S1') -> array('t', dtype='|S1'); shape = ()
>> in fact you have to do an extra list cast:
>> numpy.array(list('test'),'S1') -> array([t, e, s, t], dtype='|S1');
>> shape = (4,)
>> to get the desired result. I don't think this is very pythonic, as
>> strings are fully indexable and iterable objects.
> Let's not cast this discussion in Pythonic vs. un-pythonic because
> that does not really shed light on the issues.
> NumPy adds full support for string arrays. Numeric had this
> step-child called a character array which was really just an array of
> bytes that printed differently.
> This does raise some compatibility issues that have been hard to get
> exactly right, and convertcode indeed does not really solve the
> problem for a heavy character-array user. I have resisted simply
> adding back a 1-character string data-type back into NumPy, but that
> could be done if it is really necessary. But, I don't think it is.
>> Furthermore, converting/treating a string as an array of
>> characters is a very common thing. convertcode.py would not appear
>> to convert this part of the code correctly either. Also, the use of
>> quotes in the shape () array but not in the shape (4,) array is
>> I realize the ability to use strings of arbitrary length as array
>> elements is important in numpy, but there really should be a more
>> natural option to convert/cast strings as character arrays.
> Perhaps all that is needed to simplify handling is to handle the 'S1'
> case better so that
> array('test','S1') works the same as array('test','c') used to work
> (i.e. not stopping at strings for the sequence decomposition).
It seems a little wacky that 'S2' and 'S1' would have vastly different
>> Also, unlike Numeric.equal and 'c' arrays, numpy.equal cannot
>> compare '|S1' arrays or presumably other strings for equality,
>> although this is a very useful comparison to make.
> This is a known missing feature due to the fact that comparisons use
> ufuncs but ufuncs are not supported for variable-length arrays.
> Currently, however you can use the chararray class which does allow
> comparisons of strings.
It seems like this should be easy to worm around in __cmp__ (or
array_compare or however it's spelled). Since the strings really have a
fixed length, they're more or less equivalent to byte arrays with one
extra dimension. Writing a little lexographic comparison thing on top of
the results of a ufunc operating on the result of a compare of these
byte arrays should be a piece of cake; in theory at least.
> There are simple ways to work around this, of course. If you do have
> 'S1' arrays, then you can simply view them as unsigned bytes (using
> the .view method) and do comparison that way.
> if s1 and s2 are "character arrays"
> s1.view(ubyte) >= s2.view(ubyte)
More information about the Numpy-discussion