[Numpy-discussion] Massive differences in numpy vs. numeric string handling

Travis Oliphant oliphant at ee.byu.edu
Wed Apr 12 15:04:06 CDT 2006

Jeremy Gore wrote:

> In Numeric:
> Numeric.array('test') -> array([t, e, s, t],'c'); shape = (4,)
> Numeric.array(['test','two']) ->
> array([[t, e, s, t],
>        [t, w, o,  ]],'c')
> but in numpy:
> numpy.array('test') -> array('test', dtype='|S4'); shape = ()
> numpy.array('test','S1') -> array('t', dtype='|S1'); shape = ()
> in fact you have to do an extra list cast:
> numpy.array(list('test'),'S1') -> array([t, e, s, t], dtype='|S1');  
> shape = (4,)
> to get the desired result.  I don't think this is very pythonic, as  
> strings are fully indexable and iterable objects.

Let's not cast this discussion in Pythonic vs. un-pythonic because that 
does not really shed light on the issues.

NumPy adds full support for string arrays.   Numeric had this step-child 
called a character array which was really just an array of bytes that 
printed differently.  

This does raise some compatibility issues that have been hard to get 
exactly right, and convertcode indeed does not really solve the problem 
for a heavy character-array user.    I have resisted simply adding back 
a 1-character string data-type back into NumPy,  but that could be done 
if it is really necessary.  But, I don't think it is.

>   Furthermore,  converting/treating a string as an array of characters 
> is a very  common thing.  convertcode.py would not appear to convert 
> this part  of the code correctly either.  Also, the use of quotes in 
> the shape  () array but not in the shape (4,) array is inconsistent.

> I realize the ability to use strings of arbitrary length as array  
> elements is important in numpy, but there really should be a more  
> natural option to convert/cast strings as character arrays.

Perhaps all that is needed to simplify handling is to handle the 'S1' 
case better so that

array('test','S1')  works the same as array('test','c') used to work 
(i.e. not stopping at strings for the sequence decomposition). 

> Also, unlike Numeric.equal and 'c' arrays, numpy.equal cannot compare  
> '|S1' arrays or presumably other strings for equality, although this  
> is a very useful comparison to make.

This is a known missing feature due to the fact that comparisons use 
ufuncs but ufuncs are not supported for variable-length arrays.   
Currently, however you can use the chararray class which does allow 
comparisons of strings.

There are simple ways to work around this, of course.   If you do have 
'S1' arrays, then you can simply view them as unsigned bytes (using the 
.view method) and do comparison that way.  

if s1 and s2 are "character arrays"

s1.view(ubyte) >= s2.view(ubyte)


More information about the Numpy-discussion mailing list