[Numpy-discussion] A disconnected numarray rant

Todd Miller jmiller at stsci.edu
Wed Oct 13 14:35:08 CDT 2004


Hi Alexander,

Thanks for taking the time to provide us with feedback.  I've responded
to many of your points below.  [and in the interest of keeping the text
bloat down, I've interjected my own comments in brackets--Perry]

On Tue, 2004-10-12 at 05:37, Alexander Schmolck wrote: 
> Hi,
> 
> I'm taking a 1 month break from computers (i.e. I will be completely
> off-line), and I have to catch a train in an hour; but I've recently 
> bitten
> the bullet and made a matrix class I've been using for some time work 
> with
> numarray; I've written down a number of things that occured to me 
> while I was
> doing it, including some things which I think are bugs in numarray, so
> I
> thought at least posting the bugs would be a useful service; the rest 
> is very
> raw and essentially unedited cut-and-paste of these notes -- sorry 
> about that
> and I hope it doesn't contain anything particularly offensive.
> 
> P.S. just dumped the code for the matrix class (nummat) at
> http://www.dcs.ex.ac.uk/~aschmolc/Stuff/
> 
> 'as
> 
> The following are my notes:
> 
> 
> Things that fairly clearly seem to be bugs:
>     - numarray.Int32 etc. can't be pickled

Known limitation,  but OK.   Arrays can be pickled, as can Numeric
typecodes so I'm not sure how critical this omission is. 

>     - ``a = array(1+0j); a.imag = a.real * 10`` => IndexError
>     - array(0, type=Float64) + 1e3000  => `inf` with right error modes
>       but  array(0, type=Float32) + 1e3000 => `OverflowError`
>     - numarray.array(10)/numarray.array(0) => 0
>     - numarray.array(10000000000000L) => array(1316134912)
>     - numarray.where(0,1,0) => array([0])

There seems to be an infinity of rank-0 issues and so little
justification for having them that at one point we considered ripping
them out altogether.  Noted,  but low priority.

[Amen. If I had known the problems that rank-0 zero arrays would cause
I think I would have excluded them. I'm not sure I see the need for
them now that coercion rules have changed and helper functions to change
scalars into rank-1 len-1 arrays which serve almost all other purposes.
I'm interested in seeing what real purpose they serve now (I understand
the backward compatibility issue, but backward compatibility is not the
be all and end all for numarray; more on that later)] 
>     - l = [1,2,3]; numarray.put(l,numarray.array([1,2,0]),[0,0,0]); l 
> => [1, 2, 3]

Should raise a TypeError I guess. 
>       a = array([1,2,3]); 
> numarray.put(a,numarray.array([1,2,0]),[0,0,0]); a => array([0, 0, 0])

I don't see what's wrong here. 
>     - repr(numarray.array([],typecode='i')) (etc. etc.) => 
> "numarray.array([])"

Zero length arrays are rather like rank-0 arrays: low priority. 
Agreed... this is a small wart. 
>     - getattr(array([1,2,3]), '_aligned') => SystemError

Interesting.  I've been thinking about ripping out the _align and
_contiguous self-test hacks for a long time.  You've made up my mind. 
>     - obscure: numarray.where(0, matrix(568, convert_scalars=True),2) 
> =>
>       ValueError (tries __len__ which fails, as len(array(568)) also 
> fails)

I think this may boil down to "no where() for object arrays".
numarray.where() can't handle object arrays and there is no
numarray.objects.where().  Not implemented yet. 
> Numeric incompatiblilities (that are either undocumented or bug-like)

The best Numeric compatibility in numarray comes from:

import numarray.numeric as Numeric

It's still not perfect,  but it is more compatible than ordinary
numarray. 
> - numarray.array('a', typecode='O') => TypeError (object arrays)
> - for extra fun try: numarray.array(1, type=numarray.Object) -=> 
> RuntimeError
>   something entirely different

Object arrays in numarray do not have the synergy they have in
Numeric.   In particular,  numarray.array() can't create them, only
numarray.objects.array(). [At the time we added object arrays, we 
noticed that they were not safe in Numeric; that is, Numeric was not
properly handling reference counts of objects in arrays for at least
some operations and it was possible to segfault object arrays. This may
have changed since then; we haven't had a chance to check the current
status. But the point is that handling object arrays safely is a lot
more than just loading them with object pointers. Any function that can
set values in arrays needs to handle their refcounts, and that isn't all
that trivial. We took a short cut of using a Python implementation for
object arrays that doesn't have all the old functionality, but also
didn't have the problems that they did at the time.] 
> - nonzero is completely incompatible

numarray.numeric covers this.

numarray's nonzero() is more powerful, capable of handling
multidimensional arrays,  so it returns a tuple of values rather than a
single value.  It's unfortunate that we chose to use the name nonzero()
for the "new" function;  it has the right interface and the wrong name.
Keep in mind though,  our compatibility goals have grown immensely since
we started. 
> - shape(None) etc. no longer works (IMHO a bug)

This may be related to the object array synergy.   I think
numarray.asarray() is the problem here, since it doesn't know how to
create object arrays.
> - cross_correlate & average missing

I think cross_correlate is in numarray.convolve.correlate.  It was a
conscious choice not to put it in core numarray.  Average has never been
implemented and should be, especially since it has different semantics
than the mean() method. 
> - left_shift et al missing

These were renamed lshift and rshift.   Note that << works fine.
Synonyms should probably be added. 
> - numarray.sqrt(a,a) is None (*not* the result, as it used to be)

What do you want here?  What we have now is, IMO, correct. [Amen. This 
was intentionally changed from Numeric.] 
> - num.put(a, [0,1,2,3], [10,20]) style behavior seems unavailable 
> (without numarray.numeric)

I wasn't exactly sure what the expected behavior was for this,  but
guessed is was some kind of repeat.  If that's what the behavior was,
Perry and I don't really like it.  Besides,  numarray.numeric.put *is*
Numeric.put, modulo numarray underpinnings.

>   put(array([[ 0.,  1.,  2.], [ 3.,  4.,  5.]]), [1, 4], [10,40])
> fails

numarray.put() does have different semantics for multi-dimensional
destinations... you need multi-dimensional indexes (i.e. a tuple of
index arrays).  Again,  there's now numarray.numeric.put().

> - boolean testing (not even bool(array(0)) works; I'm not sure this is
> good)

[I am. This was a clear and explicit decision to not replicate Numeric 
behavior. I'm convinced that it is the right decision. There is just too
much confusion about what the truth value of an array should be. Helper
functions should be used to make it unambiguous.] 

> - Generally different handling of rank0-arrays; e.g. 
> ``type(num.array(1.0) +
>   0) is float``; one potentially very nasty gotcha are inplace 
> operations
>   (e.g. a**=2) which have totally different semantics for python 
> scalars and
>   rank0 arrays, which, unlike Attribute errors on ``a.shape``, can 
> lead to
>   nasty bugs in corner cases (e.g. when a reduction just infrequently 
> yields
>   scalar ``a``) -- I think this should be mentioned in a gotchas 
> section

We have areduce() for this case, which always returns an array. 
>   (another possible entry would be the need to use .copy() to **save**
> memory
>   on slicing and 1xN, Nx1 matrices versus vectors (people are not used
> to
>   thinking properly about rank from mathematical training or matlab
>   exposure)).

[You will need to elaborate about what you mean here. E.g., as to the 
first: I'm guessing you mean when a slice is taken and then the original
array is deleted. But it isn't clear.] 
> - asarray downcasts arrays (e.g.: asarray(array([1.,2.,3.]),'i'))

True enough.  Is there some reason why the method should silently
succeed (I know we wanted that) and the function should not? 
> - numarray.ones(-5) => MemoryError (ValueError would be nicer)

Easy to change. 
> - numarray.ones(2.0),

This fails, and that's fine by me.  The idea of floating point shapes
seems bogus. 
> numarray.ones([2])

AFIK, this works, and should work. 
> fail (cf. numarray.range(2.0))

IMHO, arange() is a special case and not really equivalent to
numarray.ones(). 
>       b=num.array([[1,2,3,4],[5,6,7,8]]*2)
>       assert eq(num.diagonal(b), [1,6,3,8])
>       assert eq(num.diagonal(b, -1), [5,2,7])
>       c = num.array([b,b])
>       assert eq(num.diagonal(c,1), [[2,7,4], [2,7,4]])
> - no a.toscalar() !!!

a.toscalar() is written a[()] in numarray.
[This is one method that shouldn't be there IMO. What would people 
expect it to do for arrays with  len>1 ?]   
> - matrixmultiply in the docs

OK. 
> - what's the point of swapaxes (i.e. why not have a generalized 
> in-place
>   transpose?)

It's a very common function in implementation of numarray/Numeric.
[In many cases it is far easier to use than an generalized transpose
(which does exist, but requires all axes to be explicitly given)] 
> - what's the point of innerproduct?

Compatibility. [For a while the flavor is: "dammit, why aren't you
compatible?" Now it's: "dammit, why are you compatible?"] 
> - indexing by a list is different from indexing by tuple (I haven't 
> had time
>   to look closely at the docs whether that's intentional)

It's intentional.  Indexing by a list is "array" indexing.  Indexing by
a tuple is not.  Thus, a 3D array by [1,2,3] is pulling out 2D blocks,
while (1,2,3) is pulling out a single scalar. [In particular, tuples
have a special meaning for indexing; this distinction is unavoidable 
since it is a Python language issue.] 
> - doesn't know about Numeric's bizzarre '\x0b' typecode

Me either.  Should we add this? [Not unless there is a good reason.
What's it for? Why are you using it (particularly since you called it
bizarre)?] 
> - numarray.sqrt.reduce([]) raises (sensibly) TypeError, not ValueError

Got lucky I guess. 
> - len(array(1)) or array(1)[0] won't work anymore (understandable, but
>   should be documented)

OK. 
> - (should maximim, minimum reduce to -inf and inf?)

Don't they? 
> - <built-in method reduce of _BinaryUFunc object at 0x82dfc9c> is not
>   a very helpful repr; should be possible to get to the ufunc itself

Doesn't this comment fly in the face of Python itself?
[I imagine it is possible, but why? repr(dir) doesn't give you a usable
function creator, nor does it work in Numeric.] 
> - as in Numeric numarray.maximum.reduce(numarray.array([0,-0.])) => 
> -0.0

Talk about fine points...  noted.  I think the problem is that 0.0 ==
-0.0,  so there's no way for the reduction to get it right without
adding special code to look for this case, and that isn't gonna happen
without a strong case being made. [Again, a very good case needs to be
made for handling this. I doubt that it is important to many, and as
Todd mentions, not easy to handle.] 
> - __array__ protocol no longer supported (how can a non-derived class 
> convert
>   itself efficiently to an array?)

Maybe an old-timer can explain how this worked for Numeric.  I think
this is only partially implemented in numarray and that maybe we need to
add a check for an __array__() method to numarray.array(). 
> Documentation Gotchas
> - p. 34 IMO row vector is used incorrectly; row and column vectors are
> really
>      matrices (i.e. have rank 2) so ``array([[1,2,3]])`` would be a 
> row vector

Sounds reasonable. 
> - No proper explanation of differences between Numeric and numarray,
> or
>   numarray.numeric module differences to proper (e.g. argmin)

If there is,  I don't know where it is.  Noted,  but I'm not really an
encyclopedia of these facts myself. 
> - No migration and best-practice advice (e.g. there should be a 
> standard way
>   for packages which work with both numarray and numeric as backends 
> to let
>   the user choose his preference; how about setting an environment var
> NumPy
>   or something?)

We're just working this out ourselves. [Let me elaborate more. We 
haven't really had much experience yet porting tons of Numeric code (MA
is about the only example). We are working on scipy now so I expect that
in a few months we will know much better what the most important porting
issues are. At the moment, this is better documented by others.] 
> Waffle
[meaning?] 
> ------
> 
> - there *really* ought to be an array equality function (with optional
>   tolerance); it's quite difficult to get right for are normal user 
> (nans;
>   zero-size arrays etc.) and it's often required, especially for 
> testing

You're right.  Want submit one? [Make sure it isn't dependent on the
underlying C compiler's libraries for testing floating point special 
values!] 
> - rank preserving reduction seems useful as an option would be nice --
> e.g. to
>   subtract out or divide by the reduced portion (which currently won't
> e.g.
>   work for columns without adding a unit-dimension by hand).

Sounds like an interesting idea,  but also method bloat. 
> Design
> 
>   The (AFAICS) benefit-free but downside-rich introduction of `type`
>   ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
> 
>   Is there any reason that Typecode objects that compare as desired to
> the
>   relevant strings ("i", "d") wouldn't have done? Now there is an 
> explosion
>   and confusion of interfaces -- some numpy code will now only except
>   type(code)s as "typecode" keyword parameter (even in numarray! see
>   numarray.mlab!) and other stuff
> 
>   Never mind that type already is a highly overused word in the python
> world.

Personally,  I like type because it's succinct and we have type objects,
not single character codes.  More importantly,  Perry likes type, and
the bottom line is that it's his shot to call and he's called it.
[We wrestled with this a while. Given that the representation of the
type had changed from a character code, typecode is clearly misleading
and inappropriate. It is there only for backward compatibility; for new
code to be used under numarray only, people shouldn't use it. Type
certainly seemed by far the most descriptive and accurate term. It does
have the drawback of overloading the type function. Other considerations
were things like atype, but type is what we went with.] 
>   The big method bloat.
>   '''''''''''''''''''''
> 
>   As it says in the Numeric manual introductions there were "good 
> reasons" for

I actually don't buy the reasons myself.  Some methods are natural,
convenient, and good so I need to hear more voices arguing this point
before I'll budge.  Clearly there is *some* bloat,  but identifying what
to ax is more difficult.  I suppose we could do a vote to clean this up.
>   "very few array methods" -- now there are **56** public methods and 
> 8 public
>   attributes (public == not starting with '_'); of those 56 methods 
> about 11
>   are accessors and of the rest about half are redundant or worse 
> (i.e. they
>   either also exist as numarray functions (argmin, argmax, diagonal, 
> ...) or

Which of the public attributes do you have a problem with?

Which accessors? 
>   they really ought to be functions (mean, stddev) or they are quite 
> confusing

The need for these is common so I thought it would be good to add them.
Functions could be added as well. 
>   (``a.min``, ``a.max``

These require tricks to get right so we added them.  The doc-strings
explain what they do. 
> which behave quite differenlty from ``a.argmin`` and
>   ``a.argmax``,

Good point.  These are inconsistent with min and max, which were added
independently at a later date.  I'm thinking we should deprecate the
argmin and argmax methods,  which I added hoping to do polymorphism for
strings and records and if I recall correctly never did anyway.

IMHO,  min(), max(), mean(), and stddev() are simple, useful, and should
remain. 
> never mind ``numarray.minimum``) or

min != minimum, and because it is a little tricky to get right, we
codified it as a method. 
> simply utterly pointless
>   (``a.nelements`` == ``a.size``)).

I added nelements() because I needed it and didn't know about
a.size()... simple as that.  a.size() came later for compatibility
only. [I'll argue that nelements is far clearer in meaning. What
does size mean? Total bytes? Total number of elements? Sorry,
I disagree on this one.] 
> If there really is a need for these (there isn't) if anything a.min 
> and a.max
>     should be called a.flatmin, a.flatmax

flatmin is certainly clear,  but the min/max docstrings also explain it
with no fuzz. 
>   - diagonal, mean, nelements, nonzero, ...

nonzero(), and diagonal() I could care less about so they
can probably be deprecated and removed.  I like mean(). 
>   - perversely the **only** function that I can think off that could 
> have
>     sensibly become a method hasn't: ``put`` (it used to work only on 
> arrays
>     under Numeric and not without reason, so making it a method would 
> have
>     been sensible; numarray.put of course also "works" on non-arrays, 
> it just
>     doesn't do anything with them)

Well,  we need the numarray.put() function for compatibility, and
there's already a more succinct syntax for put(), which is array based
indexing so I don't see any point in adding a put() method. 
>   Test Code
>   '''''''''
>   numtest.py doesn't inspire full confidence (it's about 1000 lines of
> actual
>   code but it doesn't seem that clearly structured and AFAICT contains
> no
>   single loop (and that despite the diversity of shapes, types etc. 
> that exist
>   in numarray -- why not try something slightly more systematic?))

Testing could certainly be better.  unittest might work better for this
kind of thing than doctest.  I agree that we should test for a wider
variety of shapes, types, sizes, and behaviors but it takes time and
effort to do it so it hasn't been done yet.  There's little doubt we'd
find bugs and the system would be better for it. [On the other hand,
is it the most important thing to do next? Any volunteers to improve the
test suite? It may not be the most complete and systematic one out 
there, but it's at least as good as the one for Numeric ;-)]

There's a lot of input here.  We'll see what we can do.  Thanks again.

Regards,
Todd

[A few more editorial comments. When we started numarray, compatibility 
was not high on the list of priorities, so the initial implementation
didn't focus on it. A number of the problems you point out reflect that
origin. While it is more important, it isn't the only guide. We seek
compatibility when there is no strong reason to be incompatible. But
there are a number of issues where we definitely wanted different
behavior (if it were to be completely compatible, we wouldn't have
bothered in the first place; we needed some changes).

Given the odd corners you've run into, it makes me curious to see the
code that generated this; particularly with regard to rank-0 arrays.
If I get a chance I'll take a look at the link you provided.
I wonder if it is typical of what other users will encounter or not.
I guess our experience in porting scipy will give us a better 
indication.

To summarize what we see as work that should be done to address the 
points
made:

rank-0 issues:
1) a.imag doesn't work
2) array(0, type=Float64) + 1e3000  => `inf` with right error modes
      but  array(0, type=Float32) + 1e3000 => `OverflowError`
3) numarray.array(10)/numarray.array(0) => 0
4) numarray.array(10000000000000L) => array(1316134912)
5) numarray.where(0,1,0) => array([0])
6) documentation of behavior (how to turn into scalar, that len and [0] 
indexing
      doesn't work, etc.)

Others

1) puts into lists should raise Type error
   l = [1,2,3]; numarray.put(l,numarray.array([1,2,0]),[0,0,0]); l => 
[1, 2, 3]
2) repr for zero length arrays needs to show type and other info.
3) rip out _align and _contiguous self-test hacks
4) improved object array handling (e.g., where and the like)
5) average function
6) change MemoryError to ValueError for ones(-5)
7) document matrixmultiply
8) support for __array__ protocol?
9) Documentation fix for p34 row vector usage.
10) Numeric to numarray conversion guide
11) Better tests

Most of these are not likely to get immediate attention as our focus 
now is on integrating scipy. To the extent they make it easier to do,
their priority may be raised. There are a lot of "should"s but we have
limited resources just like anyone else; we can't do it all at once.]





More information about the Numpy-discussion mailing list