[Numpy-discussion] Speeding up numarray -- questions on its design
oliphant at ee.byu.edu
Tue Jan 18 10:28:37 CST 2005
Thanks for the comments that have been made. One of my reasons for
commenting is to get an understanding of which design issues of Numarray
are felt to be important and which can change. There seems to be this
idea that small arrays are not worth supporting. I hope this is just
due to time-constraints and not some fundamental idea that small arrays
should never be considered with Numarray. Otherwise, there will
always be two different array implementations developing at their own pace.
I really want to gauge how willing developers of numarray are to
Perry Greenfield wrote:
>> 1) Are there plans to move the nd array entirely into C?
>> -- I would like to see the nd array become purely a c-type. I would
>> be willing to help here. I can see that part of the work has been done.
> I don't know that I would say they are definite, but I think that at
> some point we thought that would be necessary. We haven't yet since
> doing so makes it harder to change so it would be one of the last
> changes to the core that we would want to do. Our current priorities
> are towards making all the major libraries and packages available
> under it first and then finishing optimization issues (another issue
> that has to be tackled soon is handling 64-bit addressing; apparently
> the work to make Python sequences use 64-bit addresses is nearing
> completion so we want to be able to handle that. I expect we would
> want to make sure we find a way of handling that before we turn it
> all into C but maybe it is just as easy doing them in the opposite
I do not think it would be difficult at this point to move it all to C
and then make future changes there (you can always call pure Python code
from C). With the structure in place and some experience behind you,
now seems like as good a time as any. Especially, because now is a
better time for me than any... I like what numarray is doing by not
always defaulting to ints with the maybelong type. It is a good idea.
>> 2) Why is the ND array C-structure so large? Why are the dimensions
>> and strides array static? Why can't the extra stuff that the fancy
>> arrays need be another structure and the numarray C structure just
>> extended with a pointer to the extra stuff?
> When Todd moved NDArray into C, he tried to keep it simple. As
> such, it
> has no "moving parts." We think making dimensions and strides malloc'ed
> rather than static would be fairly easy. Making the "extra stuff"
> variable is something we can look at.
But allocating dimensions and strides when needed is not difficult and
it reduces the overhead of the ndarray object. Currently, that overhead
seems extreme. I could be over-reacting here, but it just seems like it
would have made more sense to expand the array object as little as
possible to handle the complexity that you were searching for. It seems
like more modifications were needed in the ufunc then in the arrayobject.
> The bottom line is that adding the variability adds complexity and we're
> not sure we understand the storage economics of why we would doing it.
> Numarray was designed, first and foremost, for large arrays.
Small arrays are never going to disappear (Fernando Perez has an
excellent example) and there are others. A design where a single
pointer not being NULL is all that is needed to distinguish "simple"
Numeric-like arrays from "fancy" numarray-like arrays seems like a great
way to make sure that
> For that case,
> the array struct size is irrelevant whereas additional complexity is
> not. I guess we would like to see some good practical examples where
> the array struct size matters. Do you have code with hundreds of
> of small arrays existing simultaneously?
As mentioned before, such code exists especially when arrays become a
basic datatype that you use all the time. How much complexity is
really generated by offloading the extra struct material to a bigarray
structure, thereby only increasing the Numeric array structure by 4
bytes instead of 200+?
On another fundamental note, numarray is being sold as a replacement for
Numeric. But, then, on closer inspection many things that Numeric does
well, numarray is ignoring or not doing very well. I think this
presents a certain amount of false advertising to new users, who don't
understand the history. Most of them would probably never need the
fanciness that numarray provides and would be quite satisfied with
Numeric. They just want to know what others are using. I think it is
a disservice to call numarray a replacement for Numeric until it
actually is. It should currently be called an "alternative
implementation" focused on large arrays. This (unintentional) slight of
hand that has been occurring over the past year has been my biggest
complaint with numarray. Making numarray a replacement for Numeric
means that it has to support small arrays, object arrays, and ufuncs at
least as well as but preferably better than Numeric. It should also be
faster than Numeric whenever possible, because Numeric has lots of
potential optimizations that have never been applied. If numarray does
not do these things, then in my mind it cannot be a replacement for
Numeric and should stop being called that on the numpy web site.
>> 3) There seem to be too many files to define the array. The mixture of
>> Python and C makes trying to understand the source very difficult. I
>> thought one of the reasons for the re-write was to simplify the source
> I think this reflects the transitional nature of going from mostly Python
> to a hybrid. We agree that the current state is more convoluted than it
> ought to be. If NDarray were all C, much of this would ended (though in
> some respects, being all in C will make it larger, harder to understand
> as well). The original hope was that most of the array setup computation
> could be kept in Python but that is what made it slow for small arrays
> (but it did allow us to implement it reasonably quickly with big array
> performance so that we could start using for our own projects without
> a long development effort). Unfortunately, the simplification in the
> rewrite is offset by handling the more complex cases (byte-swapping,
> etc.) and extra array indexing capabilities.
I never really understood the "code is too complicated" argument
anyway. I was just wondering if there is some support for reducing the
number of source code files, or reorganizing them a bit.
>> 4) Object arrays must be supported. This was a bad oversight and an
>> important feature of Numeric arrays.
> The current implementation does support them (though in a different
> way, and generally not as efficiently, though Todd is more up on the
> details here). What aspect of object arrays are you finding lacking?
I did not see such support when I looked at it, but given the previous
comment, I could easily have missed where that support is provided. I'm
mainly following up on Konrad's comment that his Automatic
differentiation does not work with Numarray because of the missing
support for object arrays. There are other applications for object
arrays as well. Most of the support needs to come from the ufunc side.
>> 5) The ufunc code interface needs to continue to be improved. I do see
>> that some effort into understanding the old ufunc interface has taken
>> place which is a good sign.
> You are probably referring to work underway to integrate with scipy (I'm
> assuming you are looking at the version in CVS).
Yes, I'm looking at the CVS version.
>> Again, thanks to the work that has been done. I'm really interested to
>> see if some of these modifications can be done as in my mind it will
>> help the process of unifying the two camps.
> I'm glad to see that you are taking a look at it and welcome the
> comments and
> any offers of help in improving speed.
I would be interested in helping if there is support for really making
numarray a real replacement for Numeric, by addressing the concerns that
I've outlined. As stated at the beginning, I'm really just looking
for how receptive numarray developers would be to the kinds of changes
I'm talking about: (1) reducing the size of the array structure, (2)
moving the ndarray entirely into C, (3) improving support for object
arrays, (4) improving ufunc API support.
I care less about array and ufunc C-API names being the same then the
underlying capabilities being available.
More information about the Numpy-discussion