[Numpy-discussion] Broadcasting rules (Ticket 76).
oliphant.travis at ieee.org
Wed Apr 26 20:30:12 CDT 2006
> In my view the most appealing feature in Python is
> the Zen of Python <http://www.python.org/doc/Humor.html#zen>" and in
> particular "There should be one-- and preferably only one --obvious
> way to do it." In my view Python represents the "hard science"
> approach appealing to physics and math types while Perl is more of a
> "soft science" language.
Interesting analogy. I've not heard that expression before.
> Unfortunately, it is the fact of life that there are
> always many ways to solve the same problem and a successful "pythonic"
> design has to pick one (preferably the best) of the possible ways and
> make it obvious.
And it's probably impossible to agree as to what is "best" because of
the different uses that array's receive. That's one reason I'm
anxious to get a basic structure-only basearray into Python itself.
> This said, let me present a specific problem that I will use to
> illustrate my points below. Suppose we study school statistics in
> different cities. Let city A have 10 schools with 20 classes and 30
> students in each. It is natural to organize the data collected about
> the students in a 10x20x30 array. It is also natural to collect some
> of the data at the per-school or per-class level. This data may come
> from aggregating student level statistics (say average test score) or
> from the characteristics that are class or school specific (say the
> grade or primary language). There are two obvious ways to present
> such data. 1) We can use 3-d arrays for everything and make the shape
> of the per-class array 10x20x1 and the shape of per-school array
> 10x1x1; and 2) use 2-d and 1-d arrays. The first approach seems to be
> more flexible. We can also have 10x1x30 or 1x1x30 arrays to represent
> data which varies along the student dimension, but is constant across
> schools or classes. However, this added benefit is illusory: the
> first student in one class list has no relationship to the first
> student in the other class, so in this particular problem an average
> score of the first student across classes makes no sense (it will also
> depend on whether students are ordered alphabetically or by an
> achievement rank).
> On the other hand this approach has a very significant drawback:
> functions that process city data have no way to distinguish between
> per-school data and a lucky city that can afford educating its
> students in individual classes. Just as it is extremely unlikely to
> have one student per class in our toy example, in real-world problems
> it is not unreasonable to assume that dimension of size 1 represents
> aggregate data. A software designed based on this assumption is what
> I would call broken in a subtle way.
I think I see what you are saying. This is a very specific
circumstance. I can verify that the ndarray has not been designed to
distinguish such hierarchial data. You will never be able to tell from
the array itself if a dimension of length 1 means aggregate data or
not. I don't see that as a limitation of the ndarray but as evidence
that another object (i.e. an R-like data-frame) should probably be
used. Such an object could even be built on top of the ndarray.
>> I don't think anyone is fundamentally opposed to multiple repetitions.
>> We're just being cautious. Also, as you've noted, the assignment code
>> is currently not using the ufunc broadcasting code and so they really
>> aren't the same thing, yet.
> It looks like there is a lot of development in this area going on at
> the moment. Please let me know if I can help.
Well, I did some refactoring to make it easier to expose the basic
concept of the ufunc elsewhere:
1) Adjusting the inputs to a common shape (this is what I call
broadcasting --- it appears to me that you use the term a little more
2) Setting up iterators to iterate over all but the longest dimension so
that the inner loop is done.
These are the key ingredients to a fast ufunc. There is 1 more
optimization in the ufunc machinery for the contiguous case (when the
inner loop is all that is needed) and then there is code to handle the
buffering needed for unaligned and/or byte-swapped data.
The final thing that makes a ufunc is the precise signature of the inner
loop. Every inner loop as the same signature. This signature does
not contain a slot for the length of the array element (that's a big
reason why variable-length arrays are not supported in ufuncs). The
ufuncs could be adapted, of course, but it was a bigger fish than I
wanted to try and fry pre 1.0
Note, though, that I haven't used these concepts yet to implement
ufunc-like copying. The PyArray_Cast function will also need to be
adjusted at the same time and this could actually prove more difficult
as it must implement buffering. Of course it could give us a chance to
abstract-out the buffered, broadcasted call as well. That might make a
useful C-API function.
Any help you can provide would be greatly appreciated. I'm focused
right now on the scalar math module as without it, NumPy is still slower
for people that use a lot of array elements.
>>> In my experience broadcasting length-1 and not broadcasting other
>>> lengths is very error prone as it is.
>> That's not been my experience.
> I should have been more specific. As I explained above, the special
> properties of length-1 led me to design a system that distinguished
> aggregate data by testing for unit length. This system was subtly
> broken. In a rare case when the population had only one element, the
> system was producing wrong results.
Yes I can see that now. Your comments make a lot more sense. Trying
to use ndarray's to represent hierarchial data can cause these subtle
issues. The ndarray is a "flat" object in the sense that every element
is seen as "equal" to every other element.
>> dim(x) <- c(2,5)
> [,1] [,2] [,3] [,4] [,5]
> [1,] 0 0 0 0 0
> [2,] 0 0 0 0 0
> (R uses Fortran order). Broadcasting ignores the dim attribute, but
> does the right thing for conformable vectors:
Thanks for the description of R.
>> x + c(1,2)
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 1 1 1 1
> [2,] 2 2 2 2 2
> However, the following is unfortunate:
Ahh... So, it looks like R does on arithmetic what NumPy copying is
currently doing (treating both as flat spaces to fill).
> Sorry I was not specific in the original post. I hope you now
> understand where I come from. Can you point me to some examples of
> the correct way to use dimension-preserving broadcasting? I would
> assume that it is probably more useful in the problem domains where
> there is no natural ordering of the dimensions, unlike in the
> hierarchial data example that I used.
Yes, the ndarray does not recognize any natural ordering to the
dimensions at all. Every dimension is "equal." It's designed to be a
very basic object.
I'll post some examples later. I've got to go right now.
> The dimension-increasing broadcasting is very natural when you deal
> with hierarchical data where various dimensions correspond to the
> levels of aggregation. As I explained above, average student score
> per class makes sense while the average score per student over classes
> does not. It is very common to combine per-class data with
> per-student data by broadcasting per-class data. For example, the
> total time spent by student is a sum spent in regular per-class
> session plus individual elected courses.
I think you've hit on something here regarding the use of an array for
"hierachial" data. I'm not sure I understand the implications entirely,
but at least it helps me a little bit see what your concerns really are.
> I hope you understand that I did not mean to criticize anyone's coding
> style. I was not really hinting at optimization issues, I just had a
> particular design problem in mind (see above).
I do understand much better now. I still need to think about the
hierarchial case a bit more. My basic concept of an array which
definitely biases me is a medical imaging volume.... (i.e. the X-ray
density at each location in 3-space).
I could use improved understanding of how to use array's effectively in
hierarchies. Perhaps we can come up with some useful concepts (or maybe
another useful structure that inherits from the basearray) and can
therefore share data effectively with the ndarray....
> In the spirit of appealing to obscure languages ;-), let me mention
> that in the K language (kx.com) element assignment is implemented
> using an Amend primitive that takes four arguments: @[x,i,f,y] id more
> or less equivalent to numpy's x[i] = f(x[i], y[i]), where x, y and i
> are vectors and f is a binary (broadcasting) function. Thus, x[i] +=
> y[i] can be written as @[x,i,+,y] and x[i] = y[i] is @[x,i,:,y], where
> ':' denotes a binary function that returns it's second argument and
> ignores the first. K interpretor's Linux binary is less than 200K and
> that includes a simple X window GUI! Such small code size would not be
> possible without picking the right set of primitives and avoiding
> special case code.
Not to mention limiting the number of data-types :-)
> I know close to nothing about variable length arrays. When I need to
> deal with the relational database data, I transpose it so that each
> column gets into its own fixed length array.
Yeah, that was my strategy too and what I always suggested to the
numarray folks who wanted the variable-length arrays. But,
memory-mapping can't be done that way....
> This is the approach
> that both R and K take. However, at least at the C level, I don't see
> why ufunc code cannot be generalized to handle variable length arrays.
They of course, could be, it's just more re-factoring than I wanted to
do. The biggest issue is the underlying 1-d loop function signature.
I hesitated to change the signature because that would break
compatibility with Numeric extension modules that defined ufuncs (like
scipy-special...) The length could piggy-back in the data argument
passed into those functions, but doing that right was more work than I
wanted to do. If you solve that problem, everything else could be
made to work without too much trouble.
> At the python level, pre-defined arithmetic or math functions are
> probably not feasible for variable length, but the ability to define a
> variable length array function by just writing an inner loop
> implementation may be quite useful.
Yes, it could have helped write the string comparisons much faster :-)
>> However, the broadcasting machinery has been abstracted in NumPy and can
>> therefore be re-used in the copying code. In Numeric, broadcasting was
>> basically implemented deep inside a confusing while loop.
> I've never understood the Numeric's while loop and completely agree
> with your characterization. I am still studying the numpy code, but
> it is clearly a big improvement.
Well, it's more straightforward because I'm not the genius Jim Hugunin
is. It makes heavy use of the iterator concept which I finally grok'd
while trying to write things (and realized I had basically already
implemented in writing the old scipy.vectorize).
I welcome many more eyes on the code. I know I've taken shortcuts in
places that should be improved.
Thanks for your continued help and useful comments.
More information about the Numpy-discussion