[Numpy-discussion] Broadcasting rules (Ticket 76).

Travis Oliphant oliphant.travis at ieee.org
Wed Apr 26 20:30:12 CDT 2006


Sasha wrote:

> In my view the most appealing feature in Python is
> the Zen of Python <http://www.python.org/doc/Humor.html#zen>" and in
> particular "There should be one-- and preferably only one --obvious
> way to do it."  In my view Python represents the "hard science"
> approach appealing to physics and math types while Perl is more of a
> "soft science" language. 
Interesting analogy.  I've not heard that expression before. 
> Unfortunately, it is the fact of life that there are
> always many ways to solve the same problem and a successful "pythonic"
> design has to pick one (preferably the best) of the possible ways and
> make it obvious.
>   
And it's probably impossible to agree as to what is "best"  because of 
the different uses that array's receive.    That's one reason I'm 
anxious to get a basic structure-only basearray into Python itself. 
> This said, let me present a specific problem that I will use to
> illustrate my points below.  Suppose we study school statistics in
> different cities.  Let city A have 10 schools with 20 classes and 30
> students in each.  It is natural to organize the data collected about
> the students in a 10x20x30 array.  It is also natural to collect some
> of the data at the per-school or per-class level.  This data may come
> from aggregating student level statistics (say average test score) or
> from the characteristics that are class or school specific (say the
> grade or primary language).  There are two obvious ways to present
> such data. 1) We can use 3-d arrays for everything and make the shape
> of the per-class array 10x20x1 and the shape of per-school array
> 10x1x1; and 2) use 2-d and 1-d arrays.  The first approach seems to be
> more flexible.  We can also have 10x1x30 or 1x1x30 arrays to represent
> data which varies along the student dimension, but is constant across
> schools or classes.  However, this added benefit is illusory: the
> first student in one class list has no relationship to the first 
> student in the other class, so in this particular problem an average
> score of the first student across classes makes no sense (it will also
> depend on whether students are ordered alphabetically or by an
> achievement rank).
>
> On the other hand this approach has a very significant drawback:
> functions that process city data have no way to distinguish between
> per-school data and a lucky city that can afford educating its
> students in individual classes.  Just as it is extremely unlikely to
> have one student per class in our toy example, in real-world problems
> it is not unreasonable to assume that dimension of size 1 represents
> aggregate data.  A software designed based on this assumption is what
> I would call broken in a subtle way.
>   

I think I see what you are saying.  This is a very specific 
circumstance.  I can verify that the ndarray has not been designed to 
distinguish such hierarchial data.  You will never be able to tell from 
the array itself if a dimension of length 1 means aggregate data or 
not.   I don't see that as a limitation of the ndarray but as evidence 
that another object (i.e. an R-like data-frame) should probably be 
used.  Such an object could even be built on top of the ndarray.

>> [...]
>> I don't think anyone is fundamentally opposed to multiple repetitions.
>> We're just being cautious.   Also, as you've noted, the assignment code
>> is currently not using the ufunc broadcasting code and so they really
>> aren't the same thing, yet.
>>     
>
> It looks like there is a lot of development in this area going on at
> the moment.  Please let me know if I can help.
>   

Well, I did some refactoring to make it easier to expose the basic 
concept of the ufunc elsewhere:

1) Adjusting the inputs to a common shape  (this is what I call 
broadcasting --- it appears to me that you use the term a little more 
loosely)

2) Setting up iterators to iterate over all but the longest dimension so 
that the inner loop is done.

These are the key ingredients to a fast ufunc.  There is 1 more 
optimization in the ufunc machinery for the contiguous case (when the 
inner loop is all that is needed) and then there is code to handle the 
buffering needed for unaligned and/or byte-swapped data.

The final thing that makes a ufunc is the precise signature of the inner 
loop.    Every inner loop as the same signature.   This signature does 
not contain a slot for the length of the array element (that's a big 
reason why variable-length arrays are not supported in ufuncs).  The 
ufuncs could be adapted, of course,  but it was a bigger fish than I 
wanted to try and fry pre 1.0

Note, though, that I haven't used these concepts yet to implement 
ufunc-like copying.   The PyArray_Cast function will also need to be 
adjusted at the same time and this could actually prove more difficult 
as it must implement buffering.   Of course it could give us a chance to 
abstract-out the buffered, broadcasted call as well.  That might make a 
useful C-API function.   

Any help you can provide would be greatly appreciated.   I'm focused 
right now on the scalar math module as without it, NumPy is still slower 
for people that use a lot of array elements. 
>> [...]
>>     
>>> In my experience broadcasting length-1 and not broadcasting other
>>> lengths is very error prone as it is.
>>>       
>> That's not been my experience.
>>     
>
> I should have been more specific.  As I explained above, the special
> properties of length-1 led me to design a system that distinguished
> aggregate data by testing for unit length.  This system was subtly
> broken.  In a rare case when the population had only one element, the
> system was producing wrong results.
>   
Yes I can see that now.   Your comments make a lot more sense.  Trying 
to use ndarray's to represent hierarchial data can cause these subtle 
issues.  The ndarray is a "flat" object in the sense that every element 
is seen as "equal" to every other element. 

>> dim(x) <- c(2,5)
>> x
>>     
>      [,1] [,2] [,3] [,4] [,5]
> [1,]    0    0    0    0    0
> [2,]    0    0    0    0    0
>
> (R uses Fortran order).  Broadcasting ignores the dim attribute, but
> does the right thing for conformable vectors:
>
>   

Thanks for the description of R.

>> x + c(1,2)
>>     
>      [,1] [,2] [,3] [,4] [,5]
> [1,]    1    1    1    1    1
> [2,]    2    2    2    2    2
>
> However, the following is unfortunate:
>   
Ahh...   So, it looks like R does on arithmetic what NumPy copying is 
currently doing (treating both as flat spaces to fill).

>> x
>>     
> Sorry I was not specific in the original post.  I hope you now
> understand where I come from.  Can you point me to some examples of
> the correct way to use dimension-preserving broadcasting?  I would
> assume that it is probably more useful in the problem domains where
> there is no natural ordering of the dimensions, unlike in the
> hierarchial data example that I used.
>   

Yes,  the ndarray does not recognize any natural ordering to the 
dimensions at all.  Every dimension is "equal."  It's designed to be a 
very basic object.

I'll post some examples later.  I've got to go right now.

> The dimension-increasing broadcasting is very natural when you deal
> with hierarchical data where various dimensions correspond to the
> levels of aggregation.  As I explained above, average student score
> per class makes sense while the average score per student over classes
> does not.  It is very common to combine per-class data with
> per-student data by broadcasting per-class data.  For example, the
> total time spent by student is a sum spent in regular per-class
> session plus individual elected courses.
>   

I think you've hit on something here regarding the use of an array for 
"hierachial" data.  I'm not sure I understand the implications entirely, 
but at least it helps me a little bit see what your concerns really are.


> I hope you understand that I did not mean to criticize anyone's coding
> style.  I was not really hinting at optimization issues, I just had a
> particular design problem in mind (see above).  
I do understand much better now.  I still need to think about the 
hierarchial case a bit more.  My basic concept of an array which 
definitely biases me is a medical imaging volume.... (i.e. the X-ray 
density at each location in 3-space). 

I could use improved understanding of how to use array's effectively in 
hierarchies.  Perhaps we can come up with some useful concepts (or maybe 
another useful structure that inherits from the basearray) and can 
therefore share data effectively with the ndarray....


> In the spirit of appealing to obscure languages ;-), let me mention
> that in the K language (kx.com) element assignment is implemented
> using an Amend primitive that takes four arguments: @[x,i,f,y] id more
> or less equivalent to numpy's x[i] = f(x[i], y[i]), where x, y and i
> are vectors and f is a binary (broadcasting) function.  Thus, x[i] +=
> y[i] can be written as @[x,i,+,y] and x[i] = y[i] is @[x,i,:,y], where
> ':' denotes a binary function that returns it's second argument and
> ignores the first. K interpretor's Linux binary is less than 200K and
> that includes a simple X window GUI! Such small code size would not be
> possible without picking the right set of primitives and avoiding
> special case code.
>   

Not to mention limiting the number of data-types :-)


> I know close to nothing about variable length arrays.  When I need to
> deal with the relational database data, I transpose it so that each
> column gets into its own fixed length array. 
Yeah, that was my strategy too and what I always suggested to the 
numarray folks who wanted the variable-length arrays.   But, 
memory-mapping can't be done that way....

>  This is the approach
> that both R and K take.  However, at least at the C level, I don't see
> why ufunc code cannot be generalized to handle variable length arrays.
>   
They of course, could be, it's just more re-factoring than I wanted to 
do.   The biggest issue is the underlying 1-d loop function signature.  
I hesitated to change the signature because that would break 
compatibility with Numeric extension modules that defined ufuncs (like 
scipy-special...)  The length could piggy-back in the data argument 
passed into those functions, but doing that right was more work than I 
wanted to do.   If you solve that problem,  everything else could be 
made to work without too much trouble.

>  At the python level, pre-defined arithmetic or math functions are
> probably not feasible for variable length, but the ability to define a
> variable length array function by just writing an inner loop
> implementation may be quite useful.
>   
Yes, it could have helped write the string comparisons much faster :-)

>> However, the broadcasting machinery has been abstracted in NumPy and can
>> therefore be re-used in the copying code.  In Numeric, broadcasting was
>> basically implemented deep inside a confusing while loop.
>>     
>
> I've never understood the Numeric's while loop and completely agree
> with your characterization.  I am still studying the numpy code, but
> it is clearly a big improvement.
>   

Well, it's more straightforward because I'm not the genius Jim Hugunin 
is.  It makes heavy use of the iterator concept which I finally grok'd 
while trying to write things (and realized I had basically already 
implemented in writing the old scipy.vectorize). 

I welcome many more eyes on the code.   I know I've taken shortcuts in 
places that should be improved.

Thanks for your continued help and useful comments.


-Travis







More information about the Numpy-discussion mailing list