[Numpy-discussion] packaging scipy (was Re: Simple financial functions for NumPy)
Joe Harrington
jh@physics.ucf....
Fri Apr 4 14:27:37 CDT 2008
Every once in a while the issue of how to split things into packages
comes up. In '04, I think, we had such a discussion regarding scipy
(with Numeric as its base at the time). One idea was a
core-plus-many-modules approach. We could then have metapackages that
just consisted of dependencies and would draw in the packages that
were needed for a given application. A user would install just one
metapackage (one of which would install every package), or could roll
their own.
Then Travis took as much of scipy as was reasonable to maintain at
once, took a chunk of numarray, and made numpy. Numpy is now a bit
more than what I would have thought of as "core" functionality.
One might consider trimming Travis's collection and moving many of the
functions out to add-ons in scipy. The question is where to draw the
line in the sand with regard to where a given function belongs, and
the problem is that the line is very hard to draw in a way that
pleases many people and mortally offends very few. Certainly lots of
content areas would straddle the divide and require both a package in
scipy and representation in the core.
I would think of core functionality as what is now the ndarray class
(including all slicing and broadcasting); things to make and index
arrays like array, zeros, ones, where and some truth tests; the most
elementary math functions, such as add, subtract, multiply, divide,
power, sin, cos, tan, asin, acos, atan, log, log10; sum, mean, median,
standard deviaton; the constants pi, e, NaN, Inf, -Inf; a few simple
file I/O functions including loadtxt, and not a whole lot else. This
is an incomplete list but you get the idea. Everything else would be
in optional packages, possiby broken down by topic.
Now ask yourself, what would you add to this to make your core? Would
you take anything out? And the kicker: do you really think much of
the community would agree with any one of our lists, including mine?
Almost all such very-light collections would require users to load
additional packages. For example, complex things like FFTs and random
numbers would not belong in the core outlined above, nor would linear
algebra, masked arrays, fitting, random numbers, most stats, or
financial functions.
There is one division that does make sense, and that would be to
distribute ONLY ndarray as the core, and have NO basic math functions,
etc., in the core. Then you have to load something, a lot of things,
to make it useful, but you don't have the question of what should be
in which package. But ndarray is already a stand-alone item.
At that point you have to ask, if the core is so small that *everyone*
has to load an add-on, what's the point of making the division? You
can argue that it's easier maintenance-wise, but I'm not certain that
having many packages to build, test, and distribute is easier. Travis
already made a decision based on maintenance, and it seems to be
working.
That brings us to the motivation for dividing in the first place. I
think people like the idea because we're all scientists and we like to
categorize things. We like neat little packages. But we're not
thinking about the implications for the code we'd actually write.
Wouldn't you rather do:
import numpy as N
...
c = (N.sin(b) + N.exp(d)) / N.mean(g)
rather than:
import numpy as N
import numpy.math as N.M
import numpy.trig as N.T
import numpy.stat as N.S
...
c = (N.T.sin(b) + N.M.exp(d)) / N.S.mean(g)
?
In the latter example, you start N.whatevering yourself to death, and
it is harder to learn because you have to remember what little
container each function is in and what you happened to name the
container when you loaded it. Code is also harder to read as the
N.whatevers distract from the sense of the equation. Lines start to
lengthen. Sure, you can do:
from whatever import functionlist
to pull the functions into your top-level namespace, but do you really
want to, in effect, declare every function you ever call? The whole
point of array languages is to get rid of mechanics like function
declarations so you can focus on programming, not housekeeping.
As I've emphasized before, finding the function you want is a
documentation problem, not a packaging problem. We're working on
that. Anne's function index on the web site is an excellent start,
and there will be much more done this summer.
Though I didn't initially agree with it, I now think Travis's line in
the sand is a pretty good one. Numpy is enough so many people don't
have to go to scipy most of the time. It's being maintained well and
released reasonably often. The problems of the rest of scipy are not
holding back the core anymore. Installing today at a tiny 10 MB,
numpy could easily stand to grow by adding small functions that are
broadly used, without making it unwieldy for even the constrained
space of OLPC.
Let's think good and hard before we introduce more divisions into the
namespace.
--jh--
More information about the Numpy-discussion
mailing list