[Numpy-discussion] the direction and pace of development
jh at oobleck.astro.cornell.edu
Wed Jan 21 10:45:02 CST 2004
This is a necessarily long post about the path to an open-source
replacement for IDL and Matlab. While I have tried to be fair to
those who have contributed much more than I have, I have also tried to
be direct about what I see as some fairly fundamental problems in the
way we're going about this. I've given it some section titles so you
can navigate, but I hope that you will read the whole thing before
posting a reply. I fear that this will offend some people, but please
know that I value all your efforts, and offense is not my intent.
THE PAST VS. NOW
While there is significant and dedicated effort going into
numeric/numarray/scipy, it's becoming clear that we are not
progressing quickly toward a replacement for IDL and Matlab. I have
great respect for all those contributing to the code base, but I think
the present discussion indicates some deep problems. If we don't
identify those problems (easy) and solve them (harder, but not
impossible), we will continue not to have the solution so many people
want. To be convinced that we are doing something wrong at a
fundamental level, consider that Python was the clear choice for a
replacement in 1996, when Paul Barrett and I ran a BoF at ADASS VI on
interactive data analysis environments. That was over 7 years ago.
When people asked at that conference, "what does Python need to
replace IDL or Matlab", the answer was clearly "stable interfaces to
basic numerics and plotting; then we can build it from there following
the open-source model". Work on both these problems was already well
underway then. Now, both the numerical and plotting development
efforts have branched. There is still no stable base upon which to
build. There aren't even packages for popular OSs that people can
install and play with. The problem is not that we don't know how to
do numerics or graphics; if anything, we know these things too well.
In 1996, if anyone had told us that in 2004 there would be no
ready-to-go replacement system because of a factor of 4 in small array
creation overhead (on computers that ran 100x as fast as those then
available) or the lack of interactive editing of plots at video
speeds, the response would not have been pretty. How would you have
We are not following the open-source development model. Rather, we
pay lip service to it. Open source's development mantra is "release
early, release often". This means release to the public, for use, a
package that has core capability and reasonably-defined interfaces.
Release it in a way that as many people as possible will get it,
install it, use it for real work, and contribute to it. Make the main
focus of the core development team the evaluation and inclusion of
contributions from others. Develop a common vision for the program,
and use that vision to make decisions and keep efforts focused.
Include contributing developers in decision making, but do make
decisions and move on from them.
Instead, there are no packages for general distribution. The basic
interfaces are unstable, and not even being publicly debated to decide
among them (save for the past 3 days). The core developers seem to
spend most of their time developing, mostly out of view of the
potential user base. I am asked probably twice a week by different
fellow astronomers when an open-source replacement for IDL will be
available. They are mostly unaware that this effort even exists.
However, this indicates that there are at least hundreds of potential
contributors of application code in astronomy alone, as I don't nearly
know everyone. The current efforts look rather more like the GNU
project than Linux. I'm sorry if that hurts, but it is true.
I know that Perry's group at STScI and the fine folks at Enthought
will say they have to work on what they are being paid to work on.
Both groups should consider the long term cost, in dollars, of
spending those development dollars 100% on coding, rather than 50% on
coding and 50% on outreach and intake. Linus himself has written only
a small fraction of the Linux kernel, and almost none of the
applications, yet in much less than 7 years Linux became a viable
operating system, something much bigger than what we are attempting
here. He couldn't have done that himself, for any amount of money.
We all know this.
Here is what I suggest:
1. We should identify the remaining open interface questions. Not,
"why is numeric faster than numarray", but "what should the syntax
of creating an array be, and of doing different basic operations".
If numeric and numarray are in agreement on these issues, then we
can move on, and debate performance and features later.
2. We should identify what we need out of the core plotting
capability. Again, not "chaco vs. pyxis", but the list of
requirements (as an astronomer, I very much like Perry's list).
3. We should collect or implement a very minimal version of the
featureset, and document it well enough that others like us can do
simple but real tasks to try it out, without reading source code.
That documentation should include lists of things that still need
to be done.
4. We should release a stand-alone version of the whole thing in the
formats most likely to be installed by users on the four most
popular OSs: Linux, Windows, Mac, and Solaris. For Linux, this
means .rpm and .deb files for Fedora Core 1 and Debian 3.0r2.
Tarballs and CVS checkouts are right out. We have seen that nobody
in the real world installs them. To be most portable and robust,
it would make sense to include the Python interpreter, named such
that it does not stomp on versions of Python in the released
operating systems. Static linking likewise solves a host of
problems and greatly reduces the number of package variants we will
have to maintain.
5. We should advertize and advocate the result at conferences and
elsewhere, being sure to label it what it is: a first-cut effort
designed to do a few things well and serve as a platform for
building on. We should also solicit and encourage people either to
work on the included TODO lists or to contribute applications. One
item on the TODO list should be code converters from IDL and Matlab
to Python, and compatibility libraries.
6. We should then all continue to participate in the discussions and
development efforts that appeal to us. We should keep in mind that
evaluating and incorporating code that comes in is in the long run
much more efficient than writing the universe ourselves.
7. We should cut and package new releases frequently, at least once
every six months. It is better to delay a wanted feature by one
release than to hold a release for a wanted feature. The mountain
is climbed in small steps.
The open source model is successful because it follows closely
something that has worked for a long time: the scientific method, with
its community contributions, peer review, open discussion, and
progress mainly in small steps. Once basic capability is out there,
we can twiddle with how to improve things behind the scenes.
IS SCIPY THE WAY?
The recipe above sounds a lot like SciPy. SciPy began as a way to
integrate the necessary add-ons to numeric for real work. It was
supposed to test, document, and distribute everything together. I am
aware that there are people who use it, but the numbers are small and
they seem to be tightly connected to Enthought for support and
application development. Enthought's focus seems to be on servicing
its paying customers rather than on moving SciPy development along,
and I fear they are building an installed customer base on interfaces
that were not intended to be stable.
So, I will raise the question: is SciPy the way? Rather than forking
the plotting and numerical efforts from what SciPy is doing, should we
not be creating a new effort to do what SciPy has so far not
delivered? These are not rhetorical or leading questions. I don't
know enough about the motivations, intentions, and resources of the
folks at Enthought (and elsewhere) to know the answer. I do think
that such a fork will occur unless SciPy's approach changes
substantially. The way to decide is for us all to discuss the
question openly on these lists, and for those willing to participate
and contribute effort to declare so openly. I think all that is
needed, either to help SciPy or replace it, is some leadership in the
direction outlined above. I would be interested in hearing, perhaps
from the folks at Enthought, alternative points of view. Why are
there no packages for popular OSs for SciPy 0.2? Why are releases so
infrequent? If the folks running the show at scipy.org disagree with
many others on these lists, then perhaps those others would like to
roll their own. Or, perhaps stable/testing/unstable releases of the
whole package are in order.
HOW TO CONTRIBUTE?
Judging by the number of PhDs in sigs, there are a lot of researchers
on this list. I'm one, and I know that our time for doing core
development or providing the aforementioned leadership is very
limited, if not zero. Later we will be in a much better position to
contribute application software. However, there is a way we can
contribute to the core effort even if we are not paid, and that is to
put budget items in grant and project proposals to support the work of
others. Those others could be either our own employees or
subcontractors at places like Enthought or STScI. A handful of
contributors would be all we'd need to support someone to produce OS
packages and tutorial documentation (the stuff core developers find
boring) for two releases a year.
More information about the Numpy-discussion