[Numpy-discussion] the direction and pace of development
eric at enthought.com
Thu Jan 22 00:05:01 CST 2004
Good thing Duke is beating Maryland as I read, otherwise, mail like this
can make you grumpy. :-)
Joe Harrington wrote:
>This is a necessarily long post about the path to an open-source
>replacement for IDL and Matlab. While I have tried to be fair to
>those who have contributed much more than I have, I have also tried to
>be direct about what I see as some fairly fundamental problems in the
>way we're going about this. I've given it some section titles so you
>can navigate, but I hope that you will read the whole thing before
>posting a reply. I fear that this will offend some people, but please
>know that I value all your efforts, and offense is not my intent.
>THE PAST VS. NOW
>While there is significant and dedicated effort going into
>numeric/numarray/scipy, it's becoming clear that we are not
>progressing quickly toward a replacement for IDL and Matlab. I have
>great respect for all those contributing to the code base, but I think
>the present discussion indicates some deep problems. If we don't
>identify those problems (easy) and solve them (harder, but not
>impossible), we will continue not to have the solution so many people
>want. To be convinced that we are doing something wrong at a
>fundamental level, consider that Python was the clear choice for a
>replacement in 1996, when Paul Barrett and I ran a BoF at ADASS VI on
>interactive data analysis environments. That was over 7 years ago.
The effort has fallen short of the mark you set. I also wish the
community was more efficient at pursuing this goal. There are
fundamental issues. (1) The effort required is large. (2) Free time is
in short supply. (3) Financial support is difficult to come by for
library development. Other potential problems would be a lack of
interest and a lack of competence. I do not think many of us suffer
from the first. As for competence, the development team beyond the
walls of Enthought self selects in open source projects, so we're stuck
with what we've got. I know most of the people and happen to think they
are a talented bunch, so I'll consider us no worse than the average
group of PhDs (some consider that a pretty low bar ...). I believe the
tasks that go undone (multi-platform support, bi-yearly releases,
documentation, etc.) are more due to (2) and (3) above instead of some
other deep (or shallow) issue.
I guess another possibility is organization. This can be improved
upon. Thanks to the gracious help of Cal Tech (CACR) and NCBR, the
community has gathered at a low cost SciPy workshop at Cal Tech the last
couple of years. I believe this is a positive step. Adding this to the
newsgroups and mailing lists provides us with a solid framework within
which to operate.
I still have confidence that we will reach the IDL/Matlab replacement
point. We don't have the resources that those products have behind
them. We do have a superior language, but without a lot of sweat and
toiling at hours of grunt work, we don't stand a chance. As for
Enthought's efforts, our success in building applications (scientific
and otherwise) has diverted our developers (myself included) away from
SciPy as the primary focus. We do continue to develop it and provide
significant (for us) financial support to maintain it. I am lucky
enough to work with a fine set of software engineers, and I am itching
to for us to get more time devoted to SciPy. I do believe that we will
get the opportunity in the future -- it is just a matter of time. Call
me an optimist.
>replace IDL or Matlab", the answer was clearly "stable interfaces to
>basic numerics and plotting; then we can build it from there following
>the open-source model". Work on both these problems was already well
>underway then. Now, both the numerical and plotting development
>efforts have branched. There is still no stable base upon which to
>build. There aren't even packages for popular OSs that people can
>install and play with. The problem is not that we don't know how to
>do numerics or graphics; if anything, we know these things too well.
>In 1996, if anyone had told us that in 2004 there would be no
>ready-to-go replacement system because of a factor of 4 in small array
>creation overhead (on computers that ran 100x as fast as those then
>available) or the lack of interactive editing of plots at video
>speeds, the response would not have been pretty. How would you have
>We are not following the open-source development model. Rather, we
>pay lip service to it. Open source's development mantra is "release
>early, release often". This means release to the public, for use, a
>package that has core capability and reasonably-defined interfaces.
>Release it in a way that as many people as possible will get it,
>install it, use it for real work, and contribute to it. Make the main
>focus of the core development team the evaluation and inclusion of
>contributions from others. Develop a common vision for the program,
>and use that vision to make decisions and keep efforts focused.
>Include contributing developers in decision making, but do make
>decisions and move on from them.
>Instead, there are no packages for general distribution. The basic
>interfaces are unstable, and not even being publicly debated to decide
>among them (save for the past 3 days). The core developers seem to
>spend most of their time developing, mostly out of view of the
>potential user base. I am asked probably twice a week by different
>fellow astronomers when an open-source replacement for IDL will be
>available. They are mostly unaware that this effort even exists.
>However, this indicates that there are at least hundreds of potential
>contributors of application code in astronomy alone, as I don't nearly
>know everyone. The current efforts look rather more like the GNU
>project than Linux. I'm sorry if that hurts, but it is true.
Speaking from the standpoint of SciPy, all I can say is we've tried to
do what you outline here. The effort of releasing the huge load of
Fortran/C/C++/Python code across multiple platforms is difficult and
takes many hours. I would venture that 90% of the effort on SciPy is
with the build system. This means that the exact part of the process
that you are discussing is the majority of the effort. We keep a
version for Windows up to date because that is what our current clients
use. In all the other categories, we do the best we can and ask others
to fill the gaps. It is also worth saying that SciPy works quite well
for most purposes once built -- we and others use it daily on commercial
>I know that Perry's group at STScI and the fine folks at Enthought
>will say they have to work on what they are being paid to work on.
>Both groups should consider the long term cost, in dollars, of
>spending those development dollars 100% on coding, rather than 50% on
>coding and 50% on outreach and intake. Linus himself has written only
>a small fraction of the Linux kernel, and almost none of the
>applications, yet in much less than 7 years Linux became a viable
>operating system, something much bigger than what we are attempting
>here. He couldn't have done that himself, for any amount of money.
>We all know this.
Elaborate on the outreach idea for me. Enthought (spend money to)
provide funding to core developers outside of our company (Travis and
Pearu), we (spend money to) give talks at many conferences a year, we
(spend a little money to) co-sponsor a 70 person workshop on scientific
computing every year, we have an open mailing list, we release most of
the general software that we write, in the past I practically begged
people to have CVS write access when they provide a patch to SciPy. We
even spent a lot of time early on trying to set up the scipy.org site as
a collaborative Zope based environment -- an effort that was largely a
failure. Still we have a functioning largely static site, the mailing
list, and CVS. As far as tools, that should be sufficient.
It is impossible to argue with the results though. Linus pulled off the
OS model, and Enthought and the SciPy community, thus far, has been less
successful. If there are suggestions beyond "spend more *time*
answering email," I am all ears. Time is the most precious commodity of
all these days.
Also, SciPy has only been around for 3+ years, so I guess we still have
a some rope left. I continue to believe it'll happen -- this seems like
the perfect project for open source contributions.
>Here is what I suggest:
>1. We should identify the remaining open interface questions. Not,
> "why is numeric faster than numarray", but "what should the syntax
> of creating an array be, and of doing different basic operations".
> If numeric and numarray are in agreement on these issues, then we
> can move on, and debate performance and features later.
?? I don't get this one. This interface (at least for numarray) is
largely decided. We have argued the points, and Perry et. al. at STSci
made the decisions. I didn't like some of them, and I'm sure everyone
else had at least one thing they wished was changed, but that is the way
this open stuff works.
It is not the interface but the implementation that started this furor.
Travis O.'s suggestion was to back port (much of) the numarray interface
to the Numeric code base so that those stuck supporting large co debases
(like SciPy) and needing fast small arrays could benefit from the
interface enhancements. One or two of them had backward compatibility
issues with Numeric, so he asked how it should be handled. Unless some
magic porting fairy shows up, SciPy will be a Numeric only tool for the
next year or so. This means that users of SciPy either have to forgo
some of these features or back port.
On speed: <excerpt from private mail to Perry>
Numeric is already too slow -- we've had to recode a number of routines
in C that I don't think we should have in a recent project. For us, the
goal is not to approach Numeric's speed but to significantly beat it for
all array sizes. That has to be a possibility for any replacement.
Otherwise, our needs (with the exception of a few features) are already
better met by Numeric. I have some worries about all of the endianness
and memory mapped support that are built into Numarray imposing to much
overhead for speed-ups on small arrays to be possible (this echo's
Travis O's thoughts -- we will happily be proven wrong). None of our
current work needs these features, and paying a price for them is hard
to do with an alternative already there. It is fairly easy to improve
its performance on mathematical by just changing the way the ufunc
operations are coded. With some reasonably simple changes, Numeric
should be comparable (or at least closer) to Numarray speed for large
arrays. Numeric also has a large number of other optimizations that can
be made (memory is zeroed twice in zeros(), asarray was recently
improved significantly for the typical case, etc.). Making these
changes would help our selling of Python and, since we have at least a
years worth of applications that will be on the SciPy/Numeric platform,
it will also help the quality of these applications.
Oh yeah, I have also been surprised at how much of out code uses
alltrue(), take(), isnan(), etc. The speed of these array manipulation
methods is really important for us.
>2. We should identify what we need out of the core plotting
> capability. Again, not "chaco vs. pyxis", but the list of
> requirements (as an astronomer, I very much like Perry's list).
Yep, we obviously missed on this one. Chaco (and the related libraries)
is extremely advanced in some areas but lags in ease-of-use. It is
primarily written by a talented and experienced computer scientist (Dave
Morrill) who likely does not have the perspective of an astronomer. It
is clear that areas of the library need to be re-examined, simplified,
and improved. Unfortunately, there is not time for us to do that right
now, and the internals have proven to complex for others to contribute
to in a meaningful way. I do not know when this will be addressed. The
sad thing here is that STSci won't be using it. That pains me to no
end, and Perry and I have tried to figure out some way to make it work
for them. But, it sounds like, at least in the short term, there will
be two new additions to the plotting stable. We will work hard though
to make the future Chaco solve STSci's problems (and everyone elses)
better than it currently does.
By the way, there is a lot of Chaco bashing going on. It is worth
saying that we use Chaco every day in commercial applications that
require complex graphics and heavy interactivity with great success.
But, we also have mixed teams of scientists and computer scientists
along with the "U Manual" (If I have a question, I ask you -- being
Dave) to answer any questions. I continue to believe Chaco's Traits
based approach is the only one currently out there that has the chance
of improving on Matlab and other plotting packages available. And,
while SciPy is moving slowly, Chaco is moving at a frantic development
pace and gets new capabilities daily (which is part of the complaints
about it). I feel certain in saying that it has more resources tied to
its development that the other plotting option out there -- it is just
currently being exercised in GUI environments instead of as a day-to-day
plotting tool. My advice is dig in, learn traits, and learn Chaco.
>3. We should collect or implement a very minimal version of the
> featureset, and document it well enough that others like us can do
> simple but real tasks to try it out, without reading source code.
> That documentation should include lists of things that still need
> to be done.
>4. We should release a stand-alone version of the whole thing in the
> formats most likely to be installed by users on the four most
> popular OSs: Linux, Windows, Mac, and Solaris. For Linux, this
> means .rpm and .deb files for Fedora Core 1 and Debian 3.0r2.
> Tarballs and CVS checkouts are right out. We have seen that nobody
> in the real world installs them. To be most portable and robust,
> it would make sense to include the Python interpreter, named such
> that it does not stomp on versions of Python in the released
> operating systems. Static linking likewise solves a host of
> problems and greatly reduces the number of package variants we will
> have to maintain.
>5. We should advertize and advocate the result at conferences and
> elsewhere, being sure to label it what it is: a first-cut effort
> designed to do a few things well and serve as a platform for
> building on. We should also solicit and encourage people either to
> work on the included TODO lists or to contribute applications. One
> item on the TODO list should be code converters from IDL and Matlab
> to Python, and compatibility libraries.
>6. We should then all continue to participate in the discussions and
> development efforts that appeal to us. We should keep in mind that
> evaluating and incorporating code that comes in is in the long run
> much more efficient than writing the universe ourselves.
>7. We should cut and package new releases frequently, at least once
> every six months. It is better to delay a wanted feature by one
> release than to hold a release for a wanted feature. The mountain
> is climbed in small steps.
>The open source model is successful because it follows closely
>something that has worked for a long time: the scientific method, with
>its community contributions, peer review, open discussion, and
>progress mainly in small steps. Once basic capability is out there,
>we can twiddle with how to improve things behind the scenes.
Everything here is great -- it is the implementation part that is hard.
I am all for it happening though.
>IS SCIPY THE WAY?
>The recipe above sounds a lot like SciPy. SciPy began as a way to
>integrate the necessary add-ons to numeric for real work. It was
>supposed to test, document, and distribute everything together. I am
>aware that there are people who use it, but the numbers are small and
>they seem to be tightly connected to Enthought for support and
Not so. The user base is not huge, but I would conservatively venture
to say it is in the hundreds to thousands. We are a company of 12
without a single support contract for SciPy.
>Enthought's focus seems to be on servicing
>its paying customers rather than on moving SciPy development along,
Continuing to move SciPy along at the pace we initially were would have
ended Enthought -- something had to change. It is surprising how
important paying customers are to a company.
>and I fear they are building an installed customer base on interfaces
>that were not intended to be stable.
Not sure what you you mean here, but I'm all for stable interfaces.
Huge portions of SciPy's interface haven't changed, and I doubt they
will change. I do indeed feel, though, that SciPy is still a 0.2
release level, so some of the interfaces can change. It would be
irresponsible to say otherwise. This is not "intentionally unstable"
>So, I will raise the question: is SciPy the way? Rather than forking
>the plotting and numerical efforts from what SciPy is doing, should we
>not be creating a new effort to do what SciPy has so far not
>delivered? These are not rhetorical or leading questions. I don't
>know enough about the motivations, intentions,
Man this sounds like an interview (or interaction) question. We'll
we're a company, so we do wish to make money -- otherwise, we'll have to
do something else. We also care about deeply about science and are
passionate about scientific computing. Let see, what else. We have
made most of the things we do open source because we do believe in it in
principle and as a good development philosophy. And, even though we all
wish SciPy was moving faster, SciPy wouldn't be anywhere close to where
it is without Travis Oliphant and Pearu Peterson -- neither of whom
would have worked on it had it not been openly available. That alone
validates the decision to make it open.
I'm not sure what we have done to make someone question our "motivations
and intentions" (sounds like a date interrogation), but it is hard to
think of malicious ones when you are making the fruits of your labors
and dollars freely available.
>and resources of the
Well, we have 12 people, and Pearu and Travis O work with us quite a bit
also. The developers here are very good (if I do say so myself), but
unfortunately primarily working on other projects at the moment.
Besides scientists/computer scientists have a technical writer and a
human-computer-interface specialist on staff.
>folks at Enthought (and elsewhere) to know the answer. I do think
>that such a fork will occur unless SciPy's approach changes
Enthought has more commitments than we used to. SciPy remains important
and core to what we do, it just has to share time with other things.
Luckily Pearu and Travis have kept there ear to the ground to help out
people on the mailing lists as well as working on the codebase.
I'm not sure what our approach has been that would force a fork... It
isn't like someone has come as asked to be release manager, offered to
keep the web pages up to date, provided peer review of code, etc and we
have turned them away. Almost from the beginning most effort is
provided by a small team (fairly standard for OS stuff). We have
repeatedly pointed out areas we need help at the conference and in mail
-- code reviews, build help, release help, etc. In fact, I double dare
ya to ask to manage the next release or the documentation effort.
okay... triple dare ya.
Some people have philosophical (like Konrad I believe) differences with
how SciPy is packaged and believe it should be 12 smaller packages
instead of one large one. This has its own set of problems obviously,
but forking based on this kind of principle would make at least a
modicum of sense.
Forking because you don't like the pace of the project makes zero
sense. Pitch in and solve the problem. The social barriers are very
small. The code barriers (build, etc.) are what need to be solved.
>The way to decide is for us all to discuss the
>question openly on these lists, and for those willing to participate
>and contribute effort to declare so openly. I think all that is
>needed, either to help SciPy or replace it, is some leadership in the
>direction outlined above. I would be interested in hearing, perhaps
>from the folks at Enthought, alternative points of view. Why are
>there no packages for popular OSs for SciPy 0.2?
Please build them, ask for web credentials, and up load them. Then
answer the questions people have about them on the mailing list. It is
as simple as that. There is no magic here -- just work.
>Why are releases so
>If the folks running the show at scipy.org disagree with
>many others on these lists, then perhaps those others would like to
>roll their own. Or, perhaps stable/testing/unstable releases of the
>whole package are in order.
>HOW TO CONTRIBUTE?
>Judging by the number of PhDs in sigs, there are a lot of researchers
>on this list. I'm one, and I know that our time for doing core
>development or providing the aforementioned leadership is very
>limited, if not zero.
Surprisingly, commercial developers have about the same amount of free time.
> Later we will be in a much better position to
>contribute application software. However, there is a way we can
>contribute to the core effort even if we are not paid, and that is to
>put budget items in grant and project proposals to support the work of
For the academics, supporting a *dedicated* student to maintain SciPy
would be much more cost effective use of your dollars. Unfortunately,
it is hard to get a PhD for supporting SciPy...
<begin shameless plugs that somehow seem appropriate here>
For companies, national laboratories, etc. Supporting development on
SciPy (or numarray) directly is a great idea. Projects that we work on
in other areas also indirectly support SciPy, Chaco, etc. so get us
involved with the development efforts at your company/lab.
Other options? Government (NASA, Military, NIH, etc) and national lab
people can get SciPy/numarray/Python related SBIR
(http://www.acq.osd.mil/sadbu/sbir/) topics that would impact there
research/development put on the solicitation list this summer. Email me
if you have any questions on this. ASCI people can propose PathForward
projects. There are probably numerous other ways to do this. We will
have a GSA schedule soon, so government contracting will also work.
</end shameless plug>
>subcontractors at places like Enthought or STScI. A handful of
>contributors would be all we'd need to support someone to produce OS
>packages and tutorial documentation (the stuff core developers find
>boring) for two releases a year.
Joe, as you say, things haven't gone as fast as any of us would wish,
but it hasn't been for lack of trying. Many of us have put zillions of
hours into this. The results are actually quite stable tools. Many
people use Numeric/Numarray/SciPy in daily work without problems. But,
like Linux in the early years, they still require "geeks" willing to do
some amount of meddling to use them. Huge resources (developer and
financial) have been pumped into Linux to get it to the point its at
today. Anything we can do to increase the participation in building
tools and financially supporting those who do build tools, I am all
for... I'd love to see releases on 10 platforms and full documentation
for the libraries as well as the next person.
Whew, and Duke managed to hang on and win.
my .01 worth,
>The SF.Net email is sponsored by EclipseCon 2004
>Premiere Conference on Open Tools Development and Integration
>See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
>Numpy-discussion mailing list
>Numpy-discussion at lists.sourceforge.net
More information about the Numpy-discussion