strang at nmr.mgh.harvard.edu
Mon Feb 7 08:42:17 CST 2005
[Long post: a long-time lurker here sharing his numerical
experiences/input re: "the numeric split" and Numeric3.]
I'm a long-time python user (and follower of this list), a heavy user in
the numerical realm, and an "incidental" contributor to SciPy (statistical
functions in stats.py). First, I'm impressed (and indeed amazed) at all
the hard work folks have put in--usually without pay--to make Numpy and
Numarray and SciPy and Matplotlib (to name the major players in this
discussion). I am indebted to each of you. Second, I must say that I think
this protracted discussion is EXACTLY what the python numerical community
needs. It appears as though we have critical mass in terms of code and
interest (and opinions), and just need to bring them all together.
Since the inception of numarray, I've just been standing back and waiting
to see how this all sorts itself out. My stats functions work for lists
and numpy arrays. I didn't want to convert them to numarray (given my lack
of spare time) unless that was going to be the "new path". It appears,
however, even after all this time, there isn't (quite) a consensus on a
new path. After the recent message-storm, however, I am very hopeful. I
see 4 issues at stake here, with the caveat that I'm not the code writer,
just a user ...
1) Multiarray in Python core: I agree that this (as already stated) is (1)
mostly irrelevant for heavy-duty numerial folks, BUT (2) is critical to
provide for python a standardized exchange data format. Being able to
trivially (i.e., out of the python box) load, save, pickle and load again
on a new platform N-D array objects would be a big deal for me (and many I
work with). Such a core object can't favor any particular size array ...
so it would need to provide good (or excellent) performance on both little
arrays (a la numpy) and on big arrays (a la numarray). To be in keeping
with other python objects, it seems this object would need to be tight,
fast and easily extensible. I *think* this is exactly what Numeric3 is
intended to do. Getting this right is tricky, but it seems like current
solutions are EXTREMELY close.
2) Numerical function "packaging": Looking at this from a distance (i.e.,
as a user) numerical packaging is too complex. The python spirit seems to
call for being a bit more of a splitter (and encapsulator) than a lumper.
For example, to do web programming in python, one often depends on several
separate modules (html, xml, cgi, etc) rather than one all-encompassing
one. To give numerical work the same modular feel (as well as structure
and insulation from installation headaches), it seems that collections of
numerical operations should be similarly organized on themes (e.g.,
timeseries analysis, morphology (nd_image?), statistics (stats.py), etc).
This way, if you're doing timeseries analysis you import the relevant
modules and go to work ... no worries about installing stuff required for
morphology or statistics that you don't need. I realize this might require
(in some cases) more refactoring, but I don't think I'm supporting
anything *that* different from what already exists. Granted, the notion of
what's "basic" vs. advanced is relative (e.g., where do you put fft, or
linear_algebra?). But if made modular and encapsulated (e.g., an fft.py,
linear_algebra.py, integration.py, morphology.py) and made available both
individually and as part of one or more suites--see #4 below, it's easier
to build on existing code rather than reinvent. Interestingly, although
not obvious, this is how Matlab works too. Your first $500 pays for basic
array-based functionality (fft, psd, etc). Then there are add-on toolkits
(at $500 each) specifically for timeseries analysis, imaging, wavelets,
engineering simulation, etc.
3) Plotting: Until perhaps a year ago, I did almost all my computations in
python, then saved data out to disk, and read it into matlab to plot it. I
hated that situation, but it was the only way to quickly and easily look
at data interactively, with zoom, easy subplotting, etc. Matplotlib has
all but solved this problem (thanks!!). John indicates that the ultimate
goal with matplotlib is to provide plotting, not just scientific plotting,
which is even better! In that case, though, and in keeping with my
previous comment, perhaps the name matplotlib is a little misleading
(suggesting scientific plotting only). Again, if I were familiar with
python but just starting timeseries analysis, I would expect to load my
data into a (multiarray) python object, import timeseries.py, import
plotlib.py (i.e., matplotlib) and go to work doing timeseries analysis ...
be that at LLNL, Wall St, or in my neuro lab.
4) Matlab-like Environment: Both SciPy and Matplotlib have a stated goal
of creating a matlab-style environment. This is great, as it might help
wean more folks off of Matlab or IDL and into the python community.
However, I think that this (as has been suggested ... sorry, I forgot who)
should be a separate goal from any of the above. Building an environment
with python is different from providing functionality to python (think
website design *environment* vs. tools for handling web content ...
they're different). SciPy, with it's integration goal, plus matplotlib's
plotting goal would be an outstanding combination to this end.
In sum, I pretty much agree with most previous contributors. With one
exception. I do agree with Konrad that, theory, the best analysis approach
is to build a custom class for each domain of investigation. I've even
done that a couple times. However, of the dozens of researchers (i.e.,
users, not developers) I've talked about this with, almost none have the
time (or even desire) to design and develop a class-based approach. Their
goal is data analysis and evaluation. So, they pull the data into python
(or, more typically, matlab or IDL), analyze it, plot the results, try to
get a plot that's pretty enough for publication, and then go on to the
next project ... no time for deciding what functionality should be a
method or an attribute, subclassing hierarchies, etc. In those rare
cases where they have the desire, matlab (I don't know about IDL) doesn't
give you the option, meaning that users (as opposed to developers)
probably won't go the (unfamiliar and more time consuming) route.
I apologize for the long post that "simply" supports others' opinions,
particularly when my opinion cannot count for much (after all, I'm not
likely to be doing much of the coding). But, I did want to express my
appreciation for ALL the hard work that's been done, and to give the
strongest encouragement to hashing things out now. I would LOVE to see
some consensus on (1) what a core multiarray object should look like,
(2-3) how to imbue python with numerical functionality and plotting for
generations to come ;-) and (4) to create environments for scientific
exploration within python. I think we're SOOO close ...
Gary Strangman, PhD | Director, Neural Systems Group
Office: 617-724-0662 | Massachusetts General Hospital
Fax: 617-726-4078 | 149 13th Street, Ste 10018
strang/@/nmr.mgh.harvard.edu | Charlestown, MA 02129
More information about the Numpy-discussion