[SciPy-user] python (against java) advocacy for scientific projects
Mon Jan 19 10:14:28 CST 2009
On 1/19/2009 9:01 AM, Marko Loparic wrote:
> For a new project one manager suggested me to use java instead of
> python. He says that python has performance problems.
Some managers prefer Java because it is hyped; they tend to be
ill-informed. Python does not have performance problems. But a program
written in Python might. Usually it is not due to Python. Most likeliy,
programs written in Java will experience the same performance problems.
An O(N**2) algorithm in Python will still be O(N**2) in Java. One needs
to be a bit more clever than just swap language. Python is used to run
YouTube.com and a web spider called Googlebot. It is used to analyze
NASA's images from the Hubble space telescope. Do you have performance
issues that exceeds that?
Java is not commonly used for scientific computing. Scientists generally
prefer languages like IDL, R, Matlab, S-PLUS, Mathematica, Perl and
Python. Java requires to much 'boiler plate' code. You don't get to
focus on the important work. Matlab and Python programs tends to be much
shorter than Java (1/10 to 1/5 lines of code). As for performance, I
tend to find that Python scrips (with NumPy) run faster than similar
scripts written in Matlab. Matlab is perhaps the most commonly used
language for numerical computing today.
The advantage of Java over Python for scientific computing is faster for
loops. Anything else is is favour of Python. One particularly important
issue is memory use. Python's strategy of reference counting keeps the
memory use down at all times. Java is much more greedy on the memory,
and only collects garbage now and then. Using to much RAM can cause the
OS to begin swapping out pages to disk. If you are worried about speed,
you really don't want this to happen.
Sometimes speed does matter. If a calculation takes a day in pure
Fortran, it may take half a year in pure Python. Remember C.A.R. Hoare's
famous statement (sometimes erroneously attributed to D. Knuth) that
'premature optimization is the root of all evil in computer
programming.' There is a reason we don't write everything in Fortran 77
or assembly, even if generates the fastest code (faster than C).
Focusing optimization on anything but the worst offending bottlenecks is
a waste of effort. And that is why scientists don't use Java or Fortran
all the time: Java and Fortran may be faster than Python or Matlab on
average, but the computationally important parts will be focused in less
than 5% of your code. There is nothing that says a program written in
Python must be 'pure Python'. If you migrate that offending 5% to
Fortran or C, you would beat Java in terms of speed, and still retain
all the advantages of Python. That is why we don't have performance
problems when using Python for HPC. We don't use Python all the time; we
use Python where it is convenient.
Here is a 10 point strategy for writing correct and fast programs with
1. Write everything in Python with NumPy (and possibly SciPy,
Matplotlib, wxPython, psyco, twisted, etc.) Get a verified, working
program. Correctness is far more important than speed in scientific
computing. Scientists must be pedantic about correctness.
2. If your program is fast enough, quit and be happy with it. You don't
need to fix something that works. 9 out of 10 times, the development
cycle ends here.
3. Identify the worst bottlenecks using a profiler. Your guess and gut
feeling will likely be incorrect.
4. If the bottlenck is I/O (disk, network, SQL server) or calls to
libraries like SciPy, there is very little that can be done about it.
Faster hardware may help, Java will certainly not. Java or C does not
read data from disk etc. faster than Python.
5. Hardware is expensive but much cheaper than labour. If you can solve
the problems by buying more hardware or better hardware, then do that.
6. If bottlenecks are most easily solved by numerical libraries, e.q.
LAPACK, FFTW, MKL, ATLAS, GSL, etc., then use these. People have spent
years optimizing them. There is likely nothing you can hand-code - in
any language - that will be faster. Remember that NumPy and SciPy will
use some of these libraries as well.
7. Did you remember to use vectorized array syntax? Neither Python (with
NumPy) nor Matlab is meant to be used like Java. For-loops are plain
evil. Most of Peter J. Acklam's vectorization guide to Matlab applies to
NumPy as well:
8. Check your algorithm. O(N) or O(N log N) is better than O(N**2) if N
is large. This is where you can get really big speed improvements,
regardless of language.
9. If the bottleneck cannot be solved by libraries or changing
algorithm, re-write these parts in Fortran 95. Compile with f2py to get
a Python callable extension module. Real scientists do not use C++ (if
we need OOP, we have Python.)
10. If you need to use parallel processors (e.g. multicore CPUs), begin
by inserting OpenMP directives into your Fortran code. If this is not
enough, use the standard lib packages 'multiprocessing' or 'threading'
for courser grained parallelism. Ensure that GIL is released if you
choose 'threading'; f2py can release the GIL around thread-safe Fortran
More information about the SciPy-user