[SciPy-user] python (against java) advocacy for scientific projects

Sturla Molden sturla@molden...
Mon Jan 19 10:14:28 CST 2009


On 1/19/2009 9:01 AM, Marko Loparic wrote:

> For a new project one manager suggested me to use java instead of
> python. He says that python has performance problems.

Some managers prefer Java because it is hyped; they tend to be 
ill-informed. Python does not have performance problems. But a program 
written in Python might. Usually it is not due to Python. Most likeliy, 
programs written in Java will experience the same performance problems. 
An O(N**2) algorithm in Python will still be O(N**2) in Java. One needs 
to be a bit more clever than just swap language. Python is used to run 
YouTube.com and a web spider called Googlebot. It is used to analyze 
NASA's images from the Hubble space telescope. Do you have performance 
issues that exceeds that?

Java is not commonly used for scientific computing. Scientists generally 
prefer languages like IDL, R, Matlab, S-PLUS, Mathematica, Perl and 
Python. Java requires to much 'boiler plate' code. You don't get to 
focus on the important work. Matlab and Python programs tends to be much 
shorter than Java (1/10 to 1/5 lines of code). As for performance, I 
tend to find that Python scrips (with NumPy) run faster than similar 
scripts written in Matlab. Matlab is perhaps the most commonly used 
language for numerical computing today.

The advantage of Java over Python for scientific computing is faster for 
loops. Anything else is is favour of Python. One particularly important 
issue is memory use. Python's strategy of reference counting keeps the 
memory use down at all times. Java is much more greedy on the memory, 
and only collects garbage now and then. Using to much RAM can cause the 
OS to begin swapping out pages to disk. If you are worried about speed, 
you really don't want this to happen.

Sometimes speed does matter. If a calculation takes a day in pure 
Fortran, it may take half a year in pure Python. Remember C.A.R. Hoare's 
famous statement (sometimes erroneously attributed to D. Knuth) that 
'premature optimization is the root of all evil in computer 
programming.' There is a reason we don't write everything in Fortran 77 
or assembly, even if generates the fastest code (faster than C). 
Focusing optimization on anything but the worst offending bottlenecks is 
a waste of effort. And that is why scientists don't use Java or Fortran 
all the time: Java and Fortran may be faster than Python or Matlab on 
average, but the computationally important parts will be focused in less 
than 5% of your code. There is nothing that says a program written in 
Python must be 'pure Python'. If you migrate that offending 5% to 
Fortran or C, you would beat Java in terms of speed, and still retain 
all the advantages of Python. That is why we don't have performance 
problems when using Python for HPC. We don't use Python all the time; we 
use Python where it is convenient.

Here is a 10 point strategy for writing correct and fast programs with 
Python:

1. Write everything in Python with NumPy (and possibly SciPy, 
Matplotlib, wxPython, psyco, twisted, etc.) Get a verified, working 
program. Correctness is far more important than speed in scientific 
computing. Scientists must be pedantic about correctness.

2. If your program is fast enough, quit and be happy with it. You don't 
need to fix something that works. 9 out of 10 times, the development 
cycle ends here.

3. Identify the worst bottlenecks using a profiler. Your guess and gut 
feeling will likely be incorrect.

4. If the bottlenck is I/O (disk, network, SQL server) or calls to 
libraries like SciPy, there is very little that can be done about it. 
Faster hardware may help, Java will certainly not. Java or C does not 
read data from disk etc. faster than Python.

5. Hardware is expensive but much cheaper than labour. If you can solve 
the problems by buying more hardware or better hardware, then do that.

6. If bottlenecks are most easily solved by numerical libraries, e.q. 
LAPACK, FFTW, MKL, ATLAS, GSL, etc., then use these. People have spent 
years optimizing them. There is likely nothing you can hand-code - in 
any language - that will be faster. Remember that NumPy and SciPy will 
use some of these libraries as well.

7. Did you remember to use vectorized array syntax? Neither Python (with 
NumPy) nor Matlab is meant to be used like Java. For-loops are plain 
evil. Most of Peter J. Acklam's vectorization guide to Matlab applies to 
NumPy as well:

http://home.online.no/~pjacklam/matlab/doc/mtt/doc/mtt.pdf

8. Check your algorithm. O(N) or O(N log N) is better than O(N**2) if N 
is large. This is where you can get really big speed improvements, 
regardless of language.

9. If the bottleneck cannot be solved by libraries or changing 
algorithm, re-write these parts in Fortran 95. Compile with f2py to get 
a Python callable extension module. Real scientists do not use C++ (if 
we need OOP, we have Python.)

10. If you need to use parallel processors (e.g. multicore CPUs), begin 
by inserting OpenMP directives into your Fortran code. If this is not 
enough, use the standard lib packages 'multiprocessing' or 'threading' 
for courser grained parallelism. Ensure that GIL is released if you 
choose 'threading'; f2py can release the GIL around thread-safe Fortran 
routines.




Regards,
Sturla Molden












More information about the SciPy-user mailing list