[AstroPy] reading one line from many small fits files

Erik Bray embray@stsci....
Fri Aug 3 12:48:15 CDT 2012


On 08/02/2012 11:40 PM, John K. Parejko wrote:
> Follow up on this:
>
> Erin's suggestion to use fitsio gave me a factor of more than 10
> improvement in speed. I was quite astonished at how much faster it was,
> so I've written up a short example, and attached it. On my laptop (13"
> macbook pro, OS X 10.6.8, regular HDD), the code produces the following:
>
> $ python fits_tester.py
> fitsio version: 0.9.0
> pyfits version: 3.0.6
> Single pass: fitsio took 1.14109 seconds.
> Single pass: pyfits took 14.64361 seconds.
>
> One of the problems with the pyfits version is that I don't know how to
> efficiently get at row(n) of a pyfits object in a form that can be
> directly ingested into an ndarray. If there is a way to make the pyfits
> version significantly faster just by calling pyfits differently, I'm all
> ears.
>
> Looking at the profiles for the runs (output to .prof files), it looks
> like pyfits is doing a lot of object creation and destruction in the
> background, which may be what's killing it.
>
> Anyway, there does seem to be a major difference in speed here, even in
> what is probably the most favorable configuration for pyfits, with it
> running last and thus having files potentially cached.
>
> Assuming this difference isn't just me, is way to get these speed
> improvements merged into pyfits?
>
> John

Thanks John for this benchmarking--this is very helpful.  For what it's 
worth, a lot of improvements have been made since PyFITS 3.0.6, and 
these are the results I'm getting on my end:

fitsio version: 0.9.0
pyfits version: 3.1.dev
Single pass: fitsio took  1.62691 seconds.
Single pass: pyfits took  7.50556 seconds.

A few additional trials gave roughly the same results.  I'm also less 
astonished by the speed differences, simply in that fitsio wraps 
CFITSIO, a C library, while much of PyFITS is pure Python.  Looking at 
the profile, it spends about 2/5th of the time just opening the file and 
creating objects for the Header and HDU structures.  There are some more 
micro-optimizations to be made there, but not much.  PyFITS provides a 
very flexible and extensible object-oriented interface that simply isn't 
possible with CFITSIO, but there's a tradeoff there in terms of raw 
performance, since it's all in pure Python.  For example, in this 
benchmark, PyFITS spends over half a second (cumulatively, under the 
profiler) just on the routine for determining which HDU subclass to 
initialize based on the header keywords--CFITSIO has no equivalent 
routine because it doesn't even care what the HDU type is until you try 
to read some data.  And even then the only real distinction it tries to 
make is, "Is this an image or a table?"

So in simply opening files you'll always get better performance with 
fitsio.  That said, when I amend the benchmark to just open files and 
read the headers (without touching the data) fitsio is only about three 
times faster.  Still a big difference when dealing with a lot of files, 
but far less dramatic.

Where PyFITS really takes a big hit performance-wise is in the handling 
of table columns, and, as Perry mentioned, the conversion from the raw 
data to Python data types like bools and strings.  As I wrote earlier in 
this thread, the biggest problem is that PyFITS' design has always been 
optimized for column-based access, and is horribly inefficient for 
row-based access, since the latter usually involves reading entire 
columns into memory anyways.  The reason for this is mostly 
historical--PyFITS' table interface is built on top of Numpy's recarray 
object, which I think is pretty flawed to begin with.  At the time this 
was necessary because PyFITS did not yet support compound dtypes in its 
normal ndarrays.  At least I think that was the issue.  But now it seems 
to be more of a hindrance.

In any case, I'm glad fitsio is available too.  It's clear from this 
experiment that in cases where reading and parsing FITS files is the 
major bottleneck, it's probably the way to go for now.  I don't know how 
much time there will be going forward to devote to improving PyFITS in 
this regard.

Erik


More information about the AstroPy mailing list