[AstroPy] reading one line from many small fits files
Tue Jul 31 10:52:07 CDT 2012
On 07/30/2012 08:57 PM, Derek Homeier wrote:
> Hi John,
> On 31.07.2012, at 1:40AM, "John K. Parejko"<firstname.lastname@example.org> wrote:
>> This is really more of a pyfits question, but I've upgraded to pyfits 3.1 (SVN), which is the version in astropy.
>> I have data stored in thousands of ~few MB .fits files (photoObj files from SDSS) totaling a few TB of data, and I know the one single line I want to extract from some known subset of those files. But pyfits is taking more than a second per file to extract the fields I want, which seems very long, especially if it is using memmapped access, and thus should only have to read that single line (plus the header) from each file.
>> I'm doing something like this:
>> result = np.empty(len(data),dtype=dtype)
>> for i,x in enumerate(data):
>> photo = pyfits.open(photo,memmap=True)
>> result[i] = photo.data[x[otherfield]-1]
>> Is there a better way to go about this? Is pyfits known to be quite slow when reading a single row from a lot of different files? Anyone have suggestions on how to speed this up?
> that seems quite slow; it takes me about 50 ms to read a random line from the DR8 example file
> with pyfits 3.0.2. Unless the file access itself takes so long something appears to be odd.
> But the only thing coming to my mind now is that pyfits supports scaled column data (similar to
> BSCALE/BZERO in image HDUs, I assume), and if such keywords were present, they would probably
> cause a corresponding transformation for the entire bintable. They don't seem to exist in the standard
> SDSS files, though.
> Naïve question: do you call photo.close() after each read?
It probably shouldn't matter whether or not he's calling close(), but
the question about BSCALE/BZERO is possibly relevant. Is the data
you're reading from an image or a table? If it's an image, then as
Derek wrote PyFITS is still pretty inefficient, in that it will
transform the entire image, even if using mmap (which is the default now
by the way). I have plans for overhauling this but it hasn't been a high
priority for the most part. You can also turn off image scaling if you
use do_not_scale_image_data=True when opening the file. That might
speed things up.
This is one area where using the .section feature on Image HDUs might
still be useful. For example:
result[i] = photo.section[x[otherfield] - 1]
In PyFITS 3.1 that has improved support for scaling just sections of the
file, which didn't work well before. That might also be faster.
Of course this is all a moot point if this is not a scaled image. In
any case, opening a file and reading a single row out of the data should
not generally take as long as 1 second--especially if they're small files.
More information about the AstroPy