[Numpy-discussion] first recarray steps
Thu May 22 02:35:01 CDT 2008
Anne Archibald wrote:
> 2008/5/21 Vincent Schut <firstname.lastname@example.org>:
>> Christopher Barker wrote:
>>> Also, if you image data is rgb, usually, that's a (width, height, 3)
>>> array: rgbrgbrgbrgb... in memory. If you have a (3, width, height)
>>> array, then that's rrrrrrr....gggggggg......bbbbbbbb. Some image libs
>>> may give you that, I'm not sure.
>> My data is. In fact, this is a simplification of my situation; I'm
>> processing satellite data, which usually has more (and other) bands than
>> just rgb. But the data is definitely in shape (bands, y, x).
> You may find your life becomes easier if you transpose the data in
> memory. This can make a big difference to efficiency. Years ago I was
> working with enormous (by the standards of the day) MATLAB files on
> disk, storing complex data. The way (that version of) MATLAB
> represented complex data was the way you describe: matrix of real
> parts, matrix of imaginary parts. This meant that to draw a single
> pixel, the disk needed to seek twice... depending on what sort of
> operations you're doing, transposing your data so that each pixel is
> all in one place may improve cache coherency as well as making the use
> of record arrays possible.
Anne, thanks for the thoughts. In most cases, you'll probably be right.
In this case, however, it won't give me much (if any) speedup, maybe
even slowdown. Satellite images often are stored on disk in a band
sequential manner. The library I use for IO is GDAL, which is a higly
optimized c library for reading/writing almost any kind of satellite
data type. It also features an internal caching mechanism. And it gives
me my data as (y, x, bands).
I'm not reading single pixels anyway. The amounts of data I have to
process (enormous, even by the standards of today ;-)) require me to do
this in chunks, in parallel, even on different cores/cpu's/computers.
Every chunk usually is (chunkYSize, chunkXSize, allBands) with xsize and
ysize being not so small (think from 64^2 to 1024^2) so that pretty much
eliminates any performance issues regarding the data on disk.
Furthermore, having to process on multiple computers forces me to have
my data on networked storage. The latency and transfer rate of the
network will probably eliminate any small speedup because my drive has
to do less seeks...
Now for the recarray part, that would indeed ease my life a bit :)
However, having to transpose the data in memory on every read and write
does not sound very attractive. It will spoil cycles, and memory, and be
asking for bugs. I can live without recarrays, for sure. I only hoped
they might make my live a bit easier and my code a bit more readable,
without too much effort. Well, they won't, apparently... I'll just go on
like I did before this little excercise.
Thanks all for the inputs.
More information about the Numpy-discussion