[Numpy-discussion] Assignment from a list is slow in Numarray

Francesc Alted falted at pytables.org
Mon Sep 20 10:42:05 CDT 2004


A Dilluns 20 Setembre 2004 15:16, Timo Korvola va escriure:
> ... which appears to be actually a HDF5 file.  Thanks for the tip.  It
> is clear that a binary file format would be more advantageous
> simply because text files are not seekable in the way needed for
> parallel reading.

Well, if you are pondering using parallel reading because of speed, try
first PyTables, you may get surprised how fast it can be. For example, using
the same example that Todd has sent today (i.e. writing and reading an array
of (10**5,3) integer elements), I've re-run it using PyTables and, just for
the sake of comparison, NetCDF (using the Scientific Python wrapper). Here
are the results (using a laptop with Pentium IV @ 2 GHz with Debian
GNU/Linux):

Time to write file (text mode) 2.12 sec
Time to write file (NetCDF version) 0.0587 sec
Time to write file (PyTables version) 0.00682 sec
Time to read file (strings.fasteval version) 0.259 sec
Time to read file (NetCDF version) 0.0470 sec
Time to read file (PyTables version) 0.00423 sec

so, for reading, PyTables can be more than 60 times faster than
numarray.strings.eval and almost 10 times faster than Scientific.IO.NetCDF
(the latter using Numeric). And I'm pretty sure that these ratios would
increase for bigger datasets.

> I was thinking of using NetCDF because OpenDX does
> not support HDF5.

Are you sure? Here you have a couple of OpenDX data importers for HDF5:

http://www.cactuscode.org/VizTools/OpenDX.html
http://www-beams.colorado.edu/dxhdf5/

> An advantage of HDF5 would be that the libraries support parallel I/O
> via MPI-IO but can this be utilised in PyTables?  There is the problem
> that there are no standard MPI bindings for Python.

Curiously enough Paul Dubois asked me the very same question during the
recent SciPy '04 Conference. And the answer is the same: PyTables does not
support MPI-IO at this time, because I guess that could be a formidable
developer time waster. I think I should try first make PyTables
threading-aware before embarking myself in larger entreprises. I recognize,
though, that a MPI-IO-aware PyTables would be quite nice.

> I have also considered writing Python bindings for Parallel-NetCDF but
> I suppose that would not be totally trivial even if the library turns
> out to be well Swiggable.

Before doing that, talk with Konrad. I know that Scientific Python supports
MPI and BSPlib right-out-of-the-box, so maybe there is a shorter path to do
what you want.

In addition, you must be aware that the next version of NetCDF (the 4), will
be implemented on top of HDF5 [1]. So, perhaps spending your time writing
Python bindings for Parallel-HDF5 would be a better bet for future
applications.

[1] http://my.unidata.ucar.edu/content/software/netcdf/netcdf-4/index.html

Cheers,

-- 
Francesc Alted






More information about the Numpy-discussion mailing list