[Numpy-discussion] Data file format choice.
Fri Jan 30 13:42:23 CST 2009
A Friday 30 January 2009, Jeff Whitaker escrigué:
> Gary Pajer wrote:
> > It's time for me to select a data format.
> > My data are (more or less) spectra ( a couple of thousand samples),
> > six channels, each channel running around 10 Hz, collecting for a
> > minute or so. Plus all the settings on the instrument.
> > I don't see any significant differences between netCDF4 and HDF5.
> Gary: netCDF4 is just a thin wrapper on top of HDF5 1.8 - think of
> it as a higher level API.
> > Similarly, I don't see significant differences between pytables and
> > h5py. Does one play better with numpy?
> pytables has been around longer and is well-tested, has nice pythonic
> features, but files you write with it may not be readable by C or
> fortran clients.
Just to be clear. PyTables only will write pickled objects on file if
it is not possible to reasonably represent them as native HDF5 objects.
But, if you try to save NumPy objects or regular Python scalars they
are effectively written as native HDF5 objects (see ).
Regarding a comparison with h5py (disclaimer: I'm the main author of
PyTables), I'd say that h5py is thought to have a direct map with NumPy
array capabilities, but doesn't try to go further. Also, it is worth
to note that h5py offers access to the low-level HDF5 functions, which
can be interesting if you want to get deeper into HDF5 intrincacies,
which can be great for some users.
On his hand, PyTables doesn't try to go this low-level and, besides
supporting general NumPy objects, it is more focused on implementing
advanced features that are normally only available in database-oriented
approaches, like enumerated types, flexible query iterators for tables
(on-disk equivalent to recarrays), indexing (only Pro version), do/undo
features or natural naming (for an enhanced interactive experience).
PyTables also tries hard to be a high performance interface to HDF5,
implementing niceties like internal LRU caches for nodes, automatic
chunksizes for the datasets or making use of numexpr internally so as
to accelerate queries to a maximum.
Finally, and although h5py is relatively recent, I'm really impressed by
the work that Andrew has already done, and in fact, I'm looking forward
to backport some of the h5py features (like general NumPy-like fancy
indexing capabilities) to PyTables. At any rate, it is clear that the
binomial h5py/PyTables will benefit users, with the only handicap that
they have to choose their preferred API to HDF5 (or they can use both,
which could be really a lot of fun ;-). NetCDF4-based interfaces are
also probably a good approach and, as it is based in HDF5, the
compatibility is ensured.
More information about the Numpy-discussion