[SciPy-User] Pylab - standard packages

Francesc Alted francesc@continuum...
Mon Sep 24 04:46:10 CDT 2012


Hey, nice to see this conversation going on.  I'm not currently an 
active developer of PyTables anymore (other brave people like Antonio 
Valentino, Anthony Scopatz and Josh Moore took over the project during 
the past year), but I created and lead its development for many years, 
and I still provide some feedback on the PyTables mailing list, so I'd 
be glad to contribute my view here too (although definitely it will be 
not impartial because, what the heck, I still consider it as my little 
boy ;)

On 9/22/12 5:04 PM, Andrew Collette wrote:
> On Sat, Sep 22, 2012 at 3:56 AM, Thomas Kluyver <takowl@gmail.com> wrote:
>
>> Andrew: Thanks for the info about h5py. As I don't use HDF5 myself,
>> can someone describe, as impartially as possible, the differences
>> between PyTables and h5py: how do the APIs differ, any speed
>> difference, how well known are they, what do they depend on, and what
>> depends on them (e.g. I think pandas can use PyTables?). If it's
>> sensible to include both, we can do so, but I'd like to get a feel for
>> what they each are.

PyTables is a high-level API for the HDF5 library, that includes 
advanced features that are not present in HDF5 itself, like indexing 
(via OPSI), out-of-core operations (via numexpr) and very fast 
compression (via Blosc).  PyTables, contrarily to h5py, does not try to 
expose the complete HDF5 API, and it is more centered around providing 
an easy-to-use and performant interface to the HDF5 library.

> I'm certainly not unbiased, but while we're waiting for others to
> rejoin the discussion I can give my perspective on this question.  I
> never saw h5py and PyTables as direct competitors; they have different
> design goals.  To me the basic difference is that PyTables is both a
> way to talk to HDF5 and a really great database-like interface with
> things like indexing, searching, etc. (both NumExpr and Blosc came out
> of work on PyTables, I believe).  In contrast, h5py arose by asking
> "how can we map the basic HDF5 abstractions to Python in a direct but
> still Pythonic way".

Yeah, I agree that the h5py and PyTables cannot be seen as direct 
competitors (although many people, including myself at times :), see 
them as such).  As I said before, performance is one of the aspects that 
is *extremely* important for PyTables, and you are right in that both 
OPSI and Blosc were developments done for the sake of PyTables.

Numexpr case is somewhat different, since it is was originally developed 
outside of the project (by David M. Cooke), and adopted and largely 
enhanced for allowing fast queries and out of core computations in 
PyTables.  Of course, all these enhancements where contributed back to 
the original numexpr project and it continues to be an stand-alone 
library that is useful in many scenarios different than PyTables, in a 
typical case of fertile project cross-polinization.

>
> The API for h5py has both a high-level and low-level component; like
> PyTables, the high-level component is oriented around files, datasets
> and groups, allows iteration over elements in the file, etc. The
> emphasis in h5py is to use existing objects and abstractions from
> NumPy; for example, datasets have .dtype and .shape attributes and can
> be sliced like NumPy arrays.  Groups are treated like dictionaries,
> are iterable, have .keys() and .iteritems() and friends, etc.

So for PyTables the approach in this regard is very close to h5py, with 
the exception that the group hierarchy is primarily (but not only) 
accessed via a natural naming approach, i.e. something like:

file.root.group1.group2.dataset

instead of the h5py approach:

file['group1']['group2']['dataset']

I find the former extremely more useful for structure discovering (by 
hitting the TAB key in REPL interactive environments), but this is 
probably a matter of tastes.

>
> The "main" high level interface in h5py also rests on a huge low-level
> interface written in Cython
> (http://h5py.alfven.org/docs/low/index.html), which exposes the
> majority of the HDF5 C API in a Pythonic, object-oriented way.  The
> goal here is anything you can do with HDF5 in C, you can do in Python.

Yeah, as it has already been said, here it lies one of the big 
differences between both projects: PyTables does not come with a 
low-level interface to HDF5.  This, however, has been a deliberate 
design goal, as the HDF5 C API which can be rather complex and 
cumbersome for the large majority of Python users, and it was estimated 
that the large majority of the people was not interested in delving with 
HDF5 intricacies (those PyTables users interested in accessing to such 
HDF5 capabilities can always take the Cython sources and build new 
features on top of it, which I find a more sensible approach, specially 
if performance is interesting for the user).

>
> It has no dependencies beyond NumPy and Python itself; I will let
> others chime in for specific projects which depend on h5py.  As a
> rough proxy for popularity, h5py has roughly 30k downloads over the
> life of the project (10k in the past year).

I cannot tell how many downloads PyTables has had over its almost 10 
years of existence (the first public release was made back in October 
2002), but probably a lot.  Sourceforge reports that it received more 
than 50K downloads for the 2.3 series (released one year ago) and more 
than 6.5K downloads for the recent 2.4.0 version released a couple of 
months ago.   However that's is a bit tricky because PyTables is shipped 
in most of Linux distributions, and Windows binaries are not available 
in SF anymore, but through independent Windows distributions like 
Gohlke's, EPD, Python(x,y) or Anaconda, so likely the actual number 
would be much more than that (but the same should apply to h5py).

>
> I have never benchmarked PyTables against h5py, but I strongly suspect
> PyTables is faster.

Yes, without knowing about anybody having done an exhaustive comparison 
in most of the scenarios, my own benchmarks confirm that PyTables is 
generally faster than h5py.  It is true that both projects uses HDF5 as 
the basic I/O library, but when combining things like OPSI, Blosc and 
numexpr, this can make a huge difference.  For example, in some 
benchmarks that I did some months ago, the difference in performance was 
in the range from 10 thousand to more than 100 thousand times, specially 
when browsing and querying medium-size on-disk tables (100 thousand rows 
long):

http://www.digipedia.pl/usenet/thread/16009/26243/#post26257

Also, Gaël Varoquaux blogged on some real-life benchmarks about how the 
ultra-fast Blosc (and LZO) compressors integrated in PyTables can 
accelerate the I/O:

http://gael-varoquaux.info/blog/?p=159

Anyway, I think the home page about PyTables does a good job expressing 
how important performance is for the project:

http://www.pytables.org/moin

>    Most of the development effort that has recently
> gone into h5py has been focused in other areas like API coverage,
> Python 3 support, Unicode, and thread safety; we've never done careful
> performance testing.

Yep, probably here h5py is more advanced than PyTables, specially 
because the latter does not provide full Python 3 support yet. However, 
Antonio made great strides on this area, and most of the pieces are 
already there, being the most outstanding missing part having Python 3 
support for numexpr.  In fact, Antonio already kindly provided patches 
for this, but I still need to check them out and release a new version 
of numexpr.  I think I should stop procrastinating and give this a go 
sooner than later (Antonio, if you are reading this, be ready for some 
questions about your patches soon :).

-- 
Francesc Alted



More information about the SciPy-User mailing list