[Numpy-discussion] ANNOUNCE: Pytables 0.4

Francesc Alted falted at openlc.org
Tue Mar 18 12:47:17 CST 2003


Announcing PyTables 0.4
-----------------------

I'm happy to announce the first beta release of PyTables. It is labelled
beta because it has been thoroughly tested, even in production environments,
and it is getting fairly mature.

Besides, from now on, the API will remain mostly stable, so you can start
using PyTables now with the guarantee that your code will also work (well,
mostly ;-) in the future. The large amount of unit tests included in
PyTables will also ensure the backward compatibility as well as the quality
of future releases will remain at least as good as it is now (although
hopefully it should increase!).

What's new
-----------

- numarray objects (NumArray, CharArray and RecArray) supported

- As a consequence of a large internal code redesign (numarray is at
  the core of PyTables now), performance has been improved by a factor
  of 10 (see "How Fast Is It?" section)

- It consumes far less memory than previous version

- Support for reading generic HDF5 files added (!)

- Some bugs and memory leaks existing in 0.2 solved

- Updated documentation

- Added more unit tests (more than 200 now!)

What it is
----------

In short, PyTables provides a powerful and very Pythonic interface to
process table and array data.

Its goal is to enable the end user to manipulate easily scientific
data tables and Numerical and numarray Python objects in a persistent
hierarchical structure. The foundation of the underlying hierarchical
data organization is the excellent HDF5 library
(http://hdf.ncsa.uiuc.edu/HDF5).

A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type.  The terms
"fixed-length" and strict "data types" seems to be quite a strange
requirement for an interpreted language like Python, but they serve a
useful function if the goal is to save very large quantities of data
(such as is generated by many scientific applications, for example) in
an efficient manner that reduces demand on CPU time and I/O resources.

Quite a bit effort has been invested to make browsing the hierarchical
data structure a pleasant experience. PyTables implements just two
(orthogonal) easy-to-use methods for browsing.

What is HDF5?
-------------

For those people who know nothing about HDF5, it is is a general
purpose library and file format for storing scientific data made at
NCSA. HDF5 can store two primary objects: datasets and groups. A
dataset is essentially a multidimensional array of data elements, and
a group is a structure for organizing objects in an HDF5 file. Using
these two basic constructs, one can create and store almost any kind of
scientific data structure, such as images, arrays of vectors, and
structured and unstructured grids. You can also mix and match them in
HDF5 files according to your needs.

How fast is it?
---------------

PyTables can write table records between 20 and 30 times faster than
cPickle and between 3 and 10 times faster than struct (it is a module
present in the Standard Library); and retrieves information around 100
times faster than cPickle and between 8 and 10 times faster than
struct.

When compared with SQLite (http://www.sqlite.org/), one of the fastest
(free) relational databases available, PyTables achieves between a 60%
and 80% the speed of SQLite during selects of dataset sizes that fit
in the O.S. filesystem memory cache. However, when those sizes does
not fit in the cache (i.e. when dealing with large amounts of data),
PyTables beats SQLite by a factor of 2 or even more (depending on the
kind of record selected), and its performance in this case is only
limited by the I/O speed of the disk subsystem.

Go to http://pytables.sourceforge.net/doc/PyCon.html#section4 for a
detailed description on the conducted benchmarks.

Platforms
---------

I'm using Linux as the main development platform, but PyTables should
be easy to compile/install on other UNIX machines. This package has
also passed all the tests on a UltraSparc platform with Solaris 7 and
Solaris 8. It also compiles and passes all the tests on a SGI
Origin2000 with MIPS R12000 processors and running IRIX 6.5.

If you are using Windows and you get the library to work, please let
me know.

An example?
-----------

At the bottom of this message there is some code that shows basic
capabilities of PyTables. You may also look at
http://pytables.sourceforge.net/tut/tutorial1-1.html and 
http://pytables.sourceforge.net/tut/tutorial1-2.html
for online code.

Web site
--------

Go to the PyTables web site for downloading and more details:

http://pytables.sf.net/

Share your experience
---------------------

Let me know of any bugs, suggestions, gripes, kudos, etc. you may
have.

Have fun!

-- Francesc Alted
falted at openlc.org


*-*-*-**-*-*-**-*-*-**-*-*- Small code example  *-*-*-**-*-*-**-*-*-**-*-*-*
from tables import *

class Particle(IsDescription):
    identity = Col("CharType", 22, " ", pos = 0)  # character String
    idnumber = Col("Int16", 1, pos = 1)  # short integer
    speed = Col("Float32", 1, pos = 1)  # single-precision

# Open a file in "w"rite mode
fileh = openFile("objecttree.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root

# Create the groups:
group1 = fileh.createGroup(root, "group1")
group2 = fileh.createGroup(root, "group2")

# Now, create a table in "group0" group
array1 = fileh.createArray(root, "array1", ["string", "array"], "String 
array")
# Create 2 new tables in group1
table1 = fileh.createTable(group1, "table1", Particle)
table2 = fileh.createTable("/group2", "table2", Particle)
# Create the last table in group2
array2 = fileh.createArray("/group1", "array2", [1,2,3,4])

# Now, fill the tables:
for table in (table1, table2):
    # Get the record object associated with the table:
    row = table.row
    # Fill the table with 10 records
    for i in xrange(10):
        # First, assign the values to the Particle record
        row['identity']  = 'This is particle: %2d' % (i)
        row['idnumber'] = i
        row['speed']  = i * 2.
        # This injects the Record values
        row.append()

    # Flush the table buffers
    table.flush()

# Select actual data from table.
# on entries where TDCcount field is greater than 3 and pressure less than 50
out = [ x['identity'] for x in table.iterrows()
        if x['idnumber'] > 3 and 4 < x['speed'] < 10 ]

print out

# Finally, close the file (this also will flush all the remaining buffers!)
fileh.close()




More information about the Numpy-discussion mailing list