[SciPy-dev] Binary i/o package

Erin Sheldon erin.sheldon@gmail....
Sun Jun 3 15:42:04 CDT 2007


On 6/3/07, Anne Archibald <peridot.faceted@gmail.com> wrote:
> On 01/06/07, Erin Sheldon <erin.sheldon@gmail.com> wrote:
> > The overwhelming silence tells me that either no one here thinks this
> > is relevant or no one bothered reading the email.  I feel like the
> > functionality I have written into this package is so basic it belongs
> > in scipy io if not in numpy itself. Please give me some feedback one
> > way or another.
> >
> > If it just seems irrelevant then I may just look into making it a
> > scikits package.
>
> I'm not trying to knock your work, but it's not clear to me that
> there's enough room between readarray/writearray/tofile/fromfile and
> pytables to accommodate another package. Maybe I don't see what your
> package does, but why wouldn't I just install pytables instead? What
> are its advantages and disadvantages compared to pytables?

Anne -

fromfile works on the whole file or nothing (or contiguous chunks of
rows).  read_array can read certain fields and rows from ascii. It is
pure-python which means it is rather slow, but that OK because ascii
files are rarely large.  PyTables or a database like postgres are at a
different level but are build on complex libraries and have complex
interfaces.

The need to random-access into a binary file with fixed-length records
is basic for most data storage and retrieval.  For example most
standardized file formats are self-describing binary tables which
require no previous knowledge of the data other than the format (e.g.
FITS in astronomy).  But in scripting languages one is usually limited
to a read all or nothing approach because all you have is the
equivalent of fromfile. I included a working example of such a
self-describing format in the simple_format sub-module of readfields.

Another example is a simple relational database which is a group of
tables, with each table in a flat file or spread across flat files
(again no variable length fields).  For efficiency one needs to random
access the files at a low level.

This package fills the niche and is the backbone of such systems.  And
it is a small chunk of code.  You can extract what you want from the
file and store it directly into a numpy array in the most efficient
manner possible.

I can speak for myself that with the larger astronomical data sets
that have come online it has become useful to write big files in a
standardized format and treat them as a simple database.   One does
not have to install and administer a database system like postgres or
pytables (cdf), and one does not have to learn a new system beyond
numpy. But one gets most of the performance benefits of low-level
random-access to the data.

Erin


More information about the Scipy-dev mailing list