[SciPy-dev] timeseries moved to scikits

Bradford Cross bradford.n.cross@gmail....
Thu Jan 24 11:40:09 CST 2008


On Jan 24, 2008 7:22 AM, David Huard <david.huard@gmail.com> wrote:

> Hi Bradford,
>
> Before putting too much time on our respective side, maybe we should
> discuss the basic implementation plan to make sure we agree on requirements
> and the general layout of the code.
>
> 1) In my draft, I divided the date sequence, the data and the mask into
> three separate entities. Do this seem also reasonable to you ?


Yes, we really have no reasonable alternative.  Homogeneous data can be
stored as arrays, mask needs to be stored on its own, and datetime must be
converted to ISO string - there is no other good way to store datetime
without losing milliseconds, which is unacceptable for many kinds of
timeseries (that is also one of the problems with the timeseries package's
internal dates, no current sub second  implementation.)


>
>
> 2) Data is stored in a table for record arrays, and EArrray for simple
> ndarrays. Using this setup, object arrays cannot be stored. One solution
> could be to convert them to record arrays and store them in a table with a
> flag indicating that when loaded, the data should be returned as an object
> array.


We can use the VLArray as mentioned in the reply from Francesc.  As I did in
the first prototype, I think we can provide a nice API based on the
Repository pattern that encapsulates our pyTables mapping implementation
regardless of whether we are using the timeseries package, user defined time
series based on objects, etc.  We can try a few things - maybe even the
mapper function approach that I took in the first prototype that maps from
objects to rows in a pytables - this allows for element-by-element
reads/writes and steals the idea from the DataMapper patterns that are often
seen in O/R mapping frameworks.



>
> 3) Dates are stored in a ISO compliant string EArray. I used EArray
> because they are enlargeable. This could become handy eventually, but I
> don't know if there are any counter-indications (file size, load speed, etc)
>


Yep, sounds good.  Nice tip in response from Francesc.

>
>
> 4) The mask is stored similary to the data. The only exception is that
> when the mask is scalar (eg. all masked values are False), the mask is
> simply stored as an attribute of the data.


Cool.


>
>
> 5) We should decide on how and where attributes are stored (eg.
> fill_value, frequency).


It has been a couple months since I did the first prototype, but I think I
recall that the pyTables docs have something about storing metadata
associated with a table.

>
>
> 6) Each array has its own group, so that multiple arrays can be stored in
> the same file. Although this is flexible, we should think about how to
> access each individual array, and possibly provide a list of the available
> arrays within a file.


I am a fan of the one-timeseries-per-table approach, which can scale nicely
to distributed databases for parallelization.  One of the cool parts about
the initial prototype that I did is that I noticed you can drill into
pyTables hierarchically with the same syntax that you drill into the unix
file system hierarchy, which makes it easy to scale in and out from pyTables
files into distributed pyTables.  In this case the hierarchical structure
just needs to be laid out in a way that makes sense for the domain that the
data comes from, but I am pretty sure that the functions/objects I used in
the initial prototype can be generalized for that.  It makes accessing each
array easy, scaling to a distributed database easy, and providing a list of
arrays at any level in the hierarchy easy.


>
> Waiting for your comments,
>
>
> David
>
>
>
>
>
>
>
>
>
>
> 2008/1/24, Bradford Cross <bradford.n.cross@gmail.com >:
>
> > congrats matt...i am working with someone now on merging our code that
> > creates a time series repository using numy + pytables + timeseries ...
> > ability to store timeseries in pytables as timeseries, numpy arrays, or ad
> > hoc for heterogeneous arrays...
> >
> > On Jan 22, 2008 8:28 PM, Matt Knox < mattknox_ca@hotmail.com> wrote:
> >
> > >  The timeseries module has been moved to the scikits svn repository (
> > > http://svn.scipy.org/svn/scikits/trunk/timeseries) and removed from
> > > the sandbox. It installs as a scikits namespace package as per the scikits
> > > convention.
> > >
> > > The maskedarray branch of numpy (only available in svn) is currently
> > > required for the timeseries scikit. This requirement will go away once the
> > > maskedarray merging is complete and an official  release of numpy has been
> > > made with the new masked array module.
> > >
> > > I'll try to whip up a quick page on the trac site for the timeseries
> > > module some time this week and port the existing documentation over to
> > > there.
> > >
> > > - Matt
> > >
> > > _______________________________________________
> > > Scipy-dev mailing list
> > > Scipy-dev@scipy.org
> > > http://projects.scipy.org/mailman/listinfo/scipy-dev
> > >
> > >
> >
> > _______________________________________________
> > Scipy-dev mailing list
> > Scipy-dev@scipy.org
> > http://projects.scipy.org/mailman/listinfo/scipy-dev
> >
> >
>
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev@scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/scipy-dev/attachments/20080124/252236ae/attachment.html 


More information about the Scipy-dev mailing list