[SciPy-dev] time series implementation approach
mattknox_ca at hotmail.com
Tue Dec 12 14:15:43 CST 2006
Hi everyone. I have been discussing the approach I've used for my time series module (available in the sandbox) with another reader of this mailing list, and there is one particular issue that we seem to disagree on that I would like to hear some other thoughts on, if anyone has any opinions one way or the other.
I'm going to just outline the two proposed approaches and highlight some pros/cons of each. And I'm certainly open to hearing another completely different approach all together if you have ideas.
Common to both implementations is a Date class, where each Date has a frequency (daily, monthly, business days, etc) and a value. The value represents periods since the origin, where the origin is taken to be some chosen fixed date (currently 1st period in the year 1850). Also, every time series object has a frequency, and a starting date, in both proposed implementations.
== Implementation A ==
Sub-class masked array. This allows usage of all the currently available functions and methods for masked array and minimizes the amount of work needing to be done on actually writing any custom internals. Indexing for the timeseries object is always done relative to the start and end dates of the series. So for example, if series1.start == '1 jan 1999' (shown as a string here for clarity, but not implemented as a string), and this was a daily frequency series, then series1 represents Jan 1, 1999, series1 represents the value at Jan 3, 1999, etc...
The __getitem__ and __setitem__ methods would be overwritten to additionally accept a Date object of the same frequency as the series, so you could do something like: jan25val = series1[Date(freq='d',year=1999,month=1,day=25)]
Functions would be provided to take multiple series and align them appropriately so they can be added together, and so forth.
The drawback of this approach (relative to the next one to be discussed) is that an index used for one series has no inherent meaning to any other series unless you explicitly aligned them ahead of time. Doing something like: foo = series1[5:25] + series2[5:25] , doesn't make any sense unless you are careful to align the two series before hand.
== Implementation B ==
Construct a new Class (let's call it ShiftingArray) that has no inherent size. It stores an underlying data array that is hidden from the user, and when points outside the bounds of this underlying array are requested, the array is dynamically resized to accomodate these new bounds. Index X means the same thing for any ShiftingArray. If I add two shifting arrays, they are aligned appropriately behind the scenes with no user intervention. The TimeSeries class is then constructed as a sub-class of ShiftingArray. This makes it possible to do things like the following:
startDate = Date(freq='d',year=1999,month=1,day=25)endDate = startDate + 50
mySlice = slice(int(startDate),int(endDate))foo1 = series1[mySlice]foo2 = series2[mySlice]
blah = series1 + series2
without worrying about where series1 and series2 start and end.
A problem with this approach is that there is more overhead than just sub classing masked array because the dynamic shifting has a cost, and existing functions will have to be wrapped in order to act on the time series objects. The internals of the Class will be more complicated, but it takes away some micro-management from the user.
So... I realize this was a bit long winded, and for that I apologize, but if you have any thoughts on the subject, please share.
- Matt Knox
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Scipy-dev