[Numpy-discussion] fixing up datetime
Dave Hirschfeld
dave.hirschfeld@gmail....
Tue Jun 7 07:34:16 CDT 2011
As a user of numpy/scipy in finance I thought I would put in my 2p worth as
it's something which is of great importance in this area.
I'm currently a heavy user of the scikits.timeseries package by Matt & Pierre
and I'm also following the development of statsmodels and pandas should we
require more sophisticated statistics in future. Hopefully the numpy datetime
type will provide a foundation such packages can build upon...
I'll use the timeseries package for reference since I'm most familiar with it
and it's a very good api for my requirements. Apologies to Matt/Pierre if
I get anything wrong - feel free to correct my misconceptions...
I think some of the complexity is coming from the definition of the timedelta.
In the timeseries package each date simply represents the number of periods
since the epoch and the difference between dates is therefore just and integer
with no attached metadata - its meaning is determined by the context it's used
in. e.g.
In [56]: M1 = ts.Date('M',"01-Jan-2011")
In [57]: M2 = ts.Date('M',"01-Jan-2012")
In [58]: M2 - M1
Out[58]: 12
timeseries gets on just fine without a timedelta type - a timedelta is just an
integer and if you add an integer to a date it's interpreted as the number of
periods of that dates frequency. From a useability point of view M1 + 1 is
much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, 1).
Something like the dateutil relativedelta pacage is very convenient and
could serve as a template for such functionality:
In [59]: from dateutil.relativedelta import relativedelta
In [60]: (D1 + 30).datetime
Out[60]: datetime.datetime(2011, 1, 31, 0, 0)
In [61]: (D1 + 30).datetime + relativedelta(months=1)
Out[61]: datetime.datetime(2011, 2, 28, 0, 0)
...but you can still get the same behaviour without a timedelta by asking that
the user explicitly specify what they mean by "adding one month" to a date of
a different frequency. e.g.
In [62]: (D1 + 30)
Out[62]: <D : 31-Jan-2011>
In [63]: _62.asfreq('M') + 1
Out[63]: <M : Feb-2011>
In [64]: (_62.asfreq('M') + 1).asfreq('D','END')
Out[64]: <D : 28-Feb-2011>
In [65]: (_62.asfreq('M') + 1).asfreq('D','START') + _62.day
Out[65]: <D : 04-Mar-2011>
As Pierre noted when converting dates from a lower frequency to a higher one
it's very useful (essential!) to be able to specify whether you want the end
or the start of the interval. It may also be useful to be able to specify an
arbitrary offset from either the start or the end of the interval so you could
do something like:
In [66]: (_62.asfreq('M') + 1).asfreq('D', offset=0)
Out[66]: <D : 01-Feb-2011>
In [67]: (_62.asfreq('M') + 1).asfreq('D', offset=-1)
Out[67]: <D : 28-Feb-2011>
In [68]: (_62.asfreq('M') + 1).asfreq('D', offset=15)
Out[68]: <D : 16-Feb-2011>
I don't think it's useful to define higher 'frequencies' as arbitrary multiples
of lower 'frequencies' unless the conversion is exact otherwise it leads
to the following inconsistencies:
In [69]: days_per_month = 30
In [70]: D1 = M1.asfreq('D',relation='START')
In [71]: D2 = M2.asfreq('D','START')
In [72]: D1, D2
Out[72]: (<D : 01-Jan-2011>, <D : 01-Jan-2012>)
In [73]: D1 + days_per_month*(M2 - M1)
Out[73]: <D : 27-Dec-2011>
In [74]: D1 + days_per_month*(M2 - M1) == D2
Out[74]: False
If I want the number of days between M1 and M2 I explicitely do the conversion
myself:
In [75]: M2.asfreq('D','START') - M1.asfreq('D','START')
Out[75]: 365
thus avoiding any inconsistency:
In [76]: D1 + (M2.asfreq('D','START') - M1.asfreq('D','START')) == D2
Out[76]: True
I'm not convinced about the events concept - it seems to add complexity
for something which could be accomplished better in other ways. A [Y]//4
dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There
may well be a good reason for it however I can't see the need for it in my
own applications.
In the timeseries package, because the difference between dates represents the
number of periods between the dates they must be of the same frequency to
unambiguopusly define what a "period" means:
In [77]: M1 - D1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\dev\code\<ipython console> in <module>()
ValueError: Cannot subtract Date objects with different frequencies.
I would argue that in the spirit that we're all consenting adults
adding dates of the same frequency can be a useful thing for example
in finding the mid-point between two dates:
In [78]: M1.asfreq('S','START')
Out[78]: <S : 01-Jan-2011 00:00:00>
In [79]: M2.asfreq('S','START')
Out[79]: <S : 01-Jan-2012 00:00:00>
In [80]: ts.Date('S', (_64.value + _65.value)//2)
Out[80]: <S : 02-Jul-2011 12:00:00>
I think any errors which arose from adding or multiplying dates would be pretty
easy to spot in your code.
As Robert mentioned the economic data we use is often supplied as weekly,
monthly, quarterly or annual data. So these frequencies are critical if we're
to use the the array as a container for economic data. Such data would usually
represent either the sum or the average over that period so it's very easy to
get a consistent "linear-time" representation by interpolating down to a higher
frequency such as daily or hourly.
I really like the idea of being able to specify multiples of the base frequency
- e.g. [7D] is equivalenty to [W] not the least because it provides an easy
way to specify quarters [3M] or seasons [6M] which are important in my work.
NB: I also deal with half-hourly and quarter-hourly timeseries and I'm sure
there are many other example which are all made possible by allowing
multipliers.
One aspect of this is that the origin becomes important - i.e. does the week
[7D] start on Monday/Tuesday etc. In scikits.timeseries this is solved by
defining a different weekly frequency for each day of the week, a different
annual frequency starting at each month etc...
http://pytseries.sourceforge.net/core.constants.html
I'm thinking however that it may be possible to use the origin/zero-date/epoch
attribute to define the start of such periods - e.g. if you had a weekly [7D]
frequency and the origin was 01-Jan-1970 then each week would be defined as a
Thursday-Thursday week. To get a Monday-Monday week you could supply
05-Jan-1970 as the origin attribute.
Unfortunately business days and holidays are also very important in finance,
however I agree that this may be better suited to a calendar API. I would
suggest that leap seconds would be something which could also be handled by
this API rather than having such complex functionality baked in by default.
I'm not sure how this could be implemented in practice except for some vague
thoughts about providing hooks where users could provide functions which
converted to and from an integer representation for their particular
calendar. Creating a weekday calendar would be a good test-case for such
an API.
Apologies for the very long post! I guess it can be mostly summarised as
you've got a pretty good template for functionality in scikits.timeseries!
Pandas/statsmodels may have more sophisticated requirements though so their
input on the finance/econometric side would be useful...
-Dave
More information about the NumPy-Discussion
mailing list