[Numpy-discussion] fixing up datetime
Mark Wiebe
mwwiebe@gmail....
Tue Jun 7 18:16:25 CDT 2011
Hi Dave,
Thanks for all the feedback on the datetime, it's very useful to help
understand the timeseries ideas, in particular with the many examples you're
sprinkling in.
One overall impression I have about timeseries in general is the use of the
term "frequency" synonymously with the time unit. To me, a frequency is a
numerical quantity with a unit of 1/(time unit), so while it's related to
the time unit, naming it the same is something the specific timeseries
domain has chosen to do, I think the numpy datetime class shouldn't have
anything called "frequency" in it, and I would like to remove the current
usage of that terminology from the codebase.
In Wes's comment, he said
I'm hopeful that the datetime64 dtype will enable scikits.timeseries
> and pandas to consolidate much ofir the datetime / frequency code.
> scikits.timeseries has a ton of great stuff for generating dates with
> all the standard fixed frequencies.
implying to me that the important functionality needed in time series is the
ability to generate arrays of dates in specific ways. I suspect equating the
specification of the array of dates and the unit of precision used to store
the date isn't good for either the datetime functionality or supporting
timeseries, and I'm presently trying to understand what it is that
timeseries uses.
On Tue, Jun 7, 2011 at 7:34 AM, Dave Hirschfeld
<dave.hirschfeld@gmail.com>wrote:
> As a user of numpy/scipy in finance I thought I would put in my 2p worth as
> it's something which is of great importance in this area.
>
> I'm currently a heavy user of the scikits.timeseries package by Matt &
> Pierre
> and I'm also following the development of statsmodels and pandas should we
> require more sophisticated statistics in future. Hopefully the numpy
> datetime
> type will provide a foundation such packages can build upon...
>
> I'll use the timeseries package for reference since I'm most familiar with
> it
> and it's a very good api for my requirements. Apologies to Matt/Pierre if
> I get anything wrong - feel free to correct my misconceptions...
>
> I think some of the complexity is coming from the definition of the
> timedelta.
> In the timeseries package each date simply represents the number of periods
> since the epoch and the difference between dates is therefore just and
> integer
> with no attached metadata - its meaning is determined by the context it's
> used
> in. e.g.
>
> In [56]: M1 = ts.Date('M',"01-Jan-2011")
>
> In [57]: M2 = ts.Date('M',"01-Jan-2012")
>
> In [58]: M2 - M1
> Out[58]: 12
>
> timeseries gets on just fine without a timedelta type - a timedelta is just
> an
> integer and if you add an integer to a date it's interpreted as the number
> of
> periods of that dates frequency. From a useability point of view M1 + 1 is
> much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, 1).
>
I think the timedelta is important, especially with the large number of
units NumPy's datetime supports. When you're subtracting two nanosecond
datetimes and two minute datetimes in the same code, having the units there
to avoid confusion is pretty useful. Ideally, timedelta would just be a
regular integer or float with a time unit associated, but NumPy doesn't have
a physical units system integrated at present.
> Something like the dateutil relativedelta pacage is very convenient and
> could serve as a template for such functionality:
>
> In [59]: from dateutil.relativedelta import relativedelta
>
> In [60]: (D1 + 30).datetime
> Out[60]: datetime.datetime(2011, 1, 31, 0, 0)
>
> In [61]: (D1 + 30).datetime + relativedelta(months=1)
> Out[61]: datetime.datetime(2011, 2, 28, 0, 0)
>
> ...but you can still get the same behaviour without a timedelta by asking
> that
> the user explicitly specify what they mean by "adding one month" to a date
> of
> a different frequency. e.g.
>
> In [62]: (D1 + 30)
> Out[62]: <D : 31-Jan-2011>
>
> In [63]: _62.asfreq('M') + 1
> Out[63]: <M : Feb-2011>
>
> In [64]: (_62.asfreq('M') + 1).asfreq('D','END')
> Out[64]: <D : 28-Feb-2011>
>
> In [65]: (_62.asfreq('M') + 1).asfreq('D','START') + _62.day
> Out[65]: <D : 04-Mar-2011>
>
I don't envision 'asfreq' being a datetime function, this is the kind of
thing that would layer on top in a specialized timeseries library. The
behavior of timedelta follows a more physics-like idea with regard to the
time unit, and I don't think something more complicated belongs at the
bottom layer that is shared among all datetime uses.
Here's a rough approximation of your calculations above:
>>> d = np.datetime64('2011-01-31')
>>> d.astype('M8[M]') + 1
numpy.datetime64('2011-02','M')
>>> (d.astype('M8[M]') + 2) - np.timedelta64(1, 'D')
numpy.datetime64('2011-02-28','D')
>>> (d.astype('M8[M]') + 1) + np.timedelta64(31, 'D')
numpy.datetime64('2011-03-04','D')
As Pierre noted when converting dates from a lower frequency to a higher one
> it's very useful (essential!) to be able to specify whether you want the
> end
> or the start of the interval. It may also be useful to be able to specify
> an
> arbitrary offset from either the start or the end of the interval so you
> could
> do something like:
>
> In [66]: (_62.asfreq('M') + 1).asfreq('D', offset=0)
> Out[66]: <D : 01-Feb-2011>
>
> In [67]: (_62.asfreq('M') + 1).asfreq('D', offset=-1)
> Out[67]: <D : 28-Feb-2011>
>
> In [68]: (_62.asfreq('M') + 1).asfreq('D', offset=15)
> Out[68]: <D : 16-Feb-2011>
>
I think this kind of functionality belongs at a higher level, but the idea
is to make it reasonable to implement it with the NumPy datetime primitives:
>>> (d.astype('M8[M]') + 1).astype('M8[D]')
numpy.datetime64('2011-02-01','D')
>>> ((d.astype('M8[M]') + 1) + 1) - np.timedelta64(1, 'D')
numpy.datetime64('2011-02-28','D')
>>> (d.astype('M8[M]') + 1) + np.timedelta64(15, 'D')
numpy.datetime64('2011-02-16','D')
I don't think it's useful to define higher 'frequencies' as arbitrary
> multiples
> of lower 'frequencies' unless the conversion is exact otherwise it leads
> to the following inconsistencies:
>
> In [69]: days_per_month = 30
>
> In [70]: D1 = M1.asfreq('D',relation='START')
>
> In [71]: D2 = M2.asfreq('D','START')
>
> In [72]: D1, D2
> Out[72]: (<D : 01-Jan-2011>, <D : 01-Jan-2012>)
>
> In [73]: D1 + days_per_month*(M2 - M1)
> Out[73]: <D : 27-Dec-2011>
>
> In [74]: D1 + days_per_month*(M2 - M1) == D2
> Out[74]: False
>
Here's what I get:
>>> d1, d2 = np.datetime64('2011-01-01'), np.datetime64('2012-01-01')
>>> m1, m2 = d1.astype('M8[M]'), d2.astype('M8[M]')
>>> d1 + 30 * (m2 - m1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Cannot get a common metadata divisor for types
dtype('datetime64[D]') and dtype('timedelta64[M]') because they have
incompatible nonlinear base time units
>>> d1 + 30 * (m2 - m1).astype('i8')
numpy.datetime64('2011-12-27','D')
> If I want the number of days between M1 and M2 I explicitely do the
> conversion
> myself:
>
> In [75]: M2.asfreq('D','START') - M1.asfreq('D','START')
> Out[75]: 365
>
> thus avoiding any inconsistency:
>
> In [76]: D1 + (M2.asfreq('D','START') - M1.asfreq('D','START')) == D2
> Out[76]: True
>
> I'm not convinced about the events concept - it seems to add complexity
> for something which could be accomplished better in other ways. A [Y]//4
> dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There
> may well be a good reason for it however I can't see the need for it in my
> own applications.
>
> In the timeseries package, because the difference between dates represents
> the
> number of periods between the dates they must be of the same frequency to
> unambiguopusly define what a "period" means:
>
> In [77]: M1 - D1
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
>
> C:\dev\code\<ipython console> in <module>()
>
> ValueError: Cannot subtract Date objects with different frequencies.
>
> I would argue that in the spirit that we're all consenting adults
> adding dates of the same frequency can be a useful thing for example
> in finding the mid-point between two dates:
>
> In [78]: M1.asfreq('S','START')
> Out[78]: <S : 01-Jan-2011 00:00:00>
>
> In [79]: M2.asfreq('S','START')
> Out[79]: <S : 01-Jan-2012 00:00:00>
>
> In [80]: ts.Date('S', (_64.value + _65.value)//2)
> Out[80]: <S : 02-Jul-2011 12:00:00>
>
Adding dates definitely doesn't work, because datetimes have no zero, but I
would express it like this:
>>> s1, s2 = m1.astype('M8[s]'), m2.astype('M8[s]')
>>> s1 + (s2 - s1)/2
numpy.datetime64('2011-07-02T07:00:00-0500','s')
>>> np.datetime_as_string(s1 + (s2 - s1)/2)
'2011-07-02T12:00:00Z'
Printing times in the local timezone by default makes that first printout a
bit weird, but I really like having that default so this looks good:
>>> np.datetime64('now')
numpy.datetime64('2011-06-07T18:15:46-0500','s')
I think any errors which arose from adding or multiplying dates would be
> pretty
> easy to spot in your code.
>
> As Robert mentioned the economic data we use is often supplied as weekly,
> monthly, quarterly or annual data. So these frequencies are critical if
> we're
> to use the the array as a container for economic data. Such data would
> usually
> represent either the sum or the average over that period so it's very easy
> to
> get a consistent "linear-time" representation by interpolating down to a
> higher
> frequency such as daily or hourly.
>
> I really like the idea of being able to specify multiples of the base
> frequency
> - e.g. [7D] is equivalenty to [W] not the least because it provides an easy
> way to specify quarters [3M] or seasons [6M] which are important in my
> work.
> NB: I also deal with half-hourly and quarter-hourly timeseries and I'm sure
> there are many other example which are all made possible by allowing
> multipliers.
>
> One aspect of this is that the origin becomes important - i.e. does the
> week
> [7D] start on Monday/Tuesday etc. In scikits.timeseries this is solved by
> defining a different weekly frequency for each day of the week, a different
> annual frequency starting at each month etc...
>
> http://pytseries.sourceforge.net/core.constants.html
I'm thinking however that it may be possible to use the
> origin/zero-date/epoch
> attribute to define the start of such periods - e.g. if you had a weekly
> [7D]
> frequency and the origin was 01-Jan-1970 then each week would be defined as
> a
> Thursday-Thursday week. To get a Monday-Monday week you could supply
> 05-Jan-1970 as the origin attribute.
>
This is one of the things where I think mixing the datetime storage
precision with timeseries frequency seems counterproductive. Having
different origins for datetime64 starting on different weekdays near
1970-01-01 doesn't seem like the right way to tackle the problem to me. I
see other valid reasons for reintroducing the origin metadata, but this one
I don't really like.
>
> Unfortunately business days and holidays are also very important in
> finance,
> however I agree that this may be better suited to a calendar API. I would
> suggest that leap seconds would be something which could also be handled by
> this API rather than having such complex functionality baked in by default.
>
I've got a business day API in development, and will post it for feedback
soon.
I'm not sure how this could be implemented in practice except for some vague
> thoughts about providing hooks where users could provide functions which
> converted to and from an integer representation for their particular
> calendar. Creating a weekday calendar would be a good test-case for such
> an API.
>
> Apologies for the very long post! I guess it can be mostly summarised as
> you've got a pretty good template for functionality in scikits.timeseries!
> Pandas/statsmodels may have more sophisticated requirements though so their
> input on the finance/econometric side would be useful...
>
Thanks again for the feedback!
-Mark
>
> -Dave
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110607/08ba149f/attachment-0001.html
More information about the NumPy-Discussion
mailing list