[Numpy-discussion] fixing up datetime

Dave Hirschfeld dave.hirschfeld@gmail....
Wed Jun 8 05:59:48 CDT 2011


Mark Wiebe <mwwiebe <at> gmail.com> writes:
> 
> 
> It appears to me that a structured dtype with some further NumPy extensions 
> could entirely replace the 'events' metadata fairly cleanly. If the ufuncs 
> are extended to operate on structured arrays, and integers modulo n are 
> added as a new dtype, a dtype like 
> [('date', 'M8[D]'), ('event', 'i8[mod 100]')] could replace the current
> 'M8[D]//100'.

Sounds like a cleaner API.

> 
>>
>> As Dave H. summarized, we used a basic keyword to do the same thing in 
>> scikits.timeseries, with the addition of some subfrequencies like A-SEP 
>> to represent a year starting in September, for example. It works, but it's
>> really not elegant a solution.
>>
> 
> This kind of thing definitely belongs in a layer above datetime.
> 

That's fair enough - my perspective as a timeseries user is probably a lot
higher level. My main aim is to point out some of the higher level uses so that
the numpy dtype can be made compatible with them - I'd hate to have a 
situation where we have multiple different datetime representations
in the different libraries and having to continually convert at the 
boundaries.

That said, I think the starting point for a series at a weekly, quarterly
or annual frequency/unit/type is something which may need to be sorted out
at the lowest level...

> 
> One overall impression I have about timeseries in general is the use of the
> term "frequency" synonymously with the time unit. To me, a frequency is a 
> numerical quantity with a unit of 1/(time unit), so while it's related to 
> the time unit, naming it the same is something the specific timeseries 
> domain has chosen to do, I think the numpy datetime class shouldn't have 
> anything called "frequency" in it, and I would like to remove the current 
> usage of that terminology from the codebase.
> 

It seems that it's just a naming convention (possibly not the best) and 
can be used synonymously with the "time unit"/resolution/dtype

> I don't envision 'asfreq' being a datetime function, this is the kind 
> of thing that would layer on top in a specialized timeseries library. The 
> behavior of timedelta follows a more physics-like idea with regard to the 
> time unit, and I don't think something more complicated belongs at the bottom
> layer that is shared among all datetime uses.

I think since freq <==> dtype then asfreq <==> astype. From your examples it
seems to do the same thing - i.e. if you go to a lower resolution (freq) 
representation the higher resolution information is truncated - e.g. a 
monthly resolution date has no information about days/hours/minutes/seconds
etc. It's converting in the other direction: low --> high resolution
where the difference lies - numpy always converts to the start of the interval
whereas the timeseries Date class gives you the option of the start or the end.

> I'm thinking of a datetime as an infinitesimal moment in time, with the 
> unit representing the precision that can be stored. Thus, '2011', 
> '2011-01', and '2011-01-01T00:00:00.00Z' all represent the same moment in 
> time, but with a different unit of precision. Computationally, this 
> perspective is turning out to provide a pretty rich mechanism to do 
> operations on date.

I think this is where the main point of difference is. I use the timeseries 
as a container for data which arrives in a certain interval of time. e.g.
monthly temperatures are the average temperature over the interval defined by 
the particular month. It's not the temperature at the instant of time that the 
month began, or ended or of any particular instant in that month. 

Thus the conversion from a monthly resolution to say a daily resolution isn't 
well defined and the right thing to do is likely to be application specific.

For the average temperature example you may want to choose a value in the 
middle of the month so that you don't introduce a phase delay if you 
interpolate between the datapoints.

If for example you measured total rainfall in a month you might want to 
choose the last day of the month to represent the total rainfall for that
month as that was the only date in the interval where the total rainfall
did in fact equal the monthly value.

It may be as you say though that all this functionality does belong at a higher
level...

Regards,
Dave












More information about the NumPy-Discussion mailing list