[Numpy-discussion] RFC: A proposal for implementing some date/time types in NumPy

Francesc Alted falted@pytables....
Fri Jul 11 11:47:46 CDT 2008


A Friday 11 July 2008, Pierre GM escrigué:
> Francesc,
>
> > We are planning to implement some date/time types for NumPy, and
> > I'm sending a document that explains our approach.  We would love
> > to hear the feedback of the NumPy community in order to cover their
> > needs as much as possible.
>
> That sounds like an excellent idea. Matt Knox and I tried something
> similar with the scikits.timeseries module we've been developing over
> the last 18 months (scipy.org/scipy/scikits/wiki/TimeSeries).
>
> Our approach for dealing with dates was to translate them into
> integers through a particular class (Date). The trick was to change
> the reference depending on the problem at hands: when dealing with
> annual series, the Date object is simply the year (since CE); when
> dealing with months, the number of months since 0CE; when dealing
> with hours, the number of hours since 1970... All the nitty-gritty
> parts were coded by Matt in C. And yes, we have routines to transform
> a datetime object into a Date object and back. We also used a parser
> from mxDate when dealing with dates in string formats.

That's very interesting.  We will have a look at your implementation and 
see if we can reuse code/ideas.  I suppose this code in your TimeSeries 
module, right?

> We thought about creating specific dtypes to simplify the interface,
> but had problems finding proper documentation for that and were
> anyway more interested in having something running. The approach
> works well for us, but one of the biggest limitations we have is that
> we can't handle series with a frequency less than 1s (as we need
> integers), and your idea of a float for higher frequencies is great.

You can obtain at least a precision of microseconds with any of the 
proposed int64-based types.  For the float64-based, you can get that 
precision too if you are dealing with dates in the [1970, 2038] range.

> About the types you propose, isn't there a typo somewhere in the
> resolution ? What's the difference between your datetime64 and
> timestamp64 ?

It's a typo indeed.  I'm attaching the new version here (with some 
additional minor fixes, mainly in the format).

>
> > In this document, the emphasis has been put in comparing the
> > compatibility of future NumPy date/time types against the
> > ``datetime`` module that comes with Python.  Should we consider the
> > compatibility with mx.DateTime as well?  Are there many people
> > using ``mx.DateTime`` [2]_ out there?  If so, which are their
> > advantages over ``datetime`` module?.
>
> mx.DateTime has a great parser for strings, but its use adds yet some
> other requirements (you need to have the module installed and it
> doesn't come by default with python, there's some licensing
> issues...), so I wouldn't focus on that for now, if I were you.

Interesting...

>
> > A final note on time scales
> > ---------------------------
>
> Wow, indeed. In environmental sciences (my side) and in finances
> (Matt's), we very rarely have a need for that precision,
> thankfully...

I was surprised about this too when Ivan bring it to my attention :)

Thanks for excellent fedback!

-- 
Francesc Alted

===========================================================
 A proposal for implementing some date/time types in NumPy
===========================================================

:Author: Francesc Alted i Abad
:Contact: faltet@pytables.com
:Author: Ivan Vilata i Balaguer
:Contact: ivan@selidor.net


Executive summary
=================

A date/time mark is something very handy to have in many fields where
one has to deal with data sets.  While Python have several modules that
define a date/time type, like ``mx.DateTime`` or the integrated
``datetime`` [1]_, NumPy has a lack of them.

In this document, we are proposing the addition of a series of date/time
types to fill this gap.  The requirements for the proposed types are
two-folded: 1) they have to be fast to operate with and 2) they have to
be as compatible as possible with the existing ``datetime`` module that
comes with Python.


Types proposed
==============

To start with, it is virtually impossible to come up with a single
date/time type that fills the needs of every case of use.  So, after
pondering about different possibilities, we have stick with three
different types, namely ``datetime64``, ``timestamp64`` and
``timefloat64`` -- these names are preliminary and can be changed
indeed; they are mostly useful for the sake of the discussion -- that
cover different needs.

Here it goes a detailed description of the different types:

* ``datetime64``

  - Implemented internally as an ``int64`` type.

  - Expressed in microseconds since POSIX epoch (January 1, 1970).

  - Resolution: microseconds.

  - Time span: 278922 years in each direction since the POSIX epoch.

  Observations::

    This will be compatible with the Python ``datetime`` module not only
    in terms of precision (it also have a resolution of microseconds)
    and time span (its range is year 1 to year 9999), but also in that
    we will provide getters and setters for it.


* ``timestamp64``

  - Implemented internally as an ``int64`` type.

  - Expressed in nanoseconds since POSIX epoch (January 1, 1970).

  - Resolution: nanoseconds.

  - Time span: 272 years in each direction since the POSIX epoch.

  Observations::

    This will be not be fully compatible with the Python ``datetime``
    module neither in terms of precision nor time span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).

* ``timefloat64``

  - Implemented internally as a float64.

  - Expressed in microseconds since POSIX epoch (January 1, 1970).

  - Resolution: 1 microsecond (for +-32 years from epoch) or 14 digits
    (for distant years from epoch).  So the precision is *variable*.

  - Time span: 1e+308 years in each direction since the POSIX epoch.

  Observations::

    In general, this will be not be fully compatible with the Python
    datetime neither in terms of precision nor time span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).


Example of use
==============

Here it is an example of usage of one of the types described above
(``datetime64``)::

  In [10]: t = numpy.zeros(5, dtype="datetime64")

  In [11]: t[0] = datetime.datetime.now()  # setter in action

  In [12]: t[0]
  Out[12]: 733234384724   # representation as an int64 (scalar)

  In [13]: t
  Out[13]: array([12155899511985929, 0, 0, 0, 0], dtype=datetime64)

  In [14]: t[0].item()     # getter in action
  Out[14]: datetime.datetime(2008, 7, 11, 14, 27, 3, 384724)


Final considerations
====================


About the ``mx.DateTime`` module
--------------------------------

In this document, the emphasis has been put in comparing the
compatibility of future NumPy date/time types against the ``datetime``
module that comes with Python.  Should we consider the compatibility
with ``mx.DateTime`` as well?  Are there many people using
``mx.DateTime`` [2]_ out there?  If so, which are their advantages over
``datetime`` module?.


A final note on time scales
---------------------------

[Only for people with high precision time requirements or just for those
that love the feel of their brain exploding]

POSIX (or UTC, in which POSIX is based) time scale [3]_ is based on the
time that the Earth takes to revolve around the Sun.  However, after the
adoption of more precise time patterns (read: atomic clocks), it became
clear that this time is pretty imprecise compared with the latter.  As a
result, the UTC standard is adding ``leap seconds`` from time to time
(at a rate of 1 second per year, approximately) in order to compensate
these differences.

Because of this, when computing time deltas (using the UTC standard)
between two instants that differ in more than one year, it is extremely
probable that an error of a second or several (depending of the time
span) would be introduced.  While this in general is harmless for common
use cases, there are situations that this can bite people quite a lot.

For example, realize that IERS (International Earth Rotation and
Reference Systems Service) decided to add a leap second the past June
30th 00:00:00 UTC and that you were doing an experiment precisely at
that time (this is not so rare, and has probably happened already to
someone, somewhere).  With this, in the analysis phase of your
experiment, you could have the surprise that the time deltas computed
during this leap second are actually 1 second shorter (and then, why the
heck do we want to use types that supposedly support micro- or
nano-seconds of precision?).

Because of this, we were initially tempted to use the TAI (Temps
Atomique International) [4]_ standard because it is strictly
*continuous* (contrarily to UTC or POSIX), so avoiding the sort of
problems exposed above.  However, the omnipresence of the UTC clocks in
the computing world would force us to be continuously converting UTC
timestamps to TAI ones and vice-versa.

Unfortunately, this is not easy to do because there is not a simple
mathematical relationship between UTC and TAI.  Instead, you have to use
a table in order to check when the IERS added the leap seconds, and take
them into account.  The problem is that, as the revolution of the Earth
is imprecise, one cannot determine ahead of time when the leap seconds
will be added.  This can lead to problems with code that performs the
TAI <-> UTC conversion in the sense that the internal conversion table
has to be continuously updated if we don't want to have problems of
precision.  And this is a too much added complication to be worth the
effort, in our opinion.

These are the difficulties that driven us to prefer the POSIX time scale
over TAI for this implementation.  However, more input on this issue is
very welcome.


.. [1] http://docs.python.org/lib/module-datetime.html
.. [2] http://www.egenix.com/products/python/mxBase/mxDateTime
.. [3] http://en.wikipedia.org/wiki/Unix_time
.. [4] http://en.wikipedia.org/wiki/International_Atomic_Time


-------------- next part --------------
===========================================================
 A proposal for implementing some date/time types in NumPy
===========================================================

:Author: Francesc Alted i Abad
:Contact: faltet@pytables.com
:Author: Ivan Vilata i Balaguer
:Contact: ivan@selidor.net


Executive summary
=================

A date/time mark is something very handy to have in many fields where
one has to deal with data sets.  While Python have several modules that
define a date/time type, like ``mx.DateTime`` or the integrated
``datetime`` [1]_, NumPy has a lack of them.

In this document, we are proposing the addition of a series of date/time
types to fill this gap.  The requirements for the proposed types are
two-folded: 1) they have to be fast to operate with and 2) they have to
be as compatible as possible with the existing ``datetime`` module that
comes with Python.


Types proposed
==============

To start with, it is virtually impossible to come up with a single
date/time type that fills the needs of every case of use.  So, after
pondering about different possibilities, we have stick with three
different types, namely ``datetime64``, ``timestamp64`` and
``timefloat64`` -- these names are preliminary and can be changed
indeed; they are mostly useful for the sake of the discussion -- that
cover different needs.

Here it goes a detailed description of the different types:

* ``datetime64``

  - Implemented internally as an ``int64`` type.

  - Expressed in microseconds since POSIX epoch (January 1, 1970).

  - Resolution: microseconds.

  - Time span: 278922 years in each direction since the POSIX epoch.

  Observations::

    This will be compatible with the Python ``datetime`` module not only
    in terms of precision (it also have a resolution of microseconds)
    and time span (its range is year 1 to year 9999), but also in that
    we will provide getters and setters for it.


* ``timestamp64``

  - Implemented internally as an ``int64`` type.

  - Expressed in nanoseconds since POSIX epoch (January 1, 1970).

  - Resolution: nanoseconds.

  - Time span: 272 years in each direction since the POSIX epoch.

  Observations::

    This will be not be fully compatible with the Python ``datetime``
    module neither in terms of precision nor time span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).

* ``timefloat64``

  - Implemented internally as a float64.

  - Expressed in microseconds since POSIX epoch (January 1, 1970).

  - Resolution: 1 microsecond (for +-32 years from epoch) or 14 digits
    (for distant years from epoch).  So the precision is *variable*.

  - Time span: 1e+308 years in each direction since the POSIX epoch.

  Observations::

    In general, this will be not be fully compatible with the Python
    datetime neither in terms of precision nor time span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).


Example of use
==============

Here it is an example of usage of one of the types described above
(``datetime64``)::

  In [10]: t = numpy.zeros(5, dtype="datetime64")

  In [11]: t[0] = datetime.datetime.now()  # setter in action

  In [12]: t[0]
  Out[12]: 733234384724   # representation as an int64 (scalar)

  In [13]: t
  Out[13]: array([12155899511985929, 0, 0, 0, 0], dtype=datetime64)

  In [14]: t[0].item()     # getter in action
  Out[14]: datetime.datetime(2008, 7, 11, 14, 27, 3, 384724)


Final considerations
====================


About the ``mx.DateTime`` module
--------------------------------

In this document, the emphasis has been put in comparing the
compatibility of future NumPy date/time types against the ``datetime``
module that comes with Python.  Should we consider the compatibility
with ``mx.DateTime`` as well?  Are there many people using
``mx.DateTime`` [2]_ out there?  If so, which are their advantages over
``datetime`` module?.


A final note on time scales
---------------------------

[Only for people with high precision time requirements or just for those
that love the feel of their brain exploding]

POSIX (or UTC, in which POSIX is based) time scale [3]_ is based on the
time that the Earth takes to revolve around the Sun.  However, after the
adoption of more precise time patterns (read: atomic clocks), it became
clear that this time is pretty imprecise compared with the latter.  As a
result, the UTC standard is adding ``leap seconds`` from time to time
(at a rate of 1 second per year, approximately) in order to compensate
these differences.

Because of this, when computing time deltas (using the UTC standard)
between two instants that differ in more than one year, it is extremely
probable that an error of a second or several (depending of the time
span) would be introduced.  While this in general is harmless for common
use cases, there are situations that this can bite people quite a lot.

For example, realize that IERS (International Earth Rotation and
Reference Systems Service) decided to add a leap second the past June
30th 00:00:00 UTC and that you were doing an experiment precisely at
that time (this is not so rare, and has probably happened already to
someone, somewhere).  With this, in the analysis phase of your
experiment, you could have the surprise that the time deltas computed
during this leap second are actually 1 second shorter (and then, why the
heck do we want to use types that supposedly support micro- or
nano-seconds of precision?).

Because of this, we were initially tempted to use the TAI (Temps
Atomique International) [4]_ standard because it is strictly
*continuous* (contrarily to UTC or POSIX), so avoiding the sort of
problems exposed above.  However, the omnipresence of the UTC clocks in
the computing world would force us to be continuously converting UTC
timestamps to TAI ones and vice-versa.

Unfortunately, this is not easy to do because there is not a simple
mathematical relationship between UTC and TAI.  Instead, you have to use
a table in order to check when the IERS added the leap seconds, and take
them into account.  The problem is that, as the revolution of the Earth
is imprecise, one cannot determine ahead of time when the leap seconds
will be added.  This can lead to problems with code that performs the
TAI <-> UTC conversion in the sense that the internal conversion table
has to be continuously updated if we don't want to have problems of
precision.  And this is a too much added complication to be worth the
effort, in our opinion.

These are the difficulties that driven us to prefer the POSIX time scale
over TAI for this implementation.  However, more input on this issue is
very welcome.


.. [1] http://docs.python.org/lib/module-datetime.html
.. [2] http://www.egenix.com/products/python/mxBase/mxDateTime
.. [3] http://en.wikipedia.org/wiki/Unix_time
.. [4] http://en.wikipedia.org/wiki/International_Atomic_Time


.. Local Variables:
.. mode: rst
.. coding: utf-8
.. fill-column: 72
.. End:



More information about the Numpy-discussion mailing list