[Numpy-discussion] RFC: A proposal for implementing some date/time types in NumPy

Francesc Alted falted@pytables....
Fri Jul 11 08:59:03 CDT 2008


Hi,

We are planning to implement some date/time types for NumPy, and I'm 
sending a document that explains our approach.  We would love to hear 
the feedback of the NumPy community in order to cover their needs as 
much as possible.

Cheers,

Francesc


===========================================================
 A proposal for implementing some date/time types in NumPy
===========================================================

:Author: Francesc Alted i Abad
:Contact: faltet@pytables.com
:Author: Ivan Vilata i Balaguer
:Contact: ivan@selidor.net


Executive summary
=================

A date/time mark is something very handy to have in many fields where
one has to deal with data sets.  While Python have several modules that
define a date/time type, like ``mx.DateTime`` or the integrated
``datetime`` [1]_, NumPy has a lack of them.

In this document, we are proposing the addition of a series of date/time
types to fill this gap.  The requirements for the proposed types are
two-folded: 1) they have to be fast to operate with and 2) they have to
be as compatible as possible with the existing ``datetime`` module that
comes with Python.


Types proposed
==============

To start with, it is virtually impossible to come up with a single
date/time type that fills the needs of every case of use.  So, after
pondering about different possibilities, we have stick with three
different types, namely ``datetime64``, ``timestamp64`` and
``timefloat64`` -- these names are preliminary and can be changed
indeed; they are mostly useful for the sake of the discussion -- that
cover different needs.

Here it goes a detailed description of the different types:

``datetime64``

  - Implemented internally as an ``int64`` type.

  - Expressed in microseconds since POSIX epoch (January 1, 1970).

  - Resolution: nanoseconds.

  - Time span: 278922 years in each direction since the POSIX epoch.

  Observations::

    This will be compatible with the Python ``datetime`` module not only
    in terms of precision (it also have a resolution of microseconds)
    and time-span (its range is year 1 to year 9999), but also in that
    we will provide getters and setters for it.


``timestamp64``

  - Implemented internally as an ``int64`` type.

  - Expressed in nanoseconds since POSIX epoch (January 1, 1970).

  - Resolution: nanoseconds.

  - Time span: 272 years in each direction since the POSIX epoch.

  Observations::

    This will be not be fully compatible with the Python ``datetime``
    module neither in terms of precision nor time-span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).

* ``timefloat64``

  - Implemented internally as a float64.

  - Expressed in microseconds since POSIX epoch.

  - Resolution: 1 microsecond (for +-32 years from epoch) or 14 digits
    (for distant years from epoch).  So the precision is *variable*.

  - Time span: 1e+308 years in each direction since the POSIX epoch.

  Observations::

    In general, this will be not be fully compatible with the Python
    datetime neither in terms of precision nor time-span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).


Example of use
==============

Here it is an example of usage of one of the types described above
(``datetime64``)::

  In [10]: t = numpy.zeros(5, dtype="datetime64")

  In [11]: t[0] = datetime.datetime.now()  # setter in action

  In [12]: t[0]
  Out[12]: 733234384724   # representation as an int64 (scalar)

  In [13]: t
  Out[13]: array([12155899511985929, 0, 0, 0, 0], dtype=datetime64)

  In [14]: t[0].item()     # getter in action
  Out[14]: datetime.datetime(2008, 7, 11, 14, 27, 3, 384724)


Final considerations
====================


About the ``mx.DateTime`` module
--------------------------------

In this document, the emphasis has been put in comparing the
compatibility of future NumPy date/time types against the ``datetime``
module that comes with Python.  Should we consider the compatibility
with mx.DateTime as well?  Are there many people using ``mx.DateTime``
[2]_ out there?  If so, which are their advantages over ``datetime``
module?.


A final note on time scales
---------------------------

[Only for people with high precision time requirements or just for those
that love the feel of their brain exploding]

POSIX (or UTC, in which POSIX is based) time scale [3]_ is based on the
time that the Earth takes to revolve around the Sun.  However, after the
adoption of more precise time patterns (read: atomic clocks), it became
clear that this time is pretty imprecise compared with the latter.  As a
result, the UTC standard is adding ``leap seconds`` from time to time
(at a rate of 1 second per year, approximately) in order to compensate
these differences.

Because of this, when computing time deltas (using the UTC standard)
between two instants that differ in more than one year, it is extremely
probable that an error of a second or several (depending of the time
span) would be introduced.  While this in general is harmless for common
use cases, there are situations that this can bite people quite a lot.

For example, realize that IERS (International Earth Rotation and
Reference Systems Service) decided to add a leap second the past June
30th 00:00:00 UTC and that you were doing an experiment precisely at
that time (this is not so rare, and has probably happened already to
someone, somewhere).  With this, in the analysis phase of your
experiment, you could have the surprise that the time deltas computed
during this leap second are actually 1 second shorter (and then, why the
heck do we want to use types that supposedly support micro- or
nano-seconds of precision?).

Because of this, we were initially tempted to use the TAI (Temps
Atomique International) [4]_ standard because it is strictly
*continuous* (contrarily to UTC or POSIX), so avoiding the sort of
problems exposed above.  However, the omnipresence of the UTC clocks in
the computing world would force us to be continuously converting UTC
timestamps to TAI ones and vice-versa.

Unfortunately, this is not easy to do because there is not a simple
mathematical relationship between UTC and TAI.  Instead, you have to use
a table in order to check when the IERS added the leap seconds, and take
them into account.  The problem is that, as the revolution of the Earth
is imprecise, one cannot determine ahead of time when the leap seconds
will be added.  This can lead to problems with code that performs the
TAI <-> UTC conversion in the sense that the internal conversion table
has to be continuously updated if we don't want to have problems of
precision.  And this is a too much added complication to be worth the
effort, in our opinion.

These are the difficulties that driven us to prefer the POSIX time scale
over TAI for this implementation.  However, more input on this issue is
very welcome.


.. [1] http://docs.python.org/lib/module-datetime.html
.. [2] http://www.egenix.com/products/python/mxBase/mxDateTime
.. [3] http://en.wikipedia.org/wiki/Unix_time
.. [4] http://en.wikipedia.org/wiki/International_Atomic_Time



-------------- next part --------------
===========================================================
 A proposal for implementing some date/time types in NumPy
===========================================================

:Author: Francesc Alted i Abad
:Contact: faltet@pytables.com
:Author: Ivan Vilata i Balaguer
:Contact: ivan@selidor.net


Executive summary
=================

A date/time mark is something very handy to have in many fields where
one has to deal with data sets.  While Python have several modules that
define a date/time type, like ``mx.DateTime`` or the integrated
``datetime`` [1]_, NumPy has a lack of them.

In this document, we are proposing the addition of a series of date/time
types to fill this gap.  The requirements for the proposed types are
two-folded: 1) they have to be fast to operate with and 2) they have to
be as compatible as possible with the existing ``datetime`` module that
comes with Python.


Types proposed
==============

To start with, it is virtually impossible to come up with a single
date/time type that fills the needs of every case of use.  So, after
pondering about different possibilities, we have stick with three
different types, namely ``datetime64``, ``timestamp64`` and
``timefloat64`` -- these names are preliminary and can be changed
indeed; they are mostly useful for the sake of the discussion -- that
cover different needs.

Here it goes a detailed description of the different types:

``datetime64``

  - Implemented internally as an ``int64`` type.

  - Expressed in microseconds since POSIX epoch (January 1, 1970).

  - Resolution: nanoseconds.

  - Time span: 278922 years in each direction since the POSIX epoch.

  Observations::

    This will be compatible with the Python ``datetime`` module not only
    in terms of precision (it also have a resolution of microseconds)
    and time-span (its range is year 1 to year 9999), but also in that
    we will provide getters and setters for it.


``timestamp64``

  - Implemented internally as an ``int64`` type.

  - Expressed in nanoseconds since POSIX epoch (January 1, 1970).

  - Resolution: nanoseconds.

  - Time span: 272 years in each direction since the POSIX epoch.

  Observations::

    This will be not be fully compatible with the Python ``datetime``
    module neither in terms of precision nor time-span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).

* ``timefloat64``

  - Implemented internally as a float64.

  - Expressed in microseconds since POSIX epoch.

  - Resolution: 1 microsecond (for +-32 years from epoch) or 14 digits
    (for distant years from epoch).  So the precision is *variable*.

  - Time span: 1e+308 years in each direction since the POSIX epoch.

  Observations::

    In general, this will be not be fully compatible with the Python
    datetime neither in terms of precision nor time-span.  However,
    getters and setters will be provided for it (loosing precision or
    overflowing as needed).


Example of use
==============

Here it is an example of usage of one of the types described above
(``datetime64``)::

  In [10]: t = numpy.zeros(5, dtype="datetime64")

  In [11]: t[0] = datetime.datetime.now()  # setter in action

  In [12]: t[0]
  Out[12]: 733234384724   # representation as an int64 (scalar)

  In [13]: t
  Out[13]: array([12155899511985929, 0, 0, 0, 0], dtype=datetime64)

  In [14]: t[0].item()     # getter in action
  Out[14]: datetime.datetime(2008, 7, 11, 14, 27, 3, 384724)


Final considerations
====================


About the ``mx.DateTime`` module
--------------------------------

In this document, the emphasis has been put in comparing the
compatibility of future NumPy date/time types against the ``datetime``
module that comes with Python.  Should we consider the compatibility
with mx.DateTime as well?  Are there many people using ``mx.DateTime``
[2]_ out there?  If so, which are their advantages over ``datetime``
module?.


A final note on time scales
---------------------------

[Only for people with high precision time requirements or just for those
that love the feel of their brain exploding]

POSIX (or UTC, in which POSIX is based) time scale [3]_ is based on the
time that the Earth takes to revolve around the Sun.  However, after the
adoption of more precise time patterns (read: atomic clocks), it became
clear that this time is pretty imprecise compared with the latter.  As a
result, the UTC standard is adding ``leap seconds`` from time to time
(at a rate of 1 second per year, approximately) in order to compensate
these differences.

Because of this, when computing time deltas (using the UTC standard)
between two instants that differ in more than one year, it is extremely
probable that an error of a second or several (depending of the time
span) would be introduced.  While this in general is harmless for common
use cases, there are situations that this can bite people quite a lot.

For example, realize that IERS (International Earth Rotation and
Reference Systems Service) decided to add a leap second the past June
30th 00:00:00 UTC and that you were doing an experiment precisely at
that time (this is not so rare, and has probably happened already to
someone, somewhere).  With this, in the analysis phase of your
experiment, you could have the surprise that the time deltas computed
during this leap second are actually 1 second shorter (and then, why the
heck do we want to use types that supposedly support micro- or
nano-seconds of precision?).

Because of this, we were initially tempted to use the TAI (Temps
Atomique International) [4]_ standard because it is strictly
*continuous* (contrarily to UTC or POSIX), so avoiding the sort of
problems exposed above.  However, the omnipresence of the UTC clocks in
the computing world would force us to be continuously converting UTC
timestamps to TAI ones and vice-versa.

Unfortunately, this is not easy to do because there is not a simple
mathematical relationship between UTC and TAI.  Instead, you have to use
a table in order to check when the IERS added the leap seconds, and take
them into account.  The problem is that, as the revolution of the Earth
is imprecise, one cannot determine ahead of time when the leap seconds
will be added.  This can lead to problems with code that performs the
TAI <-> UTC conversion in the sense that the internal conversion table
has to be continuously updated if we don't want to have problems of
precision.  And this is a too much added complication to be worth the
effort, in our opinion.

These are the difficulties that driven us to prefer the POSIX time scale
over TAI for this implementation.  However, more input on this issue is
very welcome.


.. [1] http://docs.python.org/lib/module-datetime.html
.. [2] http://www.egenix.com/products/python/mxBase/mxDateTime
.. [3] http://en.wikipedia.org/wiki/Unix_time
.. [4] http://en.wikipedia.org/wiki/International_Atomic_Time


.. Local Variables:
.. mode: rst
.. coding: utf-8
.. fill-column: 72
.. End:



More information about the Numpy-discussion mailing list