[Numpy-discussion] Question on timeseries, for financial application
Wes McKinney
wesmckinn@gmail....
Sun Dec 13 13:59:15 CST 2009
On Sun, Dec 13, 2009 at 9:27 AM, Robert Ferrell <ferrell@diablotech.com> wrote:
>
> On Dec 13, 2009, at 7:07 AM, josef.pktd@gmail.com wrote:
>
>> On Sun, Dec 13, 2009 at 3:31 AM, Pierre GM <pgmdevlist@gmail.com>
>> wrote:
>>> On Dec 13, 2009, at 12:11 AM, Robert Ferrell wrote:
>>>> Have you considered creating a TimeSeries for each data series, and
>>>> then putting them all together in a dict, keyed by symbol?
>>>
>>> That's an idea
>>
>> As far as I understand, that's what pandas.DataFrame does.
>> pandas.DataMatrix used 2d array to store data
>>
>>>
>>>> One disadvantage of one big monster numpy array for all the series
>>>> is
>>>> that not all series may have a full set of 1800 data points. So the
>>>> array isn't really nicely rectangular.
>>>
>>> Bah, there's adjust_endpoints to take care of that.
>>>
>>>>
>>>> Not sure exactly what kind of analysis you want to do, but
>>>> grabbing a
>>>> series from a dict is quite fast.
>>>
>>> Thomas, as robert F. pointed out, everything depends on the kind of
>>> analysis you want. If you want to normalize your series, having all
>>> of them in a big array is the best (plain array, not structured, so
>>> that you can apply .mean and .std directly without having to loop
>>> on the series). If you need to apply the same function over all the
>>> series, here again having a big ndarray is easiest. Give us an
>>> example of what you wanna do.
>>
>> Or a structured array with homogeneous type that allows fast creation
>> of views for data analysis.
>
> These kinds of financial series don't have that much data (speaking
> from the early 21st century point of view). The OP says 1000 series,
> 1800 observations per series. Maybe 5 data items per observation, 4
> bytes each. That's well under 50MB. I've found it satisfactory to
> keep the data someplace that's handy to get at, and easy to use. When
> I want to do analysis I pull it into whatever format is best for that
> analysis. Depending on the needs, it may not be necessary to try to
> arrange the data so you can get a view for analysis - the time for a
> copy can be negligible if the analysis takes a while.
>
> -r
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
As Josef mentioned, the pandas library is designed for the problem
we're discussing-- i.e. working with collections of time series or
cross-sectional data. The basic DataFrame object accepts a dict of
pandas.Series objects (or a dict of equal-length ndarrays and an array
of labels / dates) and provides slicing, reindexing, aggregation,
other conveniences. I have not made an official release of the library
yet but it is quite robust and suitable for general use (I use it
actively in proprietary applications). HTML documentation is also not
available yet, but the docstrings are reasonably good. I hope to make
an official release by the end of the year with documentation, etc.
More information about the NumPy-Discussion
mailing list