[SciPy-user] [timeseries] Missing dates
Sat Apr 4 12:41:42 CDT 2009
2009/4/4 Matt Knox <firstname.lastname@example.org>:
>> In the one plotting example (using yahoo finance) I saw that one can
>> fill missing dates before plotting so that the missing ones get
>> masked. Though when applying some moving windows functions that
>> caused all periods that were effected by the missing values to also
>> become masked, which isn't the behaviour I was expecting. It does
>> make sense to do it that way though.
>> Obviously it's simple enough to use the original timeseries to
>> calculate the moving window functions, or interpolate or something.
> You hit the nail on the head here. There is no way for the timeseries module to
> know what the user thinks is the proper way to handle the masked values here, so
> the sensible thing to do is mask the whole region. You can calculate the moving
> average on the original series (ie. before you call fill_missing_dates), or
> interpolate the data somehow first (eg. using forward_fill), etc.
>> The question I'm trying to get at though is if I'm going to store my
>> timeseries in hdf5 will I fill in the missing dates before I do so, or
>> only do that whenever I plot the timeseries? I'm working with stock
>> prices, so the "missing" dates over the weekends will increase file
>> size by more then 30%. Is there any other reason to fill in missing
>> dates besides for plotting?
> Note that in the example you are talking about, the series is a "BUSINESS"
> frequency series
> dates = ts.date_array([q for q in quotes], freq='DAILY').asfreq('BUSINESS')
> so calling fill_missing_dates on this has the effect of adding masked values for
> the HOLIDAYS, but not Saturday and Sunday.
I'm trying to wrap my head around how the different frequencies
behave... Correct me if I'm wrong. Using the yahoo example as a
reference: A timeseries with daily frequency for one year does not
need to have a date value for every day in that year in its date
array. But when it gets plotted the index (x-axis) runs over the
entire year, and a line plot will simply connect all the dots
basically as if it were linearly interpolating the values for the
missing dates. Thus even though the frequency is expecting values for
every day, the process of masking the missing dates needs to be done
explicitly. Which is done by fill_missing_dates(), which then adds a
date to our array for every date that the frequency expects, and sets
the mask to true for those dates that weren't in the array initially.
So when I'm using a business frequency the index on plots doesn't
behave like a calender, but mondays immediately follow fridays. And
the only dates 'missing' are in fact the holidays as you pointed out,
which we need to add explicitly so that we can mask them.
> Now as to whether or not one should fill in the holidays for storage purposes is
> a judgement call, but I generally find it simpler to just forward fill all
> holidays (see the forward_fill function in the interpolation section of the
> docs) in a batch job overnight and that way any reports or models don't have to
> think about adding special logic to handle holidays which can be somewhat
> complicated, especially if you are talking about global data with different
> calendars and so forth. Yes, this can introduce inaccuracies to some degree, but
> for most use cases I have found the gains in simplicity more than outweigh those
The way I've been working with my data up until now was purely looking
at trading days, ignoring weekends, holidays etc. So any analysis
that takes time into account basically has 'trading day' as its unit
of time. This makes some things simpler. close[-11] would always be
the closing price 10 trading days ago. Rate of change or other
indicators aren't effected by long holidays, though that's just a
The disadvantage is obviously that indexing the array with an actual
date becomes a bit harder and you always need to do a search.
What would really be the advantage if I were to use 'business day' as
And another question: Do you use the closing price from yahoo or the
adjusted closing price? It seems they use the adjusted prices
themselves, though I've come across one or two graphs where they
Thanks for all your advice Matt and Pierre.
> - Matt
> SciPy-user mailing list
More information about the SciPy-user