[SciPy-user] fill a timeseries with masked data by correlation from another series
Robert Kern
robert.kern@gmail....
Thu Apr 3 19:56:30 CDT 2008
On Thu, Apr 3, 2008 at 5:52 PM, Marco Tuckner
<marcotuckner@public-files.de> wrote:
> I would like to correlate two (or more timeseries) to estimate invalid and
> masked values in one series based on the values of another complete series using
> the correlation coefficient.
> How can I to that?
With difficulty. If your data is close enough to multivariate normal,
then one can use an Expectation-Maximization (EM) method to jointly
estimate the common mean and covariance along with the missing data.
This is fairly common in financial circles for measuring risk. I don't
have an Internet reference on-hand, but the book _Computational
Statistics_ by Givens and Hoeting has a chapter on this.
http://www.amazon.com/Computational-Statistics-Wiley-Probability/dp/0471461245
In my experience, it is slow and may not converge.
The approach that Pierre suggests (finding a common mask where you
have data for all time series) is good for 2 series (in fact, it's
probably optimal), but with increasing numbers of series, you will
most likely lose too many days to make a reasonable estimate.
Yet another approach is to find the correlations between each pair of
series using the common mask for each pair. This will almost certainly
give you an invalid correlation matrix (all eigenvalues must be >= 0),
but from that you can find the closest valid correlation matrix. There
are a couple of ways you could implement this; the one I've used
successfully is called Alternating Projections.
http://citeseer.ist.psu.edu/higham02computing.html
To impute the missing data, you essentially apply the "Expectation"
step of the EM method.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the SciPy-user
mailing list