[SciPy-User] stats, classes instead of functions for results MovStats

Bruce Southey bsouthey@gmail....
Mon Nov 30 15:05:03 CST 2009


On 11/22/2009 11:43 PM, josef.pktd@gmail.com wrote:
> Following up on a question by Keith on the numpy list and his reminder
> that covariance can be calculated by the cross-product minus the
> product of the means, I redid and
> enhanced my moving stats functions.
>
> Suppose x and y are two time series, then the moving correlation
> requires the calculation of the mean, variance and covariance for each
> window. Currently in scipy stats intermediate results are usually
> thrown away on return (while rpy/R returns all intermediate results
> used for the calculation.
>
> Using a decorator/descriptor of Fernando written for nitime, I tried
> out to write the function as a class instead, so that any desired (
> intermediate) calculations are only made on demand, but once they are
> calculated they are attached to the class as attributes or properties.
> This seems to be a useful "pattern".
>
> Are there any opinion for using the pattern in scipy.stats ? MovStats
> will currently go into statsmodels
>
> Below is the class (with cutting part of init), a full script is the
> attachment, including examples that test the class.
>
> about MovStats:
> y and x are tested for 2d, either (T,N) with axis=0 or (N,T) with
> axis=1, should (but may not yet) work for nd arrays along any axis
> (signal.correlate docstring)
> nans are handled by dropping the corresponding observations from the
> window, not adding any additional observations,
> not tested if a window is empty because it contains only nans, nor if
> variance is zero
> (kern is intended for weighted statistics in the window but not tested
> yet, I still need to decide on normalization requirements)
> requires scipy.signal, all calculations done with signal.correlate, no loops
> as often, functions are one-liners
> all results are returned for valid observations only, initial
> observations with incomplete window are cut
> bonus: slope of moving regression of y on x, since it was trivial to add
> still some cleaning and documentation to do
>
> usage:
> ms = MovStats(x, y, axis=1)
> ms.yvar
> ms.xmean
> ms.yxcorr
> ms.yxcov
> ...
>
>
> Josef
>
> class MovStats(object):
>      def __init__(self, y, x=None, kern=5, axis=0):
>          self.y = y
>          self.x = x
>          if np.isscalar(kern):
>              ws = kern
> <... snip>
>
>      @OneTimeProperty
>      def ymean(self):
>          ys = signal.correlate(self.y, self.kern, mode='same')[self.sslice]
>          ym = ys/self.n
>          return ym
>
>      @OneTimeProperty
>      def yvar(self):
>          ys2 = signal.correlate(self.y*self.y, self.kern,
> mode='same')[self.sslice]
>          yvar = ys2/self.n  - self.ymean**2
>          return yvar
>
>      @OneTimeProperty
>      def xmean(self):
>          if self.x is None:
>              return None
>          else:
>              xs = signal.correlate(self.x, self.kern, mode='same')[self.sslice]
>              xm = xs/self.n
>              return xm
>
>      @OneTimeProperty
>      def xvar(self):
>          if self.x is None:
>              return None
>          else:
>              xs2 = signal.correlate(self.x*self.x, self.kern,
> mode='same')[self.sslice]
>              xvar = xs2/self.n  - self.xmean**2
>              return xvar
>      @OneTimeProperty
>      def yxcov(self):
>          xys = signal.correlate(self.x*self.y, self.kern,
> mode='same')[self.sslice]
>          return xys/self.n - self.ymean*self.xmean
>
>      @OneTimeProperty
>      def yxcorr(self):
>          return self.yxcov/np.sqrt(self.yvar*self.xvar)
>
>      @OneTimeProperty
>      def yxslope(self):
>          return self.yxcov/self.xvar
>    
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>    
I think your handling of NaN's is incorrect because you do not drop the 
corresponding observations. That is for two arrays

y=np.array([[ 1.229563, -0.339428,  0.83891 ,  4.026574,  3.069378,  
5.95668 ]])
x=np.array([[-1.236469,  1.941089, -0.346566, -0.268529,     np.nan,  
0.191336]])

For a windows size of 5, in the first window, the first mean and 
variance of y should use all 5 elements of y, the mean and variance of X 
should use the first 4 elements of x and, the regression and correlation 
coefficients should use the first 4 elements of x and y.


Some other points:
1) Your calculation of variance is susceptible to errors, see
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Provided that you are using sufficient precision (like numpy defaults) 
it is probably not that big a problem.
2) You only use the 'full' windows so when the window width is 5, you 
miss the first two windows and the last 2 windows. At least the mean 
exists in these windows and the variance in most of these partial 
windows. This may provided unexpected results to a user if they do not 
release which windows are not returned.
3) I think the user needs to define the kern argument for your MovStat 
class as there is probably no meaningful default value (except 42).
4) I do not know how you should handle positive and negative infinity.
5) Your code expects at least 2 dimensions so 1-d arrays fail because 
you can not do this assignment 'kdim[axis] = ws' with 1-d arrays.

Bruce



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20091130/64ff314a/attachment.html 


More information about the SciPy-User mailing list