[SciPy-User] Proposal for a new data analysis toolbox
Mon Nov 22 09:52:33 CST 2010
On Mon, Nov 22, 2010 at 10:35 AM, Keith Goodman <email@example.com> wrote:
> This thread started on the numpy list:
> I think we should narrow the focus of the package by only including
> functions that operate on numpy arrays. That would cut out date
> utilities, label indexing utilities, and binary operations with
> various join methods on the labels. It would leave us with three
> categories: faster versions of numpy/scipy nan functions, moving
> window statistics, and group functions.
> I suggest we add a fourth category: normalization.
> FASTER NUMPY/SCIPY NAN FUNCTIONS
> This work is already underway: http://github.com/kwgoodman/nanny
> The function signatures for these are easy: we copy numpy, scipy. (I
> am tempted to change nanstd from scipy's bias=False to ddof=0.)
scipy.stats.nanstd is supposed to switch to ddof, so don't copy
inconsistent signatures that are supposed to be depreciated.
I would like statistics (scipy.stats and statsmodels) to stick with
I would be in favor of axis=None for nan extended versions of numpy
functions and axis=0 for stats functions as defaults, but since it
will be a standalone package with wider usage, I will be able to keep
track of axis=-1.
> I'd like to use a partial sort for nanmedian. Anyone interested in coding that?
> dtype: int32, int64, float 64 for now
> ndim: 1, 2, 3 (need some recursive magic for nd > 3; that's an open
> project for anyone)
> MOVING WINDOW STATISTICS
> I already have doc strings and unit tests
> (https://github.com/kwgoodman/la/blob/master/la/farray/mov.py). And I
> have a cython prototype that moves the window backwards so that the
> stats can be filled in place. (This assumes we make a copy of the data
> at the top of the function: arr = arr.astype(float))
> Proposed function signature: mov_sum(arr, window, axis=-1),
> mov_nansum(arr, window, axis=-1)
> If you don't like mov, then: move? roll?
> I think requesting a minimum number of non-nan elements in a window or
> else returning NaN is clever. But I do like the simple signature
> Binary moving window functions: mov_nancorr(arr1, arr2, window, axis=-1), etc.
> Optional: moving window bootstrap estimate of error (std) of the
> moving statistic. So, what's the std of each erstimate in the
> mov_median output? Too specialized?
> dtype: float64
> ndim: 1, 2, 3, recursive for nd > 0
> I already have nd versions of ranking, zscore, quantile, demean,
> demedian, etc in larry. We should rename to nanzscore etc.
> ranking and quantile could use some cython love.
> I don't know, should we cut this category?
> GROUP FUNCTIONS
> Input: array, sequence of labels such as a list, axis.
> For an array of shape (n,m), axis=0, and a list of n labels with d
> distinct values, group_nanmean would return a (d,m) array. I'd also
> like a groupfilter_nanmean which would return a (n,m) array and would
> have an additional, optional input: exclude_self=False.
> What should we call the package?
> Numa, numerical analysis with numpy arrays
> Dana, data analysis with numpy arrays
> import dana as da (da=data analysis)
> ARE YOU CRAZY?
> If you read this far, you are crazy and would be a good fit for this project.
> SciPy-User mailing list
More information about the SciPy-User