[SciPy-User] Proposal for a new data analysis toolbox

Keith Goodman kwgoodman@gmail....
Mon Nov 22 09:35:21 CST 2010

This thread started on the numpy list:

I think we should narrow the focus of the package by only including
functions that operate on numpy arrays. That would cut out date
utilities, label indexing utilities, and binary operations with
various join methods on the labels. It would leave us with three
categories: faster versions of numpy/scipy nan functions, moving
window statistics, and group functions.

I suggest we add a fourth category: normalization.


This work is already underway: http://github.com/kwgoodman/nanny

The function signatures for these are easy: we copy numpy, scipy. (I
am tempted to change nanstd from scipy's bias=False to ddof=0.)

I'd like to use a partial sort for nanmedian. Anyone interested in coding that?

dtype: int32, int64, float 64 for now
ndim: 1, 2, 3 (need some recursive magic for nd > 3; that's an open
project for anyone)


I already have doc strings and unit tests
(https://github.com/kwgoodman/la/blob/master/la/farray/mov.py). And I
have a cython prototype that moves the window backwards so that the
stats can be filled in place. (This assumes we make a copy of the data
at the top of the function: arr = arr.astype(float))

Proposed function signature: mov_sum(arr, window, axis=-1),
mov_nansum(arr, window, axis=-1)

If you don't like mov, then: move? roll?

I think requesting a minimum number of non-nan elements in a window or
else returning NaN is clever. But I do like the simple signature

Binary moving window functions: mov_nancorr(arr1, arr2, window, axis=-1), etc.

Optional: moving window bootstrap estimate of error (std) of the
moving statistic. So, what's the std of each erstimate in the
mov_median output? Too specialized?

dtype: float64
ndim: 1, 2, 3, recursive for nd > 0


I already have nd versions of ranking, zscore, quantile, demean,
demedian, etc in larry. We should rename to nanzscore etc.

ranking and quantile could use some cython love.

I don't know, should we cut this category?


Input: array, sequence of labels such as a list, axis.

For an array of shape (n,m), axis=0, and a list of n labels with d
distinct values, group_nanmean would return a (d,m) array. I'd also
like a groupfilter_nanmean which would return a (n,m) array and would
have an additional, optional input: exclude_self=False.


What should we call the package?

Numa, numerical analysis with numpy arrays
Dana, data analysis with numpy arrays

import dana as da     (da=data analysis)


If you read this far, you are crazy and would be a good fit for this project.

More information about the SciPy-User mailing list