[SciPy-dev] RFR: Proposed fixes in scipy.stats functions for calculation of variance/error/etc.

josef.pktd@gmai... josef.pktd@gmai...
Sun Oct 25 23:59:45 CDT 2009

On Mon, Oct 26, 2009 at 12:19 AM,  <josef.pktd@gmail.com> wrote:
> On Sun, Oct 25, 2009 at 11:49 PM, Ariel Rokem <arokem@berkeley.edu> wrote:
>> Hi Josef and all,
>> thank for looking. Concerning the z-score functions - I am also
>> confused by those and I would suggest unifying them under one
>> function. In particular, I can't imagine what the function 'z' is for.
>> However, I don't want to just remove these without discussion. What do
>> you think about this?
>> Another, more general thing, concerning the axis - I am wondering: why
>> is the default axis for scipy is 0, while the default for numpy (in
>> np.mean, for example) is None? I think that it would be good to have
>> one convention for both libraries. I think that the more parsimonious
>> one is the one using "None" as the default value. This doesn't favor
>> any of the dimensions of an array over others, by default. I don't
>> know - how wide-spread is this convention within scipy?
> I had to run after the last message. My impression was that maybe in
> one of the changes the ddof=1 got lost, i.e. the distinction that was
> in scipy stats for population versus sample statistics.
> z and zmap look the same to me from the intended (?) calculation
> but zmap mixes up the axis arguments. (mean with "axis", std with
> hardcoded axis=0). Maybe the intention will be clearer when I look
> at the trac history or the original stats package.
> From looking at the three function, I would assume that the combined
> function would have a signature like
> def zscore(a, compare=None, axis=0, ddof=0)
> or two functions, one with compare, one without ?


zs was the list version for the zscore using z to calculate, the translation in
the next changeset is correct only for 1d or raveled arrays, but it is missing
an axis argument. It looks like z was a helper function for a scalar score.
zmap got imported in this form in revision 71.

stats.mstats has the same functions, but they look like literal translations
since they have the same (ambiguous) treatment of axis if it's not 1d.
stats.mstats.z has ddof=1, the others ddof=0

With broadcasting and adjustment of the dimension of min and std, only
a single score function seems necessary, the current functions look a bit
like historical relics.


> About default axis=0:
> I think this is scipy.stats specific. We had a brief discussion a year
> ago, where Jarrod agreed that default for stats should remain axis=0.
> In statistics, you almost never want to ravel data, not mixing apples
> and cars, or prices and quantities. So the default should be reducing
> along an axis, e.g. mean over all observations by variable.
> axis=0 versus axis=-1, this is traditional in statistics/econometrics. Both
> from other matrix packages (gauss, matlab) and from the textbook
> treatment (of books that I know). Switching to -1 for the data would
> be a big mental break and would require axis translation of the
> textbook formulas, e.g solve X'X beta = X'Y
> From my perspective loosing axis=0 as default is the main disadvantage
> of removing mean, var, and so on, from scipy.stats. eg. I need to create
> a lambda function if I want mean(x, axis=0) as a callback function.
> Cheers,
> Josef
>> Cheers,
>> Ariel
>> On Sun, Oct 25, 2009 at 8:16 PM,  <josef.pktd@gmail.com> wrote:
>>> On Sun, Oct 25, 2009 at 10:50 PM, Ariel Rokem <arokem@berkeley.edu> wrote:
>>>> Hi everyone,
>>>> I have been working on some fixes to the functions in scipy.stats
>>>> which calculate variance/error and related quantities. In particular,
>>>> in order to comply with the deprecation warnings that appear in use of
>>>> scipy.stats.samplevar/scipy.stats.samplestd, I have replaced use of
>>>> these functions with calls to np.std/np.var. I have also cleaned up
>>>> the documentation a bit.
>>>> This can all be found here: http://codereview.appspot.com/141051
>>>> Cheers,
>>>> Ariel
>>> I just gave it a quick look, looks good so far
>>> in  def zs  looks like a shape error for axis>0
>>> "return (a-mu)/sigma"
>>> def zs   changes definition, before it normalized with raveled mean,
>>> std not by axis
>>> - mu = np.mean(a,None)
>>> - sigma = samplestd(a)
>>> - return (array(a)-mu)/sigma
>>> + a,axis = _chk_asarray(a,axis)
>>> + mu = np.mean(a,axis)
>>> + sigma = np.std(a,axis)
>>> + return (a-mu)/sigma
>>> I never looked closely at these,
>>> zmap has a description I don't understand.
>>> z, zs, zm  ???
>>> Which is which? they look a bit inconsistent, population might refer
>>> to dof correction in z ?
>>> Is there a standard terminology for z scores?
>>> I think for axis, I have seen more "int or None" ?
>>> Josef
>>>> --
>>>> Ariel Rokem
>>>> Helen Wills Neuroscience Institute
>>>> University of California, Berkeley
>>>> http://argentum.ucbso.berkeley.edu/ariel
>>>> _______________________________________________
>>>> Scipy-dev mailing list
>>>> Scipy-dev@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>> _______________________________________________
>>> Scipy-dev mailing list
>>> Scipy-dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>> --
>> Ariel Rokem
>> Helen Wills Neuroscience Institute
>> University of California, Berkeley
>> http://argentum.ucbso.berkeley.edu/ariel
>> _______________________________________________
>> Scipy-dev mailing list
>> Scipy-dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev

More information about the Scipy-dev mailing list