[SciPy-dev] RFR: Proposed fixes in scipy.stats functions for calculation of variance/error/etc.

josef.pktd@gmai... josef.pktd@gmai...
Mon Oct 26 08:58:53 CDT 2009


On Mon, Oct 26, 2009 at 2:07 AM,  <josef.pktd@gmail.com> wrote:
> On Mon, Oct 26, 2009 at 1:51 AM,  <josef.pktd@gmail.com> wrote:
>> On Mon, Oct 26, 2009 at 1:31 AM, Ariel Rokem <arokem@berkeley.edu> wrote:
>>> Hi Josef -
>>>
>>>>
>>>> >From looking at the three function, I would assume that the combined
>>>> function would have a signature like
>>>>
>>>> def zscore(a, compare=None, axis=0, ddof=0)
>>>>
>>>> or two functions, one with compare, one without ?
>>>
>>> Yes - I think that would be best. After all, someone wrote zmap with
>>> some usecase in mind (I assume), so we would still want that
>>> functionality to live on explicitly. So, I suggest (see attached diff)
>>> to have two functions: one will be zscore and the other would be
>>> zscore_compare. In the attached diff, I have decorated all these
>>> functions with a deprecation warning and added these two new
>>> functions, zscore (with the new, by-axis behavior. This makes more
>>> sense to me, somehow) and zscore_compare.
>>>
>>>>
>>>>
>>>> About default axis=0:
>>>>
>>> ...
>>>


>>> Thanks for the explanation and for digging into the history of this. I
>>> still think that in the long run it would be preferable to have these
>>> things be internally consistent (that is consistent between numpy and
>>> scipy), rather than consistent with other tools.

I hope the "long run" is very long.

axis=0 as default is consistent with views to structured arrays or recarrays
and to reading data from a csv file in the common orientation (genfromtxt).

I think, it's more a convention for data analysis than a questionn whether
you work with a package that has c orientation instead of fortran orientation.
One exception to this is panel data, especially with a 3d array.
axis=-1 looks like a pain because it doesn't stay fixed when I add a
third dimension. axis=None is pretty much useless with a
dataset of variables with mixed units.

Practicality beats Purity (especially if Purity is defined by the
software and not the problem)
and (almost) all of statsmodels is based on variables are columns.

Josef


>>> Finally - I have tried to combine sem and stderr into one function,
>>> under sem. Notice in particular the correction for ddof. My
>>> understanding is that this should produce per default the result
>>> std/sqrt(n-1), which is what we usually want for the sem. Is that
>>> correct?
>>
>>
>> Yes, I had to check the ttests, that's when I spend more time checking the
>> degrees of freedom. It looks like the denominator needs one "n" and one
>> "n-1"
>>
>>  v = np.var(a, axis, ddof=1)
>>  t = d / np.sqrt(v/float(n))
>>
>> sem(a, ddof=1, axis=0) should have ddof as last argument to match np.var.
>>
>> your axis handling is still incorrect in zscore for 2d arrays
>>
>> if axis=1 then we need to add an axis
>> a.mean(1)[:,None]
>>
>> there is a function in numpy to do this, expand_axis (?) that
>> works for general axis. There was also a recent discussion
>> on the numpy list for getting the axis back after a reduce.
>
> (codereview is much easier to read than a diff wordpad)
>
> for 2 arrays as in zscore_compare, you can also use _chk2_asarray
> zscore_compare should match the axis argument of zscore, I think.
>
> Cheers (and I'm off)
>
> Josef
>
>
>>
>> Josef
>>
>>
>>
>>>
>>>  Cheers,
>>>
>>> Ariel
>>>
>>> _______________________________________________
>>> Scipy-dev mailing list
>>> Scipy-dev@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>>
>>
>


More information about the Scipy-dev mailing list