[SciPy-User] calculating the mean for each factor (like tapply in R)

Wes McKinney wesmckinn@gmail....
Wed Aug 1 20:32:39 CDT 2012


On Wed, Aug 1, 2012 at 9:35 AM, Oleksandr Huziy <guziy.sasha@gmail.com> wrote:
> Hi,
>
> It is pretty much the same as looping, but you could do the following
>
> In [1]: import numpy as np
>
> In [2]: exps = np.array([10,13,12,3,4,6,33,44,55])
>
> In [3]: x = np.array([10,13,12,3,4,6,33,44,55])
>
> In [4]: exps = np.array([1,1,1,2,2,2,3,3,3])
>
> z = [np.mean(x[exps == i]) for i in np.unique( exps )]
>
> --
> Oleksandr (Sasha) Huziy
>
>
> 2012/8/1 Andreas Hilboll <lists@hilboll.de>
>>
>> > Hi there,
>> >
>> > I've just moved from R to IPython and wondered if there was a good way
>> > of
>> > finding the means and/or variance of values in a dataframe given a
>> > factor
>> >
>> > e.g.:
>> > if df =
>> > x             experiment
>> > 10            1
>> > 13            1
>> > 12            1
>> > 3             2
>> > 4             2
>> > 6             2
>> > 33            3
>> > 44            3
>> > 55            3
>> >
>> > in tapply you would do:
>> >
>> > tapply(df$x, list(df$experiment), mean)
>> > tapply(df$x, list(df$experiment), var)
>> >
>> > I guess I can always loop through the array for each experiment type,
>> > but
>> > thought that this is the kind of functionality that would be included in
>> > a
>> > core library.
>>
>> Pandas (http://pandas.pydata.org/) seems to be what you're looking for. It
>> has a DataFrame class which allows grouping of data.
>>
>> Cheers, Andreas.
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

For the #lazyweb, here is what this looks like in pandas:


In [24]: df
Out[24]:
    x  experiment
0  10           1
1  13           1
2  12           1
3   3           2
4   4           2
5   6           2
6  33           3
7  44           3
8  55           3

In [25]: df.groupby('experiment').x.mean()
Out[25]:
experiment
1             11.666667
2              4.333333
3             44.000000
Name: x

In [26]: df.groupby('experiment').x.var()
Out[26]:
experiment
1               2.333333
2               2.333333
3             121.000000
Name: x

or if you want to be fancy:

In [27]: df.groupby('experiment').x.agg(['mean', 'var'])
Out[27]:
                 mean         var
experiment
1           11.666667    2.333333
2            4.333333    2.333333
3           44.000000  121.000000

There are good reasons to use pandas over a DIY approach with NumPy
array operations; notably I use smart algorithms so that the runtime
scales linearly with the side of the data instead of quadratically.

- Wes


More information about the SciPy-User mailing list