[Numpy-discussion] adding a cut function to numpy
Skipper Seabold
jsseabold@gmail....
Mon Apr 16 22:24:50 CDT 2012
On Mon, Apr 16, 2012 at 8:08 PM, Tony Yu <tsyu80@gmail.com> wrote:
>
>
> On Mon, Apr 16, 2012 at 6:01 PM, Skipper Seabold <jsseabold@gmail.com>
> wrote:
>>
>> On Mon, Apr 16, 2012 at 5:51 PM, Tony Yu <tsyu80@gmail.com> wrote:
>> >
>> >
>> > On Mon, Apr 16, 2012 at 5:27 PM, Skipper Seabold <jsseabold@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a pull request here [1] to add a cut function similar to R's
>> >> [2]. It seems there are often requests for similar functionality. It's
>> >> something I'm making use of for my own work and would like to use in
>> >> statstmodels and in generating instances of pandas' Factor class, but
>> >> is this generally something people would find useful to warrant its
>> >> inclusion in numpy? It will be even more useful I think with an enum
>> >> dtype in numpy.
>> >>
>> >> If you aren't familiar with cut, here's a potential use case. Going
>> >> from a continuous to a categorical variable.
>> >>
>> >> Given a continuous variable
>> >>
>> >> [~/]
>> >> [8]: age = np.random.randint(15,70, size=100)
>> >>
>> >> [~/]
>> >> [9]: age
>> >> [9]:
>> >> array([58, 32, 20, 25, 34, 69, 52, 27, 20, 23, 51, 61, 39, 54, 39, 44,
>> >> 27,
>> >> 17, 29, 18, 66, 25, 44, 21, 54, 32, 50, 60, 25, 41, 68, 25, 42,
>> >> 69,
>> >> 50, 69, 24, 69, 69, 48, 30, 20, 18, 15, 50, 48, 44, 27, 57, 52,
>> >> 40,
>> >> 27, 58, 45, 44, 32, 54, 19, 36, 32, 55, 17, 55, 15, 19, 29, 22,
>> >> 25,
>> >> 36, 44, 29, 53, 37, 31, 51, 39, 21, 66, 25, 26, 20, 17, 41, 50,
>> >> 27,
>> >> 23, 62, 69, 65, 34, 38, 61, 39, 34, 38, 35, 18, 36, 29, 26])
>> >>
>> >> Give me a variable where people are in age groups (lower bound is not
>> >> inclusive)
>> >>
>> >> [~/]
>> >> [10]: groups = [14, 25, 35, 45, 55, 70]
>> >>
>> >> [~/]
>> >> [11]: age_cat = np.cut(age, groups)
>> >>
>> >> [~/]
>> >> [12]: age_cat
>> >> [12]:
>> >> array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5,
>> >> 1,
>> >> 3,
>> >> 1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4,
>> >> 4,
>> >> 3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1,
>> >> 3,
>> >> 3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3,
>> >> 5,
>> >> 3, 2, 3, 2, 1, 3, 2, 2])
>> >>
>> >> Skipper
>> >>
>> >> [1] https://github.com/numpy/numpy/pull/248
>> >> [2] http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html
>> >
>> >
>> > Is this the same as `np.searchsorted` (with reversed arguments)?
>> >
>> > In [292]: np.searchsorted(groups, age)
>> > Out[292]:
>> > array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1,
>> > 3,
>> > 1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4,
>> > 4,
>> > 3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1,
>> > 3,
>> > 3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3,
>> > 5,
>> > 3, 2, 3, 2, 1, 3, 2, 2])
>> >
>>
>> That's news to me, and I don't know how I missed it.
>
>
> Actually, the only reason I remember searchsorted is because I also
> implemented a variant of it before finding that it existed.
>
It's certainly not an obvious name for the behavior I wanted at least
with my background. Ie., I want something that works on the data not
the bins/groups. And it's not referenced in histogram or digitize,
though now that I wade back through some threads I see people pointing
to it. It also appears to be faster than my implementation with
digitize with a quick look.
>>
>> It looks like
>> there is overlap, but cut will also do binning for equal width
>> categorization
>>
>> [~/]
>> [21]: np.cut(age, 6)
>> [21]:
>> array([5, 2, 1, 2, 3, 6, 5, 2, 1, 1, 4, 6, 3, 5, 3, 4, 2, 1, 2, 1, 6, 2,
>> 4,
>> 1, 5, 2, 4, 5, 2, 3, 6, 2, 3, 6, 4, 6, 1, 6, 6, 4, 2, 1, 1, 1, 4, 4,
>> 4, 2, 5, 5, 3, 2, 5, 4, 4, 2, 5, 1, 3, 2, 5, 1, 5, 1, 1, 2, 1, 2, 3,
>> 4, 2, 5, 3, 2, 4, 3, 1, 6, 2, 2, 1, 1, 3, 4, 2, 1, 6, 6, 6, 3, 3, 6,
>> 3, 3, 3, 3, 1, 3, 2, 2])
>>
>> and explicitly handles the case with constant x
>>
>> [~/]
>> [26]: x = np.ones(100)*6
>>
>> [~/]
>> [27]: np.cut(x, 5)
>> [27]:
>> array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>> 3,
>> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>> 3, 3, 3, 3, 3, 3, 3, 3])
>>
>> I guess I could patch searchsorted. Thoughts?
>>
>> Skipper
>
>
> Hmm, ... I'm not sure if these other call signatures map as well to the name
> "searchsorted"; i.e. "cut" makes more sense in these cases.
>
> On the other hand, it seems these cases could be handled by `np.digitize`
> (although they aren't currently). Hmm,... why doesn't the above call to
> `cut` match (what I assume to be) the equivalent call to `np.digitize`:
>
> In [302]: np.digitize(age, np.linspace(age.min(), age.max(), 6))
> Out[302]:
> array([4, 2, 1, 1, 2, 6, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, 3,
> 1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 6, 4, 6, 1, 6, 6, 4, 2, 1, 1, 1, 4, 4,
> 3, 2, 4, 4, 3, 2, 4, 3, 3, 2, 4, 1, 2, 2, 4, 1, 4, 1, 1, 2, 1, 1, 2,
> 3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 6, 5, 2, 3, 5,
> 3, 2, 3, 2, 1, 2, 2, 2])
>
> It's unfortunate that `digitize` and `histogram` have one call signature,
> but `searchsorted` has the reverse; in that sense, I like `cut` better.
>
I actually extended digitize to work the way I wanted with the sole
intention to implement cut.
https://github.com/numpy/numpy/pull/245
I agree about the call signature. As I mentioned, the way my work flow
goes, I have the data first then think about the groups rather than
thinking about doing an action on the groups themselves. In this way,
I still think having cut is beneficial.
Skipper
More information about the NumPy-Discussion
mailing list