[Numpy-discussion] adding a cut function to numpy

Tony Yu tsyu80@gmail....
Mon Apr 16 19:08:46 CDT 2012


On Mon, Apr 16, 2012 at 6:01 PM, Skipper Seabold <jsseabold@gmail.com>wrote:

> On Mon, Apr 16, 2012 at 5:51 PM, Tony Yu <tsyu80@gmail.com> wrote:
> >
> >
> > On Mon, Apr 16, 2012 at 5:27 PM, Skipper Seabold <jsseabold@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> I have a pull request here [1] to add a cut function similar to R's
> >> [2]. It seems there are often requests for similar functionality. It's
> >> something I'm making use of for my own work and would like to use in
> >> statstmodels and in generating instances of pandas' Factor class, but
> >> is this generally something people would find useful to warrant its
> >> inclusion in numpy? It will be even more useful I think with an enum
> >> dtype in numpy.
> >>
> >> If you aren't familiar with cut, here's a potential use case. Going
> >> from a continuous to a categorical variable.
> >>
> >> Given a continuous variable
> >>
> >> [~/]
> >> [8]: age = np.random.randint(15,70, size=100)
> >>
> >> [~/]
> >> [9]: age
> >> [9]:
> >> array([58, 32, 20, 25, 34, 69, 52, 27, 20, 23, 51, 61, 39, 54, 39, 44,
> 27,
> >>       17, 29, 18, 66, 25, 44, 21, 54, 32, 50, 60, 25, 41, 68, 25, 42,
> 69,
> >>       50, 69, 24, 69, 69, 48, 30, 20, 18, 15, 50, 48, 44, 27, 57, 52,
> 40,
> >>       27, 58, 45, 44, 32, 54, 19, 36, 32, 55, 17, 55, 15, 19, 29, 22,
> 25,
> >>       36, 44, 29, 53, 37, 31, 51, 39, 21, 66, 25, 26, 20, 17, 41, 50,
> 27,
> >>       23, 62, 69, 65, 34, 38, 61, 39, 34, 38, 35, 18, 36, 29, 26])
> >>
> >> Give me a variable where people are in age groups (lower bound is not
> >> inclusive)
> >>
> >> [~/]
> >> [10]: groups = [14, 25, 35, 45, 55, 70]
> >>
> >> [~/]
> >> [11]: age_cat = np.cut(age, groups)
> >>
> >> [~/]
> >> [12]: age_cat
> >> [12]:
> >> array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1,
> >> 3,
> >>       1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4,
> 4,
> >>       3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1,
> 3,
> >>       3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3,
> 5,
> >>       3, 2, 3, 2, 1, 3, 2, 2])
> >>
> >> Skipper
> >>
> >> [1] https://github.com/numpy/numpy/pull/248
> >> [2] http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html
> >
> >
> > Is this the same as `np.searchsorted` (with reversed arguments)?
> >
> > In [292]: np.searchsorted(groups, age)
> > Out[292]:
> > array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1,
> 3,
> >        1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4,
> 4,
> >        3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1,
> 3,
> >        3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3,
> 5,
> >        3, 2, 3, 2, 1, 3, 2, 2])
> >
>
> That's news to me, and I don't know how I missed it.


Actually, the only reason I remember searchsorted is because I also
implemented a variant of it before finding that it existed.


> It looks like
> there is overlap, but cut will also do binning for equal width
> categorization
>
> [~/]
> [21]: np.cut(age, 6)
> [21]:
> array([5, 2, 1, 2, 3, 6, 5, 2, 1, 1, 4, 6, 3, 5, 3, 4, 2, 1, 2, 1, 6, 2, 4,
>       1, 5, 2, 4, 5, 2, 3, 6, 2, 3, 6, 4, 6, 1, 6, 6, 4, 2, 1, 1, 1, 4, 4,
>       4, 2, 5, 5, 3, 2, 5, 4, 4, 2, 5, 1, 3, 2, 5, 1, 5, 1, 1, 2, 1, 2, 3,
>        4, 2, 5, 3, 2, 4, 3, 1, 6, 2, 2, 1, 1, 3, 4, 2, 1, 6, 6, 6, 3, 3, 6,
>       3, 3, 3, 3, 1, 3, 2, 2])
>
> and explicitly handles the case with constant x
>
> [~/]
> [26]: x = np.ones(100)*6
>
> [~/]
> [27]: np.cut(x, 5)
> [27]:
> array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
>       3, 3, 3, 3, 3, 3, 3, 3])
>
> I guess I could patch searchsorted. Thoughts?
>
> Skipper
>

Hmm, ... I'm not sure if these other call signatures map as well to the
name "searchsorted"; i.e. "cut" makes more sense in these cases.

On the other hand, it seems these cases could be handled by `np.digitize`
(although they aren't currently). Hmm,... why doesn't the above call to
`cut` match (what I assume to be) the equivalent call to `np.digitize`:

In [302]: np.digitize(age, np.linspace(age.min(), age.max(), 6))
Out[302]:
array([4, 2, 1, 1, 2, 6, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, 3,
       1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 6, 4, 6, 1, 6, 6, 4, 2, 1, 1, 1, 4, 4,
       3, 2, 4, 4, 3, 2, 4, 3, 3, 2, 4, 1, 2, 2, 4, 1, 4, 1, 1, 2, 1, 1, 2,
       3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 6, 5, 2, 3, 5,
       3, 2, 3, 2, 1, 2, 2, 2])

It's unfortunate that `digitize` and `histogram` have one call signature,
but `searchsorted` has the reverse; in that sense, I like `cut` better.

Cheers
-Tony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120416/69edbe33/attachment.html 


More information about the NumPy-Discussion mailing list