[Numpy-discussion] adding a cut function to numpy
Skipper Seabold
jsseabold@gmail....
Mon Apr 16 16:27:27 CDT 2012
Hi,
I have a pull request here [1] to add a cut function similar to R's
[2]. It seems there are often requests for similar functionality. It's
something I'm making use of for my own work and would like to use in
statstmodels and in generating instances of pandas' Factor class, but
is this generally something people would find useful to warrant its
inclusion in numpy? It will be even more useful I think with an enum
dtype in numpy.
If you aren't familiar with cut, here's a potential use case. Going
from a continuous to a categorical variable.
Given a continuous variable
[~/]
[8]: age = np.random.randint(15,70, size=100)
[~/]
[9]: age
[9]:
array([58, 32, 20, 25, 34, 69, 52, 27, 20, 23, 51, 61, 39, 54, 39, 44, 27,
17, 29, 18, 66, 25, 44, 21, 54, 32, 50, 60, 25, 41, 68, 25, 42, 69,
50, 69, 24, 69, 69, 48, 30, 20, 18, 15, 50, 48, 44, 27, 57, 52, 40,
27, 58, 45, 44, 32, 54, 19, 36, 32, 55, 17, 55, 15, 19, 29, 22, 25,
36, 44, 29, 53, 37, 31, 51, 39, 21, 66, 25, 26, 20, 17, 41, 50, 27,
23, 62, 69, 65, 34, 38, 61, 39, 34, 38, 35, 18, 36, 29, 26])
Give me a variable where people are in age groups (lower bound is not inclusive)
[~/]
[10]: groups = [14, 25, 35, 45, 55, 70]
[~/]
[11]: age_cat = np.cut(age, groups)
[~/]
[12]: age_cat
[12]:
array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, 3,
1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4, 4,
3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1, 3,
3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3, 5,
3, 2, 3, 2, 1, 3, 2, 2])
Skipper
[1] https://github.com/numpy/numpy/pull/248
[2] http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html
More information about the NumPy-Discussion
mailing list