[SciPy-User] Proposal for a new data analysis toolbox

Wes McKinney wesmckinn@gmail....
Thu Nov 25 12:12:58 CST 2010

On Wed, Nov 24, 2010 at 5:39 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
> On Wed, Nov 24, 2010 at 2:04 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>> On Wed, Nov 24, 2010 at 12:05 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>> On Wed, Nov 24, 2010 at 4:43 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>>> I am not for placing arbitrary restrictions or having a strict
>>>> enumeration on what goes in this library. I think having a practical,
>>>> central dumping ground for data analysis tools would be beneficial. We
>>>> could decide about having "spin-off" libraries later if we think
>>>> that's appropriate.
>>> I'd like to start small (I've already bitten off more than I can chew)
>>> by delivering a well thought out (and implemented) small feature set.
>>> Functions of the form:
>>> sum(arr, axis=None)
>>> move_sum(arr, window, axis=0)
>>> group_sum(arr, label, axis)
>>> where sum can be replaced by a long (to be decided) list of functions
>>> such as std, max, median, etc.
>>> Once that is delivered and gets some use, I'm sure we'll want to push
>>> into new territory. What do you suggest for the next feature to add?
>> I have no problem if you would like to develop in this way-- but I
>> don't personally work well like that. I think having a library with 20
>> 80% solutions would be better than a library with 5 100% solutions. Of
>> course over time you eventually want to build out those 20 80%
>> solutions into 100% solutions, but I think that approach is of greater
>> utility overall.
>>> So it could be that we are talking about the same end point but are
>>> thinking about different development models. I cringe at the thought
>>> of the package becoming a dumping ground.
>> I find that the best and most useful code gets written (and gets
>> written fastest) when the person writing it has a concrete problem
>> they are trying to solve. So if someone comes along and says "I have
>> problem X", where X lives in the general problem domain we are talking
>> about, I might say, "Well I've never had problem X but I have no
>> problem with you writing code to solve it and putting it in my library
>> for this problem domain". So "dumping ground" here is a bit too
>> pejorative but you get the idea. Personally if you or someone else
>> told me "don't put that code here, we are only working on a small set
>> of features for now" I would be kind of bothered (assuming that the
>> code was related to the general problem domain).
> Let's talk about a specific value of X, either now or when it pops up.
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

All I'm saying is that I would be happy to actively contribute a
library which is a one-stop shop for "practical data analysis tools"
focusing on NumPy arrays. This could include:

- NaN-aware statistics
- Moving window functions
- Group-by functions
- Data alignment routines
- Data manipulations for categorical data
- Record array utilities (a la matplotlib.mlab etc.)
- Miscellaneous exploratory data analysis code (console-based pretty
summary statistics and matplotlib-based plotting stuff)
- Date-time tools

and topics that haven't even occurred to me. In my PhD program I'm
encountering types of data that R handles really well and Python does
not at all-- because I have deadlines often I have to use R because I
can't spend a week writing the necessary Python code-- but at some
point I would like to!!

Anyway, my point is: "Fast, NaN-aware descriptive statistics of NumPy
arrays" is too narrowly focused. Can we please, please call the
library something more general and welcome any and all code
contributions within the "practical data analysis" problem domain? I
don't think there is any harm in this, and I will happily take an
active role in preventing the library from becoming a mess.

- Wes

More information about the SciPy-User mailing list