[Numpy-discussion] [ANN] Nanny, faster NaN functions

Wes McKinney wesmckinn@gmail....
Sun Nov 21 12:25:40 CST 2010


On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
> On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>
>> Keith (and others),
>>
>> What would you think about creating a library of mostly Cython-based
>> "domain specific functions"? So stuff like rolling statistical
>> moments, nan* functions like you have here, and all that-- NumPy-array
>> only functions that don't necessarily belong in NumPy or SciPy (but
>> could be included on down the road). You were already talking about
>> this on the statsmodels mailing list for larry. I spent a lot of time
>> writing a bunch of these for pandas over the last couple of years, and
>> I would have relatively few qualms about moving these outside of
>> pandas and introducing a dependency. You could do the same for larry--
>> then we'd all be relying on the same well-vetted and tested codebase.
>
> I've started working on moving window statistics cython functions. I
> plan to make it into a package called Roly (for rolling). The
> signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
> window, axis=-1), etc.
>
> I think of Nanny and Roly as two separate packages. A narrow focus is
> good for a new package. But maybe each package could be a subpackage
> in a super package?
>
> Would the function signatures in Nanny (exact duplicates of the
> corresponding functions in Numpy and Scipy) work for pandas? I plan to
> use Nanny in larry. I'll try to get the structure of the Nanny package
> in place. But if it doesn't attract any interest after that then I may
> fold it into larry.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

Why make multiple packages? It seems like all these functions are
somewhat related: practical tools for real-world data analysis (where
observations are often missing). I suspect having everything under one
hood would create more interest than chopping things up-- would be
very useful to folks in many different disciplines (finance,
economics, statistics, etc.). In R, for example, NA-handling is just a
part of every day life. Of course in R there is a special NA value
which is distinct from NaN-- many folks object to the use of NaN for
missing values. The alternative is masked arrays, but in my case I
wasn't willing to sacrifice so much performance for purity's sake.

I could certainly use the nan* functions to replace code in pandas
where I've handled things in a somewhat ad hoc way.


More information about the NumPy-Discussion mailing list