[Numpy-discussion] Tools / data structures for statistical analysis and related applications
Fri Jun 11 08:46:44 CDT 2010
On 06/09/2010 03:40 PM, Wes McKinney wrote:
> Dear all,
> We've been having discussions on the pystatsmodels mailing list
> recently regarding data structures and other tools for statistics /
> other related data analysis applications. I believe we're trying to
> answer a number of different, but related questions:
> 1. What are the sets of functionality (and use cases) which would be
> desirable for the scientific (or statistical) Python programmer?
> Things like groupby
> fall into this category.
> 2. Do we really need to build custom data structures (larry, pandas,
> tabular, etc.) or are structured ndarrays enough? (My conclusion is
> that we do need to, but others might disagree). If so, how much
> performance are we willing to trade for functionality?
> 3. What needs to happen for Python / NumPy / SciPy to really "break
> in" to the statistical computing field? In other words, could a
> Python-based stack one day be a competitive alternative to R?
> These are just some ideas for collecting community input. Of course as
> we're all working in different problem domains, the needs of users
> will vary quite a bit across the board. We've started to collect some
> thoughts, links, etc. on the scipy.org wiki:
> A lot of what's there already is commentary and comparison on the
> functionality provided by pandas and la / larry (since Keith and I
> wrote most of the stuff there). But I think we're trying to identify
> more generally the things that are lacking in NumPy/SciPy and related
> libraries for particular applications. At minimum it should be good
> fodder for the SciPy conferences this year and afterward (I am
> submitting a paper on this subject based on my experiences).
> - Wes
> NumPy-Discussion mailing list
If you need pure data storage then all you require is an timeseries,
masked structured ndarray. That will handle time/dates, missing values
and named variables. This is probably the basis of all statistical
packages, databases and spreadsheets. But the real problem is the
blas/lapack usage that prevents anything but an standard narray.
The issue that I have with all these packages like tabulate, la and
pandas that extend narrays is the 'upstream'/'downstream' problem of
open source development. The real problem with these extensions of numpy
is that while you can have whatever storage you like, you either need to
write your own functions or preprocess the storage into an acceptable
form. So you have to rely on those extensions being update with
numpy/scipy since a 'fix' upstream can cause havoc downstream. I
subscribe to what other have said elsewhere in the open source community
in that it is very important to get your desired features upstream to
the original project source - preferably numpy but scipy also counts.
More information about the NumPy-Discussion