[Numpy-discussion] Tools / data structures for statistical analysis and related applications
Fri Jun 11 12:57:37 CDT 2010
On 06/11/2010 10:26 AM, Wes McKinney wrote:
> On Fri, Jun 11, 2010 at 9:46 AM, Bruce Southey<email@example.com> wrote:
>> On 06/09/2010 03:40 PM, Wes McKinney wrote:
>>> Dear all,
>>> We've been having discussions on the pystatsmodels mailing list
>>> recently regarding data structures and other tools for statistics /
>>> other related data analysis applications. I believe we're trying to
>>> answer a number of different, but related questions:
>>> 1. What are the sets of functionality (and use cases) which would be
>>> desirable for the scientific (or statistical) Python programmer?
>>> Things like groupby
>>> fall into this category.
>>> 2. Do we really need to build custom data structures (larry, pandas,
>>> tabular, etc.) or are structured ndarrays enough? (My conclusion is
>>> that we do need to, but others might disagree). If so, how much
>>> performance are we willing to trade for functionality?
>>> 3. What needs to happen for Python / NumPy / SciPy to really "break
>>> in" to the statistical computing field? In other words, could a
>>> Python-based stack one day be a competitive alternative to R?
>>> These are just some ideas for collecting community input. Of course as
>>> we're all working in different problem domains, the needs of users
>>> will vary quite a bit across the board. We've started to collect some
>>> thoughts, links, etc. on the scipy.org wiki:
>>> A lot of what's there already is commentary and comparison on the
>>> functionality provided by pandas and la / larry (since Keith and I
>>> wrote most of the stuff there). But I think we're trying to identify
>>> more generally the things that are lacking in NumPy/SciPy and related
>>> libraries for particular applications. At minimum it should be good
>>> fodder for the SciPy conferences this year and afterward (I am
>>> submitting a paper on this subject based on my experiences).
>>> - Wes
>>> NumPy-Discussion mailing list
>> If you need pure data storage then all you require is an timeseries,
>> masked structured ndarray. That will handle time/dates, missing values
>> and named variables. This is probably the basis of all statistical
>> packages, databases and spreadsheets. But the real problem is the
>> blas/lapack usage that prevents anything but an standard narray.
> For storing data sets I can agree that a structured / masked ndarray
> is sufficient. But I think a lot of people are primarily concerned
> about data manipulations in memory (which can be currently quite
Well that is not storage :-)
Data manipulations are too case dependent and full of comprises between
flexibility, memory usage and cpu time. For example, do I create a
design matrix X so I can compute np.dot(X.T, X) or directly form the
product as I read the data? The former is a memory hog because I have
potentially huge X array as well as the smaller product array - this
holds for any solving approach that work on X. Not to mention that X.T*X
is symmetric which is further savings especially if you can use the
symmetric functions of blas/lapack.
> If you are referring to scikits.timeseries-- it expects data
> to be fixed frequency which is a too rigid assumption for many
> applications (like mine).
I am referring to any container that holds a date/time variable such as
the datetime module.
>> The issue that I have with all these packages like tabulate, la and
>> pandas that extend narrays is the 'upstream'/'downstream' problem of
>> open source development. The real problem with these extensions of numpy
>> is that while you can have whatever storage you like, you either need to
>> write your own functions or preprocess the storage into an acceptable
>> form. So you have to rely on those extensions being update with
>> numpy/scipy since a 'fix' upstream can cause havoc downstream. I
> In theory this could be a problem but of all packages to depend on in
> the Python ecosystem, NumPy seems pretty safe. How many API breakages
> have there been in ndarray in the last few years? Inherently this is a
> risk of participating in open-source. After more than 2 years of
> running a NumPy-SciPy based stack in production applications I feel
> pretty comfortable. And besides, we write unit tests for a reason,
>> subscribe to what other have said elsewhere in the open source community
>> in that it is very important to get your desired features upstream to
>> the original project source - preferably numpy but scipy also counts.
> > From my experience developing pandas it's not clear to me what I've
> done that _should_ make its way "upstream" into NumPy and / or SciPy.
> You could imagine some form of high-level statistical data structure
> making its way into scipy.stats but I'm not sure.
As I indicated above, you have to rewrite the functions to use some new
data structure and I think that would be a negative-sum game.
> If NumPy could
> incorporate something like R's NA value without substantially
> degrading performance then that would be a boon to the issue of
> handling missing data (which MaskedArray does do for us-- but at
> non-trivial performance loss).
Numpy is not orientated to the same goals as S (or SAS or any other
stats application) so it it not a valid comparison to make. For example,
S was designed from the start to " support serious data analysis"
and "[f]rom the beginning, S was designed to provide a complete
environment for data analysis"
There is also the issue of how S/R handles missing values as well.
> Data alignment routines, groupby (which
> is already on the table), and NaN / missing data sensitive moving
> window functions (mean, median, std, etc.) would be nice general
> additions as well. Any other ideas?
At present I am waiting to see what happens with pystatsmodels as Python
stats analysis is not very high on my list as other Python things.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion