[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Charles R Harris
Thu Jun 23 19:17:28 CDT 2011
On Thu, Jun 23, 2011 at 5:51 PM, <firstname.lastname@example.org> wrote:
> On Thu, Jun 23, 2011 at 5:37 PM, Mark Wiebe <email@example.com> wrote:
> > On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith <firstname.lastname@example.org> wrote:
> >> I'd like to see a statement of what the "missing data problem" is, and
> >> how this solves it? Because I don't think this is entirely intuitive,
> >> or that everyone necessarily has the same idea.
> > I agree it represents different problems in different contexts. For
> NumPy, I
> > think the mechanism for dealing with it needs to be intuitive to work
> > in a maximum number of contexts, avoiding surprises. Getting feedback
> from a
> > broad range of people is the only way a general solution can be designed
> > with any level of confidence.
> >> > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate
> >> > as if the values weren't there
> >> For context: My experience with missing data is in statistical
> >> analysis; I find R's NA support to be pretty awesome for those
> >> purposes. The conceptual model it's based on is that an NA value is
> >> some number that we just happen not to know. So from this perspective,
> >> I find it pretty confusing that adding an unknown quantity to 3 should
> >> result in 3, rather than another unknown quantity. (Obviously it
> >> should be possible to compute the sum of the known values, but IME
> >> it's important for the default behavior to be to fail loudly when
> >> things are wonky, not to silently patch them up, possibly
> >> incorrectly!)
> > The conceptual model you describe sounds reasonable to me, and I
> > like the idea of consistently following one such model for all default
> > behaviors.
> >> Also, what should 'dot' do with missing values?
> > A matrix multiplication is defined in terms of sums of products, so it
> > be implemented to behave consistently with your conceptual model.
> >From the perspective of statistical analysis, I don't see much
> advantage of this.
> What to do with nans depends on the analysis, and needs to be looked
> at for each case.
> Only easy descriptive statistics work without problems, nansum, ....
> All the other usages require rewriting the algorithm, see scipy.stats
> versus scipy.mstats. In R often the nan handling is remove all
> observations (rows) with at least one nan, or we go to some fancier
> imputation of missing values algorithms.
> What happens if I just want to go back and forth between using Lapack
> and minpack, none of them suddenly grow missing values handling, and
> if they would it might not be what we want.
> arrays with nans are nice for data handling, but I don't see why we
> should pay for any overhead for number crunching with numpy arrays.
I didn't get the impression that there would be noticeable overhead. On the
other points, I think the idea should be to provide a low level mechanism
that it flexible enough to allow implementation of various use cases at a
higher level. For instance, current masked arrays could be reimplemented if
desired, etc. Not that I think that should be done...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion