[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Mark Wiebe mwwiebe@gmail....
Thu Jun 23 18:57:44 CDT 2011


On Thu, Jun 23, 2011 at 6:51 PM, <josef.pktd@gmail.com> wrote:

> On Thu, Jun 23, 2011 at 5:37 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> > On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith <njs@pobox.com> wrote:
> >>
> >> I'd like to see a statement of what the "missing data problem" is, and
> >> how this solves it? Because I don't think this is entirely intuitive,
> >> or that everyone necessarily has the same idea.
> >
> > I agree it represents different problems in different contexts. For
> NumPy, I
> > think the mechanism for dealing with it needs to be intuitive to work
> with
> > in a maximum number of contexts, avoiding surprises. Getting feedback
> from a
> > broad range of people is the only way a general solution can be designed
> > with any level of confidence.
> >>
> >> > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate
> >> > as if the values weren't there
> >>
> >> For context: My experience with missing data is in statistical
> >> analysis; I find R's NA support to be pretty awesome for those
> >> purposes. The conceptual model it's based on is that an NA value is
> >> some number that we just happen not to know. So from this perspective,
> >> I find it pretty confusing that adding an unknown quantity to 3 should
> >> result in 3, rather than another unknown quantity. (Obviously it
> >> should be possible to compute the sum of the known values, but IME
> >> it's important for the default behavior to be to fail loudly when
> >> things are wonky, not to silently patch them up, possibly
> >> incorrectly!)
> >
> > The conceptual model you describe sounds reasonable to me, and I
> definitely
> > like the idea of consistently following one such model for all default
> > behaviors.
> >
> >>
> >> Also, what should 'dot' do with missing values?
> >
> > A matrix multiplication is defined in terms of sums of products, so it
> can
> > be implemented to behave consistently with your conceptual model.
>
> >From the perspective of statistical analysis, I don't see much
> advantage of this.
> What to do with nans depends on the analysis, and needs to be looked
> at for each case.
>

Sure, the alternatives to implementing 'dot' with missing values are to
raise an exception or produce
all missing values.


> Only easy descriptive statistics work without problems, nansum, ....
>
> All the other usages require rewriting the algorithm, see scipy.stats
> versus scipy.mstats. In R often the nan handling is remove all
> observations (rows) with at least one nan, or we go to some fancier
> imputation of missing values algorithms.
>
> What happens if I just want to go back and forth between using Lapack
> and minpack, none of them suddenly grow missing values handling, and
> if they would it might not be what we want.
>
> arrays with nans are nice for data handling, but I don't see why we
> should pay for any overhead for number crunching with numpy arrays.
>

NaNs are certainly not going away, they're completely independent of masked
values.

-Mark


>
> Josef
>
>
> >
> >>
> >> -- Nathaniel
> >>
> >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> >> > Enthought has asked me to look into the "missing data" problem and how
> >> > NumPy
> >> > could treat it better. I've considered the different ideas of adding
> >> > dtype
> >> > variants with a special signal value and masked arrays, and concluded
> >> > that
> >> > adding masks to the core ndarray appears is the best way to deal with
> >> > the
> >> > problem in general.
> >> > I've written a NEP that proposes a particular design, viewable here:
> >> >
> >> >
> https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
> >> > There are some questions at the bottom of the NEP which definitely
> need
> >> > discussion to find the best design choices. Please read, and let me
> know
> >> > of
> >> > all the errors and gaps you find in the document.
> >> > Thanks,
> >> > Mark
> >> > _______________________________________________
> >> > NumPy-Discussion mailing list
> >> > NumPy-Discussion@scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >
> >> >
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110623/59ff29d9/attachment.html 


More information about the NumPy-Discussion mailing list