[Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Thu Jun 23 18:51:25 CDT 2011
On Thu, Jun 23, 2011 at 5:37 PM, Mark Wiebe <email@example.com> wrote:
> On Thu, Jun 23, 2011 at 4:19 PM, Nathaniel Smith <firstname.lastname@example.org> wrote:
>> I'd like to see a statement of what the "missing data problem" is, and
>> how this solves it? Because I don't think this is entirely intuitive,
>> or that everyone necessarily has the same idea.
> I agree it represents different problems in different contexts. For NumPy, I
> think the mechanism for dealing with it needs to be intuitive to work with
> in a maximum number of contexts, avoiding surprises. Getting feedback from a
> broad range of people is the only way a general solution can be designed
> with any level of confidence.
>> > Reduction operations like 'sum', 'prod', 'min', and 'max' will operate
>> > as if the values weren't there
>> For context: My experience with missing data is in statistical
>> analysis; I find R's NA support to be pretty awesome for those
>> purposes. The conceptual model it's based on is that an NA value is
>> some number that we just happen not to know. So from this perspective,
>> I find it pretty confusing that adding an unknown quantity to 3 should
>> result in 3, rather than another unknown quantity. (Obviously it
>> should be possible to compute the sum of the known values, but IME
>> it's important for the default behavior to be to fail loudly when
>> things are wonky, not to silently patch them up, possibly
> The conceptual model you describe sounds reasonable to me, and I definitely
> like the idea of consistently following one such model for all default
>> Also, what should 'dot' do with missing values?
> A matrix multiplication is defined in terms of sums of products, so it can
> be implemented to behave consistently with your conceptual model.
>From the perspective of statistical analysis, I don't see much
advantage of this.
What to do with nans depends on the analysis, and needs to be looked
at for each case.
Only easy descriptive statistics work without problems, nansum, ....
All the other usages require rewriting the algorithm, see scipy.stats
versus scipy.mstats. In R often the nan handling is remove all
observations (rows) with at least one nan, or we go to some fancier
imputation of missing values algorithms.
What happens if I just want to go back and forth between using Lapack
and minpack, none of them suddenly grow missing values handling, and
if they would it might not be what we want.
arrays with nans are nice for data handling, but I don't see why we
should pay for any overhead for number crunching with numpy arrays.
>> -- Nathaniel
>> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <email@example.com> wrote:
>> > Enthought has asked me to look into the "missing data" problem and how
>> > NumPy
>> > could treat it better. I've considered the different ideas of adding
>> > dtype
>> > variants with a special signal value and masked arrays, and concluded
>> > that
>> > adding masks to the core ndarray appears is the best way to deal with
>> > the
>> > problem in general.
>> > I've written a NEP that proposes a particular design, viewable here:
>> > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst
>> > There are some questions at the bottom of the NEP which definitely need
>> > discussion to find the best design choices. Please read, and let me know
>> > of
>> > all the errors and gaps you find in the document.
>> > Thanks,
>> > Mark
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> NumPy-Discussion mailing list
> NumPy-Discussion mailing list
More information about the NumPy-Discussion