[Numpy-discussion] missing data discussion round 2
Mark Wiebe
mwwiebe@gmail....
Mon Jun 27 16:01:17 CDT 2011
On Mon, Jun 27, 2011 at 2:59 PM, <josef.pktd@gmail.com> wrote:
> On Mon, Jun 27, 2011 at 2:24 PM, eat <e.antero.tammi@gmail.com> wrote:
> >
> >
> > On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> >>
> >> On Mon, Jun 27, 2011 at 12:44 PM, eat <e.antero.tammi@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <mwwiebe@gmail.com> wrote:
> >>>>
> >>>> First I'd like to thank everyone for all the feedback you're
> providing,
> >>>> clearly this is an important topic to many people, and the discussion
> has
> >>>> helped clarify the ideas for me. I've renamed and updated the NEP,
> then
> >>>> placed it into the master NumPy repository so it has a more permanent
> home
> >>>> here:
> >>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
> >>>> In the NEP, I've tried to address everything that was raised in the
> >>>> original thread and in Nathaniel's followup 'Concepts' thread. To deal
> with
> >>>> the issue of whether a mask is True or False for a missing value, I've
> >>>> removed the 'mask' attribute entirely, except for ufunc-like functions
> >>>> np.ismissing and np.isavail which return the two styles of masks.
> Here's a
> >>>> high level summary of how I'm thinking of the topic, and what I will
> >>>> implement:
> >>>> Missing Data Abstraction
> >>>> There appear to be two useful ways to think about missing data that
> are
> >>>> worth supporting.
> >>>> 1) Unknown yet existing data
> >>>> 2) Data that doesn't exist
> >>>> In 1), an NA value causes outputs to become NA except in a small
> number
> >>>> of exceptions such as boolean logic, and in 2), operations treat the
> data as
> >>>> if there were a smaller array without the NA values.
> >>>> Temporarily Ignoring Data
> >>>> In some cases, it is useful to flag data as NA temporarily, possibly
> in
> >>>> several different ways, for particular calculations or testing out
> different
> >>>> ways of throwing away outliers. This is independent of the missing
> data
> >>>> abstraction, still requiring a choice of 1) or 2) above.
> >>>> Implementation Techniques
> >>>> There are two mechanisms generally used to implement missing data
> >>>> abstractions,
> >>>> 1) An NA bit pattern
> >>>> 2) A mask
> >>>> I've described a design in the NEP which can include both techniques
> >>>> using the same interface. The mask approach is strictly more general
> than
> >>>> the NA bit pattern approach, except for a few things like the idea of
> >>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the
> NEP.
> >>>> My intention is to implement the mask-based design, and possibly also
> >>>> implement the NA bit pattern design, but if anything gets cut it will
> be the
> >>>> NA bit patterns.
> >>>> Thanks again for all your input so far, and thanks in advance for your
> >>>> suggestions for improving this new revision of the NEP.
> >>>
> >>> A very impressive PEP indeed.
> >
> > Hi,
> >>>
> >>> However, how would corner cases, like
> >>>
> >>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
> >>> >>> np.mean(a, skipna=True)
> >>
> >> This should be equivalent to removing all the NA values, then calling
> >> mean, like this:
> >> >>> b = np.array([], dtype='f8')
> >> >>> np.mean(b)
> >>
> >>
> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374:
> >> RuntimeWarning: invalid value encountered in double_scalars
> >> return mean(axis, dtype, out)
> >> nan
> >>>
> >>> >>> np.mean(a)
> >>
> >> This would return NA, since NA values are sitting in positions that
> would
> >> affect the output result.
> >
> > OK.
> >>
> >>
> >>>
> >>> be handled?
> >>> My concern here is that there always seems to be such corner cases
> which
> >>> can only be handled with specific context knowledge. Thus producing
> 100%
> >>> generic code to handle 'missing data' is not doable.
> >>
> >> Working out the corner cases for the functions that are already in numpy
> >> seems tractable to me, how to or whether to support missing data is
> >> something the author of each new function will have to consider when
> missing
> >> data support is in NumPy, but I don't think we can do more than provide
> the
> >> mechanisms for people to use.
> >
> > Sure. I'll ride up with this and wait when I'll have some tangible to
> > outperform the 'traditional' NaN handling.
> > - eat
>
> Just a question how things would work with the new model.
> How can you implement the "use" keyword from R's cov (or cor), with
> minimal data copying
>
> I think the basic masked array version would (or does) just assign 0
> to the missing values calculate the covariance or correlation and then
> correct with the correct count.
>
> ------------
> cov(x, y = NULL, use = "everything",
> method = c("pearson", "kendall", "spearman"))
>
> cor(x, y = NULL, use = "everything",
> method = c("pearson", "kendall", "spearman"))
>
> cov2cor(V)
>
> Arguments
> x a numeric vector, matrix or data frame.
> y NULL (default) or a vector, matrix or data frame with compatible
> dimensions to x. The default is equivalent to y = x (but more
> efficient).
> na.rm logical. Should missing values be removed?
>
> use an optional character string giving a method for computing
> covariances in the presence of missing values. This must be (an
> abbreviation of) one of the strings "everything", "all.obs",
> "complete.obs", "na.or.complete", or "pairwise.complete.obs".
> ------------
>
> especially I'm interested in the complete.obs (drop any rows that
> contains a NA) case
>
I think this is mainly a matter of extending NumPy's equivalent cov function
with a parameter like this. Implemented in C, I'm sure it could be done with
minimal copying, I'm not exactly sure how it will have to look implemented
in Python. Perhaps someone could try it once I have a basic prototype ready
for testing.
-Mark
>
> Josef
>
> >>
> >> -Mark
> >>
> >>>
> >>> Thanks,
> >>> - eat
> >>>>
> >>>> -Mark
> >>>> _______________________________________________
> >>>> NumPy-Discussion mailing list
> >>>> NumPy-Discussion@scipy.org
> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> NumPy-Discussion mailing list
> >>> NumPy-Discussion@scipy.org
> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>>
> >>
> >>
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110627/2f7ae9b0/attachment.html
More information about the NumPy-Discussion
mailing list