[Numpy-discussion] NA, and replacement or reimplimentation of np.ma
Fri Jun 14 14:35:05 CDT 2013
On Jun 14, 2013, at 20:23 , Eric Firing <firstname.lastname@example.org> wrote:
> On 2013/06/14 7:22 AM, Nathaniel Smith wrote:
>> On Wed, Jun 12, 2013 at 7:43 PM, Eric Firing <email@example.com> wrote:
>>> On 2013/06/12 2:10 AM, Nathaniel Smith wrote:
>>>> Personally I think that overloading np.empty is horribly ugly, will
>>>> continue confusing newbies and everyone else indefinitely, and I'm
>>>> 100% convinced that we'll regret implementing such a warty interface
>>>> for something that should be so idiomatic. (Unfortunately I got busy
>>>> and didn't actually say this in the previous thread though.) So I
>>>> think we should just merge the PR as is. The only downside is the
>>>> np.ma inconsistency, but, np.ma is already inconsistent (cf.
>>>> masked_array.fill versus masked_array.filled!), somewhat deprecated,
>>> "somewhat deprecated"? Really? Since when? By whom? Replaced by what?
>> Sorry, not trying to start a fight, just trying to summarize the
>> situation. As far as I can tell:
>> Despite heroic efforts on the part of its authors, numpy.ma has a
>> number of weird quirks (masked data can still trigger invalid value
>> errors), misfeatures (hard versus soft masks), and just plain old pain
>> points (ongoing issues with whether any given operation will respect
>> or preserve the mask).
The "invalid value errors" are a side-effect of some design decisions taken 6-7 years ago. It turned out to be more efficient in terms of speed to follow an approach "compute without the mask, put it back afterwards" than the original "mask before, fill the holes with some value, compute, put the mask back": some functions like `pow` that were not part of the very first implementations twisted my arm on this one. It's far from perfect, it's rather disappointing, but I don't see a workaround with the current "let's do it in python" approach. Any other implementation would have to be done directly in C (or maybe in Cython, it's been 5 years since I last touched it).
>> It's been in deep maintenance mode for some time; we merge the
>> occasional bug fix that people send in, and that's it. (To be fair,
>> numpy as a whole is fairly slow-moving, but numpy.ma still gets much
>> less attention.)
It never had a lot...
>> Therefore, my impression is that a majority (not all, but a majority)
>> of numpy developers strongly recommend against the use of numpy.ma in
>> new projects.
And you take that from? OK, to be frank, *I* would advise against a very naive use of np.ma: there are plenty of tricks to know to be really efficient with masked arrays. Most of the functions of the module are just for convenience in interactive mode…
> I think we can agree that there is major interest in having good numpy
> support for one or more styles of missing/masked values. You might not
> agree, but I will assert that the style of support provided by np.ma is
> *very* useful; it serves a real purpose in working code. We do agree
> that np.ma has problems. It is not at all clear to me, however, that
> those problems cannot or should not be fixed. Even if they can't, I
> don't think they are so severe that it is wise to try to kill off np.ma
> *before* there is a good replacement.
Quite agreed with that
> In the NA branch, an attempt was made to lay the groundwork for solid
> missing/masked support. I did not agree with every design aspect,
Talking about it, was a consensus (or at least a majority) reached about NA w/vs missing data ?
> but I
> thought it was nevertheless good as groundwork, and could be used to
> greatly improve np.ma, to provide a different style of support for those
> who require it, and perhaps to lead over the very long term to a
> withering away of the need for np.ma.
When I started rewriting np.ma, Paul Dubois wrote me that 'if he were to do it again, it'd be in C, and that he disagreed with my approach' (I'm paraphrasing, but the gist is here). Of course, like every kid, I thought I knew better. In retrospect, he was quite right. I'm no longer convinced that MaskedArray as a subclass of ndarray is a correct approach. It works, it worked well enough for my needs at the time, it was a very educational journey, but if I were to do it again...
> Is there any way to revive this line of development? To satisfy the
> needs of people coming from the R world *and* of people for whom np.ma
> is, despite its warts, an important tool? This seems to me to be the
> single biggest area where numpy needs development.
I'm always surprised by the antagonism some people have towards np.ma… You can't always use NaN to represent the missing information you're doomed to meet in the real world.
> It looks like this problem needs dedicated resources: a grant, a major
> corporate effort, or both.
<plug class="shameless">Fund me I'm yours</plug
> Numpy is central to python in science, but it doesn't seem to have a
> corresponding level of direction and support.
More seriously, I'd be delighted to help. I can no longer work on it full time as I used to (even if I were not supposed to) but I can often explain why things were done the way they are and how we could improve them..
More information about the NumPy-Discussion