[Numpy-discussion] What is consensus anyway
Charles R Harris
Tue Apr 24 08:14:20 CDT 2012
On Mon, Apr 23, 2012 at 11:35 PM, Fernando Perez <firstname.lastname@example.org>wrote:
> On Mon, Apr 23, 2012 at 8:49 PM, Stéfan van der Walt <email@example.com>
> > If you are referring to the traditional concept of a fork, and not to
> > the type we frequently make on GitHub, then I'm surprised that no one
> > has objected already. What would a fork solve? To paraphrase the
> > regexp saying: after forking, we'll simply have two problems.
> I concur with you here: github 'forks', yes, as many as possible!
> Hopefully every one of those will produce one or more PRs :) But a
> fork in the sense of a divergent parallel project? I think that would
> only be indicative of a complete failure to find a way to make
> progress here, and I doubt we're anywhere near that state.
> That forks are *possible* is indeed a valuable and important option in
> open source software, because it means that a truly dysfunctional
> original project team/direction can't hold a community hostage
> forever. But that doesn't mean that full-blown forks should be
> considered lightly, as they also carry enormous costs.
> I see absolutely nothing in the current scenario to even remotely
> consider that a full-blown fork would be a good idea, and I hope I'm
> right. It seems to me we're making progress on problems that led to
> real difficulties last year, but from multiple parties I see signs
> that give me reason to be optimistic that the project is getting
> better, not worse.
We certainly aren't there at the moment, but I can see us heading that way.
But let's back up a bit. Numpy 1.6.0 came out just about 1 year ago. Since
then datetime, NA, polynomial work, and various other enhancements have
gone in along with some 280 bug fixes. The major technical problem blocking
a 1.7 release is getting datetime working reliably on windows. So I think
that is where the short term effort needs to be. Meanwhile, we are spending
effort to get out a 1.6.2 just so people can work with a stable version
with some of the bug fixes, and potentially we will spend more time and
effort to pull out the NA code. In the future there may be a transition to
C++ and eventually a break with the current ABI. Or not.
There are at least two motivations that get folks to write code for open
source projects, scratching an itch and money. Money hasn't been a big part
of the Numpy picture so far, so that leaves scratching an itch. One of the
attractions of Numpy is that it is a small project, BSD licensed, and not
overburdened with governance and process. This makes scratching an itch not
as difficult as it would be in a large project. If Numpy remains a small
project but acquires the encumbrances of a big project much of that
attraction will be lost. Momentum and direction also attracts people, but
numpy is stalled at the moment as the whole NA thing circles around once
What would I suggest as a way forward with the NA option. Let's take the
1) Adding slots to PyArrayObject_fields. I don't think this is likely to be
a problem unless someone's code passes the struct by value or uses
assignment to initialize a statically allocated instance. I'm not saying no
one does that, low level scientific code can contain all sorts of bizarre
and astonishing constructs and it is also possible that these sort of
things might turn up in an old FORTRAN program. The question here is
whether to allow any changes at all, and I think we will have to in the
future. Given that, consistent use of accessors will make later changes to
the organization or implementation of the base structure transparent. Numpy
itself now uses accessors for the heritage slots, but not for the new NA
slots. So I suggest at a minimum adding accessors for the maskna_dtype,
maskna_data, and maskna_strides. Of course, later removing these slots will
still remain a problem.
2) NA. This breaks down into API and implementation issues. Personally, I
think marking the NA stuff experimental leaves room to modify both and
would prefer to go with what we have and change it into whatever looks best
by modification through pull requests. This kicks the can down the road,
but not so far that people sufficiently interested in working on the topic
can't get modifications in. My own preferences for future API modifications
are as follows.
a) All arrays should be implicitly masked, even if the mask isn't initially
allocated. The maskna keyword can then be removed, taking with it the sense
that there are two kinds of arrays.
b) There needs to be a distinction between missing and ignore. The
mechanism for this is already in place in the payload type, although it
isn't clear to me that that is uniformly used in all the NA code. There is
also a place for missing *and* ignored. Which leads to
c) Sums, etc. should always skip ignored data. If missing data is present,
but not ignored, then a sum should return NA. The main danger I see here is
that the behavior of arrays becomes state dependent, something that can
lead to subtle problems. Explicit request for a particular behavior, as is
done now, might be preferable for its clarity.
d) I think views are a good way add another mask layer to existing arrays.
And for implementation:
a) Ufunc loop support. This is most easily done with explicit masks.
b) Apropos a), I'm coming (again) to the opinion that byte masks are the
simplest and most general implementation.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion