[Numpy-discussion] Missing data again

Travis Oliphant travis@continuum...
Sat Mar 3 14:30:47 CST 2012

Hi all, 

I've been thinking a lot about the masked array implementation lately.     I finally had the time to look hard at what has been done and now am of the opinion that I do not think that 1.7 can be released with the current state of the masked array implementation *unless* it is clearly marked as experimental and may be changed in 1.8  

I wish I had been able to be a bigger part of this conversation last year.   But, that is why I took the steps I took to try and figure out another way to feed my family *and* stay involved in the NumPy community.   I would love to stay involved in what is happening in the SciPy community, but I am more satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles, Stefan, and others are doing there right now, and don't have time to keep up with everything.    Even though SciPy was the heart and soul of why I even got involved with Python for open source in the first place and took many years of my volunteer labor, I won't be able to spend significant time on SciPy code over the coming months.   At some point, I really hope to be able to make contributions again to that code-base.   Time will tell whether or not my aspirations will be realized.  It depends quite a bit on whether or not my kids have what they need from me (which right now is money and time). 
NumPy, on the other hand, is not in a position where I can feel comfortable leaving my "baby" to others.  I recognize and value the contributions from many people to make NumPy what it is today (e.g. code contributions, code rearrangement and standardization, build and install improvement, and most recently some architectural changes).    But, I feel a personal responsibility for the code base as I spent a great many months writing NumPy in the first place, and I've spent a great deal of time interacting with NumPy users and feel like I have at least some sense of their stories.    Of course, I built on the shoulders of giants, and much of what is there is *because of* where the code was adapted from (it was not created de-novo).   Currently,  there remains much that needs to be communicated, improved, and worked on, and I have specific opinions about what some changes and improvements should be, how they should be written, and how the resulting users need to be benefited.   It will take time to discuss all of this, and that's where I will spend my open-source time in the coming months. 

In that vein: 

Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again.    The NEP process is the appropriate one and I'm glad we are taking that route for these discussions.   My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code).    It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...).   We can represent different parts of what is fortunately a very large user-base of NumPy users.    

First of all, I want to be clear that I think there is much great work that has been done in the current missing data code.  There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data.   I'm sure there are other things as well that I'm not quite aware of yet.    However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X.   

A few particulars: 

	* the reduction operations need to default to "skipna" --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently 
	* the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python)

	* bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented. 

	* there should be some way when using "masks" (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation
	   on the masks...

I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.    For better or for worse, my approach to software is generally very user-driven and very pragmatic.  On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure.    None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications.

I will get a hold of the NEP and spend some time with it to discuss some of this in that document.   This will take several weeks (as PyCon is next week and I have a tutorial I'm giving there).    For now, I do not think 1.7 can be released unless the masked array is labeled *experimental*.  



More information about the NumPy-Discussion mailing list