[Numpy-discussion] Masked Arrays in NumPy 1.x

Paul Hobson pmhobson@gmail....
Mon Apr 23 16:57:56 CDT 2012


Nathan,

Apologies for not being clear. My interaction with these lists is
unfortunately constrained to lunch breaks and times when my code is
running. :-/

So the masked values are the the detection limits. In other words, the
lab said that they can't see the chemical in the sample, but it may be
present at a concentration below the what their machines can see, i.e.
the detection limit. Simply put, I receive data that looks like this:

Sample # - Result (mg/kg)
1 - 5.1
2 - 6.3
3 - <4.5
4 - <3.0
5 - 10.2
6 - <1.0
etc...

So when I parse the data, I mask the results that are non-detect (less
than the reported value) so that I can set them aside and sort the
data my way -- detected values descending stacked on top of non-detect
values descending. I then use the statistical distribution of the
detected values to model the non-detect values. Again, since we don't
know what they are, this is just a best guess. Some people just use
the detection limit, others half of the detection limit.

A picture is worth 1000 words, right? The attached plot shows what my
code does to my data. Hopefully this demonstrates the value of my
method over using the detection limits as received from the lab.

The best way I can explain it is that I don't use the mask to say, "I
don't know this" or "I don't want to see this". Instead I use it to
classify data within a single set into two types of data. For
simplicity's sake, we can call them "known" data and "upper bounded"
data. In order to do this with numpy.ma, I must be able to retrieve
the masked values (x[x.mask=true].values).

I'm sure that if numpy.ma went away forever, I could work around this.
However the current implementation of numpy.ma works very well for me
as-is.

Please don't hesitate to ask for further clarification if i glossed
over any details. I've been working in this field since I was a 19-yo
intern, so I'm undoubtedly taking things for granted.

-paul

On Mon, Apr 23, 2012 at 1:40 PM, Nathaniel Smith <njs@pobox.com> wrote:
> Hi Paul,
>
> On Wed, Apr 11, 2012 at 8:57 PM, Paul Hobson <pmhobson@gmail.com> wrote:
>> Travis et al,
>>
>> This isn't a reply to anything specific in your email and I apologize
>> if there is a better thread or place to share this information. I've
>> been meaning to participate in the discussion for a long time and
>> never got around to it. The main thing I'd like to is convey my
>> typical use of the numpy.ma module as an environmental engineer
>> analyzing censored datasets --contaminant concentrations that are
>> either at well understood values (not masked) or some unknown value
>> below an upper bound (masked).
>>
>> My basic understanding is that this discussion revolved around how to
>> treat masked data (ignored vs missing) and how to implement one, both,
>> or some middle ground between those two concepts. If I'm off-base,
>> just ignore all of the following.
>>
>> For my purposes, numpy.ma is implemented in a way very well suited to
>> my needs. Here's a gist of a something that was *really* hard for me
>> before I discovered numpy.ma and numpy in general. (this is a bit
>> much, see below for the highlights)
>> https://gist.github.com/2361814
>>
>> The main message here is that I include the upper bounds of the
>> unknown values (detection limits) in my array and use that to
>> statistically estimate their values. I must be able to retrieve the
>> masked detection limits throughout this process. Additionally the
>> masks as currently implemented allow me sort first the undetected
>> values, then the detected values (see __rosRanks in the gist).
>>
>> As boots-on-the-ground user of numpy, I'm ecstatic that this tool
>> exists. I'm also pretty flexible and don't anticipated any major snags
>> in my work if things change dramatically as the masked/missing/ignored
>> functionality evolves.
>>
>> Thanks to everyone for the hard work and great tools,
>> -Paul Hobson
>
> Thanks for this note -- it's getting feedback from people on how
> they're actually using numpy.ma is *very* helpful, because there's a
> lot more data out there on the "missing data" use case.
>
> But, I couldn't quite figure out what you're actually doing in this
> code. It looks like the measurements that you're masking out have some
> values "hidden behind" the mask, which you then make use of?
> Unfortunately, I don't know anything about environmental engineering
> or the method of Hirsch and Stedinger (1987). Could you elaborate a
> bit on what these masked values mean and how you process them?
>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_plot.png
Type: image/png
Size: 61872 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20120423/5f6ce279/attachment-0001.png 


More information about the NumPy-Discussion mailing list