[Numpy-discussion] 2D binning

josef.pktd@gmai... josef.pktd@gmai...
Wed Jun 2 13:26:02 CDT 2010


On Wed, Jun 2, 2010 at 2:09 PM, Mathew Yeates <mat.yeates@gmail.com> wrote:
> I'm on Windows, using a precompiled binary. I never built numpy/scipy on
> Windows.

ndimage measurements has been recently rewritten. ndimage is very fast
but (the old version) has insufficient type checking and may crash on
wrong inputs.

I managed to work with it in the past on Windows. Maybe you could try
to check the dtypes of the arguments for ndi.mean. (Preferably in an
interpreter session where you don't mind if it crashes)

I don't remember the type restrictions, but there are/were several
tickets for it.

Josef


>
> On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>> On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates <mat.yeates@gmail.com>
>> wrote:
>> > thanks. I am also getting an error in ndi.mean
>> > Were you getting the error
>> > "RuntimeError: data type not supported"?
>> >
>> > -Mathew
>> >
>> > On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney <wesmckinn@gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut <schut@sarvision.nl>
>> >> wrote:
>> >> > On 06/02/2010 04:52 AM, josef.pktd@gmail.com wrote:
>> >> >> On Tue, Jun 1, 2010 at 9:57 PM, Zachary
>> >> >> Pincus<zachary.pincus@yale.edu>
>> >> >>  wrote:
>> >> >>>> I guess it's as fast as I'm going to get. I don't really see any
>> >> >>>> other way. BTW, the lat/lons are integers)
>> >> >>>
>> >> >>> You could (in c or cython) try a brain-dead "hashtable" with no
>> >> >>> collision detection:
>> >> >>>
>> >> >>> for lat, long, data in dataset:
>> >> >>>    bin = (lat ^ long) % num_bins
>> >> >>>    hashtable[bin] = update_incremental_mean(hashtable[bin], data)
>> >> >>>
>> >> >>> you'll of course want to do some experiments to see if your data
>> >> >>> are
>> >> >>> sufficiently sparse and/or you can afford a large enough hashtable
>> >> >>> array that you won't get spurious hash collisions. Adding error-
>> >> >>> checking to ensure that there are no collisions would be pretty
>> >> >>> trivial (just keep a table of the lat/long for each hash value,
>> >> >>> which
>> >> >>> you'll need anyway, and check that different lat/long pairs don't
>> >> >>> get
>> >> >>> assigned the same bin).
>> >> >>>
>> >> >>> Zach
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>> -Mathew
>> >> >>>>
>> >> >>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary
>> >> >>>> Pincus<zachary.pincus@yale.edu
>> >> >>>>> wrote:
>> >> >>>>> Hi
>> >> >>>>> Can anyone think of a clever (non-lopping) solution to the
>> >> >>>> following?
>> >> >>>>>
>> >> >>>>> A have a list of latitudes, a list of longitudes, and list of
>> >> >>>>> data
>> >> >>>>> values. All lists are the same length.
>> >> >>>>>
>> >> >>>>> I want to compute an average  of data values for each lat/lon
>> >> >>>>> pair.
>> >> >>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
>> >> >>>>> data[1001] = (data[1001] + data[2001])/2
>> >> >>>>>
>> >> >>>>> Looping is going to take wayyyy to long.
>> >> >>>>
>> >> >>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e.
>> >> >>>> either
>> >> >>>> not floating-point, or floats that will always compare equal, that
>> >> >>>> is,
>> >> >>>> the floating-point bit-patterns will be guaranteed to be
>> >> >>>> identical)
>> >> >>>> or
>> >> >>>> approximately equal to float tolerance?
>> >> >>>>
>> >> >>>> If you're in the approx-equal case, then look at the KD-tree in
>> >> >>>> scipy
>> >> >>>> for doing near-neighbors queries.
>> >> >>>>
>> >> >>>> If you're in the exact-equal case, you could consider hashing the
>> >> >>>> lat/
>> >> >>>> lon pairs or something. At least then the looping is O(N) and not
>> >> >>>> O(N^2):
>> >> >>>>
>> >> >>>> import collections
>> >> >>>> grouped = collections.defaultdict(list)
>> >> >>>> for lt, ln, da in zip(lat, lon, data):
>> >> >>>>    grouped[(lt, ln)].append(da)
>> >> >>>>
>> >> >>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
>> >> >>>> grouped.items())
>> >> >>>>
>> >> >>>> Is that fast enough?
>> >> >>
>> >> >> If the lat lon can be converted to a 1d label as Wes suggested, then
>> >> >> in a similar timing exercise ndimage was the fastest.
>> >> >> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
>> >> >
>> >> > And as you said your lats and lons are integers, you could simply do
>> >> >
>> >> > ll = lat*1000 + lon
>> >> >
>> >> > to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat
>> >> > or
>> >> > lon will never exceed 360 (degrees).
>> >> >
>> >> > After that, either use the ndimage approach, or you could use
>> >> > histogramming with weighting by data values and divide by histogram
>> >> > withouth weighting, or just loop.
>> >> >
>> >> > Vincent
>> >> >
>> >> >>
>> >> >> (this was for python 2.4, also later I found np.bincount which
>> >> >> requires that the labels are consecutive integers, but is as fast as
>> >> >> ndimage)
>> >> >>
>> >> >> I don't know how it would compare to the new suggestions.
>> >> >>
>> >> >> Josef
>> >> >>
>> >> >>
>> >> >>
>> >> >>>>
>> >> >>>> Zach
>> >> >>>> _______________________________________________
>> >> >>>> NumPy-Discussion mailing list
>> >> >>>> NumPy-Discussion@scipy.org
>> >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >> >>>>
>> >> >>>> _______________________________________________
>> >> >>>> NumPy-Discussion mailing list
>> >> >>>> NumPy-Discussion@scipy.org
>> >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> NumPy-Discussion mailing list
>> >> >>> NumPy-Discussion@scipy.org
>> >> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >> >>>
>> >> >
>> >> > _______________________________________________
>> >> > NumPy-Discussion mailing list
>> >> > NumPy-Discussion@scipy.org
>> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >> >
>> >>
>> >> I was curious about how fast ndimage was for this operation so here's
>> >> the complete function.
>> >>
>> >> import scipy.ndimage as ndi
>> >>
>> >> N = 10000
>> >>
>> >> lat = np.random.randint(0, 360, N)
>> >> lon = np.random.randint(0, 360, N)
>> >> data = np.random.randn(N)
>> >>
>> >> def group_mean(lat, lon, data):
>> >>    indexer = np.lexsort((lon, lat))
>> >>    lat = lat.take(indexer)
>> >>    lon = lon.take(indexer)
>> >>    sorted_data = data.take(indexer)
>> >>
>> >>    keys = 1000 * lat + lon
>> >>    unique_keys = np.unique(keys)
>> >>
>> >>    result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
>> >>    decoder = keys.searchsorted(unique_keys)
>> >>
>> >>    return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))
>> >>
>> >> Appears to be about 13x faster (and could be made faster still) than
>> >> the naive version on my machine:
>> >>
>> >> def group_mean_naive(lat, lon, data):
>> >>    grouped = collections.defaultdict(list)
>> >>    for lt, ln, da in zip(lat, lon, data):
>> >>      grouped[(lt, ln)].append(da)
>> >>
>> >>    averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
>> >>
>> >>    return averaged
>> >>
>> >> I had to get the latest scipy trunk to not get an error from
>> >> ndimage.mean
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> NumPy-Discussion@scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>>
>> That's the error I was getting. Depending on your OS upgrading to the
>> scipy trunk should be the easiest fix.
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


More information about the NumPy-Discussion mailing list