[Numpy-discussion] 2D binning
Wed Jun 2 00:15:39 CDT 2010
On Tue, Jun 1, 2010 at 1:51 PM, Wes McKinney <firstname.lastname@example.org> wrote:
> On Tue, Jun 1, 2010 at 4:49 PM, Zachary Pincus <email@example.com> wrote:
>>> Can anyone think of a clever (non-lopping) solution to the following?
>>> A have a list of latitudes, a list of longitudes, and list of data
>>> values. All lists are the same length.
>>> I want to compute an average of data values for each lat/lon pair.
>>> e.g. if lat lon = lat [lon  then
>>> data = (data + data)/2
>>> Looping is going to take wayyyy to long.
>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either
>> not floating-point, or floats that will always compare equal, that is,
>> the floating-point bit-patterns will be guaranteed to be identical) or
>> approximately equal to float tolerance?
>> If you're in the approx-equal case, then look at the KD-tree in scipy
>> for doing near-neighbors queries.
>> If you're in the exact-equal case, you could consider hashing the lat/
>> lon pairs or something. At least then the looping is O(N) and not
>> import collections
>> grouped = collections.defaultdict(list)
>> for lt, ln, da in zip(lat, lon, data):
>> grouped[(lt, ln)].append(da)
>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items())
>> Is that fast enough?
>> NumPy-Discussion mailing list
> This is a pretty good example of the "group-by" problem that will
> hopefully work its way into a future edition of NumPy. Given that, a
> good approach would be to produce a unique key from the lat and lon
> vectors, and pass that off to the groupby routine (when it exists).
> NumPy-Discussion mailing list
meanwhile groupby from itertools will work but might be a bit slower
since it'll have to convert every row to tuple and group in a list.
import numpy as np
# fake data
N = 10000
lats = np.repeat(180 * (np.random.ranf(N/ 250) - 0.5), 250)
lons = np.repeat(360 * (np.random.ranf(N/ 250) - 0.5), 250)
vals = np.arange(N)
inds = np.lexsort((lons, lats))
sorted_lats = lats[inds]
sorted_lons = lons[inds]
sorted_vals = vals[inds]
llv = np.array((sorted_lats, sorted_lons, sorted_vals)).T
for (lat, lon), group in itertools.groupby(llv, lambda row: tuple(row[:2])):
group_vals = [g[-1] for g in group]
print lat, lon, np.mean(group_vals)
# make sure the mean for the last lat/lon from the loop matches the mean
# for that lat/lon from original data.
tests_idx, = np.where((lats == lat) & (lons == lon))
assert np.mean(vals[tests_idx]) == np.mean(group_vals)
More information about the NumPy-Discussion