[Numpy-discussion] 2D binning
Wes McKinney
wesmckinn@gmail....
Wed Jun 2 11:40:03 CDT 2010
On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut <schut@sarvision.nl> wrote:
> On 06/02/2010 04:52 AM, josef.pktd@gmail.com wrote:
>> On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus<zachary.pincus@yale.edu> wrote:
>>>> I guess it's as fast as I'm going to get. I don't really see any
>>>> other way. BTW, the lat/lons are integers)
>>>
>>> You could (in c or cython) try a brain-dead "hashtable" with no
>>> collision detection:
>>>
>>> for lat, long, data in dataset:
>>> bin = (lat ^ long) % num_bins
>>> hashtable[bin] = update_incremental_mean(hashtable[bin], data)
>>>
>>> you'll of course want to do some experiments to see if your data are
>>> sufficiently sparse and/or you can afford a large enough hashtable
>>> array that you won't get spurious hash collisions. Adding error-
>>> checking to ensure that there are no collisions would be pretty
>>> trivial (just keep a table of the lat/long for each hash value, which
>>> you'll need anyway, and check that different lat/long pairs don't get
>>> assigned the same bin).
>>>
>>> Zach
>>>
>>>
>>>
>>>> -Mathew
>>>>
>>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus<zachary.pincus@yale.edu
>>>>> wrote:
>>>>> Hi
>>>>> Can anyone think of a clever (non-lopping) solution to the
>>>> following?
>>>>>
>>>>> A have a list of latitudes, a list of longitudes, and list of data
>>>>> values. All lists are the same length.
>>>>>
>>>>> I want to compute an average of data values for each lat/lon pair.
>>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
>>>>> data[1001] = (data[1001] + data[2001])/2
>>>>>
>>>>> Looping is going to take wayyyy to long.
>>>>
>>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either
>>>> not floating-point, or floats that will always compare equal, that is,
>>>> the floating-point bit-patterns will be guaranteed to be identical) or
>>>> approximately equal to float tolerance?
>>>>
>>>> If you're in the approx-equal case, then look at the KD-tree in scipy
>>>> for doing near-neighbors queries.
>>>>
>>>> If you're in the exact-equal case, you could consider hashing the lat/
>>>> lon pairs or something. At least then the looping is O(N) and not
>>>> O(N^2):
>>>>
>>>> import collections
>>>> grouped = collections.defaultdict(list)
>>>> for lt, ln, da in zip(lat, lon, data):
>>>> grouped[(lt, ln)].append(da)
>>>>
>>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
>>>> grouped.items())
>>>>
>>>> Is that fast enough?
>>
>> If the lat lon can be converted to a 1d label as Wes suggested, then
>> in a similar timing exercise ndimage was the fastest.
>> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
>
> And as you said your lats and lons are integers, you could simply do
>
> ll = lat*1000 + lon
>
> to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
> lon will never exceed 360 (degrees).
>
> After that, either use the ndimage approach, or you could use
> histogramming with weighting by data values and divide by histogram
> withouth weighting, or just loop.
>
> Vincent
>
>>
>> (this was for python 2.4, also later I found np.bincount which
>> requires that the labels are consecutive integers, but is as fast as
>> ndimage)
>>
>> I don't know how it would compare to the new suggestions.
>>
>> Josef
>>
>>
>>
>>>>
>>>> Zach
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
I was curious about how fast ndimage was for this operation so here's
the complete function.
import scipy.ndimage as ndi
N = 10000
lat = np.random.randint(0, 360, N)
lon = np.random.randint(0, 360, N)
data = np.random.randn(N)
def group_mean(lat, lon, data):
indexer = np.lexsort((lon, lat))
lat = lat.take(indexer)
lon = lon.take(indexer)
sorted_data = data.take(indexer)
keys = 1000 * lat + lon
unique_keys = np.unique(keys)
result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
decoder = keys.searchsorted(unique_keys)
return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))
Appears to be about 13x faster (and could be made faster still) than
the naive version on my machine:
def group_mean_naive(lat, lon, data):
grouped = collections.defaultdict(list)
for lt, ln, da in zip(lat, lon, data):
grouped[(lt, ln)].append(da)
averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
return averaged
I had to get the latest scipy trunk to not get an error from ndimage.mean
More information about the NumPy-Discussion
mailing list