[Numpy-discussion] 2D binning
Mathew Yeates
mat.yeates@gmail....
Wed Jun 2 12:23:29 CDT 2010
thanks. I am also getting an error in ndi.mean
Were you getting the error
"RuntimeError: data type not supported"?
-Mathew
On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
> On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut <schut@sarvision.nl> wrote:
> > On 06/02/2010 04:52 AM, josef.pktd@gmail.com wrote:
> >> On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus<zachary.pincus@yale.edu>
> wrote:
> >>>> I guess it's as fast as I'm going to get. I don't really see any
> >>>> other way. BTW, the lat/lons are integers)
> >>>
> >>> You could (in c or cython) try a brain-dead "hashtable" with no
> >>> collision detection:
> >>>
> >>> for lat, long, data in dataset:
> >>> bin = (lat ^ long) % num_bins
> >>> hashtable[bin] = update_incremental_mean(hashtable[bin], data)
> >>>
> >>> you'll of course want to do some experiments to see if your data are
> >>> sufficiently sparse and/or you can afford a large enough hashtable
> >>> array that you won't get spurious hash collisions. Adding error-
> >>> checking to ensure that there are no collisions would be pretty
> >>> trivial (just keep a table of the lat/long for each hash value, which
> >>> you'll need anyway, and check that different lat/long pairs don't get
> >>> assigned the same bin).
> >>>
> >>> Zach
> >>>
> >>>
> >>>
> >>>> -Mathew
> >>>>
> >>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary Pincus<
> zachary.pincus@yale.edu
> >>>>> wrote:
> >>>>> Hi
> >>>>> Can anyone think of a clever (non-lopping) solution to the
> >>>> following?
> >>>>>
> >>>>> A have a list of latitudes, a list of longitudes, and list of data
> >>>>> values. All lists are the same length.
> >>>>>
> >>>>> I want to compute an average of data values for each lat/lon pair.
> >>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
> >>>>> data[1001] = (data[1001] + data[2001])/2
> >>>>>
> >>>>> Looping is going to take wayyyy to long.
> >>>>
> >>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either
> >>>> not floating-point, or floats that will always compare equal, that is,
> >>>> the floating-point bit-patterns will be guaranteed to be identical) or
> >>>> approximately equal to float tolerance?
> >>>>
> >>>> If you're in the approx-equal case, then look at the KD-tree in scipy
> >>>> for doing near-neighbors queries.
> >>>>
> >>>> If you're in the exact-equal case, you could consider hashing the lat/
> >>>> lon pairs or something. At least then the looping is O(N) and not
> >>>> O(N^2):
> >>>>
> >>>> import collections
> >>>> grouped = collections.defaultdict(list)
> >>>> for lt, ln, da in zip(lat, lon, data):
> >>>> grouped[(lt, ln)].append(da)
> >>>>
> >>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
> >>>> grouped.items())
> >>>>
> >>>> Is that fast enough?
> >>
> >> If the lat lon can be converted to a 1d label as Wes suggested, then
> >> in a similar timing exercise ndimage was the fastest.
> >> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
> >
> > And as you said your lats and lons are integers, you could simply do
> >
> > ll = lat*1000 + lon
> >
> > to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat or
> > lon will never exceed 360 (degrees).
> >
> > After that, either use the ndimage approach, or you could use
> > histogramming with weighting by data values and divide by histogram
> > withouth weighting, or just loop.
> >
> > Vincent
> >
> >>
> >> (this was for python 2.4, also later I found np.bincount which
> >> requires that the labels are consecutive integers, but is as fast as
> >> ndimage)
> >>
> >> I don't know how it would compare to the new suggestions.
> >>
> >> Josef
> >>
> >>
> >>
> >>>>
> >>>> Zach
> >>>> _______________________________________________
> >>>> NumPy-Discussion mailing list
> >>>> NumPy-Discussion@scipy.org
> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>>>
> >>>> _______________________________________________
> >>>> NumPy-Discussion mailing list
> >>>> NumPy-Discussion@scipy.org
> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>>
> >>> _______________________________________________
> >>> NumPy-Discussion mailing list
> >>> NumPy-Discussion@scipy.org
> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >>>
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> I was curious about how fast ndimage was for this operation so here's
> the complete function.
>
> import scipy.ndimage as ndi
>
> N = 10000
>
> lat = np.random.randint(0, 360, N)
> lon = np.random.randint(0, 360, N)
> data = np.random.randn(N)
>
> def group_mean(lat, lon, data):
> indexer = np.lexsort((lon, lat))
> lat = lat.take(indexer)
> lon = lon.take(indexer)
> sorted_data = data.take(indexer)
>
> keys = 1000 * lat + lon
> unique_keys = np.unique(keys)
>
> result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
> decoder = keys.searchsorted(unique_keys)
>
> return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))
>
> Appears to be about 13x faster (and could be made faster still) than
> the naive version on my machine:
>
> def group_mean_naive(lat, lon, data):
> grouped = collections.defaultdict(list)
> for lt, ln, da in zip(lat, lon, data):
> grouped[(lt, ln)].append(da)
>
> averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
>
> return averaged
>
> I had to get the latest scipy trunk to not get an error from ndimage.mean
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100602/b26c4f44/attachment.html
More information about the NumPy-Discussion
mailing list