[Numpy-discussion] MemoryError : with scipy.spatial.distance

Abhishek Pratap apratap@lbl....
Wed Apr 4 18:41:51 CDT 2012


Thanks Chris. So I guess the question becomes how can I efficiently
cluster 1 million x,y coordinates.

-Abhi

On Wed, Apr 4, 2012 at 4:35 PM, Chris Barker <chris.barker@noaa.gov> wrote:
> On Wed, Apr 4, 2012 at 4:17 PM, Abhishek Pratap
>> close to a 900K points using DBSCAN algo. My input is a list of ~900k
>> tuples each having two points (x,y) coordinates. I am converting them
>> to numpy array and passing them to pdist method of
>> scipy.spatial.distance for calculating distance between each point.
>
> I think pdist creates an array that is:
>
> sum(range(num+points)) in size.
>
> That's going to be pretty darn big:
>
> 404999550000 elements
>
> I think that's about 3 terabytes:
>
> In [41]: sum(range(900000)) / 1024. / 1024 / 1024 / 1024 * 8
> Out[41]: 2.946759559563361
>
> (for 64 bit floats)
>
>
>> I think the error has something to do with the default double dtype
>> of numpy array of pdist function.
>
> you *may* be able to get it to use float32 -- but as you can see, that
> probably won't help enough!
>
> You'll need a different approach!
>
> -Chris
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker@noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


More information about the NumPy-Discussion mailing list