[SciPy-user] SOM in scipy.cluster

Eric Bruning eric@deeplycloudy....
Mon Aug 18 10:49:28 CDT 2008


On Aug 14, 2008, at 10:03 AM, Corran Webster wrote:

>> On Wed, Aug 13, 2008 at 10:31, Eric Bruning  
>> <eric@deeplycloudy.com> wrote:
>> > Greetings,
>> >
>> > I'm considering using self-organizing maps for mining some  
>> lightning
>> > data, and info.py in scipy.cluster mentions that implementation of
>> > self-organizing maps is under development. I've found a few  
>> examples
>> > written in python around the web, but none that are likely to be
>> > efficient enough for my use.
>> >
>> > Is there still interest in including SOM in scipy? I'd be happy to
>> > coordinate on a contribution if nothing else is under way.
>>
>> Sure! My colleague Corran Webster was thinking about doing some SOM
>> stuff for scipy, too, so you two should talk.
>
> Hi,
>
> yes, I've been thinking seriously about adding some SOM algorithms  
> to scipy - I used them heavily in my previous job (we used SOMs to  
> classify documents for the Mayo clinic and slot machine players for  
> casinos...).
>
> I think that there is a place for a simple and fast batch SOM  
> algorithm in scipy.cluster.vq, since the batch SOM can be viewed as  
> a generalization of K-means.
>
> For more general variations of the SOM algorithm, it may make sense  
> to put them elsewhere - possibly in the machine learning scikit  
> where the algorithms can access other.
>
> I'd be interested to hear what your needs are as far as data types,  
> distance functions, data set sizes and SOM topologies, as that  
> would likely influence on where I concentrate my energy.

Thanks for your interest!

I have two data types that I'm considering mining. One is a space and  
time tracing of lightning channels, on the order of 10^2 to 10^3  
points per flash. The spatial coordinates are inherently vectorial,  
but there is the complication of doing a distance measure along the  
time coordinate. I might want to look at the map generated by a  
single flash. Perhaps I might also want to throw 10^6 points from a  
bunch of flashes to characterize an entire thunderstorm at once.

The other data type is a collection of flash properties, which don't  
naturally form any sort of vector. These are properties like extent,  
altitude, brightness, etc. We'd like to use mapped channels to  
predict the optical signal.

SOMs are new to me, but my naive intuition finds them suited to this  
kind of exploratory data mining. Working on their implementation  
should be instructive.

-Eric


More information about the SciPy-user mailing list