[SciPy-dev] Machine learning datasets (was Presentation of pymachine, a python package for machine learning)

David Cournapeau david@ar.media.kyoto-u.ac...
Sun Jun 3 21:46:36 CDT 2007

Peter Skomoroch wrote:
> The licensing of datasets is an interesting issue, it sounds like they 
> will need to be tackled one by one unless explicitly released to the 
> public domain.
> Check out the wikipedia entry on "Open Data":
> http://en.wikipedia.org/wiki/Open_Data
> "Creators of data often do not consider the need to state the 
> conditions of ownership, licensing and re-use. For example, many 
> scientists do not regard the published data arising from their work to 
> be theirs to control and the act of publication in a journal is an 
> implicit release of the data into the commons. However the lack of a 
> license makes it difficult to determine the status of a data set 
> <http://en.wikipedia.org/wiki/Data_set> and may restrict the use of 
> data offered in an Open spirit. Because of this uncertainty it is also 
> possible for public or private organisations to aggregate such data, 
> protect it with copyright and then resell it."
> I remember a while back Leslie Kaelbling bought the enron dataset 
> http://www.cs.cmu.edu/~enron/ <http://www.cs.cmu.edu/%7Eenron/> for 
> use in machine learning.  
> Maybe we can start a scipy wikipage with a list/table of datasets 
> along with license status...and check off the ones which we find are 
> not compatible so we can find replacements or get permission.  Also, 
> we might want to add a column for which modules use the data in scipy 
> tests etc.,
> Should I go ahead and create the page? 
I started something here: http://www.scipy.org/DataSets. I tried to put 
all websites talked about in this thread there, with license information 
if available, plus the comment of R. Kern on licensing (at least in the US).



