[SciPy-dev] Google Summer of Code and scipy.learn (another trying)
Mon Mar 24 09:49:14 CDT 2008
The main problem that I have with using sparse input formats is that it
tends to ignore the complete picture. Typically the algorithms are
typically not implemented to utilize sparse matrices and associated
techniques so the internals and outputs are not stored as a sparse
format. This means that the only gain is the apparent ease of input
because any storage advantage is lost if the input needs to be converted
to a dense format (especially if both copies are.
Using record arrays with both masked and sparse arrays would provide
probably address many concerns. Record arrays would allow labels like
'target' without forcing any order in the data storage, masked arrays
would allow for missing values in the input and sparse arrays would
potentially provide storage and algorithmic advantages.
Anton Slesarev wrote:
> On Mon, Mar 24, 2008 at 3:55 PM, Nathan Bell <email@example.com
> <mailto:firstname.lastname@example.org>> wrote:
> > On Mon, Mar 24, 2008 at 5:41 AM, David Cournapeau
> > <email@example.com <mailto:firstname.lastname@example.org>>
> > > > I mean that I have 1 million text pages with 150 thousands
> > > > words(features), but each document has only small part of
> all 150
> > > > thousands world. And if I use simple matrix it will be huge.
> But if
> > > > i use sparse format such as libsvm data format than input file
> will be
> > > > much smaller. I don't know how to do it with scikits now. I
> know how
> > > > to do it with libsvm and many other tools, but I want to make scipy
> > > > appropriate for this task. And I want make a tutorial in which one
> > > > paragraph will be about "Sparse data".
> > >
> > > I understand sparse, I don't understand why you cannot use existing
> > > scipy implementation :)
> > Anton, can you describe libsvm's sparse format? I think it's highly
> > likely that scipy.sparse supports the functionality you need.
> > Currently you can load a sparse matrix from disk using MatrixMarket
> > format (scipy.io.mmread) or MATLAB format (scipy.io.loadmat). Both
> > of these functions should be fast enough for your 150K by 1M example.
> > FWIW the MATLAB files will generally be smaller and load faster.
> > --
> > Nathan Bell email@example.com <mailto:firstname.lastname@example.org>
> > http://graphics.cs.uiuc.edu/~wnbell/
> > _______________________________________________
> > Scipy-dev mailing list
> > Scipyemail@example.com <mailto:Scipyfirstname.lastname@example.org>
> > http://projects.scipy.org/mailman/listinfo/scipy-dev
> libsvm format:
> "libsvm uses the so called "sparse" format where zero values do not
> need to be stored. Hence a data with attributes 1 0 2 0
> is represented as 1:1 3:2"
> I understand that it is possible to use scipy.sparse and something
> else but what about if I need to make feature selection or some
> specific normalization? I think that we can integrate this
> procedure(with scipy.sparse and reading huge files) to dataset class
> in scikits.learn.
> Anton Slesarev
> Scipy-dev mailing list
More information about the Scipy-dev