[SciPy-dev] Google Summer of Code and scipy.learn (another trying)
Mon Mar 24 08:35:02 CDT 2008
On Mon, Mar 24, 2008 at 3:55 PM, Nathan Bell <email@example.com> wrote:
> On Mon, Mar 24, 2008 at 5:41 AM, David Cournapeau
> <firstname.lastname@example.org> wrote:
> > > I mean that I have 1 million text pages with 150 thousands different
> > > words(features), but each document has only small part of all 150
> > > thousands world. And if I use simple matrix it will be huge. But if
> > > i use sparse format such as libsvm data format than input file will
> > > much smaller. I don't know how to do it with scikits now. I know how
> > > to do it with libsvm and many other tools, but I want to make scipy
> > > appropriate for this task. And I want make a tutorial in which one
> > > paragraph will be about "Sparse data".
> > I understand sparse, I don't understand why you cannot use existing
> > scipy implementation :)
> Anton, can you describe libsvm's sparse format? I think it's highly
> likely that scipy.sparse supports the functionality you need.
> Currently you can load a sparse matrix from disk using MatrixMarket
> format (scipy.io.mmread) or MATLAB format (scipy.io.loadmat). Both
> of these functions should be fast enough for your 150K by 1M example.
> FWIW the MATLAB files will generally be smaller and load faster.
> Nathan Bell email@example.com
> Scipy-dev mailing list
"libsvm uses the so called "sparse" format where zero values do not need to
be stored. Hence a data with attributes 1 0 2 0
is represented as 1:1 3:2"
I understand that it is possible to use scipy.sparse and something else but
what about if I need to make feature selection or some specific
normalization? I think that we can integrate this procedure(with
scipy.sparse and reading huge files) to dataset class in scikits.learn.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Scipy-dev