[SciPy-dev] Google Summer of Code and scipy.learn (another trying)
Anton Slesarev
slesarev.anton@gmail....
Mon Mar 24 08:35:02 CDT 2008
On Mon, Mar 24, 2008 at 3:55 PM, Nathan Bell <wnbell@gmail.com> wrote:
>
> On Mon, Mar 24, 2008 at 5:41 AM, David Cournapeau
> <david@ar.media.kyoto-u.ac.jp> wrote:
> > > I mean that I have 1 million text pages with 150 thousands different
> > > words(features), but each document has only small part of all 150
> > > thousands world. And if I use simple matrix it will be huge. But if
> > > i use sparse format such as libsvm data format than input file will
be
> > > much smaller. I don't know how to do it with scikits now. I know how
> > > to do it with libsvm and many other tools, but I want to make scipy
> > > appropriate for this task. And I want make a tutorial in which one
> > > paragraph will be about "Sparse data".
> >
> > I understand sparse, I don't understand why you cannot use existing
> > scipy implementation :)
>
> Anton, can you describe libsvm's sparse format? I think it's highly
> likely that scipy.sparse supports the functionality you need.
>
> Currently you can load a sparse matrix from disk using MatrixMarket
> format (scipy.io.mmread) or MATLAB format (scipy.io.loadmat). Both
> of these functions should be fast enough for your 150K by 1M example.
>
> FWIW the MATLAB files will generally be smaller and load faster.
>
> --
> Nathan Bell wnbell@gmail.com
> http://graphics.cs.uiuc.edu/~wnbell/
>
>
>
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev@scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>
libsvm format:
"libsvm uses the so called "sparse" format where zero values do not need to
be stored. Hence a data with attributes 1 0 2 0
is represented as 1:1 3:2"
I understand that it is possible to use scipy.sparse and something else but
what about if I need to make feature selection or some specific
normalization? I think that we can integrate this procedure(with
scipy.sparse and reading huge files) to dataset class in scikits.learn.
--
Anton Slesarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/scipy-dev/attachments/20080324/e9b6795a/attachment.html
More information about the Scipy-dev
mailing list