[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Anton Slesarev slesarev.anton@gmail....
Mon Mar 24 08:35:02 CDT 2008


On Mon, Mar 24, 2008 at 3:55 PM, Nathan Bell <wnbell@gmail.com> wrote:
>
> On Mon, Mar 24, 2008 at 5:41 AM, David Cournapeau
> <david@ar.media.kyoto-u.ac.jp> wrote:
> >  > I mean that I have 1 million text pages with 150 thousands different
> >  > words(features), but  each  document has only  small part of all 150
> >  > thousands world. And if  I  use simple matrix it will be huge. But if
> >  > i use sparse format such as libsvm data format than input file will
be
> >  > much smaller. I don't know how to do it with scikits now. I know how
> >  > to do it with libsvm and many other tools, but I want to make scipy
> >  > appropriate for this task. And I want make a tutorial in which one
> >  > paragraph will be about "Sparse data".
> >
> >  I understand sparse, I don't understand why you cannot use existing
> >  scipy implementation :)
>
> Anton, can you describe libsvm's sparse format?  I think it's highly
> likely that scipy.sparse supports the functionality you need.
>
> Currently you can load a sparse matrix from disk using MatrixMarket
> format (scipy.io.mmread)  or  MATLAB format (scipy.io.loadmat).  Both
> of these functions should be fast enough for your 150K by 1M example.
>
> FWIW the MATLAB files will generally be smaller and load faster.
>
> --
> Nathan Bell wnbell@gmail.com
> http://graphics.cs.uiuc.edu/~wnbell/
>
>
>
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev@scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>

libsvm format:



"libsvm uses the so called "sparse" format where zero values do not need to
be stored. Hence a data with attributes 1 0 2 0

is represented as 1:1 3:2"


I understand that it is possible to use scipy.sparse and something else but
what about if I need to make feature selection or some specific
normalization? I think that we can integrate this procedure(with
scipy.sparse and reading huge files) to dataset class in scikits.learn.






-- 
Anton Slesarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/scipy-dev/attachments/20080324/e9b6795a/attachment.html 


More information about the Scipy-dev mailing list