[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Nathan Bell wnbell@gmail....
Mon Mar 24 07:55:15 CDT 2008


On Mon, Mar 24, 2008 at 5:41 AM, David Cournapeau
<david@ar.media.kyoto-u.ac.jp> wrote:
>  > I mean that I have 1 million text pages with 150 thousands different
>  > words(features), but  each  document has only  small part of all 150
>  > thousands world. And if  I  use simple matrix it will be huge. But if
>  > i use sparse format such as libsvm data format than input file will be
>  > much smaller. I don't know how to do it with scikits now. I know how
>  > to do it with libsvm and many other tools, but I want to make scipy
>  > appropriate for this task. And I want make a tutorial in which one
>  > paragraph will be about "Sparse data".
>
>  I understand sparse, I don't understand why you cannot use existing
>  scipy implementation :)

Anton, can you describe libsvm's sparse format?  I think it's highly
likely that scipy.sparse supports the functionality you need.

Currently you can load a sparse matrix from disk using MatrixMarket
format (scipy.io.mmread)  or  MATLAB format (scipy.io.loadmat).  Both
of these functions should be fast enough for your 150K by 1M example.

FWIW the MATLAB files will generally be smaller and load faster.

-- 
Nathan Bell wnbell@gmail.com
http://graphics.cs.uiuc.edu/~wnbell/


More information about the Scipy-dev mailing list