[Scipy-tickets] [SciPy] #326: coo_matrix.tocsr() and tocsc() are very slow

SciPy scipy-tickets at scipy.net
Thu Nov 30 14:46:30 CST 2006


#326: coo_matrix.tocsr() and tocsc() are very slow
--------------------------+-------------------------------------------------
 Reporter:  wnbell        |       Owner:  somebody              
     Type:  defect        |      Status:  new                   
 Priority:  highest       |   Milestone:                        
Component:  scipy.sparse  |     Version:  devel                 
 Severity:  major         |    Keywords:  sparse coo_matrix slow
--------------------------+-------------------------------------------------
 The coo_matrix format is used for many sparse matrix format conversions.
 However current implementation of coo_matrix.tocsr() and
 coo_matrix.tocsc() are very slow owing to the use of a sort on Python
 tuples in coo_matrix._normalize().  I've changed the tuple sort to a
 couple numpy sorts for a 10x speedup.

 The previous implementation was:
 {{{
 2277        def _normalize(self, rowfirst=False):
 2278            if rowfirst:
 2279                l = zip(self.row, self.col, self.data)
 2280                l.sort()
 2281                row, col, data = list(itertools.izip(*l))
 2282                return data, row, col
 2283            if getattr(self, '_is_normalized', None):
 2284                return self.data, self.row, self.col
 2285            l = zip(self.col, self.row, self.data)
 2286            l.sort()
 2287            # This breaks when len(self.data) etc == 0.  Does this
 matter?
 2288            col, row, data = list(itertools.izip(*l))
 2289            self.col = asarray(col, intc)
 2290            self.row = asarray(row, intc)
 2291            self.data = array(data, self.dtype)
 2292            setattr(self, '_is_normalized', 1)
 2293            return self.data, self.row, self.col
 }}}

 And the faster version is:
 {{{
     def _normalize(self, rowfirst=False):
         if rowfirst:
             if getattr(self, '_is_normalized', None):
                 #columns already sorted, use stable sort for rows
                 Pr = numpy.argsort(self.row,kind='mergesort')
                 return self.data[P], self.row[P], self.col[P]
             else:
                 #sort columns, then stable sort rows
                 Pc = numpy.argsort(self.col)
                 Pr = numpy.argsort(self.row[Pc],kind='mergesort')
                 return self.data[Pc][Pr], self.row[Pc][Pr],
 self.col[Pc][Pr]
         if getattr(self, '_is_normalized', None):
             return self.data, self.row, self.col
         #sort rows, then stable sort columns
         Pr = numpy.argsort(self.row)
         Pc = numpy.argsort(self.col[Pr],kind='mergesort')
         self.data,self.row,self.col = self.data[Pr][Pc], self.row[Pr][Pc],
 self.col[Pr][Pc]
         setattr(self, '_is_normalized', 1)
         return self.data, self.row, self.col
 }}}


 Using a 50K by 50K coo_matrix with approx 10 nnz per row I got the
 following timings:

 With the old method:
 {{{
 In [10]: A
 Out[10]:
 <50000x50000 sparse matrix of type '<type 'numpy.float64'>'
         with 500000 stored elements in COOrdinate format>

 In [11]: time B = A.tocsr()
 CPU times: user 9.03 s, sys: 0.12 s, total: 9.14 s
 Wall time: 9.22

 In [12]: time B = A.tocsc()
 CPU times: user 9.31 s, sys: 0.04 s, total: 9.36 s
 Wall time: 9.38

 In [13]: time B = A.tocsc()
 CPU times: user 0.03 s, sys: 0.00 s, total: 0.03 s
 Wall time: 0.03
 }}}


 And the new method:
 {{{
 In [19]: A
 Out[19]:
 <50000x50000 sparse matrix of type '<type 'numpy.float64'>'
         with 500000 stored elements in COOrdinate format>

 In [20]: time B = A.tocsr()
 CPU times: user 0.92 s, sys: 0.03 s, total: 0.94 s
 Wall time: 0.97

 In [21]: time B = A.tocsc()
 CPU times: user 0.93 s, sys: 0.05 s, total: 0.98 s
 Wall time: 1.00

 In [22]: time B = A.tocsc()
 CPU times: user 0.04 s, sys: 0.00 s, total: 0.04 s
 Wall time: 0.04
 }}}

 Note that the second call to tocsc() does not require a sort in either
 method.

-- 
Ticket URL: <http://projects.scipy.org/scipy/scipy/ticket/326>
SciPy <http://www.scipy.org/>
SciPy is open-source software for mathematics, science, and engineering.


More information about the Scipy-tickets mailing list