[SciPy-User] Sparse Matrices, summing columns !=sum
Aronne Merrelli
aronne.merrelli@gmail....
Fri Nov 16 12:25:30 CST 2012
On Fri, Nov 16, 2012 at 10:57 AM, Pauli Virtanen <pav@iki.fi> wrote:
> <gabe.g <at> me.com> writes:
> [clip]
>> So-- is this a bug or am I doing something wrong?
>>
>> I have a sparse matrix that I arrived at through
>> a complicated bunch of calculations which I cannot
>> reproduce here. I will try to find a simpler example of this.
> [clip]
>> In [170]: X
>> Out[170]:
>> <196980x43 sparse matrix of type '<type 'numpy.uint16'>'
>> with 70875 stored elements in Compressed Sparse Row format>
>
> It's an integer overflow due to using the same integer type
> as the accumulator.
>
> Dense Numpy arrays however appear to use the platform default integer
> as the accumulator:
>
>>>> np.array([[1,1],[1,1]], dtype=np.uint16).sum(axis=0)
> array([2, 2], dtype=uint64)
>
> IIRC, this was changed in Numpy at some point to work like this.
>
> The sparse matrices perhaps also should mirror this behavior,
> at least it would avoid overflows in the most common cases.
>
Tracing the code, it looks like the sum method of the spmatrix class
is used in this case (see base.py in the sparse module). It computes
column or row sums by a matrix product:
if axis == 0:
# sum over columns
return np.asmatrix(np.ones((1, m), dtype=self.dtype)) * self
elif axis == 1:
# sum over rows
return self * np.asmatrix(np.ones((n, 1), dtype=self.dtype))
My guess is by turning this into a multiplication it uses a more
efficient sparse matrix product. I'd think this could be modified to
take an optional keyword dtype argument instead of defaulting to the
same type as itself. The user could pick a larger dtype to prevent
overflow. Something like the following (which also might be a
workaround for the OP?):
In [1]: import scipy.sparse
In [2]: denseA = np.zeros((2,2), dtype=uint16)
In [3]: denseA[:,0] = 60000
In [4]: denseA.sum(0)
Out[4]: array([120000, 0], dtype=uint64)
In [5]: sparseA = scipy.sparse.csr.csr_matrix(denseA)
In [6]: sparseA
Out[6]:
<2x2 sparse matrix of type '<type 'numpy.uint16'>'
with 2 stored elements in Compressed Sparse Row format>
In [7]: sparseA.sum(0) # overflows
Out[7]: matrix([[54464, 0]], dtype=uint16)
In [12]: np.asmatrix(np.ones((1,2),dtype=uint64)) * sparseA # does not overflow
Out[12]: matrix([[120000, 0]], dtype=uint64)
More information about the SciPy-User
mailing list