[SciPy-User] creating sparse indicator arrays

josef.pktd@gmai... josef.pktd@gmai...
Wed Nov 30 20:04:18 CST 2011


On Tue, Nov 29, 2011 at 3:25 PM,  <josef.pktd@gmail.com> wrote:
> On Tue, Nov 29, 2011 at 2:50 PM,  <josef.pktd@gmail.com> wrote:
>> On Tue, Nov 29, 2011 at 2:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
>>> On Tue, Nov 29, 2011 at 7:14 AM,  <josef.pktd@gmail.com> wrote:
>>>> Is there a simple or fast way to create a sparse indicator array, `a`
>>>> below, without going through the dense matrix first?
>>>
>>> The standard way is to use the LIL or DOK sparse formats. If you want
>>> to use them then you'll have to do your construction "by hand", though
>>> -- you can't do the nice broadcasting tricks you're using below.
>>> Alternatively, constructing CSC or CSR format directly is not that
>>> hard, though it may take some time to wrap your head around the
>>> definitions...
>>>
>>>>>>> from scipy import sparse
>>>>>>> g = np.array([0, 0, 1, 1])   #categories, integers,
>>>>>>> u = np.arange(2)    #unique's,  range(number_categories)
>>>
>>> If 'u' is *always* going to be np.arange(number_categories), then
>>> actually this is quite trivial (untested code):
>>>
>>> data = np.ones(len(g), dtype=np.int8)
>>> indices = g
>>> indptr = np.arange(len(g))
>>> a = np.csr_matrix((data, indices, indptr))
>>
>> This works nicely  (only "sparse" namespace)
>
> small correction:
> indptr needs to be 1 longer or it drops the last row
> (since I didn't RTFM, I don't know what it means, but it works)
>
>>>> g = np.array([0, 0, 1, 2, 1, 1, 2, 0])
>>>> data = np.ones(len(g), dtype=np.int8)
>>>> indptr = np.arange(len(g)+1)   #add 1
>>>> a = sparse.csr_matrix((data, g, indptr))
>>>> a.todense()
> matrix([[1, 0, 0],
>        [1, 0, 0],
>        [0, 1, 0],
>        [0, 0, 1],
>        [0, 1, 0],
>        [0, 1, 0],
>        [0, 0, 1],
>        [1, 0, 0]], dtype=int8)
>>>> np.all(a.todense() == (g[:,None] == np.arange(3)).astype(int))
> True
>
> Josef
>
>>
>> u = np.arange(number_categories)  will be a code requirement
>> (group or period labels are consecutive ints)
>>
>>>
>>> This gives you a CSR matrix, which you can either use as is or convert to CSC.
>>>
>>> If you want to build CSC directly, and want to support an arbitrary
>>> 'u' vector, then you could do something like (untested code):
>>>
>>> data = np.ones(len(g), dtype=np.int8)
>>> indices = np.empty(len(g), dtype=int)
>>> write_offset = 0
>>> indptr = np.empty(number_categories, dtype=int)
>>> for col_i, category in enumerate(u):
>>>  indptr[col_i] = write_offset
>>>  rows = (data == category).nonzero()[0]
>>>  indices[write_offset:write_offset + len(rows)] = rows
>>>  write_offset += len(rows)
>>
>> I still need to check this.
>>
>>>
>>> Or you could just use a loop that fills in an LIL matrix :-)
>>
>> I'm playing with panel data or general error component models. The
>> main point of using sparse is to have a compact, non-loop version.
>>
>> In some previous attempts at sparse the cost of constructing the array
>> with loops removed much of the advantage of using them, and I could
>> just loop in the algorithm directly.

finally got started with some timing

groupsums: 200,000 observations, 20 variables, 500 groups
sparse looks good, even better than bincount

In [69]: %timeit a = sparse.csr_matrix((data, g, np.arange(len(g)+1)))
1000 loops, best of 3: 732 us per loop

In [64]: a
Out[64]:
<200000x500 sparse matrix of type '<type 'numpy.int8'>'
	with 200000 stored elements in Compressed Sparse Row format>

In [65]: %timeit x.T * a  #sparse
100 loops, best of 3: 6.28 ms per loop

In [66]: %timeit np.array([np.bincount(g, weights=x[:,col]) for col in
range(x.shape[1])])
10 loops, best of 3: 57.5 ms per loop

In [67]: %timeit np.dot(xT, indi)
1 loops, best of 3: 1.29 s per loop

In [72]: %timeit for cat in u: x[g==cat].sum(0)
1 loops, best of 3: 635 ms per loop

In [68]: indi.shape
Out[68]: (200000, 500)

In [70]: x.shape
Out[70]: (200000, 20)


In [73]: res_dot = np.dot(xT, indi)

In [74]: res_sparse = x.T * a

In [75]: np.max(np.abs(res_dot - res_sparse))
Out[75]: 0.0

nice

Josef

>>
>> Thanks,
>>
>> Josef
>>
>>>
>>> -- Nathaniel
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>


More information about the SciPy-User mailing list