# [SciPy-User] creating sparse indicator arrays

josef.pktd@gmai... josef.pktd@gmai...
Tue Nov 29 13:50:34 CST 2011

On Tue, Nov 29, 2011 at 2:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Tue, Nov 29, 2011 at 7:14 AM,  <josef.pktd@gmail.com> wrote:
>> Is there a simple or fast way to create a sparse indicator array, `a`
>> below, without going through the dense matrix first?
>
> The standard way is to use the LIL or DOK sparse formats. If you want
> to use them then you'll have to do your construction "by hand", though
> -- you can't do the nice broadcasting tricks you're using below.
> Alternatively, constructing CSC or CSR format directly is not that
> hard, though it may take some time to wrap your head around the
> definitions...
>
>>>>> from scipy import sparse
>>>>> g = np.array([0, 0, 1, 1])   #categories, integers,
>>>>> u = np.arange(2)    #unique's,  range(number_categories)
>
> If 'u' is *always* going to be np.arange(number_categories), then
> actually this is quite trivial (untested code):
>
> data = np.ones(len(g), dtype=np.int8)
> indices = g
> indptr = np.arange(len(g))
> a = np.csr_matrix((data, indices, indptr))

This works nicely  (only "sparse" namespace)

u = np.arange(number_categories)  will be a code requirement
(group or period labels are consecutive ints)

>
> This gives you a CSR matrix, which you can either use as is or convert to CSC.
>
> If you want to build CSC directly, and want to support an arbitrary
> 'u' vector, then you could do something like (untested code):
>
> data = np.ones(len(g), dtype=np.int8)
> indices = np.empty(len(g), dtype=int)
> write_offset = 0
> indptr = np.empty(number_categories, dtype=int)
> for col_i, category in enumerate(u):
>  indptr[col_i] = write_offset
>  rows = (data == category).nonzero()[0]
>  indices[write_offset:write_offset + len(rows)] = rows
>  write_offset += len(rows)

I still need to check this.

>
> Or you could just use a loop that fills in an LIL matrix :-)

I'm playing with panel data or general error component models. The
main point of using sparse is to have a compact, non-loop version.

In some previous attempts at sparse the cost of constructing the array
with loops removed much of the advantage of using them, and I could
just loop in the algorithm directly.

Thanks,

Josef

>
> -- Nathaniel
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>