[SciPy-User] creating sparse indicator arrays

josef.pktd@gmai... josef.pktd@gmai...
Tue Nov 29 14:25:02 CST 2011


On Tue, Nov 29, 2011 at 2:50 PM,  <josef.pktd@gmail.com> wrote:
> On Tue, Nov 29, 2011 at 2:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
>> On Tue, Nov 29, 2011 at 7:14 AM,  <josef.pktd@gmail.com> wrote:
>>> Is there a simple or fast way to create a sparse indicator array, `a`
>>> below, without going through the dense matrix first?
>>
>> The standard way is to use the LIL or DOK sparse formats. If you want
>> to use them then you'll have to do your construction "by hand", though
>> -- you can't do the nice broadcasting tricks you're using below.
>> Alternatively, constructing CSC or CSR format directly is not that
>> hard, though it may take some time to wrap your head around the
>> definitions...
>>
>>>>>> from scipy import sparse
>>>>>> g = np.array([0, 0, 1, 1])   #categories, integers,
>>>>>> u = np.arange(2)    #unique's,  range(number_categories)
>>
>> If 'u' is *always* going to be np.arange(number_categories), then
>> actually this is quite trivial (untested code):
>>
>> data = np.ones(len(g), dtype=np.int8)
>> indices = g
>> indptr = np.arange(len(g))
>> a = np.csr_matrix((data, indices, indptr))
>
> This works nicely  (only "sparse" namespace)

small correction:
indptr needs to be 1 longer or it drops the last row
(since I didn't RTFM, I don't know what it means, but it works)

>>> g = np.array([0, 0, 1, 2, 1, 1, 2, 0])
>>> data = np.ones(len(g), dtype=np.int8)
>>> indptr = np.arange(len(g)+1)   #add 1
>>> a = sparse.csr_matrix((data, g, indptr))
>>> a.todense()
matrix([[1, 0, 0],
        [1, 0, 0],
        [0, 1, 0],
        [0, 0, 1],
        [0, 1, 0],
        [0, 1, 0],
        [0, 0, 1],
        [1, 0, 0]], dtype=int8)
>>> np.all(a.todense() == (g[:,None] == np.arange(3)).astype(int))
True

Josef

>
> u = np.arange(number_categories)  will be a code requirement
> (group or period labels are consecutive ints)
>
>>
>> This gives you a CSR matrix, which you can either use as is or convert to CSC.
>>
>> If you want to build CSC directly, and want to support an arbitrary
>> 'u' vector, then you could do something like (untested code):
>>
>> data = np.ones(len(g), dtype=np.int8)
>> indices = np.empty(len(g), dtype=int)
>> write_offset = 0
>> indptr = np.empty(number_categories, dtype=int)
>> for col_i, category in enumerate(u):
>>  indptr[col_i] = write_offset
>>  rows = (data == category).nonzero()[0]
>>  indices[write_offset:write_offset + len(rows)] = rows
>>  write_offset += len(rows)
>
> I still need to check this.
>
>>
>> Or you could just use a loop that fills in an LIL matrix :-)
>
> I'm playing with panel data or general error component models. The
> main point of using sparse is to have a compact, non-loop version.
>
> In some previous attempts at sparse the cost of constructing the array
> with loops removed much of the advantage of using them, and I could
> just loop in the algorithm directly.
>
> Thanks,
>
> Josef
>
>>
>> -- Nathaniel
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>


More information about the SciPy-User mailing list