[SciPy-User] vectorized version of 'multinomial' sampling function

josef.pktd@gmai... josef.pktd@gmai...
Wed Oct 14 23:07:01 CDT 2009

```On Wed, Oct 14, 2009 at 11:05 PM, per freem <perfreem@gmail.com> wrote:
> <josef.pktd@gmail.com> wrote:
>
>
>>>> If you only want n=1, then it seems to me, you could do this also with
>>>> drawing an integer out of range(3) with the given probabilities. This
>>>> can be vectorized easily directly with numpy. (just like an individual
>>>> observation in multinomial logit)
>>>>
>>>> Josef
>>>>
>>>>>
>>>>> to get a vector of multinomial samplers, each using the nth list in
>>>>> 'p'. something like:
>>>>>
>>>>> array([[1, 0, 0], [0, 0 1]]) in this case. is this possible? it seems
>>>>> like 'multinomial' takes only a one dimensional array. i could write
>>>>> this as a "for" loop of course but i prefer a vectorized version since
>>>>> speed is crucial for me here.
>>>
>>>
>>> for n=1 the following should work (unless I still cannot tell
>>> multinomial distribution and multinomial logit apart)
>>>
>>> Josef
>>>
>>>>>> p = np.array([[ 0.9 ,  0.05,  0.05],
>>>       [ 0.05,  0.05,  0.9 ]])
>>>>>> pcum = p.cumsum(1)
>>>>>> pcuma = np.repeat(pcum,20, 0)
>>>>>> rvs = np.random.uniform(size=(40))
>>>>>> a,b,c = pcuma.T
>>>>>> mnrvs = np.column_stack((rvs<=a, (rvs>a) & (rvs<=b), rvs>b)).astype(int)
>>>>>> mnrvs[:20].sum()
>>> 20
>>>>>> mnrvs[:20].sum(0)
>>> array([19,  0,  1])
>>>>>> mnrvs[20:].sum()
>>> 20
>>>>>> mnrvs[20:].sum(0)
>>> array([ 1,  0, 19])
>>>>>> mnrvs[15:25]
>>> array([[0, 0, 1],
>>>       [1, 0, 0],
>>>       [1, 0, 0],
>>>       [1, 0, 0],
>>>       [1, 0, 0],
>>>       [0, 0, 1],
>>>       [0, 0, 1],
>>>       [0, 0, 1],
>>>       [1, 0, 0],
>>>       [0, 0, 1]])
>>>
>>>
>>> check for larger sample:
>>>
>>>>>> pcuma = np.repeat(pcum,500, 0)
>>>>>> rvs = np.random.uniform(size=(1000))
>>>>>> a,b,c = pcuma.T
>>>>>> mnrvs = np.column_stack((rvs<=a, (rvs>a) & (rvs<=b), rvs>b)).astype(int)
>>>>>> mnrvs[:500].sum(0)
>>> array([456,  17,  27])
>>>>>> mnrvs[500:].sum(0)
>>> array([ 24,  35, 441])
>>>>>> p*500
>>> array([[ 450.,   25.,   25.],
>>>       [  25.,   25.,  450.]])
>>>
>>
>> similar works for e.g. n=5
>>
>> Josef
>
>
> hi Josef,
>
> thanks very much for your reply. i see how this works for the specific
> case where the multinomial vector of probabilities has 3 numbers in
> it, for which you defined the corresponding variables "a", "b" and
> "c". but is there a way to rewrite this code to make it work for an
> arbitrarily sized vector of probabilities? for example, i'm not sure
> how one would generalize this line:
>
> mnrvs = np.column_stack((rvs<=a, (rvs>a) & (rvs<=b), rvs>b)).astype(int)
>
> any ideas on how to do this would be greatly appreciated.  thanks again.
>

Here is what I used for the multinomial logit case

rvsunif = np.random.rand(nobs,1)
yrvs = (rvsunif<np.cumsum(pr,axis=1)).argmax(1)[:,np.newaxis]

the code I parked  here a while ago:

after cleaning up this will eventually move to statsmodels.

It's a while ago that I did this, so I'm not sure I remember all the
reasons I did things, but in the examples it worked.

I think, pr should be the column vector of probabilities defined by
the multinomial choice regression with the same number of rows as
rvsuinf (nobs).
yrvs might be the index of the choice/realization and not the
[[0,0,1],...] array. there should be a trick to get the indicator
back. I found it at the bottom of the script:

>>> dummyvar = (yrvs == np.arange(nk)).astype(int)
>>> dummyvar[:10,:]
array([[1, 0, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 1, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1]])

Hope that helps,

Josef
```