[Numpy-discussion] vectorizing
josef.pktd@gmai...
josef.pktd@gmai...
Fri Jun 5 15:41:45 CDT 2009
On Fri, Jun 5, 2009 at 4:35 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
> On Fri, Jun 5, 2009 at 1:31 PM, <josef.pktd@gmail.com> wrote:
>> On Fri, Jun 5, 2009 at 4:27 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>> On Fri, Jun 5, 2009 at 1:22 PM, Brent Pedersen <bpederse@gmail.com> wrote:
>>>> On Fri, Jun 5, 2009 at 1:05 PM, Keith Goodman<kwgoodman@gmail.com> wrote:
>>>>> On Fri, Jun 5, 2009 at 1:01 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>>>>> On Fri, Jun 5, 2009 at 12:53 PM, <josef.pktd@gmail.com> wrote:
>>>>>>> On Fri, Jun 5, 2009 at 2:07 PM, Brian Blais <bblais@bryant.edu> wrote:
>>>>>>>> Hello,
>>>>>>>> I have a vectorizing problem that I don't see an obvious way to solve. What
>>>>>>>> I have is a vector like:
>>>>>>>> obs=array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>>>> and a matrix
>>>>>>>> T=zeros((6,6))
>>>>>>>> and what I want in T is a count of all of the transitions in obs, e.g.
>>>>>>>> T[1,2]=3 because the sequence 1-2 happens 3 times, T[3,4]=1 because the
>>>>>>>> sequence 3-4 only happens once, etc... I can do it unvectorized like:
>>>>>>>> for o1,o2 in zip(obs[:-1],obs[1:]):
>>>>>>>> T[o1,o2]+=1
>>>>>>>>
>>>>>>>> which gives the correct answer from above, which is:
>>>>>>>> array([[ 0., 0., 0., 0., 0., 0.],
>>>>>>>> [ 0., 0., 3., 0., 0., 1.],
>>>>>>>> [ 0., 3., 0., 1., 0., 0.],
>>>>>>>> [ 0., 0., 2., 0., 1., 0.],
>>>>>>>> [ 0., 0., 0., 2., 0., 0.],
>>>>>>>> [ 0., 0., 0., 0., 1., 0.]])
>>>>>>>>
>>>>>>>>
>>>>>>>> but I thought there would be a better way. I tried:
>>>>>>>> o1=obs[:-1]
>>>>>>>> o2=obs[1:]
>>>>>>>> T[o1,o2]+=1
>>>>>>>> but this doesn't give a count, it just yields 1's at the transition points,
>>>>>>>> like:
>>>>>>>> array([[ 0., 0., 0., 0., 0., 0.],
>>>>>>>> [ 0., 0., 1., 0., 0., 1.],
>>>>>>>> [ 0., 1., 0., 1., 0., 0.],
>>>>>>>> [ 0., 0., 1., 0., 1., 0.],
>>>>>>>> [ 0., 0., 0., 1., 0., 0.],
>>>>>>>> [ 0., 0., 0., 0., 1., 0.]])
>>>>>>>>
>>>>>>>> Is there a clever way to do this? I could write a quick Cython solution,
>>>>>>>> but I wanted to keep this as an all-numpy implementation if I can.
>>>>>>>>
>>>>>>>
>>>>>>> histogram2d or its imitation, there was a discussion on histogram2d a
>>>>>>> short time ago
>>>>>>>
>>>>>>>>>> obs=np.array([1,2,3,4,3,2,1,2,1,2,1,5,4,3,2])
>>>>>>>>>> obs2 = obs - 1
>>>>>>>>>> trans = np.hstack((0,np.bincount(obs2[:-1]*6+6+obs2[1:]),0)).reshape(6,6)
>>>>>>>>>> re = np.array([[ 0., 0., 0., 0., 0., 0.],
>>>>>>> ... [ 0., 0., 3., 0., 0., 1.],
>>>>>>> ... [ 0., 3., 0., 1., 0., 0.],
>>>>>>> ... [ 0., 0., 2., 0., 1., 0.],
>>>>>>> ... [ 0., 0., 0., 2., 0., 0.],
>>>>>>> ... [ 0., 0., 0., 0., 1., 0.]])
>>>>>>>>>> np.all(re == trans)
>>>>>>> True
>>>>>>>
>>>>>>>>>> trans
>>>>>>> array([[0, 0, 0, 0, 0, 0],
>>>>>>> [0, 0, 3, 0, 0, 1],
>>>>>>> [0, 3, 0, 1, 0, 0],
>>>>>>> [0, 0, 2, 0, 1, 0],
>>>>>>> [0, 0, 0, 2, 0, 0],
>>>>>>> [0, 0, 0, 0, 1, 0]])
>>>>>>>
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>>>>>>> re
>>>>>>> array([[ 0., 0., 0., 0., 0., 0.],
>>>>>>> [ 0., 0., 3., 0., 0., 1.],
>>>>>>> [ 0., 3., 0., 1., 0., 0.],
>>>>>>> [ 0., 0., 2., 0., 1., 0.],
>>>>>>> [ 0., 0., 0., 2., 0., 0.],
>>>>>>> [ 0., 0., 0., 0., 1., 0.]])
>>>>>>>>>> h
>>>>>>> array([[ 0., 0., 0., 0., 0., 0.],
>>>>>>> [ 0., 0., 3., 0., 0., 1.],
>>>>>>> [ 0., 3., 0., 1., 0., 0.],
>>>>>>> [ 0., 0., 2., 0., 1., 0.],
>>>>>>> [ 0., 0., 0., 2., 0., 0.],
>>>>>>> [ 0., 0., 0., 0., 1., 0.]])
>>>>>>>
>>>>>>>>>> np.all(re == h)
>>>>>>> True
>>>>>>
>>>>>> There's no way my list method can beat that. But by adding
>>>>>>
>>>>>> import psyco
>>>>>> psyco.full()
>>>>>>
>>>>>> I get a total speed up of a factor of 15 when obs is length 10000.
>>>>>
>>>>> Actually, it is faster:
>>>>>
>>>>> histogram:
>>>>>
>>>>>>> h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>>>> timeit h, e1, e2 = np.histogram2d(obs[:-1], obs[1:], bins=6, range=[[0,5],[0,5]])
>>>>> 100 loops, best of 3: 4.14 ms per loop
>>>>>
>>>>> lists:
>>>>>
>>>>>>> timeit test(obs3, T3)
>>>>> 1000 loops, best of 3: 1.32 ms per loop
>>>>> _______________________________________________
>>>>> Numpy-discussion mailing list
>>>>> Numpy-discussion@scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>
>>>>
>>>>
>>>> here's a go:
>>>>
>>>> import numpy as np
>>>> import random
>>>> from itertools import groupby
>>>>
>>>> def test1(obs, T):
>>>> for o1,o2 in zip(obs[:-1],obs[1:]):
>>>> T[o1][o2] += 1
>>>> return T
>>>>
>>>>
>>>> def test2(obs, T):
>>>> s = zip(obs[:-1], obs[1:])
>>>> for idx, g in groupby(sorted(s)):
>>>> T[idx] = len(list(g))
>>>> return T
>>>>
>>>> obs = [random.randint(0, 5) for z in range(10000)]
>>>>
>>>> print test2(obs, np.zeros((6, 6)))
>>>> print test1(obs, np.zeros((6, 6)))
>>>>
>>>>
>>>> ##############
>>>>
>>>> In [10]: timeit test1(obs, np.zeros((6, 6)))
>>>> 100 loops, best of 3: 18.8 ms per loop
>>>>
>>>> In [11]: timeit test2(obs, np.zeros((6, 6)))
>>>> 100 loops, best of 3: 6.91 ms per loop
>>>
>>> Nice!
>>>
>>> Try adding
>>>
>>> import psyco
>>> psyco.full()
>>>
>>> to test1. Or is that cheating?
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>> how about
>> from scipy import ndimage
>> ndimage.histogram((obs[:-1])*6 +obs[1:], 0, 35, 36 ).reshape(6,6)
>>
>> (bincount doesn't take the limits into account if they don't show up
>> in the data, so first and last position would need special casing)
>
> Game over:
>
>>> from scipy.ndimage import histogram
>>> timeit histogram((obs[:-1])*6 +obs[1:], 0, 35, 36 ).reshape(6,6)
> 1000 loops, best of 3: 366 µs per loop
> Game over:
maybe not, ndimage is very fast but crash prone (try feeding the wrong
type) and will eventually be rewritten.
Josef
More information about the Numpy-discussion
mailing list