# [SciPy-User] Identify unique sequence data from array

josef.pktd@gmai... josef.pktd@gmai...
Wed Dec 22 23:24:50 CST 2010

```On Thu, Dec 23, 2010 at 12:12 AM,  <josef.pktd@gmail.com> wrote:
> On Wed, Dec 22, 2010 at 11:49 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
>> On Wed, Dec 22, 2010 at 6:00 PM, Charles R Harris
>> <charlesr.harris@gmail.com> wrote:
>>>
>>>
>>> On Wed, Dec 22, 2010 at 2:51 PM, Robert Kern <robert.kern@gmail.com> wrote:
>>>>
>>>> On Wed, Dec 22, 2010 at 16:47, Paul Anton Letnes
>>>> <paul.anton.letnes@gmail.com> wrote:
>>>> >
>>>> > On 22. des. 2010, at 21.18, otrov wrote:
>>>> >
>>>> >>>> The problem:
>>>> >>
>>>> >>>> I have 2D data sets (scipy/numpy arrays) of 10^7 to 10^8 rows, which
>>>> >>>> consists of repeated sequences of one unique sequence, usually ~10^5 rows,
>>>> >>>> but may differ in scale. Period is same for both columns, so there is not
>>>> >>>> really difference if we consider 2D or 1D array.
>>>> >>>> I want to track this data block.
>>>> >>
>>>> >>> for i in range(1, len(X)-1):
>>>> >>>    if (X[i:] == X[:-i]).all():
>>>> >>>        break
>>>> >>
>>>> >> Just look at that python beauty! Such a great language when in hand of
>>>> >> a smart user.
>>>> >> Thanks for you snippet, but unfortunately it takes forever to finish
>>>> >
>>>> > You could also check one element at a time. I think it will be faster,
>>>> > because it will break if comparison of the first element doesn't hold. Then,
>>>> > if you find such an occurrence, use Robert's method to double check that you
>>>> > found the true repetition period.
>>>>
>>>> Excellent point.
>>>>
>>>
>>> Why not do an FFT and look at the shape around the carrier frequency? The DC
>>> level should probably be subtracted first. It shoud also be possible to
>>> construct a Weiner filter to extract the sequences if they don't occur with
>>> strict periods.
>>>
>>
>> Could you give an example?  I've used a convolution to find the number
>> of successive discrete events, but I'm not sure how to generalize it
>> or if your suggestion is similar.
>>
>> For example, to count the number of three successes in a row for
>> Bernoulli trials and to find where
>>
>> In [1]: import numpy as np
>>
>> In [2]: x = np.array([1,1,1,0,0,1,0,1,0,1,1,1,0,1,0,1])
>>
>> In [3]: y = np.array([1,1,1])
>>
>> In [4]: idx = np.convolve(x,y)
>>
>> In [5]: num_runs = len(np.where(idx==len(y))[0])
>>
>> In [6]: # Extract runs from original array
>>
>> In [6]: idx = np.where(idx==len(y))[0]-(len(y)-1)
>>
>> In [7]: idx = np.hstack([np.arange(i,i+3) for i in idx])
>>
>> In [8]: x[idx]
>> Out[8]: array([1, 1, 1, 1, 1, 1])
>
> It's different because in the original case you want to find the
> periodicity, for example calculate the periodogram/fft and find the
> spike. This should give the frequency and then the period length. If
> there are small numerical problems, it would still narrow down the
> range to search with direct comparison.
>
> Josef
> (finding runs sounds like a fun problem, more than multiple tests and
> comparisons)

a one liner, just for fun and not relevant for the question
>>> x = np.array([1,1,1,0,0,1,0,1,0,1,1,1,0,1,0,1])
>>> np.bincount(np.diff(np.nonzero(np.diff(np.r_[[-np.inf], x, [np.inf]]))[0]))
array([0, 8, 1, 2])
>>> (_*np.arange(4)).sum() == len(x)
True

Josef

>
>>
>> Skipper
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>
```