[SciPy-User] String Matching in SciPy

Pauli Virtanen pav@iki...
Sat Oct 9 10:03:52 CDT 2010

```Sat, 09 Oct 2010 15:21:15 +0200, Lorenzo Isella wrote:
[clip]
> where L is the length I am looking for calculated for various choices of
> i. In the end of the day, I need a sort of built-in grep function for
> Python, but the first step is to understand if there is an efficient way
> to detect whether a certain substring (in the future of i) is a subset
> of the string giving the past of i.
> Any suggestion is welcome.

As far as I know, there's no builtin function in Numpy for doing this.
There are probably several choices how to proceed, among them:

(i)

Python's regexp module works also with buffers, so you can
directly use it on character arrays:

>>> import numpy as np
>>> import re
>>> x = np.array(list('asdasdasds'), dtype='S1')
>>> x
array(['a', 's', 'd', 'a', 's', 'd', 'a', 's', 'd', 's'],
dtype='|S1')
>>> re.search('sda', x[:4]).start()
1

This does not copy the data to a string, so it should be efficient.

If you need to find all occurrences, you can do

>>> matches = re.finditer('sda', x)
>>> offsets = [m.start() for m in matches]
>>> offsets
[1, 4]

If you have a large number of matches, this approach may become
less efficient, as it needs to form a Python match object for each
match.

(ii)

Write a simple function in Cython that does the string matching,
and returns an integer array of offsets.

--
Pauli Virtanen

```