[SciPy-Dev] GSoC: Speed Up 2to3 in Pattern Matching

gombiuda JHL gombiuda@gmail....
Wed Mar 31 18:12:40 CDT 2010

Hi all, firstly sorry for my length of the proposal.
My name is Gombiuda Jiang. Here is my proposal. I am eager to listen to all
kind of comments and advices. Thank you for reading my proposal.

GSoC: Speed Up 2to3 in Pattern Matching

Firstly, thank your for your time to have a glance of my proposal.

Idea Background
As PSF pay more and more attention on porting projects using python 2.x to
python 3.x, developers in the community start to pay their efforts to port
their projects. Luckily, there is a great tool `2to3` helping them in
porting job automatically to some extent. However, `2to3` has its
limitations. Within these limitations, speed of the conversion can be
improved with another implementation to replace the using one. To be more
detail, `2to3` is slow because of its using back-tracking of its pattern
matching job. Back-tracking is easy to understand and implement, so firstly
it's put into used. In addition its implementation is using recursion
instead of loop. So as the payload of `2to3` is rocketing up, it eventually
shows its weakness of creating more and more frames to allow function calls.
As a result, it's necessary to use another algorithm to speed up the pattern
matching process.

As a traditional problem, pattern matching has some efficient algorithms.
One of them is named `McNaughton-Yamada-Thompson` algorithm, which bases on
NFA(Nondeterministic Finite Automata) theory. This theory can efficiently
judge if a string(token stream is also acceptable) can be matched a given
regular expression. So it reaches its basic requirements of the pattern
matching in `2to3` project.
After finishing implementing the algorithm and passing test cases to
certificate it works well, we can, if necessary, continue to improve more
using DFA theory, which may cost more memory and pattern-compiling time but
work faster in matching the patterns.

The details is described as below:

In original implementation, a pattern string of a fixer is firstly tokenized
and then compiled to a pattern syntax tree. Then it starts matching it with
the syntax tree of the python 2.x source code with back-tracking.
Here we use the original tokenizing result, then put it into a
`NFAGenerator` to generate the NFA. At last, we put a node of the syntax
tree which is generated from the python 2.x source code into the NFA, and
output the result of `True` or `False`. The left process(replacement
process) hasn't change.

So the changed things are a new `NFAGenerator` class, a new `NFAPattern`
class to represent the result of the generation result of the `NFAGenerator`
class. Here is the model of the two classes:

class NFAPattern(object):
-`content`: The original pattern to be matched
        -`graph`: A 2D array to store the transition diagram

-`optimize(self)`: Maybe implement later using DFA or some graph reduction
-`match(self, node, result)`: Travels the tree to find out all matched node
and put them into result to be passed out
    def __init__(self, content):
        self.graph = None
self.content = content

    def optimize(self):
        # Maybe implement later using DFA or some graph reduction algorithms

    def match(self, node, result=None):
        # Find out all matched node(as root) and put them into result to be
passed out

class NFAGenerator(object):
    Basically the same as the original one `PatternCompiler` except of the
implementation of the compile_node() method.
    def __init__(self):
        # implement like the original one in `PatternCompiler`

    def compile_pattern(self, input, debug=False):
        tokens = tokenize_wrapper(input)
        root = self.driver.parse_tokens(tokens, debug=debug)
        return self.compile_node(root)

    def compile_node(self, node):
        # implement base on `McNaughton-Yamada-Thompson` algorithm to
compile a pattern string into a NFA

With this improvement, the process will be more faster than before. But
there still contains protential to improve. Right, that's pararrelization.

Simplely, we can imagine that we have a bunch of files to be converted.
Multithread can be easily used because there is no sequence-relationship
between each file. As a result, the changes would be:

# Inside refactor.py
import threading
class RefactoringTool(object):
    # ....
    THREADS = 10

    def refactor_dir(self, dir_name, write=False, doctests_only=False):
        cl = threading.Condition(threading.Lock())
        def generate_filenames(dir_name):
            for dirpath, dirnames, filenames in os.walk(dir_name):
                for name in filenames:
                    if not name.startswith(".") and name.endswith("py"):
                        fullname = os.path.join(dirpath, name)
        yield fullname

def refacotr_one_file(iter):
    while True:
fullname = iter.next()
if fullname:
    self.refactor_file(fullname, write, doctests_only)

iter = generate_filenames(dir_name)
threads = []
        for i in range(THREADS):
    threads.append(threading.Thread(target=refacotr_one_file, args=(iter,)))
for thread in threads:
 # ....

Time Schedule of GSoC
April 27~May 6: Communicate with the mentor to get more requirement
May 7~May 16: Read more documentations to make sure the implementation.
May 17~May 26: Write the structure of the new implementation
May 27~June 6: Write test cases for testing and write benchmark tool for
checking out the efficiency.
June 7~July 6: Write the new implementation.
July 7~July 11: Testing and bug fixing.
July 12~July 16: Mid-term evaluation
July 17~July 26: Improvement from NFA to DFA or just reduce the graph of the
NFA or add multithread support
July 27~August 6: Scrub code and test
August 7~August 16: Write documentations
August 17: Pencil down

About Me
Name: Gombiuda Jiang
University: Sun-Yat sen university of China
Major: Software Engineering
Email/Gtalk: gombiuda@gmail.com

Knowledge Background:
    - 9 years programming experience with 6 years focusing on algorithm and
data structure learning and 3 years software development experience.
    - Some experience of researching face detection and recognition using
Hidden-Markov-Model algorithm.
    - Some experience of researching text clustering with artifical
    - Fundermental courses at software school, like `Software Engineering`,
`Operating System`, ...

    - The first prize NOIP of primary group district and thus roll in high
school with bonus point added
    - The first prize NOIP of high school group district and thus
recommended to the Sun-Yat sen university
    - The third prize `Software Inovation Competition` at Sun-Yat sen
    - The second prize `Commercial Software Inovation Competition` which is
hold by Citizen Bank

Development Experience:
    - A RSS news reader using C++ in 2007
    - A commercial website in Citizen Bank competition using Java SSH
    - A educational online-judge website using python and django which is
put into use in our software school.
    - Some small tools like file format convertor, auto voter via Internet,
cheater of the Travian game, and so on.

Why Me?
I have been using Linux as my default system for years and I like open
source very much. I do want to contribute my energy to the community
continuously. I have been using python for nearly 2 years and I like it very
much for is efficiency. So I read part of the source code of python and got
to realize its internal structure. As a result, I'd like to take part in
developing the python projects, especially the core and the built-in
libraries. And with a strong algorithm ability, I think I can manage to do

Future View
I would like to join the group of developing python with the help of this
project's success. If you see my protential, please let me in, then I will
try my best to learn and work hard, not making you disappoint.

Finally, thanks again for reading my proposal.

Gombiuda Jiang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20100401/d9858489/attachment.html 

More information about the SciPy-Dev mailing list