Representing Regular Expressions as a Syntax Tree

In developing a chatbot, I recently became curious about generating regular expressions from a set of examples. I thought if I can acquire enough examples of a particular command, I could use regex to determine the intent of the message and any entities within the message and process the command. Creating a system that processes a set of strings and substrings to extract using regex was my first thought.

After a quick search on the internet, I found a few resources, which I'll link below that have more than extensively researched this topic. My goal for the following blog posts is to describe the steps I used to come up with my solution to this problem. I've quite barely scratched the surface in trying to solve this problem, and I'm aware that there may be better solutions, but I stumbled across one that I found particularly intriguing. This research paper describes a process for automatically generating regex from examples using genetic programming. With little more research in hand, I decided it would be a fun project to implement.

A Brief Overview of Genetic Programming

In order to write a genetic program using regular expressions, I needed to understand and represent regular expressions in a way that makes genetic programming easier. If you're familiar with genetic programming, you can skip ahead, but I'll briefly describe why I decided to use syntax trees.

In artificial intelligence, genetic programming (GP) is a technique of evolving programs, starting from a population of unfit (usually random) programs, fit for a particular task by applying operations analogous to natural genetic processes to the population of programs. https://en.wikipedia.org/wiki/Genetic_programming

Simply put, genetic programming is like Darwinian evolution of programs. In order to evolve a program, there are a few steps: fitness, crossover, mutation, and a termination condition. The fitness step is how you access your programs and deem them fit or unfit for the next step. Crossover is like chromosomal crossover, where two chromosomes swap genetic material producing new chromosomes with different genetic material. Likened to chromosomal crossover, genetic programming crossover is the step of swapping the fit portions of one program with another. Mutation, is the process of randomly modifying a fit program to see if it improves its fitness in the next generation. Finally, the termination step determines if your genetic program should continue to evolve or stop because it's found a fit enough population.

From my initial thought process, representing regex as strings makes the process of crossover and mutation difficult without a lofty set of rules and string manipulation. It's possible that it can be done, but conceptually, it was easier to think of regex as syntax trees. With a tree, I can snip of a branch and attach a new one for the mutation and crossover processes. This meant that I could boil the problem down to simple tree traversal and manipulation techniques.

Understanding Regular Expression Grammar

Until now, I only vaguely understood the regex syntax and operations. This project forced me to better understand valid and invalid regex so I could represent them as trees. An analogous representation that helped me understand regex is mathematical expressions. The expression 2 + 3 * 4 can be represented as a tree:

  +
 / \
2   *
   / \
  3   4

You can see that the numbers 2, 3, and 4 are leaves in the tree, and the operations are the nodes that create branching. The operation precedence is represented in the structure of the tree. In order to evaluate this, you'd run a function eval at the root, and run that same function on the children. Doing this recursively will yield you the result.

Unpacking this for regular expressions, I can represent them as a tree if I think of everything as either an operation or a literal. Literals are the leaves of the tree, and operations are parents of other operations or literals.

Literals

It took me a while to figure out what I wanted as a literal. At first I knew I wanted a-z and A-Z, but I wasn't sure how to represent these. Should I simply included all letters of the alphabet, upper and lower case? Should I just include the range? Do I include the range operation if I do this?

I decided on including all characters of the alphabet and all numbers. Additionally, I decided to include the \d, \w, and . literals. This decision will be important in a future blog post. These operators are more general than a single character like a. \d is any digit so it will match 0-9. \w is more broad still. It matches all alphanumeric characters, so a-z, A-Z, and 0-9. Finally, the . is even more broad. It matches any character, as far as I know.

Operations

I consulted this to help me understand the various operations in regex. It was more straightforward than I thought.

I created a little chart describing the operations and vaguely what they do:

Operator	Name	Description
`?`	question	matches zero or one of the preceding character
`*`	star	matches zero or more of the preceding character
`+`	plus	matches one ore more of the preceding character
`[a-c]`	range	matches a character in the range
`a	b`	bar
`[]`	list	matches any character in the list
`()`	group	matches a subexpression as a group
	concat	concatenates two expressions

I made a few observations looking at this table:

Operations either operate on single subexpression or multiple subexpressions
Single subexpression operations operate on the preceding character
The concat operation is probably the most important operation because with out it, I couldn't represent arbitrarily complex trees
The concat operation while it can take an arbitrary set of subexpressions, it's possible to represent any regex if it is a binary operator.
The list operation can also be represented as a binary operator
The range operator can be represented by a binary operator

Simple Regex as a Tree

Armed with these observations, I could represent any expression that uses these operations as a tree. Next, I worked through some simple examples to demonstrate the feasibility of this solution. I'd write a syntax tree that would match the input strings.

Input:

'aab', 'ab'

Solution Regex Tree:

     concat
     /    \
concat     b
 /  \
a    *
     |
     a

While this is by no means a mathematical proof, it was enough for me to start figuring out how to represent it in code, but that's for another post.

References

Posted on August 26, 2019