Automata Minimization and Beyond: A Systematic Evaluation of DFA-based Pattern Matching
thesis
posted on 2025-05-01, 00:00authored byFilippo Giovanni Del Nero
Pattern-matching based on Regular Expressions (REs) is crucial in several applications, including network intrusion detection, Deep Packet Inspection (DPI), genome analysis, spam filters, database queries, and natural language preprocessing. Finite State Automata (FSAs) are a representation with equivalent power to REs commonly employed to analyze patterns against data efficiently. FSAs can be divided into two macro categories: Deterministic Finite State Automata (DFAs) and Non-Deterministic Finite State Automata (NFAs). The former ensures bounded execution times at the cost of state-explosion issues – as they have a larger memory footprint – while the latter significantly reduces memory usage but demands higher throughput capability in the processing engine. In this context, numerous works aim at compressing DFAs to reduce their memory footprint while exploiting their attractive computation characteristics. These compressed DFAs often require modifying their execution algorithm to perform pattern matching on a data stream with new automata representation. However, most of these techniques focus on state and transition compression, potentially lacking a thorough evaluation of the impact the compression has when running the modified execution algorithms. This thesis proposes a new framework to systematically analyze two classes of highly effective DFA-compression algorithms stressing their impact on pattern-matching execution. Firstly, this thesis characterizes the most impacting DFA-based compression techniques regarding state compression, transition compression, and compression time. Then, it thoroughly assesses the memory footprint and execution time of pattern-matching applications exploiting these compression algorithms across modern benchmarks and architectures, thus enriching the analysis proposed in previous works. With this in-depth evaluation, this thesis clarifies the tradeoff between memory compression and the execution impact of DFA-based compression algorithms in a modern context, supporting the design of ad-hoc pattern matching engines.
History
Advisor
Wenjing Rao
Department
Computer Science
Degree Grantor
University of Illinois Chicago
Degree Level
Masters
Degree name
MS, Master of Science
Committee Member
Piotr Gmytrasiewicz
Zhiling Lan
Marco Santambrogio