Goals 1. Flesh out the dynamic programming algorithm for matching regular expressions 2. Parsing context free grammars, CYK algorithm 3. Parsing regular expressions in one-pass, in linear time "Finite automata" open Base type regex = | Empty (* Reject all strings. *) | Singleton of string (* Match the string "abc", reject everything else. *) | Concat of regex * regex (* Concat(r1, r2). If the string can be split into two, so that the first part matches r1, and the second part matches r2, then match the whole string.*) | Union of regex * regex (* Union(r1, r2). Match the string if it either matches r1 or it matches r2. "r1 | r2". "r1 union r2". "r1 + r2". *) | Star of regex (* Star(r1). Match the string if it can be split into zero or more pieces, each of which matches r1. *) Without Kleene-star, each regular matches a finite number of strings. Empty <> Singleton "" 0 1 No string can ever match Concat(r, Empty), regardless of r. string = "_______ matches r|__________ cannot match Empty" Concat(r, Empty) = Empty r * 0 = 0 A string matches Concat(r, Singleton "") iff it matches r. string = "___________ matches r | ___________ matches Singleton """ Concat(r, Singleton "") = r r * 1 = r Union(r, Empty) = r r + 0 = r Union(r, Singleton "") not very interesting. matches "abcde" (Star r1) let rec matches (str : string) (r : regex) : bool = match r with | Empty -> false | Singleton s -> Poly.(s = str) | Concat(r1, r2) -> (* Break it into all possible combinations of two strings, check them all. |1|2|3|4|5|6|7|8| *) let split_lens = List.range 0 (1 + String.length str) in let splits = List.map split_lens ~f:(fun l -> (String.prefix str l, String.drop_prefix str l)) in List.exists splits ~f:(fun (s1, s2) -> matches s1 r1 && matches s2 r2) | Union(r1, r2) -> matches str r1 || matches str r2 | Star(r1) -> (* ______________ ==> ____ matches r1|____ matches r1|____ matches r1|__ matches r1|__________ matches r1 ==> ____ matches r1|_________________ matches r1* *) (* Complaint 1: Incorporate case where Kleene-* is of length 0 *) (* Complaint 2: Doesn't finish. Non-terminating computation. Base-case is ill-defined. *) if String.length str = 0 then true else let split_lens = List.range 1 (1 + String.length str) in let splits = List.map split_lens ~f:(fun l -> (String.prefix str l, String.drop_prefix str l)) in List.exists splits ~f:(fun (s1, s2) -> matches s1 r1 && matches s2 r) (* matches "" r1 && matches "abcde" (Star r1) *) (* This algorithm is a bad idea for two reasons: 1. Sub-optimal algorithm, use finite automata, if applicable. 2. Doesn't memoize computations. Repeated computation and exponential blowup. matches "abbbab" Star(r) ==> "a|bb|ab". Will eventually call matches "ab" r ==> "ab|b|ab". Will eventually call matches "ab" r Multiple paths to same function call. Repeated computation. *) let fib n = ... fib (n - 1) + fib (n - 2) fib 5 = fib 4 + fib 3 = (fib 3 + fib 2) + fib 3 To match context-free grammars: 0. Given a string s, can the start symbol Start expand to s? "Does s match the pattern encoded by the start symbol Start?" 1. For every substring s1 of the full string s, for every non-terminal symbol NT, check if NT can expand to s1. In ourparser.mly, top_expr is the start symbol { top_expr, expr_nt } are the non-terminal symbols. The tokens INT, LPAREN, RPAREN, PLUS, MINUS, etc. are the terminal symbols. 2. We can schedule these computations to look like a dynamic programming tree. expr ::= expr; PLUS; expr a. To check if a string s matches expr, split it, in all possible ways, into three pieces s = s1 s2 s3. Check if expr expands to s1, and s2 = PLUS, and expr expands to s3. If all yes, then expr can expand to s. 3. Maintain a three-dimenional memo table: First dimension: Start index of the substring Second dimension: Length of the substring Third dimension: Non-terminal symbol Base case: Where second dimension = 0 or second dimension = 1. 4. "3 + 4 - (8 + 9)" ==> (INT 3, PLUS, INT 4, MINUS, LPAREN, INT 8, PLUS, INT 9, RPAREN) Does this sequence of tokens match expr_nt? Grammar is: expr_nt: | c = INT { Int c } <---- Rule 1 | e1 = expr_nt; MINUS; e2 = expr_nt { Minus(e1, e2) } <--- Rule 2 | e1 = expr_nt; PLUS; e2 = expr_nt { Plus(e1, e2) } <---- Rule 3 | LPAREN; e = expr_nt; RPAREN { e } <--- Rule 4 Does "" match expr_nt? No. Does "INT 3" match expr_nt? Yes, because of rule 1. Does "PLUS" match expr_nt? Can "PLUS" be matched using rule 1? No, it is not of the form INT. Can it be matched using Rule 2? No, because there is no MINUS. Can it be matched using Rule 3? Possibly, we need more time. Only one plausible way in which to apply rule 3. e1 = "", e2 = PLUS, e3 = "" But we already know that e1 does not match expr_nt. Therefore, "PLUS" cannot be matched against expr_nt using Rule 3. can it be matched using rule 4? No, because there is no LPAREN. Strings of length 1 which match expr_nt = { INT 3, INT 4, INT 8, INT 9}. Strings of length 2 which match expr_nt = {} For example, does "INT3, PLUS" match expr_nt? Does rule 1 apply? No. Does rule 2 apply? No, because no MINUS. Does rule 3 apply? Possibly. Only feasible choice is to say e1 = INT3, e2 = PLUS, e3 = "". So rule 3 does not apply. Rule 4 does not apply either. Does "PLUS, INT4" match expr_nt? No. Does "INT4, MINUS" match expr_nt? No. ... This algorithm needs O(n^3 |G|) time, to match a string of length n against a grammar of size |G|. Imagine your source file contained 10000 characters (10KB). You would need ~ ____ * (10^4)^3 * ______ = 10^12 * _______ ==> ~1000 seconds (20 minutes).