The recognizing power of "modern" regexes?

Question

What class of languages do real modern regexes actually recognise?Whenever there is an unbounded length capturing group with a back-reference (e.g. `(.*)_\1`) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like `S ::= '(' S ')' | ε` — the context-free language of matching pairs of parens.Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG _and_ the opposite?Links to relevant papers would be much appreciated.

Maitrik K. · Accepted Answer

Modern regular expressions (regexes) with features like back-references, recursive patterns, and unbounded-length capturing groups, go beyond the power of traditional finite automata and regular languages. Here’s a breakdown of the concepts and limitations in the context of regexes and context-free languages (CFLs):Modern Regexes and Their Power

Regular Expressions and Regular Languages: Traditional regexes (without back-references or recursion) recognize only regular languages, which are the languages that can be represented by finite automata (FA). These languages are described by regular grammars, which are typically simpler and less expressive than context-free languages (CFLs).

Back-references and Non-Regular Languages: When regexes include back-references (e.g., (.*)\1), they can recognize non-regular languages. This is because back-references allow for pattern matching that requires the automaton to "remember" parts of the string to match later, which introduces a level of memory that finite automata do not possess. In essence, back-references create dependencies between different parts of the string, which makes the language non-regular.
Example: (.*)\1 can match strings like ababc, where the part of the string before and after \1 must match.
These kinds of regexes typically recognize languages that are more powerful than regular languages but do not necessarily reach the complexity of context-free languages (CFLs).

Recursive Regexes: Recursive regexes (supported by engines like PCRE, Perl, etc.) are capable of matching nested structures, such as balanced parentheses, which is a hallmark of context-free languages (e.g., S ::= '(' S ')' | ε). This allows regexes to perform tasks that are typically reserved for context-free grammars (CFGs).

Example: A regex like $(?:[^()]|(?R))*$ in PCRE can match balanced parentheses, something that regular expressions (without recursion or back-references) cannot do.

Recursive regexes extend the expressive power of traditional regexes but do not necessarily match the full power of context-free grammars (CFGs). They can handle many CFLs but not all.
Limitations of Modern Regexes

Not All Context-Free Languages Are Recognizable: Even with recursion and back-references, there are certain context-free languages (CFLs) that modern regexes cannot recognize. This is because there are languages that require more sophisticated memory structures (e.g., stack-based memory) to recognize, which goes beyond the capabilities of modern regex engines that use finite memory for back-references and recursion.

Non-CFLs Recognizable by Regex: While most context-free languages can be recognized by recursive regexes, there are languages that can be matched by regular expressions (even with back-references) but cannot be represented by a context-free grammar (CFG). The addition of recursive patterns allows for increased complexity, but it's still within certain boundaries.

Limited Memory and Stack Simulation: Regex engines (even those with recursion) typically use a limited memory model, unlike pushdown automata (PDAs) which can simulate a stack. The inability to handle unbounded recursion in the way PDAs do means regex engines can't fully replicate the power of CFGs.
Relation to LL or LR Grammars:
LL and LR Grammars: LL and LR grammars are subsets of context-free grammars (CFGs), designed for top-down and bottom-up parsing, respectively. While these types of grammars are more structured and can be parsed by specific types of automata, they do not always capture the full range of context-free languages. A modern regex engine with recursion and back-references might be able to handle many CFGs, but not all. There are certain CFGs (especially those requiring stack-based memory management) that would be beyond the reach of regex engines.Research and Papers on Modern Regexes:﻿This topic has been explored in various research papers, but much of the work focuses on the complexity of matching non-regular languages with regular expressions, and how back-references and recursion can be seen as augmenting finite automata to achieve more powerful recognition.

"Regular Expressions and Context-Free Languages" by Jeffrey D. Ullman is a good starting point to understand the theoretical foundations and limitations of regexes versus CFGs.

"The Power of Regular Expressions" by David Maier and Jeffrey D. Ullman discusses the extended expressiveness of regexes and their limitations relative to CFGs.
The PCRE (Perl Compatible Regular Expressions) manual and its formal specifications provide insights into how recursive regexes are implemented and their capabilities.
In conclusion, modern regexes, especially those with back-references and recursion, recognize a broader class of languages than traditional regular expressions, but they still cannot recognize the full range of context-free languages. Some languages that are recognizable by regexes are beyond the reach of CFGs, while others are still outside the power of regexes, requiring more sophisticated parsing mechanisms like pushdown automata.

The recognizing power of "modern" regexes?

1 Expert Answer

Modern Regexes and Their Power

Limitations of Modern Regexes

Relation to LL or LR Grammars:

Research and Papers on Modern Regexes:

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

Write an employee payroll program that uses polymorphism to calculate and print the weekly payroll for your company

Hey, I have this question for Matlab, and I am completely lost on what I need to do .

Under what circumstances would you use a sequential file over a database?

Help writing this program homework

Using Visual Basic write Program

RECOMMENDED TUTORS

IXL

Rosetta Stone

Education.com

TPT

Vocabulary.com

ABCya

SpanishDictionary.com

Inglés.com

Emmersion

The recognizing power of "modern" regexes?

1 Expert Answer

Modern Regexes and Their Power

Limitations of Modern Regexes

Relation to LL or LR Grammars:

Research and Papers on Modern Regexes:

﻿

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

Write an employee payroll program that uses polymorphism to calculate and print the weekly payroll for your company

Hey, I have this question for Matlab, and I am completely lost on what I need to do .

Under what circumstances would you use a sequential file over a database?

Help writing this program homework

Using Visual Basic write Program

RECOMMENDED TUTORS

find an online tutor