Overview
- The Pattern Specification Language (PSL) provides pattern matching. A given PSL program is applied to target text, and the search engine determines where (or whether) the specified patterns appear in the input. For example, a program could match all occurrences of "cat" in the target text:
-
Catherine
medicated
her
cat in
Decatur.
- Note:
- This specification uses program as a general term for a set of PSL pattern-matching instructions. See ref page_transunits for a more complete description of the term.
- This PSL reference manual specifies:
- the representation of PSL programs;
- the syntax and constraints of the PSL language;
- the semantic rules for interpreting PSL programs;
- the representation of input data to be processed by PSL programs.
- This PSL reference manual does not specify:
- the mechanism by which PSL programs are transformed for use by the execution search engine;
- the mechanism by which PSL programs are invoked for use by the execution search engine;
- the representation of executable PSL programs;
- the representation of output data produced by PSL programs;
- the size or complexity of a program and its data that will exceed the capacity of any specific data-processing system, during either program translation or program execution.
- The document introduces low-level concepts first, building the foundation for later sections.
- ref page_lex. This section specifies the lexical tokens used by the language.
- ref page_strings. As part of scanning target text, strings specify the textual content to match in the input.
- ref page_windows. In the matching process, a window specifies how much text to examine. Whereas strings specify what to match, a window defines the region of the target text to be examined.
- ref page_defs. This section describes definitions, such as ref defs_pattern_macros "pattern macros" and ref defs_directives "compiler directives". Although not directly executable, these definitions support ref page_queries "queries" and provide ways to affect the ref main_translation_environment "translation environment" and the ref main_execution_environment "execution environment".
- ref page_idexprs. The language provides some functions for examining text or testing conditions on the target text. Additionally, a program can define its own ref defs_pattern_macros "pattern macros" and use them like pattern functions. Both intrinsic functions and macros are invoked with identifier expressions.
- ref page_patterns. A complete PSL matching rule is called a pattern.
- ref page_queries. This section consolidates preceding material. A query is the "executable unit" of a PSL program.
- ref page_transunits. This section describes the collective PSL input that can be given to the compiler.
- ref page_syntax. This section collects all the syntax descriptions for reference.
- ref page_defaults. This section describes the default configuration values.
- ref page_impl_brutesquad. This section describes implementation details that are specific to Brutesquad.
Conceptual Models
- Translation Environment
- After a PSL program is written, a translator converts the textual program into an executable representation (not specified here). The translator, however, is responsible for checking the program syntax, possibly combining multiple translation units, and generating a program image that can be used to search target text. That executable program image can be presented immediately to the execution environment, or it can be preserved for later use. If the program is revised, the updated PSL would be retranslated, and the new program image could be used for search execution.
- Execution Environment
- Once translated, a program can be applied to arbitrary input text. A search engine interprets the program's instructions, applies those conditions to the input target text, and determines whether the target text satisfies the program's constraints. If so, the target text is said to match the program. The search engine provides means to record these matches (but that mechanism is outside the scope of this manual).
- Technically, PSL refers to the language. For brevity, this manual also applies the term to the execution environment when no confusion would arise.
- Target Text Scanning
- When the search engine applies a program to target text, it scans that text from left to right. The program's constraints are checked against the input text, and the search engine declares a match when all the constraints are satisfied. These constraints can be simple (find all instances of "cat") or complex.
- Conceptually, the matching engine scans the target text a single time, without backing up. Depending on the complexity of the program, many constraints can be active simultaneously, and the search engine tracks them during the scan.
Character Sets
- PSL was designed for searching ASCII text. Although it can be used with Unicode (especially the UTF-8 encoding scheme), some limitations exist. Both the translation environment and the execution environment assume the following character set attributes:
- A character occupies eight bits.
- The textual character set is ASCII.
- Except for the ASCII subset in the UTF-8 encoding scheme, Unicode code points consume at least 16 bits and are not treated as individual units.
- ref lex_identifiers "Identifiers" in the translation environment can use only ASCII characters.
- Strings' ref strings_quoted_strings "word recognition" and ref strings_simple_chars "case sensitivity" features work only with ASCII target text.
- Some ref strings_wildcards "wildcards" recognize only ASCII characters.
- The ref page_strings section includes additional information about Unicode.
Future Directions
- TBD
Notation Conventions
- Language Syntax
- In the syntax notation used by following sections, syntactic categories (nonterminals) appear in italic type; literal words and character set members (terminals) appear in
bold
type. A colon (:) following a nonterminal introduces its definition. Alternative definitions appear on separate lines, except when prefaced by the words "one of". An optional symbol is indicated by the subscript "opt".
- The following example defines quoted-string-token with two alternatives:
- an optional string-token-sequence surrounded with double quotes (
"
);
- an optional string-token-sequence surrounded with single quotes (
'
).
- The example then defines letter as any single lower-case or upper-case letter character (ASCII letters only).
-
- quoted-string-token:
-
-
"
string-token-sequenceopt
"
-
'
string-token-sequenceopt
'
- letter: one of
-
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
- Code Examples
- Code examples are shown using
fixed width
font:
{ 1 sentence -> 'Bach' and 'Beethoven' }
- Target Text
- Target text examples appear as below, with hit positions indicated by underlined, bold, red text. A sample block can show both the PSL and target text.
-
"two"
Sample target text with two
matches on two lines.
[ Previous Page, ref page_lex "Next Page" ]