Computerworld

Pattern matching with PHP

Matching patterns in strings is important in just about all Web applications that deal with data. PHP has a large number of pattern matching functions and extensions, tightly integrated into the language itself. The easiest pattern in a string to match is a single character. This can be accomplished in PHP as follows:

Python vs. PHP: Choosing your next project's language

01 <?
02 $c = 'a';
03 $str = 'string with a';
04 $found = 0;
05 for($i=0;$i<strlen($str);$i++) {
06 if($str[$i] == $c) {
07 		$found = 1;
08 		break;
09 	 }
10 }
11 ?>

In this basic script, we loop through each character of $str, testing it against the character we are looking for, $c. If we find it, we set $found to true and break the loop.

Conveniently, PHP has a function which does exactly this. The following does the same:

01 <?
02 $found = (strpos($str,$c) === false ? 0 : 1);
03 ?>

The function strpos() searches for $c in $str. It returns the offset of the first occurrence of $c, or false if $c isn't found. Notice that we use '===' to ensure that false is being returned and not 0, which is the first character of $str.

Another common requirement is to find the portion of a string matching a number of characters. For example, say you wanted to read a string containing numeric data up until a non-numeric character appeared. Consider the following:

01 <?
02 $nums = "1234567890";
03 $str = "340872 * 10";
04 $numlen = strspn($str,$nums);
05 $numeric = substr($str,0,$numlen - 1);
06 ?>

The strspn() function runs through $str until it does not find a character in $nums. It returns the length of this segment. The script above stores the number in $numeric using substr() in conjunction with this data.

Advanced pattern matching

Widely used on UNIX systems and in many programming languages, regular expressions are probably the most sophisticated all-purpose textual pattern matching mechanisms. A regular expression is a script which denotes a pattern.

The most elementary regular expressions are 'literals', i.e., characters which are interpreted literally. Consider the following:

01 <?
02 $regs = array();
03 if(ereg("abc","a string with abc",$regs)) {
04 echo "Pattern matched: {$regs[0]}\n";
05 } else {
06 echo "No pattern matched\n";
07 }
08 ?>

The ereg() function matches patterns in the first argument within the string passed as the second argument. Matches for this type of pattern are stored in the array $regs, passed as the third argument.

This kind of pattern matching can easily be accomplished with the functions already covered; what those functions cannot do is match wild cards. Consider the following:

01 <?
02 $regs = array();
03 if(ereg("wildcard: (.)","The wildcard: x",$regs)) {
04 echo "The wildcard matched: {$regs[1]}\n";
05 } else {
06 echo "No pattern matched\n";
07 }
08 ?>

The pattern used this time is decidedly different. There are two distinct parts. The first, "wildcard: " (note the trailing space) are treated literally. The second part, "(.)", is used to match the character "x" in the string. The stop (".") matches any character except a newline. The parenthesis instructs ereg() to place the matched character (that is, the character matching ".") to be placed in $regs. The first matched parenthesis starts at $regs[1].

01 <?
02 $regs = array();
03 if(ereg("wildcard: (.)","The wildcard: x",$regs)) {
04 echo "The wildcard matched: {$regs[1]}\n";
05 } else {
06 echo "No pattern matched\n";
07 }
08 ?>

Pattern matching can be vastly enhanced. When an asterisk ("*") is appended to a pattern, the regular expression matches zero or more instances of that pattern. For example, a regular expression "abc*" will match "ab" (remember: zero or more instances), "abc", "abcc", "abccccc" and so on. Notice that "abc*" does not match zero or more instances of "abc", but only "ab" followed by zero or more instances of the character "c".

To match "abc" zero or more times, the regular expression would be: "(abc)*". In this case, "abc" is enclosed in parentheses and is treated as an "atom" or "single unit".

Page Break

Pattern matching with PHP: Part II

We have already examined asterisk (*) quantifier. This meta character operates on the previous atom and matches zero or more instances of that atom. For example, the RE 'ab*' will match 'a', 'ab', 'abbbb' and so on; the RE '(ab)*' will match '' (nothing), 'ab', 'abab', etc.

In many applications, however, one wants to match one or more instances of a pattern. A regular expression to match one or more instances of 'ab' might look like this: "ab(ab)*". This is slightly cumbersome, however. For this reason, regular expressions can have patterns followed by a quantifier plus ("+"). Unlike the asterisk, this matches one or more instances of the previous atom. So the previous RE could be modified to '(ab)+'.

There is another option, however. Some applications might require matching zero or one instance of a pattern! For this reason, a question mark (?) quantifier is supported. The RE '(ab)?' will match only '' (nothing) and 'ab'.

Notice that current syntax supports matching zero, one or an infinite number of patterns. If an application was required to match between one and three instances of 'ab', it could not be done.

It might at first appear that to match between one and three instances of a pattern 'ab', you could use the RE '(ab)*(ab)*(ab)*. This is not correct, since the first pattern atom will match all three instances of 'ab' in 'ababab'. You can test this with the following script:

01 <?
02 $str = "ababababababab test abab";
03 $pat = "(ab)*(ab)*(ab)*";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: {$regs[0]}\n";
07 }
08 ?>

Line 02 contains a string, $str, with several instances of 'ab'. On line 05, the script tests the pattern against $str and reports the first matched pattern on line 06. This is the whole initial 'ab' segment.

As such, to support a specific range of matches, REs support a quantifier called a 'bound'. Bounds determine how many instances of a pattern should be matched. For example, the following RE matches one to three instances of 'ab': '(ab){1,3}'. The bound is denoted by curly brackets ({}). The first value is the lower bound and the second is the upper bound. If the second value is omitted, only a lower bound is used when matching patterns.

Complex patterns

The wild card meta character '.' matches any character except a new line. The reason it doesn't match a new line as well is due to one of the most important rules in the RE syntax: the largest pattern is always matched. That is, the RE 'a.*b', when used against the string 'avbab', matches 'avbab', not 'avb' or 'ab'. If '.' matched new lines as well, many REs would match an entire string immediately (since '.' matches all characters in it). It is also convenient to match REs line by line - a point which will become more and more apparent with increased usage.

This begs the question, then, of how to match new lines with REs. There are two meta characters for new lines: circumflex/hat ('^') and dollar sign ('$'). The first denotes the beginning of a line, the second denotes the end of a line. The following RE matches a whole line: '^.*$'.

This syntax allows much more complex REs. For example, to match all lines beginning with 'Hello', the following RE would be used: '^Hello.*'. To match all lines ending with 'World' the following RE would be used: '.*World$'.

What if, however, an application was required to match either lines beginning with 'Hello' or ending with 'World'? The pipe ('|') meta character provides an alternation mechanism by which a string matches one pattern or another. Using '|', the following RE matches the above requirement: '(^Hello|World$)'.

In some circumstances, applications are required to match one of a number of characters. For example, you may be required to match a space, a tab or a lower case character. Using our existing syntax, the RE would look something like this: '^( |\t|a|b|c|....)' where '\t' is a meta character for a tab and '....' represents the rest of the alphabet in lower case joined by pipes. This is cumbersome and very inefficient from a performance perspective. The solution to this is 'bracket expressions'. A bracket expression is a list of one or more characters which are treated, character by character, as a pattern. The previous RE could be rewritten as: '^[ \tabc....]'.

It is still cumbersome, however, to type out the whole alphabet in lowercase. For this reason, bracket expressions support ranges of characters. Ranges include 'a-z', which matches all lower case alphabet characters; 'A-Z', which matches all upper case alphabet characters; and 0-9, matching the digits 0 to 9. This means we can finally create a basic RE which matches all possible patterns we're interested in: '^[ \ta-z]'.

On the next page, we will take a detailed look at back references in conjunction with regular expression substitutions in PHP.

Page Break

Pattern matching with PHP: Part III

Building on the previous page on pattern matching with regular expressions (REs), this time we will investigate substitutions and back references.

Substitutions and REs

PHP provides a framework for matching strings with REs and also for replacing substrings based on those REs. Say your application was required to strip all HTML tags from some text, except line breaks (<br>); you could use REs as follows:

01 <?
02 $pat = "(<[^b>]?[^r>]*>)";
03 $str = "<p>a <b>test</b><br>html string</p>";
04 echo eregi_replace($pat,"",$str);
05 ?>

On line 02, we define a pattern which matches a less than sign (<) that marks the start of an HTML tag. We then match any character other than 'b' and 'r'. We also want to avoid matching '>' since the second atom matches all characters up to the final '<'.

On line 05, we call eregi_replace() to match instances of the pattern. The second argument is our replacement text. In this case, it is "" (an empty string). As such, instances of the pattern in $str will be removed. To get familiar with this code, try replacing "" with "test" to see where substrings are removed. We also use eregi_replace() because it matches case insensitively and HTML is case insensitive. The case-sensitive equivalent to eregi_replace() is ereg_replace() (note the missing 'i' after 'ereg').

Understanding back references

A back reference is the text that has been matched by a sub-pattern in the pattern string. Back references allow the user to refer to parts of a matched string. The following example illustrates how they are used:

01 <?
02 $str = "01/03/2004";
03 $pat = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})";
04 $regs = array();
05 if(ereg($pat,$str,$regs)) {
06 echo "match: day: {$regs[1]}, month: {$regs[2]}, year: {$regs[3]}\n";
07 }
08 ?>

On line 02, we define a string that is a human-readable form of the date for 1 March. On line 03, we define a pattern to match the structure of this date. The first atom, '([0-9]{1,2})', is designed to match either one or two numerical characters. The second atom is identical. The third matches four numeric characters.

The use of parentheses not only allows us to form sub-patterns in our pattern string, but also to 'save' the text of the matched sub-pattern and recall it later. On line 05, we call ereg() and tell it to store the matched patterns in $regs. On line 06, we can output a broken-down date string. (Readers interested in extending their RE skills should attempt to modify the pattern on line 03 to validate dates; currently, $pat would match a string such as 99-99-2004.)

Parentheses nested in another set of parentheses can also be back referenced. For example, the pattern "(a (string))" allows two back references. Reference one, which would be stored in $regs[1] if called in conjunction with ereg(), would be 'a string'. Back reference two would be 'string'.

Combining back references and substitutions

By using back references in conjunction with substitutions we can design small scripts to perform very complex tasks. Consider the following problem: replace European dates of the form 'dd/mm/yyyy' or 'dd-mm-yyyy' with the ISO 8601 format of 'yyyy-mm-dd' in a text file.

01 <?
02 $pat = "([0-9]{1,2})[/-]([0-9]{1,2})[/-]([0-9]{4})";
03 $repl = "\\3-\\2-\\1";
04 $str = join('',file("test.txt"));
05 echo ereg_replace($pat,$repl,$str);
06 ?>

Place the following string in a file called test.txt: "this is a date 1/2/2004 and this is another 20-3-2003". Running the script will convert this text to: "this is a date 2004-2-1 and this is another 2003-3-20".

On line 02, we match a European date of the form defined above and isolate three different atoms. For the purpose of back referencing, the sub-patterns are number 1 through 3 from left to right. On line 03 we rearrange the order of the date by reversing the order of the atoms we are back referencing.

Page Break

Pattern matching with PHP: Part IV

Over the previous pages, we've focused on regular expressions. The implementation in the extension follows the POSIX standard. An alternative and much more feature-rich implementation is found in the Perl scripting language. Perl-compatible regular expressions (PCREs) can be used in PHP, and they're significantly faster and easier to use than the POSIX ones.

Using PCRE

Unlike the POSIX regular expressions we've used over the last two months, patterns in PCREs must be delimited. For example, consider the following pattern:

^.*news

This matches the start of a line, followed by all characters up until the string 'news'. The same regular expression, used as a PCRE, looks like this:

/^.*news/

The start and end of the pattern is delimited with a slash. The slash is the most widely used delimiter, but any other non alpha-numeric character is valid, as are square brackets [], curly brackets {} and less-than and greater-than symbols <>.

Modifiers

Delimiters are used to separate the pattern from syntax modifiers. Modifiers change how the PCRE library works. Every modifier is represented by a unique character. The modifier is placed at the end of the pattern, after the closing delimiter. For example, to use a modifier 'X', a PCRE would take the following form: '/pattern/X'.

The following problems can be solved with PCRE modifiers:

Case: using the modifier 'i', the PCRE library will ignore case when matching characters in the pattern against the string. For example, the RE '/abc/i' will match 'Abc', 'ABC', 'aBc' and so on.

Multiple lines: unlike POSIX regular expressions, PCREs do not operate on a single line at a time. That is, a string is treated as a single 'line', even if it contains embedded new lines. The '^' meta character matches the start of the string and '$' matches the end of the string. By using the 'm', or multi line, modifier the RE is executed in the same way as a POSIX regular expression. (New lines can be matched with the '\n' meta character.)

Matching new lines: we have already established that the '.' meta character matches every character except a new line. Since, however, PCREs treat a string as a single line this can be inconvenient. Using the 's' modifier, '.' will match all characters. This means that the PCRE '/^.*$/s' will match the whole string:

This
is a multi line
string

Greediness: default wild card matching in POSIX REs and PCREs is said to be greedy. That is, the largest pattern possible is matched. For example, the PCRE '/a.*c/' executed on the string 'abcbcdef' will match 'abcbc'. It is greedy because it consumes as many characters as possible. This behaviour can be reversed with the 'U', or ungreedy, modifier. The PCRE '/a.*c/U' matches 'abc'.

Meta characters

The PCRE extension also provides some useful meta characters missing from or which are very hard to use in the POSIX extension. Some the most useful meta characters deal with isolating words in a string.

For example, say you wanted to locate the word in the string 'Hello!?'.

There are three characters which need to be ignored. The following PCRE would be able to locate just the word: '/[^'!?]/+'. There are, however, many other non-alphabetical characters. Instead of forcing users to determine them and then type them out in every pattern, the '\w' meta character can be used to match any character in a word. As such, our RE can be simplified to '/\w+/'.

Complementing this is a meta character to match non-word characters: \W (upper case 'w'). This meta character will match white space (tabs, spaces, and so on), non-alphabetical characters such as punctuation and all those other characters you always forget about.

It is important to be able to differentiate white space from punctuation. To this end, the '\s' meta character can be used to match any white space character. Its complement is '\S' (upper case 's'), which matches any non-white space character (including punctuation).

Finally, the '\b', or word boundary meta character, can be used to match the start or end of a word. This is analogous to '^' and '$' - the start and end of a line.

Page Break

Pattern matching with PHP: Part V

We have already looked at the sophisticated features Perl-compatible regular expressions (PCRE) can bring to PHP developers. Now we will finish off with the final significant feature, assertions, and look at the functions provided in the PCRE extension.

Assertions

Assertions allow you to match a sub-pattern without actually consuming that sub-pattern. For example, say you wanted to match only the pattern 'end' if it was followed by the HTML tag '</p>'. We could do this using back references in a PCRE: '/(end)(<\/p>)/'.

Using assertions, however, we can prevent the PCRE matching routine from consuming the '</p>' portion of the string. To do so, we use a look ahead assertion: '/(end)(?=<\/p>)/'. The syntax '?=' tells the PCRE extension that the following pattern is a look ahead assertion.

Complementing this is the look behind assertion, in which the sub-pattern is matched only if it is preceded by another pattern. Say we wanted to match 'item' only if it was preceded by 'news '.The following PCRE could be used: '/(?<=news )(item)/'. Here, the syntax '?<=' tells the PCRE extension that the following pattern is a look behind assertion.

Sub-patterns are also permitted in assertions. For example, say you want to match the pattern 'end' if it is followed by '</p>' or '.' (a full stop). The following PCRE could be used:

'/(end)(?=(<\/p>|\.))/'.

Sub-patterns can be used in a similar way in look behind assertions, with one restriction: each sub-pattern must be of equal length. For example: '/(?<=(news|review) )(item)/' is invalid, since 'news' and 'review' differ in length.

Another feature of assertions is negative assertions. A negative assertion allows you to match a pattern not preceded or followed by a pattern. To match a pattern 'end' not followed by '</p>', the following PCRE could be used: '/(end)(?!<\/p>)/'. Notice that the normal look ahead assertion syntax is '?=' whereas the negative look ahead assertion syntax is '?!'. Likewise, the syntax for negative look behind assertions is '?<!'.

PCRE functions

The PCRE extension provides functions similar to those you would be familiar with from the POSIX regular express extension. For matching, use preg_match(), and for substitution, use preg_replace(). Consider the following examples:

01 <?
02 $str = "<p>Find the word at the end</p>";
03 $regs = array();
04 if(preg_match("/(end)(?=(<\/p>|\.))/",$str,$regs)) {
05 echo "Word is {$regs[1]}' followed by '{$regs[2]}'\n";
06 }
07 ?> 

In this example, we match the string $str with the pattern on line 04. The pattern uses the look ahead assertion discussed above. As with the POSIX extension, we are able to store the matched sub-patterns in $regs as an array and report to the user on our matches.

01 <?
02 $str = "Yet another news item.";
03 $str = preg_replace("/(?<=news )(item)/i","article",$str);
04 echo $str;
05 ?> 

In this example, we replace a matched pattern in the string $str. We match any instance of 'item' preceded by 'news '. Notice that only the sub-pattern 'item' is replaced, not the pattern in the look behind assertion.

The PCRE extension also comes with useful functions missing from the POSIX extension. The preg_grep() function can be used to match patterns in array data. In this sense, it treats each array element as a line of a file and behaves like the popular UNIX tool 'grep'.

01 <?
02 $file = file('test.html');
03 $lines = preg_grep("/<h1>.*<\/h1>/iU",$file);
04 print_r($lines);
05 ?> 

On line 02 of this example we read the input of the file test.html into an array $file, one line per element of the array. In line 03, we attempt to match lines with '<h1>' and '</h1>'. Note the 'i' (case) and 'U' (greediness) modifiers allow us to do this. The keys to the array are the line numbers of those lines from test.html which matched our pattern, counting from one.

Another invaluable function is preg_quote(). This function escapes any PCRE meta characters in the string passed as an argument, preparing the string to be used as a pattern in the PCRE extension. For example, it will escape the string '</h1>' to '<\/h1>'.

Readers interested in extending their PCRE and PHP skills should investigate the full range of functionality available in the extension, at www.php.net/manual/en/ref.pcre.php.