Perl Regular Expressions
This page was last modified on Wednesday, 13-Sep-2000 21:19:28 EDT

 

One of the most useful features in Perl is its powerful string manipulation features.  An important key to Perl's stringies manipulation is the ability to manipulate regular expressions.  Perl's regular expression facility is nothing new, it borrows liberally from similar capabilities that are shared by other UNIX tools. 

Regular expressions may be applied to the $_ implicit variable, or explicitly to any variable with the =~ operator.  The statement 

    $variable =~ someRE 
states the regular expression on the right is applied to the variable on the left.  If the regular expression on the right is a substitution or a translation, the varable may be modified.  There is also an !~ which is true when the test described in the regular expression fails. 

If the regular expression is a matching test, the result is reported as true if there is a match, and false otherwise. 

Matching 

Assume, 
$sentence = "The quick brown fox jumped."; 

The command, 

    $sentence =~ /fox/
or, equivalenty, 
    $sentence =~ m/fox/
is true and 
    $sentence !~ /fox/
  
Substitution 

The general form for substitutions is 

    s/expr1/expr2/opt
where expr1 is replaced by expr2.  The details of the substitution are controled by the possible options, opt. A substitution only occurs with the fist occurrence of expr1 unless the global option g, is indicated.  For example, the statement 
    $buffer =~ s/\ /+/;
replaces the first blank with a plus sign, while 
    $buffer =~ s/\ /+/g;
replaces all blanks with plus signs. 
 
 


Translate 

The translate process 

    tr/expr1/expr2/
function performs a character-by-character translation by replacing each occurrence of the i-th symbol in the expression with the i-th symbol in the expression. 

For example, 

    tr/abc/xyz/
replaces a by x, b by y, and c by z in $_.  The command, 
    $buffer ~= tr/[A-Z]/[a-z]/;
replaces all upper case alphabetic characters with their lower case equivalents. 

Translate can also be used to count the number of occurrences of a pattern, for example, 

    $count = ($buffer ~= tr/[A-Z]/[a-z]/);
counts the number of characters that were translated. 
 
m/expr1/
match the expression
s/expr1/expr2/
substitute expr2 for expr1
tr/expr1/expr2/
or
y/expr1/expr2/
translate symbols replacing expr1[i] with expr2[i]
/expr1/
same as m//
RE Commands
 
e Evaluate the right side as an expression. Valid for substitution, not for matching.
g Global replace, all occurrrences
i ignore upper/lower case
m string as multiple lines
o compile pattern once
s string as single line
x extended regular expression
RE Substitution and Matching Options
 
 
c Complement the search list
d Delete found but unreplaced characters
s Quash duplicate replaced characters
RE Translate Options
 
Special Symbols and Backslash Codes 

There are several special characters that play specific rolls in regular expressions.  The table to the right lists the special characters with their roles and a simple llustration of its use.  These special characters play important roles in specifiying particular locations in the string and groups of substrings. 

Along with the special characters there is also a set of backslash codes.  The backslask code serve several purposes: 

  1. Provide access to the symbols when they are used as ordinary characters.
  2. Provide access to the $1 $2, ..., variables.
  3. Access to octal, hex, and control codes
  4. Useful string processing indicators.
It's best to see the special characters and backslash codes working together.  Let's approach this from the point of view of simple some simple problems. 

EXAMPLE:  Write a Perl progam to read a string and replace every ccurrence of a blank with a plus sign. 

This program begins with the problem of indicating a blank.  A blnk may be indicated by an actua blank space, or with a backslash code.  For the sake of readability, we use a backslash code.

    #!usr/local/bin/perl
    #
    $buf=<STDIN>;
    chomp $buf;
    $buf ~= s/\ /+/g;
    print "$buf\n";
    exit;
The program reads a string into $buf, removes the end of record indicator, then processes $buf by replacing all blanks with plus signs, s/\ /+/g. This could also have been accomplished with s/ /+/g.  Below is another version of the program that uses the $_ variable.
    #!usr/local/bin/perl
    #
    $_ = <STDIN>;
    chomp;
    s/\ /+/g;
    print "$_ \n";
    exit;
EXAMPLE:  Write a Perl program to read a string and make sure that the first symbol in every alphabetic substring is upper case.  Also, if the string begins with an alphabetic character, make that symbol upper case as well. 

Let's begin by solving the first part, when a blank precedes an alphabetic character.  Finding a blank followed by an alphabetic character is relatively easy, "\ [a-z]".  The baskslash blank matches a blank and [a-z] finds any lower case a through z
 
Next, the RE feature must be told to replace the lower case symbol by its upper case equivalent.  This is accomplished by placing parentheses around [a-z], ([a-z]) to capture the symbol in a temporary variable, then that variable is used with the \U code to indicate the symbol is to be forced to upper case, as illustrated in the program,

    #!usr/local/bin/perl 
    # 
    $_ = <STDIN>; 
    chomp; 
    s/\ ([a-z])/\ \U\1/g; 
    print "$_ \n"; 
    exit;
 The program reads a string from standard input into the implicit variable, $_.  The chomp function removes any end of record indicator from te read.  The substitution look for any blank followed by a lower case variable (note the parens) and replaces it with a blank and the forced upper case equivalent of the symbol captured in $1.  Note the use of the global option, g, otherwise only the first occurrence would be replaced.

To handle the symbol at the beginning of the line, the beginning of line indicator, ^, is used.  Now the search is looking for the beginning of the line or a blank, "^|\ " followed by the "[a-z]".  Each of these is captured in a temporary variable for use with the substitution.  This leads to the program,

    #!usr/local/bin/perl 
    # 
    $_ = <STDIN>; 
    chomp; 
    s/(^|\ )([a-z])/\1\U\2/g; 
    print "$_ \n"; 
    exit;
 
 
. Any single character except a new line.
/t.e/ matches any letter between t and e.
^ Two meanings: When it is the first character in the  expression, it means the match must be at the beginning of the string, otherwise is means except the following.
/^a matches the string must begin with a 
[^0-9] matches anything except 0 through 9.
$ Forces match to occur at the end of the string.
...x$/ the string must end with x
* Zero or more occurrences of the character preceding this one.
ab*c matches a and c with any number of bs between them, including ac
+ One or more occurrences of the character preceding this one.
ab*c matches a and c with at least one b between them.
? Zero or one occurrences of the character preceding this one.
/ab?c/ matches ac or abc.
[ ] Match any one of the characters inside the brackets.
/b[iaou]g/ matches any of the string big, bag, bog, and bug.
-
Range indicator.  Like .. in Pascal and Ada.
/[a-z]/ match any lower case alphabetic charater.
( )
Bind the match to the appropriate $1, $2, ..., variable.
{ } Repeat the preceding.
a{5} match exactly five a
a{5,} matches 5 or more times. 
a{2,12} match between 2 and 12 as.
\ Match the special character that follows, or the interpretation given to the symbol that follows (see the Table below).
/a\/b/ matches a/b.
/ Regular expression delimiters
s/abc/xyz/ means replace abc with xyz.
| Logical or.
/a|b/ matches a or b.
RE Special Symbols
 
 
\ The special character that follows, \\, \/, \., ...
\0
Interpret what follows in octal
\digit \1, \2, ... , are codes for the implied variables $1, $2, ..., respectively.
\a
Alarm beep
\A
Beginning of string
\b A word boundary
\B No word boundary
\c
Control, as in \cC is cntl-C.
\d
Any digit.  Shorthand for [0-9]
\D Any non-digit.  Shorthand for [^0-9]
\e
Escape
\E
End the preceding \U, \L, or \Q.
\f
Form feed
\l
Force next character to lower case
\L
Force all following characters to lower case
\n Newline
\Q
Backslash all the following non-alphanumeric characters.
\r
Carriage return
\s Any whitespace character - space, tab, newline, ...
\S Any non-whitespace character
\t tab
\u
Force next character to upper case
\U
Force all following characters to upper case
\w Any alphanumeric character.  Shorthand for [a-zA-Z0-9_]
\W Any non-word character.  Shorthand for [^a-zA-Z0-9_]
\x
Interpret what follows as hexadecimal
\Z
End of string
RE Backslash Codes
 
 
Some sites with RE information

Perl Tutorial at Leeds University (UK)
Birkbeck College, University of London., MSc Computer Modelling & Bioinformatics: Perl Materials
Perl Workshop, Regular Expressions Give Perl Its Luster by Ben Smith
Perl Workshop, The Art of Abstraction by Ben Smith
Regular expressions in Perl and Javascript