Regex in Perl – webscraping.pro

In this post we summarized some basic features of regex in Perl. We presented basic operators using regex and special regex pattern modifiers. More details are the in following articles…

Perl is powerful for string processing

Perl regular expressions (regex) are an essential part of Perl syntax for string processing. Perl language syntax smoothly matches and merges with the regex specifications providing concise and beautiful output:

#(example) extraction hours, minutes, seconds;
if ($time =~ m/(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
     $hours = $1;
     $minutes = $2; 
     $seconds = $3; }
#or 
($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);
# the variables $hours, $minutes, $seconds will get hours, minutes and
# seconds of $time string variable respectively.

Regex matching is remarkably fast. Perl compiles the regex into a compact sequence of opcodes that often fits inside a processor cache. When the code is executed, these opcodes can then run at full throttle and search very quickly. To fully use this feature, invoke the pattern modifier ‘o’ and Perl will only compile the pattern once, regardless of the amount of match operations or substitution operations required. For details, see the Special modifiers section. The basic method for applying a regular expression is to use the pattern binding operators =~ and !~. The first operator is a regex match test and assignment operator — the second is its negation.

$newvar = ‘new variable’;
$newvar =~ /\s/; # if a white space is present
$newvar !~ /\d/; # if a digit isn’t present

Regular expression operators

There are two basic regular expression operators in Perl:

Match Regular Expression – m/<pattern>/
Substitute Regular Expression – s/<pattern>/<replacement>/

Delimiters

The forward slashes in each operator act as delimiters for the regular expression that you specify. You can use other characters for regex delimiters, except alphabetic letters, digits and white space. For example: ! # | etc.

Also, paired parentheses are allowable to confine pattern and replacements:

$text =~ /something/i; # all three examples play well
$text =~ m(something)is;
$text =~ s{something}{anything}g;

In the example above (1st string) we missed the m for m// operation, that is allowable if / delimiter is used. The patterns with the '?' delimiter perform differently. They act as triggers taking 0 (false) condition if the pattern matches. They remain in this condition until they are effectively reset, and all constructions ?...? being set to 1 (true) for the given block. The following script checks if there are empty strings in a file:

While (<>)
if (?^$?) {print 'There is an empty line here.\n';} continue
      { reset if eof;   #clear up for the next file }

Operators

The m/pattern/ operator searches for a string pattern to match in the input string. Its work depends on whether scalar or list context it works in. Also, it depends on 'g’ (global) modifier, allowing the global search. In scalar mode and without 'g’ modifier, m/.../ returns the logical 1(true) if the search is successful, otherwise an empty string ‘’ (false). If there are groups of elements inside of a pattern in round brackets, after the search operation the numbered variables are created $1, $2, …, containing the results of the groups’ matches. After a successful search you may use the special service variables $&, $', $' и $+.

The Perl inbuilt variables:

$' – substring following match, only for reading
$& – entire matched string, only for reading
$` – substring preceding match, only for reading
$^R – result calculation inside of the pattern
$n – n-th match group according to the round bracket order in pattern
\n – n-th match, being used in pattern itself
$+ – last matched string
$_ – an input string where search is performed by default if other variable isn’t specified. If the operator uses =~ or !~, the input searched string is one set at left of those operators. Otherwise the search is in the $_ variable, by default.
@- – special array, containing the initial position of match in a string
@+ – special array, containing the last position of match in a string

$text = '1223 cows go for a new grass.';
$scalar = ($text =~ m/(\d*)\s(\w+)\s(.*?)\./);
print 'Result: $scalar';
print 'Result: $1   $2   $3';
Result: 1
Result: 1223   cows   go for a new grass

If you’re using m// in the list context, the return values depend on the content within the round brackets. For each grouping, the pattern that matched inside goes into the special variables $1 , $2 , etc. They can be used as variables, as seen in the very first example above.

$text = '1223 cows go for a new grass.';
@array = ($text =~ m/(\d*)\s(\w+)\s(.*?)\./);
print join ';' ,  array;
1223;cows;go for a new grass

When global search (‘g’) is on, the result of all the matches of all groups is arrayed:

$text = '1st 2nd 3d 4th';
@array = ($text =~ m/((\d)\w+)/);  # single mode match
print 'Single mode: , join(', ',  array),\n';
@array = ($text =~ m/((\d)\w+)/g);  # global mode match
print 'Global mode: , join(', ',  array)';
Single mode: 1st, 1
Global mode: 1st, 1, 2nd, 2, 3d, 3, 4th, 4

In the list context with global search, one should differentiate between group match and all the matches of certain groups. In the above example in global mode groups ((\d)\w+) and (\d) have been matched 4 times each, thus we have had 8 matches total in 2 groups. Note that runtime access to those service variables ($&, $', $' и $+.) may slow down the search.

The operator s/<pattern>/<replacement>/ takes a matching pattern (<pattern>) and if the pattern is found, substitutes it for the given replacement value (<replacement>). If ‘g’ modifier is not set, the exchange is done only for the first matching instance. When using ‘g’, the replacement is done for every matching instance in an input string. The operator returns the number of the successful substitutions or empty string if no occurrence happened. The text analyzed is in $_ variable, otherwise it should be a scalar variable, array element, or hash element located in front of the =~ or !~ expressions.

The Perl special pattern modifiers

We won’t overview the basic pattern modifiers, but we will pay more attention to the special pattern modifiers. For the basic pattern modifiers, look at Regex expressions post.

с – if in scalar context operator failed to find next match this modifier prevents to reset the current search position. It’s valid only with m/…/ operator and ‘g’ modifier.
o – forbids pattern recompilation at every match operation, thus guaranteeing the pattern doesn’t change within the operation.

$var = 'line';
$text = 'a new line of the code lines';
while ($text =~ /(\s$var)/o) {
  print $1 '\n'; #changing $var will not impact the operation’s result
}
Result:  line
   line

e – defines the right part of s/.../.../ command as the code to be processed. Replacement will be done with a retrieved value, interpolation being allowed.
ee – defines the right part of s/.../.../ command as the string expression to be calculated and processed as a code (thru eval function). Replacement will be done with the retrieved value, with interpolation being allowed.
a – modifier is used to restrict the matches of \d, \s, and \w to just those in the ASCII range. It is useful to keep your program from being needlessly exposed to full Unicode when all you want is to process only English text.

Summary

The text processing for search and replacement is extremely powerful in Perl, thus the basic techniques of Perl syntax is always good to review.

Свежие записи

Свежие комментарии

Архивы

Рубрики