Lasso Soft Inc. > Home

  • Articles

Optimizing Regular Expressions

This article includes some information about how to optimize code using regular expressions by using the [RegExp] type rather than individual string tags.

Introduction

Regular expressions provide a powerful method of parsing strings and performing search/replace operations. Lasso provides two different methods to use regular expressions. The familiar procedural [String_ReplaceRegExp] and [String_FindRegExp] tags and the newer [RegExp] type. This tip shows some ways in which the the [RegExp] type can be used to optimize code which relies on complex or repeated regular expressions.

Repeated Find/Replace Operations

Every time you call the tag [String_ReplaceRegExp] several things happen. The find and replace patterns are both compiled into an internal regexp object and then the patterns applied to the string repeatedly until the output has been generated. It is easy to dismiss the time to compile the find and replace patterns as inconsequential, but in fact this operation can sometimes take longer than performing the actual find/replace operation.

If we need to perform the same regular expression multiple times on different inputs, we can save this overhead by creating a [RegExp] object first with our desired find and replace patterns. Then we can apply this [RegExp] object repeatedly on different inputs. Now, each find/replace operation only incurs the time it takes to perform the actual operation.

The simple code below takes on average 1.4 seconds to run on my machine. Each find/replace takes less than a millisecond to complete, but looping through several thousand of them adds up.

!insert lassoscript here
  var('input') = 'it was a dark and stormy night';
  var('find') = '[a-z]+';
  var('replace') = 'x';
  
  loop(10000);
    string_replaceregexp($input, -find=$find, -replace=$replace);
  /loop;
?>

 

If we rewrite the code using the [RegExp] type we can see a significant speed benefit. The following code which performs the same operation as above only takes 0.2 seconds to complete. In this code a [RegExp] object is created with the find and replace patterns. This object is then applied repeatedly to the input string.

!insert lassoscript here
  var('input') = 'it was a dark and stormy night';
  var('find') = '[a-z]+';
  var('replace') = 'x';
  
  var('regexp') = regexp(-find=$find, -replace=$replace);
  loop(10000);
    $regexp->replaceall(-input=$input);
  /loop;
?>

 

We see a speed benefit of just over a second, but over the course of 10,000 repetitions. On your site you are most likely to see a speed benefit only if you are using the [String_ReplaceRegexp] tag within a loop.

Find Operations

As we saw above the [RegExp] type has a direct analog of the [String_ReplaceRegExp] tag, but what about the [String_FindRegExp] tag?

The [RegExp] type supports an interactive mode which is a great replacement for [String_FindRegExp] if you are going to be iterating through the result of the tag. Where the [String_FindRegExp] tag crates an array of its results, the [RegExp] tag produces each match one at a time within a while loop.

For example, the following code generate an array $myArray which contains all of the matches from our input string. We can iterate through $myArray to perform an operation using each word found in the input string.

!insert lassoscript here
  var('input') = 'it was a dark and stormy night';
  var('find') = '[a-z]+';
  
  var('myArray') = string_findregexp($input, -find=$find);
  iterate($myArray, var('myWord'));
    ...
  /iterate;
?>

 

The equivalent code using the [RegExp] type is shown below. A [While] ... [/While] loop tests the tag [RegExp->Find]. This tag returns true if there is another match in the string or false otherwise. Within the loop the [RegExp->MatchString] tag can be used to retrieve the current match.

!insert lassoscript here
  var('input') = 'it was a dark and stormy night';
  var('find') = '[a-z]+';
  var('regexp') = regexp(-find=$find);
  
  $regexp->input($input);
  while($regexp->find);
    var('myWord') = $regexp->matchString;
    ...
  /while;
?>

 

We don't see quite as big an improvement using this code as we did by replacing the find/replace operation above. 10,000 repetitions of the [String_FindRegExp] tag run in 0.90 seconds and 10,000 repetitions using the [RegExp->Find] methodology run in 0.56 seconds.

However, the [RegExp->Find] methodology does have one big advantage when we are dealing with sub-patterns in the find string. Rather than looping through the array output from [String_FindRegExp] and jumping over the number of sub-patterns we can simply use [RegExp->MatchString(#)] to pull out each sub-pattern. The code ends up being a lot easier to maintain.

The following simple code breaks out phone numbers in ###-###-#### format from a text string. it is very straightforward to deal with the sub-patterns when we use the [RegExp->Find] methodology.

!insert lassoscript here
  var('regexp') = regexp(-find='([0-9]{3})-([0-9]{3})-([0-9]{4})');
  $regexp->input('My phone number is 360-555-1212.');
  while($regexp->find);
	var('phone') = $regexp->matchstring;
	var('areacode') = $regexp->matchstring(1);
	var('prefix') = $regexp->matchstring(2);
	var('linenumber') = $regexp->matchstring(3);
	...
  /while;
?>

 

Split Operations

The [RegExp] type provides a great method of splitting strings into an array using a regular expression as the split operation. Using this method can result in significant speed improvements over using a series of loops and string operations to achieve the same result.

For example, if we want split a document into lines we could first probe the document to find out what kind of line endings it has and then split on that line ending. However, in the real world we might encounter documents which use a mix of line endings. The following input is not split properly since our probe spots the \r\n line ending and then the split misses the solo \n later in the string.

!insert lassoscript here
  var('input') = 'Some sample text. \r\nThis is a sample\n with bad line endings.';

  var('eol') = ($input >> '\r\n' ? '\r\n' | ($input >> '\r' ? '\r' | '\n'));
  $input->split($eol);
?>

 

   array: (Some sample text. ), (This is a sample with bad line endings.)

The [RegExp->Split] tag can make things easier by allowing us to split on a regular expression. Here any sequence of returns or newlines in a row are recognized as a single line ending.

!insert lassoscript here
  var('input') = 'Some sample text. \r\nThis is a sample\n with bad line endings.';
  
  $input->split(regexp(-find='[\r\n]+'));
?>

 

   array: (Some sample text. ), (This is a sample), ( with bad line endings.)

Note, the pattern will collapse multiple line ending into a single delimiter so \r\n\r\n will be treated the same as a solo \r\n. If we want to preserve multiple line endings we'd want to use a pattern more like '(?:\r\n|\r|\n)'.

We can even expand our regular expression to trim white space from each line while we perform the split operation. This could save us the need to do a [String->Trim] later.

!insert lassoscript here
  var('input') = 'Some sample text. \r\nThis is a sample\n with bad line endings.';
  
  $input->split(regexp(-find='[ \t]*[\r\n]+\\s*'));
?>

 

   array: (Some sample text.), (This is a sample), (with bad line endings.)

Author: Fletcher Sandbeck
Created: 25 Apr 2008
Last Modified: 2 Mar 2011

Comments

No comments found
You must be logged in to comment.

Please note that periodically LassoSoft will go through the notes and may incorporate information from them into the documentation. Any submission here gives LassoSoft a non-exclusive license and will be made available in various formats to the Lasso community.

LassoSoft Inc. > Home

 

 

©LassoSoft Inc 2015 | Web Development by Treefrog Inc | PrivacyLegal terms and Shipping | Contact LassoSoft