The Problem With Jamie Zawinski and Regular Expressions

Jamie Zawinski, also known simply as jwz, is famous for his quote: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” It’s a very amusing line, and I can totally see why people all over the world are using it in their .sig files: It makes them feel better about not being able to understand regexes. And it looks more erudite than just saying, “Some dude who used to work for Netscape thinks they’re a bad idea, so I don’t have to worry about the difference between \w and \S.”

(It turns out the “now they have two problems” construction didn’t even originate with Jamie; it can be found — slamming awk instead of regexes — in the .sig line of a Usenet post in comp.sys.ibm.pc.rt in 1988, and perhaps even further. Maybe this is just another instance of how, as Jamie puts it, people only remember the stupid stuff he says?)

Anyway, as witty as this line is, it’s wrong. Like any other tool, regular expressions are designed to solve a particular class of problems. And, like every other tool, there is a class of problems that they’re just not so good at handling.

Regexes are excellent for doing pattern matching. If you need to ensure that a text string is, for example, a standard ISBN, or a postal code from a particular nation, or a valid username for a given spec. In any of these (fairly standard and common) validation scenarios, regexes will save you a lot of time and trouble.

To take a fairly trivial case, suppose you want to know if a given string is a valid US ZIP code. If you don’t have regexes available (either because you’re working in a language that doesn’t support them, or because you haven’t learned them), your first thought might be:

if (is_integer($input) && $input < 100000 && $input > 0)

But this doesn’t work, because “2134” is not a valid ZIP code — although “02134” is, and can be found in Boston. Plus, this conditional won’t handle 9-digit ZIP codes. What you really need to test for is: “This string has 5 digits in a row. Then, optionally, it has a dash, and then 4 more digits.”

If you don’t have regexes, this is where you wind up writing something painful like:

if (strlen(input) != 5 && strlen(input) != 10) {
    return false;
}
for (int i = 0; i < strlen(input); i++) {
    int ascii = chr(substr(input, i, 1));
    if (i = 5 && ascii != 45) {
        return false;
    } else if (ascii < 48 || ascii > 57) {
       return false;
    }
}
return true;

In fact, this isn’t some contrived example I whipped up to make the process seem more awkward than it has to be. This is the best solution I can find without regexes, and it relies on the fact that “the digits from 0-9” is a contiguous range of ASCII characters and values. If I’d instead wanted to allow “upper- and lower-case letters, but not the characters ` [ \ ] or ^” (a fairly reasonable restriction), I’d have to use two non-contiguous ranges, with a corresponding increase in complexity.

Now suppose you’ve decided usernames must be 4-20 characters long, and may contain letters, numbers, hyphens, underscores, and periods. If you look at this from a user’s perspective (where details of implementation are irrelevant, and all that matters is the actual functionality), this makes perfect sense: it lets users use all the characters they’re most likely to want. They can have names like John.Doe, mary_roe, curly-moe and b_hills90210.

But if you look at this as a developer who’s going to have to implement the filter (without regexes), it looks like a nightmare. “Do we really have to let them use underscores?” you whine (after a quick glance at the ASCII table). To which the answer is, pretty simply, “Yes.” Even people who never heard of an underscore five years ago now expect to be able to use one in their usernames, and if it’s not allowed, they’ll perceive your service as cheap and poorly-constructed.

But really, that isn’t the end of it. Taking a cue from DNS, you should probably ban the “separator” characters (period, underscore, and hyphen) from the initial and final spots in usernames. This means your complete function to test for username validity looks like this:

function is_valid_username(input, &errormsg) {
  if (strlen(input < 4)) {
    errormsg = "Too short!";
    return false;
  }
  if (strlen(input > 20)) {
    errormsg = "Too long!";
    return false;
  }

  for (pos = 0; pos < strlen(input); pos++) {
    char = substr(input, pos, 1);
    if (pos == 0 || pos == strlen(input)) {
      if (! is_letter(char)) && ! is_digit(char))) {
        errormsg = "Invalid first/last character: \"" \
        . char . "\"";
        return false;
      }
    } else {
      if (! is_letter(char) && ! is_digit(char) && \
      ! is_separator(char)) {
        errormsg = "Invalid character: \"" \
        . char . "\"";
        return false;
      }
    }
  }
  return true;
}

function is_digit(char) {
  asc = chr(char);
  if (48 <= asc && asc <= 57) {
    return true;
  }
  return false;
}
function is_letter(char) {
  asc = chr(char);
  if ( (65 <= asc && asc <= 90) || \
  (97 <= asc && asc <= 122) ) {
    return true;
  }
  return false;
}
function is_separator(char) {
  asc = chr(char);
  if (asc = 45 || asc = 46 || asc = 95) {
    return true;
  }
  return false;
}

If you’re not familiar with regular expressions, this is where you’re probably saying something like “Yeah, that looks like a fairly decent, and reasonably flexible, solution.” But the regex solution is so simple, it isn’t even worth its own function:

if (! matchRE(/^[a-z0-9][a-z0-9_\.-]{2,18}[a-z0-9]$/i, \
input)) {
    errormsg = "Requested username is not valid!";
}

The people who already do understand regexes, on the other hand, might have noticed how the three is_...() helper functions in my earlier code sample are groping their way towards regular expressions’ character classes.

If regular expressions are such a powerful and useful tool, why did Jamie Zawinski diss them so harshly? One possible reason is simply “because he could” — the opportunity presented itself, and there’s a level on which wittiness is its own excuse.

But it’s also true that no matter how wonderful a tool is, it won’t give you any help if you use it on the wrong type of problem.

And regexes have been misused an awful lot, on a general class of problem that they’re simply not equipped to handle: text parsing is a much bigger problem than mere text matching — but they’re not completely unrelated; the second is an important part of the first. Which is why lots of people keep trying to parse text (like chunks of HTML) with nothing but regexes.

Which is like trying to build a cabinet with nothing but a screwdriver. A screwdriver is a nice tool, but when you turn it around and use it to drive nails, you’ll find that it leaves something to be desired. When you try to use it as a plane or an adze, you can’t help but be frustrated by its inadequacies, and when you need a saw, your screwdriver can’t even begin to do the job.

Blaming the screwdriver does not help. And deciding that screwdrivers are “another problem” and throwing away yours simply leaves you without a screwdriver when the time comes to drive (or remove) a screw.

2 Comments


  1. Fatal error: Uncaught Error: Call to undefined function ereg() in /websites/mactane/live/blog/wp-content/themes/coyote/functions.php:300 Stack trace: #0 /websites/mactane/live/blog/wp-content/themes/coyote/comments.php(35): sandbox_commenter_link() #1 /websites/mactane/live/blog/wp-includes/comment-template.php(1556): require('/websites/macta...') #2 /websites/mactane/live/blog/wp-content/themes/coyote/single.php(52): comments_template() #3 /websites/mactane/live/blog/wp-includes/template-loader.php(106): include('/websites/macta...') #4 /websites/mactane/live/blog/wp-blog-header.php(19): require_once('/websites/macta...') #5 /websites/mactane/live/blog/index.php(17): require('/websites/macta...') #6 {main} thrown in /websites/mactane/live/blog/wp-content/themes/coyote/functions.php on line 300