{"id":351,"date":"2011-08-16T19:43:11","date_gmt":"2011-08-17T02:43:11","guid":{"rendered":"http:\/\/kagan.mactane.org\/blog\/?p=351"},"modified":"2020-11-25T16:23:25","modified_gmt":"2020-11-26T00:23:25","slug":"the-problem-with-jamie-zawinski-and-regular-expressions","status":"publish","type":"post","link":"https:\/\/kagan.mactane.org\/blog\/2011\/08\/16\/the-problem-with-jamie-zawinski-and-regular-expressions\/","title":{"rendered":"The Problem With Jamie Zawinski and Regular Expressions"},"content":{"rendered":"<p>Jamie Zawinski, also known simply as jwz, is famous for his quote: &#8220;Some people, when confronted with a problem, think \u2018I know, I&#8217;ll use regular expressions.\u2019 Now they have two problems.&#8221; It&#8217;s a very amusing line, and I can totally see why people all over the world are using it in their .sig files: It makes them feel better about not being able to understand regexes. And it looks more erudite than just saying, &#8220;Some dude who used to work for Netscape thinks they&#8217;re a bad idea, so I don&#8217;t have to worry about the difference between \\w and \\S.&#8221;<\/p>\n<p>(It turns out the &#8220;now they have two problems&#8221; construction didn&#8217;t even originate with Jamie; it can be found&nbsp;\u2014 slamming awk instead of regexes&nbsp;\u2014 in the .sig line of a Usenet post in comp.sys.ibm.pc.rt in 1988, and perhaps even further. Maybe this is just another instance of how, as Jamie puts it, people <a href=\"http:\/\/en.wikiquote.org\/wiki\/Jamie_Zawinski\">only remember the stupid stuff<\/a> he&nbsp;says?)<\/p>\n<p>Anyway, as witty as this line is, it&#8217;s wrong. Like any other tool, regular expressions are designed to solve a particular class of problems. And, like every other tool, there is a class of problems that they&#8217;re just not so good at&nbsp;handling.<\/p>\n<p>Regexes are excellent for doing pattern matching. If you need to ensure that a text string is, for example, a standard ISBN, or a postal code from a particular nation, or a valid username for a given spec. In any of these (fairly standard and common) validation scenarios, regexes will save you a lot of time and&nbsp;trouble.<\/p>\n<p>To take a fairly trivial case, suppose you want to know if a given string is <span class=\"tooltip\" title=\"For international readers: these are five-digit numbers, optionally followed by a dash and four more digits.\">a valid US ZIP code<\/span>. If you don&#8217;t have regexes available (either because you&#8217;re working in a language that doesn&#8217;t support them, or because you haven&#8217;t learned them), your first thought might&nbsp;be:<\/p>\n<div class=\"code\">if (is_integer($input) &amp;&amp; $input &lt; 100000 &amp;&amp; $input &gt;&nbsp;0)<\/div>\n<p>But this doesn&#8217;t work, because &#8220;2134&#8221; is not a valid ZIP code&nbsp;\u2014 although &#8220;02134&#8221; is, and can be found in Boston. Plus, this conditional won&#8217;t handle 9-digit ZIP codes. What you really need to test for is: &#8220;This string has 5 digits in a row. Then, optionally, it has a dash, and then 4 more&nbsp;digits.&#8221;<!--more--><\/p>\n<p>If you don&#8217;t have regexes, this is where you wind up writing something painful&nbsp;like:<\/p>\n<pre class=\"code\">if (strlen(input) != 5 &amp;&amp; strlen(input) != 10) {\n    return false;\n}\nfor (int i = 0; i &lt; strlen(input); i++) {\n    int ascii = chr(substr(input, i, 1));\n    if (i = 5 &amp;&amp; ascii != 45) {\n        return false;\n    } else if (ascii &lt; 48 || ascii &gt; 57) {\n       return false;\n    }\n}\nreturn&nbsp;true;<\/pre>\n<p>In fact, this isn&#8217;t some contrived example I whipped up to make the process seem more awkward than it has to be. This is the <em>best<\/em> solution I can find without regexes, and it relies on the fact that &#8220;the digits from 0-9&#8221; is a contiguous range of ASCII characters and values. If I&#8217;d instead wanted to allow \u201cupper- and lower-case letters, but <em>not<\/em> the characters ` [ \\ ] or ^\u201d (a fairly reasonable restriction), I&#8217;d have to use <em>two non-contiguous<\/em> ranges, with a corresponding increase in complexity.<\/p>\n<p>Now suppose you&#8217;ve decided usernames must be 4-20 characters long, and may contain letters, numbers, hyphens, underscores, and periods. If you look at this from a user&#8217;s perspective (where details of implementation are irrelevant, and all that matters is the actual functionality), this makes perfect sense: it lets users use all the characters they&#8217;re most likely to want. They can have names like John.Doe, mary_roe, curly-moe and b_hills90210.<\/p>\n<p>But if you look at this as a developer who&#8217;s going to have to implement the filter (without regexes), it looks like a nightmare. &#8220;Do we really have to let them use underscores?&#8221; you whine (after a quick glance at <a href=\"http:\/\/www.ascii-code.com\/\">the ASCII table<\/a>). To which the answer is, pretty simply, &#8220;Yes.&#8221; Even people who never heard of an underscore five years ago now expect to be able to use one in their usernames, and if it&#8217;s not allowed, they&#8217;ll perceive your service as cheap and poorly-constructed.<\/p>\n<p>But really, that isn&#8217;t the end of it. Taking a cue from DNS, you should probably ban the &#8220;separator&#8221; characters (period, underscore, and hyphen) from the initial and final spots in usernames. This means your complete function to test for username validity looks like&nbsp;this:<\/p>\n<pre class=\"code\">function is_valid_username(input, &amp;errormsg) {\n&nbsp;&nbsp;if (strlen(input &lt; 4)) {\n&nbsp;&nbsp;&nbsp;&nbsp;errormsg = \"Too short!\";\n&nbsp;&nbsp;&nbsp;&nbsp;return false;\n&nbsp;&nbsp;}\n&nbsp;&nbsp;if (strlen(input &gt; 20)) {\n&nbsp;&nbsp;&nbsp;&nbsp;errormsg = \"Too long!\";\n&nbsp;&nbsp;&nbsp;&nbsp;return false;\n&nbsp;&nbsp;}\n\n&nbsp;&nbsp;for (pos = 0; pos &lt; strlen(input); pos++) {\n&nbsp;&nbsp;&nbsp;&nbsp;char = substr(input, pos, 1);\n&nbsp;&nbsp;&nbsp;&nbsp;if (pos == 0 || pos == strlen(input)) {\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (! is_letter(char)) &amp;&amp; ! is_digit(char))) {\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;errormsg = \"Invalid first\/last character: \\\"\" \\\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;. char . \"\\\"\";\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return false;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}\n&nbsp;&nbsp;&nbsp;&nbsp;} else {\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (! is_letter(char) &amp;&amp; ! is_digit(char) &amp;&amp; \\\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;! is_separator(char)) {\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;errormsg = \"Invalid character: \\\"\" \\\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;. char . \"\\\"\";\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;return false;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}\n&nbsp;&nbsp;&nbsp;&nbsp;}\n&nbsp;&nbsp;}\n&nbsp;&nbsp;return true;\n}\n\nfunction is_digit(char) {\n&nbsp;&nbsp;asc = chr(char);\n&nbsp;&nbsp;if (48 &lt;= asc &amp;&amp; asc &lt;= 57) {\n&nbsp;&nbsp;&nbsp;&nbsp;return true;\n&nbsp;&nbsp;}\n&nbsp;&nbsp;return false;\n}\nfunction is_letter(char) {\n&nbsp;&nbsp;asc = chr(char);\n&nbsp;&nbsp;if ( (65 &lt;= asc &amp;&amp; asc &lt;= 90) || \\\n&nbsp;&nbsp;(97 &lt;= asc &amp;&amp; asc &lt;= 122) ) {\n&nbsp;&nbsp;&nbsp;&nbsp;return true;\n&nbsp;&nbsp;}\n&nbsp;&nbsp;return false;\n}\nfunction is_separator(char) {\n&nbsp;&nbsp;asc = chr(char);\n&nbsp;&nbsp;if (asc = 45 || asc = 46 || asc = 95) {\n&nbsp;&nbsp;&nbsp;&nbsp;return true;\n&nbsp;&nbsp;}\n&nbsp;&nbsp;return false;\n}\n\n<\/pre>\n<p>If you&#8217;re not familiar with regular expressions, this is where you&#8217;re probably saying something like &#8220;Yeah, that looks like a fairly decent, and reasonably flexible, solution.&#8221; But the regex solution is so simple, it isn&#8217;t even worth its own&nbsp;function:<\/p>\n<pre class=\"code\">if (! matchRE(\/^[a-z0-9][a-z0-9_\\.-]{2,18}[a-z0-9]$\/i, \\\ninput)) {\n    errormsg = \"Requested username is not valid!\";\n}<\/pre>\n<p>The people who already do understand regexes, on the other hand, might have noticed how the three <code>is_...()<\/code> helper functions in my earlier code sample are groping their way towards regular expressions&#8217; character&nbsp;classes.<\/p>\n<div class=\"separator\"><\/div>\n<p>If regular expressions are such a powerful and useful tool, why did Jamie Zawinski diss them so harshly? One possible reason is simply &#8220;because he could&#8221;&nbsp;\u2014 the opportunity presented itself, and there&#8217;s a level on which wittiness is its own&nbsp;excuse.<\/p>\n<p>But it&#8217;s also true that no matter how wonderful a tool is, it won&#8217;t give you any help if you use it on the wrong type of&nbsp;problem.<\/p>\n<p>And regexes have been misused an awful lot, on a general class of problem that they&#8217;re simply not equipped to handle: text <em>parsing<\/em> is a much bigger problem than mere text <em>matching<\/em>&nbsp;\u2014 but they&#8217;re not completely unrelated; the second is an important part of the first. Which is why lots of people keep trying to parse text (like chunks of HTML) with nothing but&nbsp;regexes.<\/p>\n<p>Which is like trying to build a cabinet with nothing but a screwdriver. A screwdriver is a nice tool, but when you turn it around and use it to drive nails, you&#8217;ll find that it leaves something to be desired. When you try to use it as a plane or an adze, you can&#8217;t help but be frustrated by its inadequacies, and when you need a saw, your screwdriver can&#8217;t even begin to do the&nbsp;job.<\/p>\n<p>Blaming the screwdriver does not help. And deciding that screwdrivers are &#8220;another problem&#8221; and throwing away yours simply leaves you without a screwdriver when the time comes to drive (or remove) a&nbsp;screw.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Jamie Zawinski, also known simply as jwz, is famous for his quote: &#8220;Some people, when confronted with a problem, think \u2018I know, I&#8217;ll use regular expressions.\u2019 Now they have two problems.&#8221; It&#8217;s a very amusing line, and I can totally see why people all over the world are using it in their .sig files: It [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[7,60,71,29],"_links":{"self":[{"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/posts\/351"}],"collection":[{"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/comments?post=351"}],"version-history":[{"count":18,"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/posts\/351\/revisions"}],"predecessor-version":[{"id":771,"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/posts\/351\/revisions\/771"}],"wp:attachment":[{"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/media?parent=351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/categories?post=351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kagan.mactane.org\/blog\/wp-json\/wp\/v2\/tags?post=351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}