Regular Expressions

While the technique we've used so far for building our reader gave a nice tour of the language, things are starting to get complicated. We still can't parse strings, negative numbers, and many other things. It's time to reset and use a better approach.

We'll now turn to regular expressions, which is a separate language for string parsing that is built into the JVM and browser runtimes. Let's start again with our text, and this time we won't break it into separate characters:


(def text "42 0.5 1/2 foo-bar")

To look for the first digit in this string, we can use re-find like this:


(re-find #"\d" text)

Regular expressions look like strings with a # in the beginning, and \d is a special flag for finding digits. If we want to find one or more digits, we just add a + to it:


(re-find #"\d+" text)

This won't work for decimals though:


(re-find #"\d+" "0.5")

We represent periods with \. because a standalone . as a special meaning. So we can match decimals by saying "one or more digits followed by a period followed by one or more digits":


(re-find #"\d+\.\d+" "0.5")

We can do something similar for fractions:


(re-find #"\d+/\d+" "1/2")

Spaces are easy. To find one or more spaces, just do this:


(re-find #" +" text)

To say that the regular expression must match the very beginning of the string, you just put a ^ in front:


(re-find #"^ +" text)

It returned nil this time, because text does not begin with a space. In Clojure, you can use or to run several expressions until one of them returns a non-nil value:


(or (re-find #"^ +" text)
    (re-find #"^\d+\.\d+" text)
    (re-find #"^\d+/\d+" text)
    (re-find #"^\d+" text))

Now let's put that in a function:


(defn read-next-token [text]
  (or (re-find #"^ +" text)
      (re-find #"^\d+\.\d+" text)
      (re-find #"^\d+/\d+" text)
      (re-find #"^\d+" text)))

The next step is to make a loop that continuously reads tokens from text until it can't anymore. To do that, we'll use an index like before, and cut off the beginning of the string as we move through it. We can do this with subs. For example, to remove the first two characters:


(subs text 2)

So let's begin our loop with this:


(loop [result []
       index 0]
  ,,,)

First, call subs to get the substring, and give that to read-next-token:


(loop [result []
       index 0]
  (let [sub-text (subs text index)
        token (read-next-token sub-text)]
    ,,,))

Then, if token is not nil, add it to result and increment the index. Otherwise, return result:


(loop [result []
       index 0]
  (let [sub-text (subs text index)
        token (read-next-token sub-text)]
    (if token
      (recur (conj result token) (+ index (count token)))
      result)))

We already have a pretty good tokenizer! But it didn't find foo-bar because we have no regex for symbols. If we want to say "one or more characters that are neither digits nor spaces", we can do this:


(re-find #"^[^\d ]+" "foo-bar")

This isn't perfect, but we'll work with it for now. Let's add it to our function:


(defn read-next-token [text]
  (or (re-find #"^ +" text)
      (re-find #"^\d+\.\d+" text)
      (re-find #"^\d+/\d+" text)
      (re-find #"^\d+" text)
      (re-find #"^[^\d ]+" text)))

No we can re-run our loop, and it parses the symbol correctly:


(loop [result []
       index 0]
  (let [sub-text (subs text index)
        token (read-next-token sub-text)]
    (if token
      (recur (conj result token) (+ index (count token)))
      result)))

Let's add something we haven't yet: comments! They'll help us remember what each regex matches. In Clojure, they start with a semicolon:


(defn read-next-token [text]
  (or ; spaces
      (re-find #"^ +" text)
      ; float
      (re-find #"^\d+\.\d+" text)
      ; ratio
      (re-find #"^\d+/\d+" text)
      ; integer
      (re-find #"^\d+" text)
      ; symbol
      (re-find #"^[^\d ]+" text)))

Previous: Loops Continued Next: Namespaces