# CS262 - Lesson 1: String Patterns.

Finding and specifying classes of strings using regular expressions.

Contents

## Building a Web Browser

In this course you will learn the theory and practice of programming languages that will culminate in the construction of a web browser. Your web browser will take HTML and JavaScript as input - the primary languages of the web- and use it to produce an image of the webpage.

You are probably familiar with HTML, which describes the basics of web pages. However, you may be less familiar with JavaScript, which allows you to describe computations so that you can have a lot of power and flashy graphics.

For example, sites use JavaScript to animate tabs, so that when you scroll over them a drop down appears or some floating text appears . When you look at the source code for pages with these features you will see that they use both HTML and JavaScript.

Course and Project Overview

• Step 1: Start with the web page source, and then break that source into important words.
• Step 2: Next, understand the structure of the words you have found.
• Step 3: Finally, you will look for and find the meaning of the structure.

The goal of this course is to use a browser to structure your learning.

### Breaking up Strings

We want to break up strings like the source to a webpage into important words and we're going to use Python to do it. We're given as input part of a webpage like this:

<b>Hello 1

One approach to breaking this up would be to use Python's string.find function, which would find the space between the "Hello" and the "1" and then split up the string into everything to the right of the space as well as everything to the left of the space.

Python's string.find function is often described as finding a needle in a haystack.

Example:

"Mifune Toshiro".find("fun")

"fun" is the needle you are looking for in "Mifune Toshiro" - the haystack string on the left.

The result you should expect is the string index of the beginning of the fun in "Mifune," which in this case is two. Remember, Python starts counting positions at zero; so that the "M" is in position zero, "i" is in position one, and "f" is in position two.

Do you know who Mifune Toshiro is?

Example:

"Hello world".find(" ")
5

Example:

"1 + 1=2".find("1", 2)
4

This expression defines a starting position, two. Therefore, you want to find the first occurrence of 1, starting at position two. Notice that there is a 1 in position zero and position four, however, since we are counting the 1s after position two, there is just one in position four. Example:

"haystack".find("needle")

What happens when the needle you are look for does not occur? In this case Python will return negative one (-1).

### Quiz: Breaking Up Strings

What answer would the Python interpreter return, given the following expressions:

"Alan Turing".find("n", 4)

Solution

## Selecting Substrings

So far, you can find positions (indices) in strings. Now, what you want to do is chop up those strings into substrings. Once you know where the spaces are you can start splitting a sentence up into words. The Python syntax for isolating strings is to use square brackets. The elements within the brackets are the starting position (included), followed by a semi-colon, and then the ending position (not included).

"hello"[1:3]
'el'

This statement says to start at position one and include all of the elements in the string up to but NOT including the element in position three. Therefore, the resulting string is el. You can also leave out one of the number specifiers. In the example below, leaving the position to the right of the semi-colon empty means to go as far as possible - to the end of the string.

'hello'[1: ]
'ello'

Now that you know how to find strings and chop them up, see if you can combine them together to write a Python procedure.

### Quiz: Selecting Substrings

Let p and q be strings containing two words separated by a space. Example: "bell hooks", "grace hopper", "alonzo church". Write a procedure called myfirst_yoursecond(p,q) that returns True if the first word in p equals the second word in q.

Example:

myfirst_yoursecond("bell hooks", "curer bell")
True

Solution

### Split

Splitting words by spaces in computer science is such a common task that Python has a built in function to do just that.

• string.split( )

For example:

"Jane Eyre".split( )
["Jane", "Eyre"]

### Quiz: Split

What is the number of elements in the list returned by split( ) expression in each of the following.

"Python is fun".split( )
"July-August 1842".split( )
"6*9==42".split( )

Solution

## Regular Expressions

For strings with elements such as hyphens or punctuation, like the second two above, you may be inclined to want to split them up too. Therefore, you need more control over splitting strings than Python's built in string function can offer.

Regular expressions are tools that give you more control over splitting up strings!

Suppose you want to find all the numbers in a string. You could make ten different calls to s.find:

s.find("1")
s.find("2")
s.find("3")
...

This gets really tedious, really fast. Regular expressions are a better way to do this.

The term "regular" has special meaning in mathematics and computer science. For now it means, simple strings. And an expression is concise notation.

The following equations correspond to some possible value for x:

• x = sqrt(4) → x =2 or x =-2
• 5<x&lt;9 →="" x="6" or="" 7="" or="" 8<="" li="">

These mathematical equations are concise notations for potentially large sets of values. Especially an equation such as:

• 50< x < 90

Regular expressions are going to be very concise and will let you describe a large number of simple strings.

Regular expressions:

1. [1-3], matches or denotes these three strings: "1", "2", "3"

The basic idea is that there is some symbol on the left and some symbol on the right, so that the regular expression matches each one of those and everything in between

1. [4-8] → "4", "5", "6", "7", "8"
2. [a-b] → "a" "b"

Regular expressions are very popular. They are very useful online and in computing in general. Credit cards, phone numbers, addresses and emails are all handled by regular expressions on websites you use everyday. For example, the form that you fill out when you apply for a U.S. Passport relies upon regular expressions. Regular expressions are very common when you want to enter structured data. Within the form, notice that different items; your birthday, your birthplace, your social security number, your email address, require different ways of writing this information. Regular expressions allows you to make sense of this type of data and process it when you see it on webpages.

Solution

## Import RE

Industrial software is often so big that it doesn't all fit on one page, so people have to break it up into chunks just like a book is broken up into chapters or the body of human knowledge is broken up into many books.

In computer science, a module is a repository or a library of code --functions and data that do useful things.

In Python, import brings in a module.

It turns out that there is already a bunch of functions related to regular expressions..We can just import them into our own Python programs and get their benefits for free.

Python's regular expression module is called, imaginatively enough, 're' --regular expressions.

At the beginning of a Python program, just write 'import re' and then you'll have access to all of the regular expression functions.

If we're going to write down regular expressions in Python, we need to know what they look like. Python regular expressions look just like strings. They start with double quotes or single quotes, and they have contents, except that to separate regular expressions from strings regular expressions begin with a lowercase 'r' outside of the double quotes.

• 5 letter string : "[0-9]"
• RE that matches 10 different one-letter strings: r"[0-9]"

Writing regular expressions is a creative process. You the programmer have to do it. I'm using 'creative' here in the same way that people often describe mathematics as elegant. Just as there are many different equations that could get you the number 4 — 2 plus 2, 1 plus 3, 8 divided by 2, absolute value of the square root of 16-- in fact, and infinite number--there are often an infinite number of regular expressions that could serve a purpose. Picking the right one, the small one, the simple one, the concise one, the elegant one, requires creativity.

It's a skill. It's something you'll learn in this class.

Let's say you've written a regular expression, though--like this one.

r"[0-9]"

Now we need to use it.

One of the most common functions involving regular expressions is findall.

re.findall(r"[0-9]", "1 + 2 == 3")

It takes a regular expression and a string and returns a list of all of the substrings that match that regular expression.

Here if we're looking for single letter strings that are between 0 and 9,

The return value of re.findall is:

>>> re.findall(r"[0-9]", "1 + 2 == 3")
['1', '2', '3']

The list could be empty if you didn't actually match anything. The 're' means it comes from the regular expression library. We really need that import statement at the beginning for this to work.

In this example, I'm using the same haystack string --'1+2==3'--but I'm using a different regular expression. This one only matches two single-letter strings.

>>> re.findall(r"[1-2]", "1 + 2 == 3")
['1', '2']

We'll get out 1 and 2. These two match. The 3 does not, because it's not between 1 and 2. It's not specified or matched by this regular expression.

This last example is a little more tricky. We're looking for the letters a through c, but if you look carefully, these are the lowercase letters a and c.

>>> re.findall(r"[a-c]", "Barbara Liskov")

So even though this 'B' is very tempting it's not between lowercase a and lowercase c. We'll match this a, b, a, a, and then there's nothing over here in 'Liskov'

>>> re.findall(r"[a-c]", "Barbara Liskov")
['a', 'b', 'a', 'a']

For some more details on regular expressions in Python check out this document from Google Code. For even more resources look to the forums.

Solution

### Concatenation

Now that we've mastered single character regular expressions, let's look into gluing them together.

We're going to need to find important bits of punctuation like /> or == to reason about JavaScript and HTML and thus write our web browser.

Thus we really need the ability to concatenate or put right next to each other in repeat regular expressions.

Well, with regular expressions that's actually as simple as just writing two regular expressions right next to each other.

>>>r"[a-c][1-2]"

This matches the string a1, a2, b1, b2, c1, and c2--six strings in all. In each one, the first letter comes in the first regular expression, and the second letters, 1 or 2, matches the second part of the regular expression. We've concatenated a through c and 1 through 2 together to match more complicated strings. You may have noticed that we suddenly had quite a few strings from a relatively small regular expression.

In fact, if we 0 through 9 next to 0 through 9, there are a huge number of strings that we would match--100 in total.

>>>r"[0-9][0-9]"

Just as the first part matches 10 and the second part matches 10 when you put them together, you match 10-squared strings.

So let's look for a two-digit number in the string July 28, 1921.

>>>re.findall(r"[0-9][0-9]", "July 28, 1821")

We'll end up getting:

>>> re.findall(r"[0-9][0-9]", "July 28, 1821")
['28', '18', '21']

Now I'm looking for two-digit numbers in 12345

>>> re.findall(r"[0-9][0-9]", "12345")
['12', '34']

12 is a two-digit number, 34 is a two-digit number, but 5 actually does not qualify. This regular expression requires that both subparts be matched.

Solution

### One or More

It's time to introduce a new regular expression: +.

This is really handy when we want to match one or more of something.

This is a very concise way of listing, actually, an infinite number of possibilities.

If I write the regular expression

r"a+"

it matches 'a', 'aa', 'aaa', 'aaaa'

The plus looks back to the previous regular expression and changes the meaning.

Instead of just matching that once, you match it once or more--as many times as you'd like.

Here I've shown another example.

r"[0-1]+"

We're looking for 0 through 1 repeated one or more times.

The interpretation here is every time you repeat you can make a different choice.

This matches '0' and '1', '00' and '11' but also '01', '10', ... '1010', '1011', '1111', '1101', ...

There is a minor bit of ambiguity I need to clear up with this plus.

Let's say that we're looking for numbers 0 through 9 plus in the string '13 from 1 in 1776'

re.findall(r"[0-9]+", "13 from 1 in 1776")

One possible answer is 13, 1, and 1776,

['13', '1', '1776']

but this plus just means 1 or more. Is there anything that says that I have to match them all at the same time or could I break up 1776 into four different one-letter strings.

['1', '3', '1', '1', '7', '7', '6']

It turns out that there is a rule in regular expressions called 'maximal munch' which says that a regular expression should consume, or eat, or match the biggest string it can and not smaller parts.

So we and Python and other people studying regular expressions are going to get this answer:

['13', '1', '1776']

13, 1, 1776, because 1776 is the maximal munch we can get here for 0 through 9 plus.

Don't stop early. Go all the way.

Solution

## Finite State Machines

We want to do even more with regular expressions, such as matching a word or a number.

To do this, we're going to introduce a visual representation for regular expressions that actually shows exactly what's going on behind the scenes, and then we're going to follow along in Python.

Suppose we have the regular expression

r"[0-9]+%"

Any character like this % that just appears on its own is matched directly,so this catches strings like 30%, 99%, 2% and various other things we might find describing sales or the fat content of milk.

Here I've drawn a finite state machine, a visual representation of this regular expression.

Often there's an arrow coming out of nowhere on the left that's not connected to the rest of the picture.

That indicates where we start. These 3 circles are states.

They represent what we're up to when we're matching a string against the regular expression--

what configuration we're in, what our current state of mind is, what we've seen so far.

I've labeled my states 1, 2, and 3. These other arrows are called edges or transitions.

They tell us when to move from 1 state to another. I start in state 1, and if I see a 0 - 9, I move over to state 2.

This 0 - 9 is the label associated with this edge.Finally, you'll notice that 1 of my states has a double circle.

That's an accepting state.

If we end up in an accepting state at the end of the input, this finite state machine matches the given string.

Let's trace through what happens on input 23%.

We start in the start state, and the character we see is a 2, so we follow this edge to state 2.

Now the next thing we see is a 3, so we follow this edge back to state 2.

These are sometimes called self-loops. It's a loop that takes me back to right where I started.

Now we see the % sign, and we end up in state 3, which is an accepting state, so our finite state machine accepts this string '23%' just like our regular expression would.

Let's try just the string '2'.

We start in the start state. We see a 2, so we move over here, and then we're done.

We ran out of input, but we're not in an accepting state.

Our finite state machine rejects this just like our regular expression would.

Finally, let's consider the string '2x'.

We start here in state 1. We see a 2, so we go over to state 2. Then we see an x, and there's no outgoing edge from state 2 on an x, so we fall off the finite state machine and die.

This is very sad, and when this happens our finite state machine does not accept the string, just like the regular expression would not.

Solution

Solution

## Disjunction

Consider this spiffy, new finite state machine.

It accepts both 1 or more letters like word--w-o-r-d--and also 1 or more digits--1, 2, 3.

In fact, there's a sense in which it accepts either [a - z]+ or [0 - 9]+.

Note its 2 accepting states. Such power!. Can we do the same thing with regular expressions?

It turns out we can with a new regular expression operator, a nubitive syntax in regular expressions that I'm going to teach you.

r"[a-z]+|[0-9]+"

This vertical bar means I match either the thing on the left or the thing on the right.

It's formally called 'disjunction' sometimes, but we can just read it as 'or'.

Match [a - z]+ or [0 - 9]+.

For example, let's say we want to find all matches of lowercase [a - z]+ or [0 - 9]+ in 'Goethe 1749'.

re.findall(r"[a-z]+|[0-9]+", "Goethe 1749")

We'll get both 'oethe' and '1749'.

['oethe', '1749']

We don't get the capital G because we asked for lowercase letters over here.

While we're here, our old friend regular expression [0 - 2] is really just 0|1|2.

So I could write out [a - z] as 26 different choices, but that's not very concise.

Solution

### Quiz: Disjunction Construction

Now let's turn the tables a bit and attack the problem from the other direction.

This time, I'd like you to use the interpreter, and you're going to create your own regular expression.

Assign to the variable regexp, a common abbreviation for regular expression, a regular expression that matches either the exact string ab--2 letters in ab-- or 1 or more digits. To help clarify this specification, I have 3 positive instances --you should match ab, 1, and 123-- and 3 negative instances-- don't match a alone, don't match abc, and don't match abc 123.

Solution

## Options

So now we have a way with regular expressions to choose either a or b.

Another very common choice is to choose between 'a' or nothing--that is, to have a part of a string that's optional.

For example, when you're writing out numbers, it's possible to put a negative sign at the begining of a number.

But you don't need to, depends on which number your trying to get across.

Here I've drawn a finite state machine that accepts numbers with no leading negative sign and numbers with leading negative sign.

Lets see how it goes.

For something like 1,2,3, we start here. 1,2,3, and we're in state 4, which accepts.

For something like -57 --negative, 5, 7-- we're in state 3, which accepts.

But you may have noticed quite a bit of duplication in this finite state machine.

These two red areas are the same.

It's an edge labeled 0 - 9, an accepting state with a self-loop labeled 0 - 9.

And one of the things we really wanted was to be concise.

So conceptually, it might be simpler to have an edge that somehow consumes no input.

This new finite state machine will still accept 1, 2, 3.

Here we start in state 1. I don't consume anything and move to state 2.

And then it's 1, 2, 3, and we accept.

-57, I take the negative, 5, 7, and then we accept.

We have a particular convention for indicating that an edge accepts no input.

We use the Greek letter Epsilon.

You can either think of Epsilon as meaning 'consume no input' when you go across this edge, or you can think of it as referring to 'the empty string', at which point you can consume the empty string all you want, but since it's of size 0, it doesn't effect what you're trying to recognize.

This idea of using this Greek letter --this is totally arbitrary. This is just an artificial convention. But it's a commonly used one, so it's worth knowing.

Continuing our theme that anything that can be done in finite state machines can be done in regular expressions and vice versa--we'll firm that up later on.

I'm going to give you a new regular expression-- ?, which we typically read as optional or the previous thing 0 or 1 times.

In that way, it's a lot like plus, which was the previous thing 1 or more times.

So I might write a regular expression that accepts numbers that may optionally have a leading negative sign:

re.findall(r"-?[0-9]+", "1861-1941 R. Tagore")

This negative sign may be present 1 time or 0 times. We definitely need the [0 - 9]+.

And the string we're looking for this needle in is '1861 - 1941 R. Tagore'

And on this particular input, we will find 2 substrings that match.

1861 matches without the leading negative sign, and - 1941 matches with the leading negative sign.

["1861", "-1941"]

## Escape Sequences

Just as we can get a lot of use out of + for 1 or more copies, sometimes it's nice to have 0 or more copies.

So we'll introduce the * regular expression for that.

You can convert between the 2 of them.

• a+ == aa*

If you wanted 1 or more copies of a, you could have a, followed by 0 or more copies of a. Some classes or texts will teach the star first and then move on to the plus. We teach the plus first because it's more common for specifying Python and Javascript.

So now, the plus, the star, the question mark, the open square brackets, the closed square brackets--they all mean something special in regular expressions.

+, *, ?, [, ]

They help us to note sets of strings.

What if the string I want to match is actually just a + sign.

How do I do it if + means 1 or more copies of what came before?

We need some way of saying, 'No, no--I really mean it! Actually +, not 1 or more, just +.'

We're going to solve this by using something called escape sequences, but before I get into them, let's introduce them by way of analogy.

In Python, you can declare a string using double quotes or single quotes.

So if you want to have a constant string that reads,

'P&P is Jane's book'

If you use single quotes, Python will get confused because you have 3 and it will think your string ends at the 'e' in Jane.

No problem, you say?

I will just use double quotes.That's what they're there for.

"P&P is Jane's book"

But what if you want to include quoted dialogue in your string?

So now I want to say,

I said, "P&P is Jane's book"

So now I'm using both the single quote and the double quote for their actual meanings. What do I put on the side?

Well, Python will actually let you bypass this by using triple quotes,

"""I said, "P&P is Jane's book""""

but what if I really wanted to have triple quotes in the middle too? We can't do this forever.

Well, it turns out that if I just put a backward slash, a backslash, in front of a quote or a double quote or whatnot, Python will treat it as being part of this string and not as being the end of the string.

"P&P is Jane\'s book"

We're escaping out of treating quotes as string delimiters, so this is Python's way of saying, "No, no, I really mean it! I actually have a quote there!"

I said, \"P&P is Jane's book\"

Now when you actually go to write this down in Python, you may not get different colors or different fonts, so this maybe a little hard to read, but it does work.

I can use these escape sequences to literally write down a special character.

The backslash is sometimes called an escape character and the 2 of these together are an escape sequence, a sequence of characters that are treated as if it were just the double quotes, just a single quote.

Note that Python is throwing away this backslash.You won't actually see it.

Just to show you how this sort of thing plays out in Python, I've written down 2 different versions of P&P is Jane's book-- one using double quotes and one using an escape sequence. We're going to print out both of their values and then check to see if they're actually equal.

And in fact, they are.

Even though we entered them slightly differently, Python treats them both the same internally and indicates that they're equal.

Here's one more example of this.

This time with a double quote being escaped twice.

And again, the 2 strings are equal under the hood as Python deals with them.

It turns out that we can do the same thing in regular expressions.

Suppose you want to find the string '++'.

This regular expression

r"\+\+"

has 2 escape sequences, and it finds only the string '++'.

Solution

## Tricky

So as we saw, this quiz was particularly tricky. It wasn't obvious how to work out the right answer. We're really interested in supporting phone numbers from a bunch of different countries or formats, but we're really only interested in that hyphen if it's followed by more digits.

Conceptually, you might think 'Wow, I really want to group the hyphen and the digits together and then have either all of them or none of them'

We don't know how to do that just yet, but we will in a few more minutes,and then you'll get a chance to show off your mastery in the homework.

Solution

## Quoted Strings

Quoted strings, that is, strings that are surrounded by double quotes, or the like, are a tricky issue that comes up in both JavaScript and HTML.

Let's bring all of our regular expression power to bear to see about separating quoted strings from other words.

Here I've drawn an evil quoted string that contains a bunch of double quotes.

"I said, \"Hello.\""

We really want to get to just the heart of it, just the contents and peel off these 2 double quotes at the end. They're like the rind. I want to get to the core. However, if I just repeatedly use string.find to look for double quotes, I'll find the first one, but also the one in the middle. And this one in the end.

So I might mistakenly think that it's 2 quoted strings,

'I said' , and nothing and the end.

Find isn't good enough! We'll have to use regular expressions instead.

But first, to make our job a little easier, let me introduce to you some new regular expressions.

The first is the dot, or period, (.) which matches any character except a new line,or what you get when you press enter or return.

For example, here I'm looking for any decimal digit [0 - 9] and then any character-- anything except a line break--and then another [0 - 9].

re.findall(r"[0-9].[0-9]", "1a1 222 cc3")

['1a1', '222']

So for example, this is a decimal digit. This is another decimal digit. And the 'a' between them is any character: '1a1'

This 2 is a decimal digit. The 2 between them is any character. This 2 is another decimal digit: '222'

This 'cc3' doesn't qualify because this 'c' is not in [0 - 9]

Recall that regular expression matching is non-overlapping. If we use re.findall() to look for r"[0-9][0-9]" in the string "123" we'll find "12" but we will not also find "23" because that "2" was already used as part of the first match.

And one more--sometimes it's nice to be able to say anything except a digit or anything except a number or anything except p.

The regular expression syntax \^ (pronounced caret) only means "not" or "set complement" or "everything except" if it is used just inside [square brackets]. If you use the caret outside of the square brackets, it has another meaning.

re.findall("[0-9][^a-b]", "1a1 222 cc3")

Here we're looking for [0 - 9], followed by anything that's not 'a'; and also not 'b'

So here ( 1a1) --oh!

That immediately didn't work because the next thing was an 'a', and we're asking for not 'a'.

Right over here we've got a 1 and a space, and space isn't 'a' or 'b', so that looks good.

Then here we've got a 2 and a 2, and this second 2 is not 'a' or 'b', so that looks good.

Here we've got a 2 and a space, and this space is not 'a' or 'b', so that looks good.

'C' is not [0-9]. 'C' is not [0-9], 3 is [0-9], but then we're at the end of the string.

So that's it.

['1 ', '22', '2 ']

## Structure

In mathematics, when an expression gets complicated, we can add parenthesis to show structure or grouping

(x - 3) * 5 is different than x - (3 * 5).

Python regular expressions have similar parenthesis, but they're written a litttle differently.

The closing parenthesis looks just the same, but the opening parenthesis--the version you'll be using in this class--is 3 characters--

(?: ).

There's a simple example.

This regular expression makes a group around xyz and then whole thing can be repeated 1 or more times, so some strings are xyz, xyzxyz, and so on.

Suppose we want to find words made up of combinations of musical notes.

In Western music, the notes are often given names--do, re, mi, fa, so, la, ti-- and you could put them together in various combinations--re, fa, fa--do, do, re-- stuff like that.

Let's say we want to recognize words that are made up of these syllables in order, or these syllables not in order but in any combination.

So we set out to try it.

re.findall(r"do+|re+|mi+","mimi rere midore doo-wop")

We can have a bunch of do's or a bunch of re's or a bunch of mi's. Let's say we're looking for all of the matching strings in mimi, rere, midore, doo-wop, and we want to see which ones we get.

We'd like to get mimi as 1, rere as another, midore, and then maybe do, just sort of as a corner case, but mostly these 3, but we will be unpleasantly surprised.

['mi', 'mi', 're', 're', 'mi', 'do', 're', 'doo']

We would really expect something like mi+ to get mimi. Maximum munch rule, why have you betrayed us? If you look over here, you'll see that actually everyone of these little musical syllables — 'mi', 'mi', 're', 're', 'mi', 'do', 're'--seem to come out separately. Why? Well, if you think about it, in the regular expression

r'mi+'

the plus only effects the 'i', so really this is getting mi, mii, miii — an entire virtue of selfishness — rather than the thing that we wanted.

You can actually see this at the end where do+ got us doo from doo-wop.

So here, the + isn't applying to the right thing, isn't binding correctly. This isn't quite the right way to do it.

This, however, is.

re.findall(r"(?:do+|re+|mi)+","mimi rere midore doo-wop")

Note our use of the parenthesis in regular expressions-- (?:--marks the beginning of such a group--), and then here in the middle we have do/re/mi.

Anything inside this group can be repeated 1 or more times. This gets us just the answer we were looking for.

["mimi", "rere", "midore", "do"]

But it's also worth pointing out that a very popular computer musical format or interface M-I-D-I--MIDI, the musical instrument digital interface used for recording things like pianos or synthesizers or drum sets, is more or less exactly what we've seen here.

Basically, a list of notes and modifiers and combinations over and over again.

By contrast, formats like MP3 or other audio compression approaches for recording voice, do not follow this general pattern, or at least they don't look like they do at first blush.

Solution

## Representing a FSM

Let's zoom back to finite state machines at 88 miles an hour.

Here's a finite state machine that corresponds to the regular expression 'a+1+'

Let's just verify that by tracing out the input, aa1, on this finite state machine. We start in the start state. We haven't seen anything yet. We see the a. We're in state 2. We see the a, self-loop back to state 2. We see the 1. We're in state 3. Oh! State 3 is an accepting state. Ha-za!

Surprisingly, this super high-tech sounding 'tracing with my finger' approach is actually pretty much exactly what computers do under the hood to check strings against regular expressions or evaluate finite state machines.

You really only have to keep track of where you are in the input and which state you're in and not much else.

So let's do this together.

We'll write a computer program in Python to check to see if a finite state machine accepts a string.

So if I somehow give it this finite state machine an aa1 as input, it should say, true.

If I instead give it aa1b, it should say false because that string is not accepted.

But the first big design decision is, how do we represent this finite state machine?

By now, we know how to pass a Python program a number or a string or a list, but how do I pass in a picture?

Well, for the states 1, 2, 3, presumably, I could just pass in a list of the states.

It's these edges, these arrows that go anywhere.

That's what really matters.

What we really want to know from an edge is, if I'm in state 1, and the next input is 'a', where do I go?

So let's use Python dictionaries or maps to do this.

I'll make a Python dictionary or map called edges, and I'll just pass in my current state and the input letter, and it will give me the new state at the end.

Before we dive into it though, let's have a little refresher on maps and also tuples.

You may have seen them before in a previous CS class, but if you haven't, I'll just go over them right now.

You make a new map or dictionary in Python using open curly braces and closed curly braces.

The purpose of a map is to associate 1 thing with another.

For example, here I'm making a map that's going to help me keep track of which things in the world are flowers because I might easily forget this critical knowledge.

So I can update my map by saying, oh, 'roses' should map to true in the is_flower dictionary. But 'dog' is not a flower, so that should map to false.

Then if I go look it up later, is_flower of 'rose' will return true.

There's an alternative notation for specifying a map.

Inside the curly braces you use to make a new map, you can actually just put all of the bindings--'rose' maps to true. Colon. 'Dog' maps to false.

There's a colon in the center.

Now at this point you're probably thinking, what's in a name?

Is this word, 'rose' really important, or would a 'rose' by any other name still smell as sweet?

Well, we may be able to tell synonyms, but Python cannot.

So if I try something like, is_flower 'juliet', that's not defined in this mapping, so we will get some sort of key error element not found exception.

For Python dictionaries, you need to get the name exactly right.

Dictionaries and mappings are synonyms. They both refer to the same thing.

A Python tuple is just an immutable list.

Immutable means you cannot change it. Once you make it, it is etched in stone.

For example, I can make a tuple to hold the Cartesian coordinates of some object.

Maybe my point on the grid is at (1, 5).

I can access its elements the same way I would for a list.

The 0th part of point is 1. The 1th part of point is 5.

And while Cartesian points may not be super exciting, many of you may have done navigation or taken long trips and used GPS coordinates or longitude and latitude.

### FSM Simulator

With all of that in mind, let's encode our finite state machine in Python.

Here I've redrawn our finite state machine for 'a+1+', and we said before that we were going to make the edges a mapping or a dictionary.

Well, one of our edges is at state 1 on 'a.' State 1 on input 'a' goes to state 2. And another one is at state 2 on 'a' stays in state 2.That's our self-loop. If we were on state 2 and we see a 1, we go to state 3. State 3 on 1 stays the same.

Let me just highlight one of these.

This particular edge from 2 to 3 on 1 corresponds to this entry in our edges mapping.

I also need to know which states are the accepting states. Previously, I denoted that by drawing double lines, but again we can't pass a picture into Python, so I'll just make a list of all the things it accepts.

Then actually that's it.

You'd think we'd need a list of nodes, but you're going to see that we're actually able to finesse it because all the nodes we really care about already appear in this edges listing.

### Quiz: FSM Simulator

So here we are writing our finite state machine simulator, and this is actually super exciting because it previews one of the concepts that we're going to have later in the course--interpreting another program.

It's like the junior grade version of it, even this will be a lot of fun.

So together, we're going to write a procedure called fsmsim for FSM simulation, finite state machine simulator.

You pass in the input string, the start state or the current state, the edges, and the accepting states, and it returns true if that string is accepted by the finite state machine and false otherwise.

We'll do it together.

Submit via the interpreter--I'll write the first half of this procedure with you.

So let's get started on our finite state machine simulation quiz.

Here I'm just recopying the edges definition so that we'll have a test input to work with.

These 2--the edges and the accepting state--correspond to the regular expression 'a+1+', and now we're going to define our procedure, finite state machine simulator given a string, the current state, the edges--these ones up here--and the accepting state.

What do we do? Well, one possibility is that we're already at the end of the input, at which point we should just check to see if our current state is an accepting state or not.

If we're at the end of the input and we are state 3, then we return true.

Otherwise, we should be returning false.

If the string isn't empty, then I can get the current letter as the 0th position from the string.

We know the current input letter we're looking at, the current state we're in, all of the edges are available to us-- you fill out the rest of this code.

Here's a hint.

Find out if there's a valid edge leaving from this state on this letter.

If there is, you should take that edge and keep walking.

If there is not, we fall off the end of the finite state machine and die, so you should return false.

And the big hint for you is recursion, which is always the hint in computer science because it's the secret of all power and knowledge in the universe.

Recursion, use it.

Solution

Solution

Solution

Solution

## Epsilon and Ambiguity

It turns out that Python's regular expression module actually uses something very similar to FSM sim under the hood.

You just take the regular expression, turn it into a finite-state machine, which you've done forwards and backwards many times, and then check with a simple recursive procedure to see if the finite-state machine accepts a string.

However, our simulation did not handle epsilon transitions or ambiguity, and what I mean by ambiguity is what if there are 2 outgoing edges labeled a?

Let's say one of them leads to an accepting state, and one of them doesn't.

What should we do?

Well, there is a formal definition for this kind of ambiguity.

However, it's not going to solve our problems.

We say that a finite-state machine accepts a string s if there exists even one path from the start state to any accepting state that follows s.

This finite-state machine accepts a because there's one way to do it where a causes you to end up in an accepting state.

If you like, you can say that finite-state machines are generous.

If there's any way to accept, we will make that work.

However, our finite-state machine simulation didn't code that up, so we're going to have to return to both of these issues.

Solution

Solution

## Nondeterminism

These easy-to-write FSMs that we've been using that involve epsilon transitions or ambiguity-- remember, ambiguity means that I can go to 2 different places on the same input-- are formerly known as non-deterministic finite state machines.

Non-deterministic here just means that you may not know exactly where to go or where to put your finger.

It's not lock-step. You have choices.You have freedom.

A lock-step FSM with no epsilon transitions and no ambiguity by contrast is called a deterministic finite state machine.

Everything is determined from the start.

Given the finite state machine and the input, you always know exactly what will happen.

Our finite state machine simulation procedure can handle deterministic finite state machines.

That makes them really useful for implementing regular expressions under the hood.

Let me just show you an example of this non-determinism just to drive it home.

Suppose we were back here in this previous finite state machine, but now the input is 1-23.

Where are we?

We started here, and on a 1 we went here, and then I guess if we're supposed to stay alive and there's a hyphen, we must have gone here and taken the hyphen.

And now there's a 2, but now this is really not obvious.

I could stay here on this self-loop for a 3, or I could have gone back on this free epsilon transition and done the self-loop here on a 3, so I could be in state 2 or state 5.

Since there isn't one single answer for where I could be, this is non-deterministic.

As a bit of a fun aside, this notion of determinism or non-determinism can be related to the question of free will in philosophy.

Can we actually make independent choices?

Or is everything preordained by the current state of the world and forces acting on it, like a lock-step game of billiards or snooker or pool?

Some philosophers will approach this by suggesting that we have the illusion of free will--that's a disconcerting thought-- which is handy for describing subjective experience.

We certainly often feel like we have free will.

Regardless of what's going on in the real world,

we're going to see that something similar holds for finite state machines.

Although non-deterministic finite state machines look like they have much more power and much more freedom, anything that could be done with them could also be done in our deterministic billiard ball world.

In fact, you can watch me suck free will out of this world right now.

Every non-deterministic finite state machine has a corresponding deterministic finite state machine that accepts exactly the same strings.

Non-deterministic finite state machines are not any more powerful than deterministic finite state machines.

They're just more convenient. It's easier to write them down.

Let's see an example of this extraordinary claim.

Suppose we have this regular expression.

There are only 2 strings in the language of this regular expression, but here I've drawn out a very elaborate finite state machine for it that has epsilon transitions coming out the wazoo.

This is very non-deterministic.

We definitely need to see an a, but then here these 2 epsilon transitions represent the explicit choice.

Do I do the b, or do I skip over it?

On the top road, we need to see the b.

On the bottom road, we skip entirely past it.

And then in any event, we need to see the c.

I'm now going to write a deterministic finite state machine that does exactly the same thing, and I'm going to hint at how this conversion works.

We'll see this again in just a minute.

After I see an a, I could be in 2, or I could take the epsilon to 3.

I could have taken the epsilon down here to 6 or all the way over to 4, so there are 4 places

I could be in. That's a lot of fingers.

I'll just record all of them as the name for my new state, 2364.

From here, if I see a b and I survive--remember, finite state machines work if there's any path that gets to the end--it must have been that I was in state 3, at which point now I'm just in state 4.

By contrast, if I was back here and I saw a c, it must have been that I was in state 4, and now I'm in state 5.

And then finally, if I'm in state 4 and I see a c,

I end up here, so this deterministic finite state machine accepts the same language as the one above.

The 2 strings, a, b, c, and a, c are both in it, but it does not have epsilon transitions or ambiguity.

In any particular node, there are never 2 edges going out labeled a, and there are never epsilon transitions.

Let's see another example of how this works.

Again, I'm going to build a deterministic machine where every state in the deterministic machine corresponds to a set of states

in the non-deterministic one.

Again, to restate that, you give me a non-deterministic machine,

I'm going to build you a deterministic machine d that accepts the same language, and the way I'll do it is every state in d is going to correspond to a set of states in the non-deterministic machine you gave me.

### Quiz: Nondet to det

As the quiz, what should the label for this edge be so that this deterministic equivalent and this non-deterministic machine accept exactly the same language?

Solution

## Save the World

Let's wrap up what we've learned in this unit.

Strings are just sequences of characters, and manipulating strings is going to be critically important for making a web browser.

Modern programming languages support regular expressions, which are just a concise notation for specifying sets of strings, and using regular expressions is more flexible than using fixed string matching like string.find.

With regular expressions, we can define phone numbers, words, numbers, quoted strings, and given a regular expression, we can search for and match it in a bigger string.

Finite state machines are a pictorial equivalent to regular expressions.

Every regular expression, concatenation, plus, question mark, star, has an equivalent finite state machine.

And in fact, although I didn't show it, vice versa.

Every regular expression has a finite state machine, and every finite state machine has a regular expression.

And then every finite state machine can be converted to a deterministic finite state machine.

No epsilons, no ambiguity.

Once we have a deterministic finite state machine, we can simulate it, and it turns out it's very easy- about 10 lines of recursive code--to see if a deterministic finite state machine accepts a string.

In fact, you've written that code.

Now that you know how to implement regular expressions, take the regular expression, make a finite state machine, make it deterministic, call FSM sim.

We'll just use Python's regular expression library, but it's doing exactly those steps under the hood.

It works the same way you would.

In our next exciting episode, we're going to use this knowledge to specify important parts of HTML and JavaScript like string constants or hypertext tags.

As the first step to writing our web browser, one great resource available to you as you revise this material and work on the homework is the forums.

## Conclusion

We've just finished learning about sets of strings, regular languages, regular expressions, and finite state machines, a beautiful formalism and a lovely way of implementing it in actual Python.

This idea, this tool of regular expressions specifying sets of strings is a really powerful and really expressive way of writing quite a few programs.

We're going to see this come up later on in everything from mail to web servers to web browsers to writing our interpreter for JavaScript and HTML.

## Quiz Solutions

### Breaking Up Strings

3
>>> "Alan Turing".find("n", 4)
9
>>>

### Split

>>> "Python is fun".split( )
['Python', 'is', 'fun']
>>> "July-August 1842".split( )
['July-August', '1842']
>>> "6*9==42".split( )
['6*9==42']

### Disjunctions in FSMS

# Disjunction Construction

import re

# Assign to the variable regexp a regular expression that matches either the
# exact string ab or one or more digits.

regexp = r"ab|[0-9]+"

# regexp matches:

print re.findall(regexp,"ab") == ["ab"]
#>>> True

print re.findall(regexp,"1") == ["1"]
#>>> True

print re.findall(regexp,"123") == ["123"]
#>>> True

# regexp does not match:

print re.findall(regexp,"a") != ["a"]
#>>> True

print re.findall(regexp,"abc") != ["abc"]
#>>> True

print re.findall(regexp,"abc123") != ["abc123"]
#>>> True

### Hyphenation

# Regexp Details and Challenges

import re

# Assign to the variable regexp a Python regular expression that matches
# lowercase words (a-z) or singly-hyphenated lowercase words.

# Hint: It may not be possible to get correctly - do your best!

regexp = r"[a-z]+\-?[a-z]+"

# regexp matches:

print re.findall(regexp,"well-liked") == ["well-liked"]
#>>> True

print re.findall(regexp,"html") == ["html"]
#>>> True

# regexp does not match:

print re.findall(regexp,"a-b-c") != ["a-b-c"]
#>>> True

print re.findall(regexp,"a--b") != ["a--b"]
#>>> True

However, 1 problem with this regular expression is that it does not accept single letter words like 'a' or 'i'

To see why, just look at the 2 plus signs.

This requires 1 or more letters here and 1 or more letters there,

That's at least 2 letters.

We might be tempted to fix it up by making 1 of these a star,

r"[a-z]*\-?[a-z]+"

but now we mistakenly accept things like just '-a'

What if I try to make the other one a star?

r"[a-z]+\-?[a-z]*"

Well, dual problem--now we'll mistakenly accept things like 'a-'.

Well, this is a bit of a challenge.What we really want is for this hypen and the second word to be grouped together, and either they're both there or they're not.

It's as if I really want this question mark to apply to both the hyphen and also the [a - z]+.

We don't know how to do that yet, but you'll get a chance after we've learned how to fix this up in the homework.

### Re-Challenges

# RE Challenges

# Assign to the variable regexp a Python regular expression that matches single-
# argument mathematical functions.

# The function name is a lowercase word (a-z), the function argument must be a
# number (0-9), and there may optionally be spaces before and/or after the
# argument.

# Hint: You may need to escape the ( and ).

import re

regexp = r"[a-z]+\( *[0-9]+ *\)"

# regexp matches:

print re.findall(regexp,"cos(0)") == ["cos(0)"]
#>>> True

print re.findall(regexp,"sqrt(   2     )") == ["sqrt(   2     )"]
#>>> True

# regexp does not match:

print re.findall(regexp,"cos     (0)") != ["cos     (0)"]
#>>> True

print re.findall(regexp,"sqrt(x)") != ["sqrt(x)"]
#>>> True

### Escaping the Escape

# Tricky REs with ^ and

# Assign to regexp a regular expression for double-quoted string literals that
# allows for escaped double quotes.

# Hint: Escape " and
# Hint: (?: (?: ) )

import re

regexp = r'"(?:[^\\]|(?:\\.))*"'

# regexp matches:

print re.findall(regexp,'"I say, \\"hello.\\""') == ['"I say, \\"hello.\\""']
#>>> True

# regexp does not match:

print re.findall(regexp,'"\\"') != ['"\\"']
#>>> True

### FSM Simulator

# FSM Simulation

edges = {(1, 'a') : 2,
(2, 'a') : 2,
(2, '1') : 3,
(3, '1') : 3}

accepting = [3]

def fsmsim(string, current, edges, accepting):
if string == "":
return current in accepting
else:
letter = string[0]
# QUIZ: You fill this out!
# Is there a valid edge?
if (current,letter) in edges:
# If so, take it.
current=edges[(current,letter)]
return fsmsim(string[1:], current, edges, accepting)
# If not, return False.
else:
return False
# Hint: recursion.

print fsmsim("aaa111",1,edges,accepting)
# >>> True

You may have noticed as we were going through it, that edges and accepting never change.

I defined them once at the beginning of the file.

So our finite state machine simulator is really just recursive in the input and the current state, and that matches our intuition because those are the 2 fingers I was using to work it out on paper.

### FSM Interpretation

# FSM Interpretation

# Define edges and accepting to encode r"q*". Name your start state 1.

edges = {(1,'q'):1}

accepting = [1]

def fsmsim(string, current, edges, accepting):
if string == "":
return current in accepting
else:
letter = string[0]
if (current, letter) in edges:
destination = edges[(current, letter)]
remaining_string = string[1:]
return fsmsim(remaining_string, destination, edges, accepting)
else:
return False

print fsmsim("",1,edges,accepting)
# >>> True

print fsmsim("q",1,edges,accepting)
# >>> True

print fsmsim("qq",1,edges,accepting)
# >>> True

print fsmsim("p",1,edges,accepting)
# >>> False

### More FSM Encoding

# FSM Interpretation

# Define edges and accepting to encode r"[a-b][c-d]?". Name your start state 1.

edges = {(1,'a'):2,
(1,'b'):2,
(2,'c'):3,
(2,'d'):3}

accepting = [2,3]

def fsmsim(string, current, edges, accepting):
if string == "":
return current in accepting
else:
letter = string[0]
if (current, letter) in edges:
destination = edges[(current, letter)]
remaining_string = string[1:]
return fsmsim(remaining_string, destination, edges, accepting)
else:
return False

print fsmsim("a",1,edges,accepting)
# >>> True

print fsmsim("b",1,edges,accepting)
# >>> True

# >>> True

print fsmsim("e",1,edges,accepting)
# >>> False

### mis MSF

# FSM Interpretation

# Provide s1 and s2 that are both accepted, but s1 != s2.

s1 = "bdf"

s2 = "bdgbdf"

edges = {(1,'a') : 2,
(1,'b') : 3,
(2,'c') : 4,
(3,'d') : 5,
(5,'c') : 2,
(5,'f') : 6,
(5,'g') : 1}

accepting = [6]

def fsmsim(string, current, edges, accepting):
if string == "":
return current in accepting
else:
letter = string[0]
if (current, letter) in edges:
destination = edges[(current, letter)]
remaining_string = string[1:]
return fsmsim(remaining_string, destination, edges, accepting)
else:
return False

print fsmsim(s1,1,edges,accepting)
# >>> True

print fsmsim(s2,1,edges,accepting)
# >>> True

print s1 != s2
# >>> True

### Phone It In

# Cumulative Recap!

# Suppose we want to recognize phone numbers with or without hyphens. The
# regular expression you give should work for any number of groups of any (non-
# empty) size, separated by 1 hyphen. Each group is [0-9]+.

# Hint: Accept "5" but not "-6"

import re

regexp = r"[0-9]+(?:-[0-9]+)*"

# regexp matches:

print re.findall(regexp,"123-4567") == ["123-4567"]
#>>> True

print re.findall(regexp,"1234567") == ["1234567"]
#>>> True

print re.findall(regexp,"08-78-88-88-88") == ["08-78-88-88-88"]
#>>> True

print re.findall(regexp,"0878888888") == ["0878888888"]
#>>> True

# regexp does not match:

#print re.findall(regexp,"-6") != ["-6"]
#>>> True

## World of Westley's Whimsical References

### From Unit 1-4 : Toshiro Mifune, Hello World, Ada Lovelace, Alan Turing

Toshiro Mifune was a Japanese actor in films such as Seven Samurai and Rashomon

Hello World is a traditional first program or first example text.

Ada Lovelace is widely regarded as the first computer programmer.

Alan Turing was highly influential in the development of computer science and in code breaking during World War II.

### From Unit 1-5 : Grace Hopper, Currer Bell, bell hooks

Grace Hopper was a computer scientist who developed the first compiler — or translator — for a computer programming language.

Currer Bell (note spelling — the video has a typo) was a pen name used by Charlotte Bronte when she wrote Jane Eyre.

bell hooks (intentionally uncapitalized) is a pen name used by Gloria Jean Watkins, an author and social activist.

### From Unit 1-6 : What do you get if you multiply six by nine, Douglas Adam's The Hitchhiker's Guide to the Galaxy

What do you get if you multiply six by nine? was a fairly important question in The Hitchhiker's Guide to the Galaxy, a science fiction comedy by Douglas Adams.

### From Unit 1-8 : Isak Dinesen, Out of Africa

Isak Dinesen was a pen name for Karen Blixen, the author of Out of Africa.

### From Unit 1-9 : Barbara Liskov, Turing Award

Barbara Liskov is a computer scientist and recipient of the Turing Award.

### From Unit 1-10 : Mir Taqi Mir, Khwaja Mir Dard, Urdu Poets

Mir Taqi Mir and Khwaja Mir Dard were 18th century Urdu poets.

### From Unit 1-12 : Peru's Independence from Spain

Peru declared independence from Spain on July 28, 1821.

### From Unit 1-14 : Murasaki Shikibu's The Tale of Genji, American Independence

Murasaki Shikibu wrote The Tale of Genji, often called the first historical or psychological novel.

Thirteen colonies declared independence from the British Empire in 1776 and became the United States of America.

### From Unit 1-16 : Brave New World

Aldous Huxley wrote Brave New World, a popular science fiction novel.

### From Unit 1-18 : Goethe, Faust

Johann von Goethe wrote Faust.

### From Unit 1-19 : Vaclav Havel

Vaclav Havel was a Czech politician and author.

### From Unit 1-21 : Greek Letter Epsilon, Rabindranath Tagore, Where the Mind is Without Fear

The lowercase Greek letter Epsilon (e) is often used to mean "the empty string" (when talking about languages) or "a very small number" (in some branches of mathematics).

Rabindranath Tagore is a Bengali poet and Nobel Prize laureate who wrote the oft-quoted Where the mind is without fear.

### From Unit 1-22: Pride and Prejudice, Jane Austen

Pride and Prejudice is an 1813 novel written by Jane Austen.

### From Unit 1-27 : Do-Re-Mi, The Virtue of Selfishness, MIDI

Do-Re-Mi is a song from the musical The Sound of Music.

The Virtue of Selfishness is a 1964 collection of philosophy essays.

MIDI, the Musical Instrument Digital Interface, is a standard for digital music.

### From Unit 1-28 : The Beatles' Hello, Goodbye, Turtles All The Way Down, Don Knuth's The Complexity of Songs, Ninety-Nine Bottles of Beer

Hello, Goodbye is a 1967 song by The Beatles.

Turtles all the way down refers to problems of infinite regression, such as when something is defined in terms of itself.

Don Knuth, a Turing-award winning computer scientist, wrote The Complexity of Songs.

Ninety-nine bottles of beer is an American children's song with repetitive lyrics.

### From Unit 1-29 : Back to the Future, Romeo & Juliet, Taj Mahal

In the movie Back to the Future, traveling at 88 miles per hour is related to traveling back in time.

"What's in a name? That which we call a rose / By any other name would smell as sweet." is spoken by Juliet in Shakespeare's play Romeo and Juliet (II,ii,1-2).

The Taj Mahal is a UNESCO World Heritage Site in India.

### From Unit 1-35 : History of the Telephone

The history of the telephone is actually fairly interesting.

### From Unit 1-37 : Newtonian Mechanics, Free Will & Determinism, Subjective Experience

Newtonian mechanics is a reasonable description for a billiard ball universe (at least, for objects that are neither too small nor too fast).

Free will and determinism are the subjects of much debate in philosophy.