cs262 »

CS262 Lesson 2: Lexical Analysis

Breaking strings down into important words

Contents

Review

Introduction

Last lesson, we learned about regular expression and finite state machines. In this lesson, we're going to put those concepts together to make a lexical analyzer--a program that reads in a web page or a bit of JavaScript and breaks it down into words, just like I might break and English sentence down into words. This is going to be a really important tool in our arsenal, and it's one of the first steps towards making a web browser,

Welcome back

Welcome back. In our last exciting episode we learned about regular expressions, a concise notation as a way to write down or denote or match a number of strings.

This

[4-7]

corresponds to four different strings

4, 5, 6, 7

We learned to write more complicated regular expressions like this one

r"ba+"

this + means one or more copies of a's, yielding words like ba, baa, baaa, and eventually yielding my sheep. sheep.png I assert that it's a sheep. You can tell because of the label. Those labels never lie.

We also learned how you can use regular expressions in Python by importing, bring in, the functions and data types from the regular expression library:)

import re

An example of such a function was findall:

re.findall(needle, haystock)

which, given a sort of needle regular expression, would return all of the places in the haystack that it matched.

We also learned that you could turn regular expressions into finite state machines.

machine.png

This finite state machine accepts the same language as our ba, baa, baaa regular expression from above. Starting in a start state, on a b we transition to the middle state, on an a we end up in the third state, which is an accepting state. You can tell by the double circle. Then there's a self loop back. That was last time.

Specification

Now, we're going to learn how to specify important parts of HTML and JavaScript, and, in an incredible surprise move, we're going to do this specification using regular expressions. Just a quick reminder, an outline of the overall project:

speci.png

We want to start with a web page and then break it down into important words. Maybe < and > used for the tag are important words, the 1, +, 2, but we're largely ignoring white space or new line characters. Then we want to take those words and diagram them into this cool tree-like structure. Of course, this tree is growing upside down, but that won't be a problem at all. Finally, we're going to interpret that tree to figure out what it means to get the result.

HTML Fundamentals

Introduction

HTML stands for the "hypertext markup language." Many of you may have some previous experience with HTML, but that's not necessary. HTML as we know it was invented to Tim Berners-Lee, a British computer scientist working in Switzerland around 1990. For our purposes, HTML just tells a web browser how to display a webpage. In fact, HTML is not all that different from using symbols like stars or underscores to emphasis text that you're writing to someone else.

I <b> really </b> like you

In HTML this emphasized plain text becomes I, and then a special punctuation that means "let's do some bold now", really, this special punctuation </b> that means "I'm done with bold, let's go back to normal", and then like you. The b stands for "bold."

Let's go see how that pans out. Here in this particular window, I'm showing the raw HTML source on the left and how it might look in a web browser on the right. Here when I've added the bold tags around really, we see "I really like you." The "really" is rendered in bold.

speci2.png

Other comment approaches in HTML for emphasis are the use of underlines, the "u," and italics, "i." Each one of these is called a "tag." This special syntax with the "b" in angle brackets and the "b" sort of similar angle brackets, the "u," the closing "u," the "i," the closing "i," is a tag that's associated with that word, or that span of text, and tells the web browser how to render it.

speci3.png

The left angle bracket the b follows the right angle bracket-- that's a starting tag, and the left angle bracket followed by a slash, the slash is super important, begin an end tag. That's a little complicated to say. Mark the start of an ending tag and tell you that "the current tag is about to stop". <b> is s a beginning bold tag. </b> is an ending bold tag. You can see above. Only the word "really" is bolded.

Quiz: Really

Let's check your knowledge of that with a multiple choice quiz. Here a number of times I've written the sentence "George Orwell was really Eric Blair." You might think I've written it, say, 1984 times, but really, just four. I hear if you repeat something like this enough, it becomes true. What I'd like you to do is mark in this multiple multiple choice quiz which of these will end up showing the world "really" in bold. They can show other words in bold, but I want to know that they'll show "really" in bold.

  1.  George Orwell was <b>really</b> Eric Blair. 

  2.  George Orwell was </b>really<b> Eric Blair. 

  3.  George Orwell was <b>   really</b> Eric Blair. 

  4.  <b>George Orwell was really Eric Blair.</b> 

Solution

Tags

As we hinted before, <b> in HTML is called a tag. It's kind of like a price tag you might attach to a shirt or another item you're buying. This modifies nearby text and tells you how to interpret it, whether you can wash it or not in a machine, how much it costs, whether or not it should be bolded or underlined--that sort of thing. Another super common kind of tag is the anchor tag <a>, which is used to add hyperlinks to webpages. In some sense this is the defining characteristic of what it means to be a webpage. Here I've written a fragment of HTML that includes such an anchor tag.

Click here <a href="www.google.com"> now! </a>

It begins with <a, but unlike the relatively simple bold and underline tags, it has an argument. This means pretty much the same thing it did when we were talking about functions in Python or math. In sin(pi) the argument or my sine function is pi. Here the argument or modifier for my anchor tag is href =. This stands for hypertext reference--the target of this link. After href = we have a string that is a URL, a web address. Hypertext transfer protocol google.com. This text in the middle is often rendered in blue with an underline, although it doesn't have to be. With </a> we're ending the anchoring tag. Let's see how this plays out.

tag1.png

Here I've the old "Eric Blair was really George Orwell" text, but I've added a new sentence--"Click here for a link to a webpage." Right after the anchor starts, the text is rendered in a slightly different color. If we were to click on it, you can potentially see down in the lower left that it goes to google.com.

Just to break this down, if this is a fragment of HTML,

Click here <a href="www.google.com"> now! </a>

then the words "Click here and now" will all be drawn on the screen. This syntax <a marks the beginning of the anchor tag. This syntax </a>, marks the end of the anchor tag. This part href="www.google.com" is the argument of the tag. It contains extra information for things that are more complicated than simple bold or underline.

Quiz: Interpreting html

Here I've written a significantly more complicated fragment of HTML, and I would like you, gentle student, to help me interpret it.

<a href="http://wikipedia.org"> Mary Wollstonecrafz wrote <i> A Vindication of the right of women </i> </a>

In this multiple-multiple choice quiz, check each box that corresponds to a word that will be displayed on the screen by the web browser.

  1. href

  2. Mary Wollstonecrafz

  3. wikipedia

  4. i

  5. Vindication

  6. wrote

Solution

Tokens

Taking Html Apart

Now that we understand how HTML works, we want to separate out these tags from the words that will be displaced on the screen. Breaking up words like this is actually a surprisingly common task in real life. For example, ancient Latin was often written or inscribed without spaces. This particular set of letters "SENTATUSPOPULUSQUEROMANUS" is inscribed on the arch of Titus. Roman inscriptions like this were written without spaces, and it requires a bit of domain knowledge to know how to break this up. "Senate and the People of Rome." That inscription was made quite some time ago. Similarly, in many written Asian languages, they don't explicitly include spaces or punctuations between the various characters or glyphs.

takapart.png

In this particular Japanese example, and both my handwriting and my stroke order are very, very poor--have pity--some amount of domain knowledge is required to break up "ano" from "yama"--"that mountain." Finally,even if you're not familiar with Asian languages or ancient Latin, you might have seen the same sort of thing in a much more modern guise, in text messaging. Some amount of domain knowledge is required to break I <3 u up into "I love you" even though no particular spaces are given.

Wollstonecraft </a> wrote

We will want to do the same thing for HTML to break it up into words like "Wollstonecraft" and "wrote" that will appear on the screen or the special left angle bracket slash maneuver that tells us that we're starting end tag, the special word in the middle that tells us which tag it was, and then this closing right angle bracket. Once again, for this HTML fragment we want to break it up into this first word, the start of the closing tag, another word, the end of the closing tag, and then another word.

We're going to need to do this to write our web browser. In order to interpret HTML and JavaScript, we're going to have to break sentences down into their component words to figure out what's going on. This process is called--dun, dun, dun, dun-- lexical analysis. Lexical here has the same roots and "lexicon" like a dictionary. This means "to break something down into words." You'll be pleased to know that we're going to use regular expressions to solve this problem. Here I've written another one of those decompositions.

wordWollstonecraf
start of closing tag
worda
end of closing tag>
wordwrote

Quiz: Taking Html Apart

We might have broken an HTML fragment down into these word-like objects, but this time you're going to help me out by doing the problem in reverse.

start of tag
word>
end of tagb
wordsalvator
worddali

So in this multiple multiple choice quiz, I'd like you to mark each one of these HTML fragments that would decompose into this sequence of five elements.

  1. </b> salvator  dali

  2. < b>salvator  dali

  3. <b>salvatordali

  4. <b> salvator  dali </b>

  5. <b> salvator  dali

  6. <b> salvator  dali

Solution

Html Structure

Since HTML is structured, we're going to want to break it up into words and punctuation and word-like elements, and we use the special term token to mean all of those. In general, a token can refer to a word, a string, numbers, punctuation. It's the smallest unit of the output of a lexical analysis. Remember, that's what we're currently working on. Mostly tokens do not refer to white space, which is just a formal way of referring to the spaces between words.

We're going to be focusing on lexical analysis, a process whereby we break down a string, like a sentence or an utterance or a webpage, into a list of tokens.

HTML Structure.png

One string might contain many tokens in the same way that one sentence might contain many words. Here

LANGLE<
LANGLESLASH
RANGLE>
EQUAL=
STRING"google.com"
WORDWelcome!

I've written 6 HTML tokens, given them names on the left and examples on the right. Now, the naming of tokens is a bit arbitrary. In general, though, tokens are given uppercase names to help us tell them apart from other words or variables. Here [1] corresponds to an angle bracket facing left presumably. [2] is a < followed by a /, division sign. The [3] facing to the right.. The [4] is just =. [5] is going to have double quotes around it, and [6] is anything else.

Now, it turns out that the naming of tokens is not quite an arbitrary matter. You may think I'm as mad as a hatter. No, that's a different story. We're just going to go with these token names for now, but if you were designing a system from the ground up, you can rename them to be anything you like.

Specifying Tokens

We're going to use regular expressions, which are very good at specifying sets of strings to specify tokens. Later on we'll want to match a bunch of different tokens from webpages or JavaScript, and this is how we write out token definitions in Python.

def t_RANGLES(token)
    r'>'
    return token

The t_ tells us--and it tells the Python system-- that we're declaring a token. The next letters are the name of the token. You either get to make this up yourself, or in the homework I'll tell you what I want it to be. Tokens are in some sense going to be functions of the text they match. More on this a bit later. Skip me for now.

Next, we have a regular expression corresponding to the token, which in this case, for the right angle token, there's really only one string it can correspond to, so we've written out the regular expression that corresponds to a single string. And then here. On the last line we're returning the text of the token unchanged.

We could transform it, and you'll see us do that for more complicated tokens like numbers where maybe we'll want to change the string '1.2' into the number 1.2.

The particular syntax we'll use to define tokens comes from PLY, the Python Lex-Yacc project.

Quiz: Specifying Tokens

Now it's your chance to define your first token. What I would like you to do is write code in the style of the procedure I just showed you before for the LANGLESLASH token. The LANGLESLASH token is surprisingly important. We really need it to know when all of our tags end. Use the interpreter to define a procedure t_LANGLESLASH that matches it.

# Specifying Tokens

# Write code for the LANGLESLASH token to match </ in our HTML.

Solution

Token Values

It's not enough to know that a string contains a token, just like it's not enough to know that a sentence contains a verb. We need to know which one it is, and formally we refer to that as the value of the token. By default, the value of a token is the value of the string it matched. We can rebuild it, however. We have the technology. Let's see how that plays out.

Quiz: Token Values

def t_NUMBER(token):
    r"[0-9]+"
    token.value = int(token.value)
    return token

Here I've written a definition for a slightly more complicated token, a number, one or more copies of the digit 0-9, and now I would like you, our last, best hope for victory, to help me understand it. If the input text is "1368", what will the value of the token be? Check all that apply.

  1. "1368"

  2. 1368

  3. "1"

  4.  1111111

Solution

Quoted Strings

When reasoning about HTML, it's critical that we understand quoted strings. They come up in almost every anchor tag, and anchor tags are the essence of hypertext.

<a href="www.google.com"> link! </a>

They're the interlinks between documents, so we really need those. They rely on quoted strings. That means we're going to need to understand quoted strings.It is convenient then that we are amazing at quoted strings. You have plenty of practice with them from the previous lesson. Once again, it's time for a quiz.

Quiz: Quoted Strings

Suppose that the strings we care about start with a double quote, end with a double quote, and in the middle they can contain any number of characters except double quotes. I'd like you to write a definition for the Python function t_STRING and submit it via the interpreter. And just to make this a bit clearer, we should definitely accept quoted strings like "cuneiform" or "sumerian writing". But we should not accept 30th century BCE because it doesn't start and end with quotes, and we should not accept "esc \" ape" which has an escaped double quote. We may want to get to that later, but for now, don't worry about escaped double quotes. Go to it. Submit via the interpreter.

# Quoted Strings

# Suppose a string starts with " and ends with " and contains any number of
# characters except ". Write a definition for t_STRING.

# Match Exactly:
#     "cuneiform"
#     "sumerian writing"
# Do Not Match Exactly:
#     "esc \" ape"
#     League of Nations Treaty Series

Solution

Whitespace

So we're using these token definitions in regular expressions to break down HTML and JavaScript into important words. As we've seen before, there can be lots of extra space between various tokens. We really want to skip or pass over spaces and possibly newline characters and tabs. More on that later.

def t_WHITESPACES(token):
    r" "
    pass

We do that using the same sort of token definition as before, so here I've made a regular expression that just matches a single space, but instead of returning the token, we pass it by. This is the power. Let's test out our knowledge with a quiz.

Quiz: Whitespace

We've already seen how to do left and right angle bracket sorts of tokens. We've taken a look at strings before. And now let's do words, which are almost everything else on a web page. Let's say that we want a word to be any number of characters except a left angle bracket, a right angle bracket, or a space, and here I really mean the single character pressing the space bar, not this 5-character word, but it's hard to write out. And when you're writing your function to match word tokens, you should leave the value unchanged. Submit a definition for t_WORD using the interpreter.

# Whitespace

# Suppose a WORD is any number of characters EXCEPT < > or space.
# A WORD token should leave its value unchanged.

# Submit a definition for t_WORD.

Solution

Lexical-analyzer

You've now seen a bunch of these token definitions, one for words, one for strings. A lexical analyzer or lexer--this is a term of art-- is just a collection of such tokens. You tell me what makes a word, what makes a string, what makes a number, what makes white space. You put it all together, and the result is a lexer, something that splits a string into exactly the token definitions you've given it.

For example, once I put these 3 rules together, they become a lexer, or lexical analyzer.

def t_WHITESPACES(token):
    r" "
    pass

def t_WORD(token):
    r"[^ <>]+"
    return token

def t_NUMBER(token):
    r"[0-9]+"
    return token

Quiz: Lexical Analyzer

Suppose we passed to the above lexical analyzer the input string:

33 is less
than 55

Oh, gentle student, tell to me which one of these token output sequences could result. In this multiple choice quiz, indicate which of these 3 possible output lists could correspond to the values of the tokens extracted from this input string using these rules. Let's put it all together.

  • [33, 'is', 'less', 'than', 55]
  • [33, 'is', 'less', ' ', 'than',  55]
  • ['33', 'is', 'less', 'than', '55']

Solution

Quiz: Ambiguity

Now let's take a look at the same concept but do it backwards. Let's assume that we've fixed the problem from the last quiz. We're going to assume that NUMBER is preferred to WORD. Whenever we could match something as a number, we'll match it as a number instead of a word. What I'm going to do is write out a few phrases, a few sentences, and I want you to notice which of them could produce, could be decomposed into a word followed by a word followed by a number. Multiple, multiple choice. Check all that apply. Which of these 4 inputs could break down into WORD, WORD, NUMBER using the rules that we've been going over so far?

  1. grace hopper 1906

  2. grace hopper 1 9 0 6

  3. grace "hopper" 1906

  4. grace hopper onenineohsix

Solution

Rule Order

When two token definitions can match the same string, the behavior of our lexical analyzer may be ambiguous. In our implementation we favor the token definition listed first.

As we saw in that last quiz, it's not quite clear what to do when our token definitions overlap.

The 7-character sequence "hello" matches our regular expression for word but also matches our regular expression for string.

This is a problem not just with computer languages but also with natural languages.

As the hypothetical owner of this restaurant would notice, we don't just serve hamburgers, we serve people could be interpreted the wrong way. Presumably those hamburgers are soylent green flavored. We want to have definitive rules for figuring out which of these we prefer.

In fact, we're going to use a very simple rule.

The first one you list wins, the one closer to the top of the file, so this is our big winner and is going to take priority over string.

If you're making a lexical analyzer for HTML or JavaScript, ordering your token definitions is of prime importance.

Rule Order 1.png

Quiz: Rule Order

Let's investigate this issue in the form of a quiz.

Suppose we have the input string hello, "world," and we really want that to yield word, the word hello, followed by a string.

I'm going to list 3 rules for you, and I want you to tell me which one has to come last for us to get the desired effect.

And here, because you've seen it all before, I'm eliding some of the details like the colon, token, blah, blah, blah.

Instead what I'd like you to do is tell me which one of these functions, which one of these rules, would have to come last, bearing in mind that the one that comes first wins all ties in order for hello, "world" to break down into a word followed by a string.

Rule Order 2.png

Solution

Lexical Analyzer

Strip Snipping

We turn "quoted strings" into quoted strings. In addition, we write our first lexer in Python.

So you may have noticed a bit of redundancy in our handling of "quoted strings".

We return the entire matched text, which includes these double quotes at the end.

But, in some sense, they're not as much part of the meaning, as they are beginning and ending markers to tell us when the string starts.

This is our default token value, but we might want to take a small pair of scissors to this string, and snip off the quotes at the beginning--and at the end.

Here we have an example of a token definition that does just that.

def t_STRING(t):
        r'"[^"]*"'
        t.value = t.value[1:-1] # drop "surrounding quotes"
        return t

After matching the right kind of string, we take the token value-- the entire thing-- and we're going to use substring selection, starting at character 1-- this is going to be character 1-- and going up to, but not including, character negative 1.

Strip Snipping.png

Now if you haven't seen this trick before in Python this might surprise you a bit, but you can count back from the end of the string, using negative numbers. So this is actually the negative first character. And remember that substring inclusion starts at 1 and goes up to, but not including, the negative 1. So this is going to get everything from the "q" over to the "s" in strings-- or in other words, have exactly the effect that we wanted. Cute little trick, huh?

So now I'm going to show you how to make a lexical analyzer--which, recall-- is just a bunch of token definitions put together. I'm going to write it out in Python and we'll follow along.

import ply.lex as lex

tokens = (
        'LANGLE',       # <
        'LANGLESLASH',  # </
        'RANGLE',       # >
        'SLASHRANGLE',  # />
        'EQUAL',        # =
        'STRING',       # "144"
        'WORD',         # 'Welcome' in "Welcome to my webpage."
)

t_ignore                = ' \t\v\r' # shortcut for whitespace

def t_LANGLESLASH(t):
        r'</'
        return t

def t_LANGLE(t):
        r'<'
        return t

def t_SLASHRANGLE(t):
        r'/>'
        return t

def t_RANGLE(t):
        r'>'
        return t

def t_EQUAL(t):
        r'='
        return t

def t_STRING(t):
        r'"[^"]*"'
        t.value = t.value[1:-1] # drop "surrounding quotes"
        return t

def t_WORD(t):
        r'[^ <>\n]+'
        return t

webpage = "This is <b>my</b> webpage!"
htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
        tok = htmllexer.token()
        if not tok: break
        print tok

This top line--the import statement--is a lot like Import RE. It's telling Python where to find our lexical analyzer software or libraries that we're going to build upon.

Just like regular expressions were called RE to save space, a lexical analyzer is just called "lex"--to save space.

And now I'm going to give a list of all of the tokens that I care about.

Here, I'm just going to be concerned with the 6 that we've previously spoken about:

the Left Angle bracket; the Left Angle bracket, followed by a slash; the Right Angle bracket-- these 3 make tags-- an Equal sign, Strings that are surrounded by quotes, and every other word.

I'm also going to use a little shortcut. Before, we used a Whitespace token, but if you like, you can write the word t_ignore instead and, implicitly, we'll ignore everything matching this regular expression. Here's my first token definition rule.

It's for LANGLESLASH.

Here's the regular expression that it matches.

We return the text, unchanged.

Here's another rule for the Left Angle bracket, the regular expression that it matches--and we return the text, unchanged.

And you'll note that I have the LANGELSLASH rule ahead-- before it--in the file.

And that's because I want this one to win, on ties.

If I see a Left Angle, followed by a slash, I want it to be the LANGLESLASH (token)-- and not the Left Angle, followed by--say--a WORD(token).

More on that in just a bit; I'll test that out and show it to you.

Here's our rule for the Right Angle bracket.

Here's our rule for the Equal sign token.

Note that while these are long-- they take up a bit of space--they're not actually particularly complicated.

This has mostly been listing 5 regular expressions.

Here's one now.

This one is a little bit more complicated--here's are rule for STRING(token)s.

Here's our regular expression that matches it.

And there I am, dropping off--shaving off-- the surrounding double quotes, just as you've seen before.

Finally, there's our definition for the WORD(token).

And now what we want to do is use these regular expressions, together--these token definitions-- to break up a Web page.

So here, I'll make a variable that holds the text of a hypothetical Web page.

"This is my webpage!"

Let's make it more exciting; Ho, ho--this is at least 10 percent more exciting!

This function call

lex.lex()

tells our lexical analysis library that we want to use all of the token definitions above to make a lexical analyzer, and break up strings.

This function call tells it which string to break up.

htmllexer.input(webapge)

I want to break up this Web page: "This is my webpage!"

Now, recall that the output of a lexical analyzer is a list of tokens.

I want to print out every element of that list.

This call, .token(), returns the next token that's available.

If there are not more tokens, then we're going to break out of this loop.

Otherwise, we print out the token.

Well, let's go see what sort of output we get.

The odds of me having written this, making no mistakes the first time, from scratch, are about zero. Let's go see what happens. Oh! I actually don't really believe it!

We can see the output here at the bottom:

LexToken (WORD, ' T ',' h ', ' i ', ' s '

but it's not quite the output I was expecting.

Oh, here's the mistake that I made-- right now, I only have one character in t_WORD and if you look down here, instead of seeing the word, "This"--for "This is my webpage!"--

def t_WORD(token):
    r'[^ <>\n]'
    return token

I have each letter spelled out separately. Let me fix that.

def t_WORD(token):
    r'[^ <>\n]+'
    return token

And now we get more of the output that we were expecting.

Our first token is ' This ';

our next token is a word, ' is '.

Then we saw the Left Angle bracket,

a word, ' b '--for bold,

the Right Angle bracket; a word, ' my ';

the LANGLESLASH,

and then the word, ' webpage '.

Lexer Fun Time

So this is a demonstration of a lexical analyzer.

And what I'm going to show you is how to test things out or change things around.

For example, let's go up here and turn "This" into a quoted string.

webpage = '"This" is <b>my</b> webpage!'

So previously, our first token was the WORD, ' This ', but when I rerun it, I'm really expecting to see the STRING, "This"-- because I've put This in quotes.

And now it is the STRING, ' This '.

These other numbers, after lexical analyzer token STRING, with token value, ' This ' are the line and character number information.

The token corresponding to ' This ' starts on line 1, at character zero. The token corresponding to ' is ' starts on line 1, character 7-- zero, 1, 2, 3, 4, 5, 6, 7--yep, that's where the "i" starts.

So let's say we made a mistake, we got one of our regular expressions wrong. Let's say that I mistakenly put in a Greater Than sign here, for LANGLESLASH.

Well, then right after "my", I would expect to see the output change.

I'm going to expect to see us get the wrong answer. Currently, right after ' my ', we see LANGLESLASH. Let's rerun it.

Oh! Now we don't see quite the right thing at all. Right after the ' b ', we see an LANGLESLASH, but it's associated with the wrong text. After ' my ', we just see an LANGLE. The slash is made its way into this WORD, ' /b '.

By looking at this output, we can notice mistakes in our lexical analyzer and help fix them.

I repair the problem, rerun, and the world looks like a bit better.

Here, I've used Python's triple quoting approach.

webpage = """Tricky "string" <i>output<i/>!"""

We have the word Tricky; the quoted STRING, "string"; LANGLE, word, RANGLE, output, LANGLE, slash, i, RANGLE, word-- and then we're done.

Let's go see how this comes out. We've got the word Tricky; the STRING, "string"-- and if you've noticed, we've shaved off the 2 double quotes, which is just what we wanted, an LANGLE, the i, the RANGLE, the word ' output ', LANGLESLASH, Bang at the end.

Tracking Line Numbers

Lexers often track line number information (and sometimes column number or character count information as well).

You saw in the simple lexer that we wrote that tokens come, not just with a type of token in their value, but also their line number and column number.

That sort of information is very handy to keep track of for users.

At some point, you may have written an incorrect Python program--nah. I've only written about a million of them. And it's really nice when the interpreter tells you which line the mistake is on.

Unfortunately, the lexer won't do this for us--entirely automatically.

It will keep track of columns, but it won't keep track of lines-- unless we do some work.

Here's an example of a rule that matches new lines.

def t_newline(t):
        r'\n'
        t.lexer.lineno += 1

You may not have seen this before, but we've already seen how the backslash can be used to Escape special characters, like the quote.

And this basically just means: quote.

However, when paired with something else, the backslash often has a special meaning.

\n is the string equivalent of pressing the Return or Enter key on your keyboard, and we call this the "newline" key. It advances your Editor, by 1 line.

So this rule--t_newline-- matches a new line that appears.

And when we see it, we take the token and increment its line number by 1, and then we pass over the new line, as if it were Whitespace.

Here you can see that I've added the newline rule, just as I described on the previous slide, to our old lexer from before. Let's go all the way to the bottom, and now I'm going to make my Web page much more complicated.

webpage = """Line1
 Line2
"""

And here, I'm using triple-quoted strings, showing line1 on the first line and line2 on the second.

Unfortunately, the output shows Line1, followed by the new line. I'm going to have to go up here and remove newline from the list of things that it's possible to have in a Word--just like we had to remove space.

def t_WORD(token):
    r'[^ <>\n]+'
    return token

And now we see that Line1 starts on Line1 and Line2 starts on Line2, and it's the seventh character from the beginning of the file. Zero, 1, 2, 3, 4, 5 is the newline,

Quiz: Crafting-input

Define a variable called webpage that holds a string that causes our lexical analizer to produce exactly these 6 bits of output

All right. So now that we understand all of that, it's: a QUIZ!

Written a little differently than normal, but a quiz, nonetheless.

I would like you to define a variable called (webpage) that holds a string that causes our lexical analyzer to produce exactly these 6 bits of output:

Crafting Input.png

but pay careful attention to the line number offsets. Notice, for example, that this Left Angle starts on line 2, character 11.

Try it out. Define a variable, (webpage), that holds this string.

Solution

Comments

Commented Html

Just as we have to separate words in HTML and JavaScript based on white space, we also have to take into account comments, and comments in HTML serve the same purpose that they do in Python, documentation or removing functionality. You can add a comment containing English text to explain what a program should be doing. You can do the same with a webpage or with a JavaScript program. Or you could comment out a function or a line to see how things behave without it. In HTML, comments start with this interesting 4-character sequence and end with this 3-character sequence. Left angle, bang, hyphen, hyphen begins a comment. Hyphen , hyphen , right angle ends a comment.

<!-- Comments -->

They look a bit like tags.

I think therefore <!-- This comment is not printed -->I am

Here I have an HTML fragment, "I think therefore," and then there's a comment, "I am." Je pense, donc je suis. We're going to see how to implement this in our lexical analyzer, but recall that our lexical analyzer was just based on regular expressions.

Commented HTML.png

I could recognize these comments with another finite state machine,

Commented HTML 2.png

and all I have to do is just merge these 2 finite state machines together conceptually. If I could have one set of rules describing comments and another set of rules describing all of my other tokens, I'll just put them together into one big machine.

Commented HTML 3.png

It might have too many states for us to be comfortable with, but it is entirely fine for a computer.

When we're processing a comment, normal rules don't apply. Let's consider a super tricky comment example.

Welcome to <b>my <!-- careful </b> -->webpage.</b>

Here we have "Welcome to <b> my," comment that closes the bold tag, "webpage," close the bold tag again. My question for you is how will this render? Which of these words will be bolded? Well, it turns out that when something is in a comment, we ignore it entirely. It's as if it weren't there, so even though this looks like it's closing the bold tag, it does not, and the words "my" and "webpage" will both be bolded.

Welcome to my webpage

In fact, it's almost as if everything in the comments were entirely erased and had no impact on the final rendering of the webpage at all. And now without the distracting text, it's relatively clear that "my" and "webpage" should both be bolded.

Quiz: Commented Html

Here I've written another HTML fragment that includes a comment,

Hello <!-- "world" </b> -->confusing</b>

and the quiz for you is--in multiple, multiple choice fashion-- to tell me which of the following HTML tokens--and I'll draw them now-- could be found by our lexer. Check all that apply. Based on this string, assuming that we've added the right rules for comments to our lexer, which of the following would be found?

Commented HTML 4.png

Solution

Html Comments

So now I'm going to show you how to add HTML style comments to our lexer so that it gets the behavior that we want. Just as we had to list all of the tokens, we're going to have to list all of these possibilities for things we could be doing. Either we're breaking the input down into tokens, or we're handling comments.

These 2 possible choices are called lexer states, a word I really don't like because it's super ambiguous. We have states in finite state machines. Let's not have states elsewhere.

import ply.lex as lex

tokens = (
        'LANGLE',       # <
        'LANGLESLASH',  # </
        'RANGLE',       # >
        'SLASHRANGLE',  # />
        'EQUAL',        # =
        'STRING',       # "144"
        'WORD',         # 'Welcome' in "Welcome to my webpage."
)

t_ignore                = ' \t\v\r' # shortcut for whitespace

states = (
        ('htmlcomment', 'exclusive'),   # <!--
)

def t_htmlcomment(t):
        r'<!--'
        t.lexer.begin('htmlcomment')

def t_htmlcomment_end(t):
        r'-->'
        t.lexer.lineno += t.value.count('\n')
        t.lexer.begin('INITIAL')
        pass

def t_htmlcomment_error(t):
        t.lexer.skip(1)

def t_LANGLESLASH(t):
        r'</'
        return t

def t_LANGLE(t):
        r'<'
        return t

def t_SLASHRANGLE(t):
        r'/>'
        return t

def t_RANGLE(t):
        r'>'
        return t

def t_EQUAL(t):
        r'='
        return t

def t_STRING(t):
        r'"[^"]*"'
        t.value = t.value[1:-1] # drop "surrounding quotes"
        return t

def t_WORD(t):
        r'[^ <>]+'
        return t

webpage = "hello <!-- comment -->all"
htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
        tok = htmllexer.token()
        if not tok: break
        print tok

This particular syntax for specifying them is arbitrary. It's just a function of the library we're using, so I'm going to declare a new state called "htmlcomment," and that's exclusive. If I'm in the middle of processing an HTML comment, I can't be doing anything else. I'm not going to be finding strings or words. Here's my rule for entering this special mode for HTML comments. If I see the beginning marker, <!--, then I want to enter this special HTML comment state and not find words or strings or numbers or tags but instead ignore everything. When I reach the end marker for an HTML comment, I want to jump back to the initial mode, whatever I was doing before, that time when I could find l angle slash and l angle and r angle. By default, that mode is called "INITIAL," all in capital letters. I'm actually going to do one other thing all at the same time. When we're in our special HTML comment mode, we won't use any of our other rules, even our super happy newline rule that's counting the line number, so what I'm going to do is call this string.count function to count the number of newline characters that occur in the token in the entire comment and add those to the current line number.

This is a little tricky, so don't worry if this doesn't make sense the first time. Now, we've said what to do when an HTML comment begins, and we've said what to do when it ends, but any other character we see in this special HTML comment mode isn't going to match one of those two rules. Anything that doesn't match one of our rules counts as an error, and what I'm going to say is just skip over that single character in the input. It's part of the comment. I don't even want to see it. This is a lot like writing "pass" except that it gathers up all of the text into one big value so that I can count the newlines later. Again, this is a pretty tricky maneuver.

Now I've changed my webpage so that it says "hello." There's an HTML comment with the word comment in it, and then there's the word all, and we'll get a chance to see how this plays out. Our webpage produces exactly the 2 tokens we expect, hello, starting at position 0, and all, Excellent.

Token Counting

So given that we've just seen how to code up HTML comments, let's test out that knowledge. Suppose we have our definition of HTML comments plus a rule for word tokens that are anything that's not a space, a left angle, a right angle, one or more of the above. We return that token, and now I give you the following HTML input fragment, "ob fuscation tuse tangle." For this blue string, for our lexer, we've got word tokens. We've got HTML comments. How many word tokens are we going to find?

Token Counting.png

Solution

JavaScript

Five Factorial

All right, so now that we've mastered lexing or breaking up into words HTML, let's turn our attention to JavaScript, the other language we'll be considering in this class. And I'm going to introduce you to JavaScript by jumping right into an example by way of a parable.

Five Factorial.png

Over here on the left we see some webpage code, apparently a webpage owned by Steven. Welcome to Steven's webpage. And Steven would really like to compute five factorial, 5 times 4 times 3 times 2 times 1. Over here on the left, we see the HTML source code, and on the right we see the result as it would render. This p tag I haven't introduced you to yet, but it means begin a new paragraph. We'll write out the words "Welcome to." We'll show "Steven" in bold. We've got this five factorial, this bang. The exclamation mark often means factorial in mathematics. Printed in italics you can see it slanted. But unfortunately, Steven is super sad. He can't remember the value of five factorial. Well, this is exactly the sort of thing that a programming language like JavaScript could help us out with. It can carry out computations just like Python, so we can do work in the middle of a webpage.

Let's write our first JavaScript script together. Here, this line starting with script type= "text/javascript" and then this document write line, and then ending here with this closing script tag, all of this, these 3 lines together are a JavaScript program embedded inside an HTML webpage. JavaScript programs always begin with this special script tag, and this tag has an argument because there might be multiple types of tags out there in the universe. We've seen tag arguments before with the anchor tag. Here I have an anchor tag where the argument is a hypertext reference.

Five Factorial 2.png

Here I have a script where we're telling the web browser you should treat this as a JavaScript program. This JavaScript program is very simple. It's the equivalent of print "Hello World" in Python. JavaScript's name for the print function is document.write, which we'll sometimes just abbreviate as write. But the semantics, the meaning is largely the same. It's also worth noting that we've put parentheses around the argument to document.write almost as if it were a mathematical function. We can do that in Python. It's allowed, but often we don't. And we've ended the line with a semi-colon, whereas at the end of a Python line we often don't have a semi-colon, but again, you can put semi-colons at the end of Python lines. We just typically don't. Now we're going to try to use the full phenomenal cosmic power of JavaScript to compute five factorial. To do so, I'm going to make a recursive function called--surprise, surprise-- factorial that's going to compute the value. Let's walk through every part of this JavaScript code together.

Five Factorial 3.png

The word function means I'm declaring a function. This is very similar to def in Python. Then I give the name of the function, and then I write the arguments just like I would in Python. In Python I'd have a colon here, but JavaScript requires slightly different punctuation, this open and curly brace, and in this regard it's more like languages such as C or C++ or Java or C#, curly brace languages. Our factorial function is going to be recursive, and every recursive function needs a base case, a time when it stops. Our stopping condition is when n is 0. We could have picked n is less than or equal to 0, n is less than or equal to 1, so I have an if statement that's checking that. Again, in Python this would probably look very, very similar except that we'd use a colon instead of an open and curly brace. If n is 0, we return 1, and I have a semi-colon at the end of all my statements. Then in Python, I would know that I'm done with the if statement because of the tabbing.

JavaScript doesn't use that sort of readable tabbing rule to figure out the control flow when an if statement ends, so instead you have to explicitly close off this open and curly brace just like you'd have to close off a tag in HTML or close off parentheses once you start them. We're going to study this a lot more as time goes by. I close off the then branch of my if. I have a semi-colon, and now I have a new return statement, and this is basically just the formula for factorial. It's n times the factorial of n - 1. This part here is a function call, in fact, a recursive function call, just like you'd expect to see in Python. I'm ending the whole thing with a semi-colon. This is the end of my function definition, and at the end of the day I print out-- and the JavaScript version of print is document.write-- factorial of 5, and over here on the right you can actually see it.

We've got the 120, which is the correct value for factorial we can compute five factorial using embedded JavaScript in the middle of a webpage.

Honor-and-honour

Many of the differences between JavaScript and Python are similar to the differences between American English and British English.

Sometimes the spelling changes a bit, or the pronunciation changes a bit, but in general, they are mutually intelligible. It should look very similar, but with 1 or 2 extra letters.

JavaScript is big on curly braces. Python is big on tabbing. JavaScript is big on semi-colons. Python not so much. JavaScript really loves parentheses. Python could take them or leave them. Python could have ;, and could have (), but doesn't have to. They're somewhat optional.

Quiz: Honor-and-honour

Here I've written a new JavaScript function called "ironcrutchli."

Honor and Honour.png

This one is not recursive, but it takes a formal argument x and then does some reasoning based on x and returns values as a result. I want to make sure that you're following along with JavaScript, so I'm going to have you tell me what this function evaluates to based on a few different inputs. In this fill-in-the-blank quiz, for each one of these inputs, imagine we're calling ironcrutchli on 0, so this 0 is bound to the value of x. What would the output be, similarly for 1, 2, and 9?

Solution

Identifier

In the previous example, I had highlighted ironcrutchli and x in red. They were identifiers, which is a formal way of saying variable name or function name.

Identifiers are textual string descriptions that refer to program elements. Things like factorial, s and tmp are all identifiers. We often call them identifiers because they identify a particular value or storage location.

We're going to allow our identifiers to start with lowercase letters or also uppercase letters, and they can finish out with any number of upper or lowercase letters.

Super
WonderWoman
my_count

Here we have another uppercase letter as time goes by, and sometimes people like to use underscores to add readability. That's totally allowed in the identifier, but we'll say not at the beginning. You have to start with a letter, but after that you can have underscores. Here I've shown 6 examples of identifiers, and I would like you to use the interpreter to write an identifier token rule for these sorts of JavaScript identifiers. Try it out using the interpreter.

Identifier.png

Solution

Numbers

JavaScript also has support for numbers, and just like in Python they can be whole numbers.

They can have decimal fractional parts, or they can be negative.

12 5.6 -8.1 -1. 3.1415

I like all of those, but all of these examples here in blue at the bottom,

1.2.3.4 five jenny

This is not a number written using digits. This is not a number at all, although I may have its number. That would totally work. Now that we have an intuition for what numbers look like in JavaScript, there's a quiz.

Quiz: Numbers

I'd like you to define t_number that matches the examples in red but does not match the examples in blue, but normally the value of a token is a string. I want you to convert it to a float before returning.

Solution

The End Of The Line

Both Python and JavaScript feature comments that go until the end of the current line.

In Python, we can write a comment that extends to the end of the line by prefacing it with the # sign or the pound sign or the hash sign.

#Python Comment

JavaScript has this same idea, but instead the comments start with // and then continue to the end of the line.

//Javascript Comment

This time I'll write the rule for you. Here's a rule corresponding to comments that go to the end of the line in JavaScript.

def t_eolcomment(token):
        r'//[^\n]*'
        pass

They start with a slash and then another slash, and then you can have anything that's not our special newline character, as many of those as you want, and the maximal munch rule means we're going to eat up all of those characters that we possibly can. And then rather than returning that as a token, we throw it away because it's a comment, documentation for the user.

Quiz: The End Of Line

Let's test our knowledge of this rule with a quiz. I've written out 4 sequences of input, and I'd like you to indicate which of these 4 sequences would yield identifier, identifier, number given the rules we've been talking about for JavaScript. You've seen how we define identifiers, how we define numbers, and what starts an end of line comment. Here are 4 different lines. Which ones give us IDENTIFIER, IDENTIFIER, NUMBER?

The End of the Line.png

Solution

It Could Be Worse

Alright, so you've just finished learning how to specify tokens for languages like HTML and JavaScript.

At this point, you might be thinking, boy, so those HTML and JavaScript designers did not make pretty tokens.

I just want to mention an experience from my own life. It could be much worse.

I once had the opportunity to work on software that did, let's say, collaborative edits. Say that a bunch of people wanted to make changes to a book like Pride and Prejudice by Jane Austin.

Suppose that you really want to make it about pandas instead of people, and I really want to make Mr. Darcy nicer because he's a bit snappish in the book as it stands.

In theory, we should be able to apply both of these edits together as long as they don't conflict and then someone could download that new version of the book and read it with happy pandas all around.

Because I didn't know much about language design at the time, I made all of these editing commands very long and very verbose, including 1 key word token that now lives in infamy for me-- BUT_ONLY_IF_IT_CHANGES--all separated by underscores, 1 word, must be capital letters, which meant oh, if there's nothing that would change in this scene, for example, because it doesn't have Mr. Darcy in it, there's no way to make him happier, then don't do anything so that people don't think they have to redownload the book or redownload that chapter. Because I didn't have any experience with language design, I made this super long token name, and now I'm stuck with it forever.

You, too, can find it on Google if you search.

So although it may seem now like HTML and JavaScript are totally arbitrary, in fact, some effort has gone into making them concise and relatively easy.

I just wanted to mention that it could be worse. I know from personal experience.

Wrap Up

Wrap Up.png

Syntax

In this lesson, we learned how to specify tokens using regular expressions.

We could break down HTML and JavaScript into words, the smallest unit of meaning.

In the next part of this class, we're going to learn how to combine those words — combine those tokens — into sentences, into utterances that are starting to make sense.

We're going to check them using rules of syntax, just like the rules we have in natural languages.

It's an exciting time, and it's a necessary part on our way to making an interpreter for HTML and JavaScript a web browser. I hope you'll join me for our next lesson.

Quiz Solutions

Really

Let's go through these together. [1] is well-formed HTML, we're beginning the bold tag. It ends after the word "really." This looks great. Unfortunately, in [2] we end the bold before we start it, and then we start it over here with Eric Blair. That's not going to work out well. I will show you in just a minute what that looks like. In [3] we begin the bold tag and then we have lots of space and then the word "really." It turns out that is totally fine. Web browsers use the same sort of techniques we talked about in the last unit to break up sentences like these into words based on white space. All this extra space doesn't matter. Finally, in [4] we start bold at the beginning of the sentence, so all the of these words--George-Orwell-was-really-Eric-Blair"-- they're all bolded. Notably "really" was bolded as well, so this works out. Let's go see how this plays out.

real1.png

Here we have the first option--"George Orwell was really Eric Blair"-- and really is definitely bolded. If I reverse these, it's harder to interpret.

real2.png

This bold tag closes nothing. It's ill-balanced. This makes me super unhappy. But this next one applies to "Eric Blair," and then falls off into the end of the universe. This isn't very good.

real3.png

I can put huge numbers of spaces here, and as we see, this does not influence the rendered web page at all. T

real4.png

Interpreting html

Let's go through it together. [1] is actually one of the arguments to this anchor tag. It's not displayed. If the user clicks on this link, they'll go to Wikipedia.org, but they'll never see the href. This is not shown. [2] on the other hand, is not the name of a tag or the argument to a tag. It will be shown. Similarly, [5] will be shown. It'll be shown as part of a link and italicized, but it'll be there. [3] is part of the argument to this anchor tag, so it will not be shown. [6], however, will be shown, and then this [4], the italic tag, isn't shown. Instead, the next is actually slanted. Here we're taking a look at how it would render in a web browser.

interp.png

Taking Html Apart

Let's go through it together, [1] starts with a left angle bracket, which looks super promising, but then it has this slash which we don't see reflected up here, so this one doesn't match, could not have produced the sequence of the five elements. In [2] we have a left angle bracket, a b, a right angle bracket. Looking great. Salvador, looking good. Dali, looking great. Oh, yeah, this totally matches. In [3]e we have almost the same sentence, but there's no space between salvador and dali. This is very close, but instead of getting 2 separate words at the end, it would break down into just one word at the end, salvadordali, so this one doesn't match. In [4] we start with a bold tag, have salvador and then dali, but then we have a few more characters that aren't shown in the five elements list above, so this doesn't match exactly. In [5] hsalvador is followed by the bold tag. That's getting the order wrong, and the order of this breakdown is really going to matter. We really need to know the order of words in a sentence. Super important, it is. Finally, in [6] over here we start with bold, and we have salvador dali again. This looks great. No problems there.

Notice the spacing in [2] and [6] were a little different. In [6] we had a space between the bold tag and salvador. In [2] we had kind of a space before b. These spaces don't matter very much.

Specifying Tokens

# Specifying Tokens

# Write code for the LANGLESLASH token to match </ in our HTML.

def t_LANGLESLASH(token):
    r"</"
    return token

Token-values

All right, let's go through the possible answers together. We're definitely going to match 1368 because it's in the language of this regular expression. It's 4 copies of 0-9 together. By default, the value of the token, that is, when it comes into us, it's just the string "1368". But we're going to convert it to an integer using this cast or conversion here, so at the end of the day, it's going to be [2]. It's not going to be [1] because we have this special conversion. It's not going to be the [3] because the maximal munch rule means that we're not going to match just one. We match all 4 digits. Similarly, it's not going to be [4] because we match 1368. It's also not going to be the empty string. We definitely match 1368, and [2] is a good number to match.

Quoted Strings

Let's go through the answer together. We definitely want to name our procedure t_STRING, and now the real trick to this is just writing a good regular expression. Here I've written a regular expression that starts with a double quote, ends with a double quote, and can have anything that's not a double quote-- remember that super useful carat character--zero or more times.

# Quoted Strings

# Suppose a string starts with " and ends with " and contains any number of
# characters except ". Write a definition for t_STRING.

# Match Exactly:
#     "cuneiform"
#     "sumerian writing"
# Do Not Match Exactly:
#     "esc \" ape"
#     League of Nations Treaty Series

def t_STRING(token){
    r'"[^"]*"'
    return token

Whitespace

So let's go through a possible answer together.

# Whitespace

# Suppose a WORD is any number of characters EXCEPT < > or space.
# A WORD token should leave its value unchanged.

# Submit a definition for t_WORD.

def t_WORD(token):
    r"[^ <>]+"
    return token

Lexical Analyzer

Let's take a look at the answers. The 33 of the input is definitely going to match, we hope-- we're going to get to this in a minute--our numbering rule. And then we're going to end up converting it into an integer. In [1] 33 is pretty good. is will match the word is. less will match the word less. than will match the word than. And all the white space in the middle will be dropped. This is a possible answer. In [2] we have pretty much the same thing except that we appear to be returning a space as a word. That's not going to happen because our words can't include spaces, and the white space rule would skip over it beforehand. That's not it. In [3] looks a lot like the beginning, but instead of matching 33 and 55 as numbers, we appear to have matched them as words because we've returned this string unchanged. This isn't really what we wanted, but actually, nothing prevents that from happening. or our number definition, and I haven't told you how to resolve this sort of ambiguity. Oh, that word is back from the dead. [3] is also a possible answer.

Lexical Analyzer.png

Ambiguity

Remember that our definition for word was anything that's not a space, a left angle bracket or a right angle bracket, and the double quote certainly fits in there, so it may not be what we want right now, but if you are sharp and notice that this qualifies as a word under our rules, good eye.

And then 1906 is a number.

Here we have grace is the word, hopper is a word, one nine oh six is a word, so this is 3 words rather than 2 words and a number. And grace hopper is of particular significance to programming language and compiler people because she's the origin of the word 'bug' meaning mistake in program.

Ambiguity.png

Rule Order

Rule Order.png

Well, let's take a look.

Suppose our word rule comes first, just like it does here on the page in some sense.

Hello is definitely a word, but actually, world in quotes is also a word because this quotation mark is not a space, a left angle or a right angle.

If this comes first, we get word, word, which would make us very unhappy. On the other hand, if it comes after string, then we'll end up getting the behavior that we're expecting.

However, probably this whitespace rule actually has to come first lest we confuse whitespace and the body of strings.

Our word rule precludes us from including spaces as parts of words, but our string rule really does not, so we probably want our whitespace rule to come if not first, somewhere near the top.

That means that our word rule is the one that comes last.

If our final ordering is either whitespace, string, word, string, whitespace, word, we'll get the output we expect.

Let's walk through the first one and see how it goes.

We'll start with the h in hello. Does that match whitespace? No, it does not. We'll try our next rule.

Does that match string? No, it doesn't start with a double quote. Could it match word? Oh, it totally could.

And now we'll use our maximal munch power, and it will eat up the e, the l, l, o, comma, but it's going to stop because whitespace has higher priority.

We'll see the word hello, then we'll see the space.

Does that match the whitespace rule? It does, so we'll pass over it.

Then we'll see these double quotes. Does that match the whitespace rule? No. Does it match the string rule? Yes, it does.

Maximal munch will take us all the way to the end of this token, and we get the output we expect, and the second case behaves similarly.

Crafting-input

All right. Let's go take a look at the big reveal, for how it actually started. Up here is our string; we start at characters zero, on line 1, with ' This '. and we're starting the Left Angle, then the ' b ', then the Right Angle, then the ' webpage '. And to reverse engineer this, you might note--for example-- that you know the Left Angle is 1 character. If it starts on character 11, and the ' b ' comes right after it on character 12, there must be no spaces between them. Similarly, since the ' b ' is 1 character and it starts on character 12, the RANGLE, there must be no spaces between the ' b ' and the RANGLE. So the real trick here is figuring out what happens after the "is" and before the ' b ', and knowing that you need these 3 extra spaces ao all the action's happening here.

# Crafting Input

# Define a variable called webpage that holds a string that causes our lexical
# analyzer to produce the exact output below

# LexToken(WORD,'This',1,0)
# LexToken(WORD,'is',1,5)
# LexToken(LANGLE,'<',2,11)
# LexToken(WORD,'b',2,12)
# LexToken(RANGLE,'>',2,13)
# LexToken(WORD,'webpage!',2,14)

webpage = """This is
   <b>webpage!"""

import ply.lex as lex

tokens = ('LANGLE', # <
          'LANGLESLASH', # </
          'RANGLE', # >
          'EQUAL', # =
          'STRING', # "hello"
          'WORD', # Welcome!
          )

t_ignore = ' ' # shortcut for whitespace

def t_newline(token):
    r'\n'
    token.lexer.lineno += 1
    pass

def t_LANGLESLASH(token):
    r'</'
    return token

def t_LANGLE(token):
    r'<'
    return token

def t_RANGLE(token):
    r'>'
    return token

def t_EQUAL(token):
    r'='
    return token

def t_STRING(token):
    r'"[^"]*"'
    token.value = token.value[1:-1]
    return token

def t_WORD(token):
    r'[^ <>\n]+'
    return token

htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
    tok = htmllexer.token()
    if not tok: break
    print tok

Commented-html

Well, let's go over it together. It looks temptingly like there might be a few left angle brackets, like this one or that one, but if you look carefully, you'll see that those are both part of the comment, so it's as if they were never there, so we don't see any left angles. We do see a left angle slash over here at the end of the string. Any right angles? Yes, right here at the absolute tail end of the string, but note that this one in here did not count. Any strings? "World" looks super tempting, but again, it's inside the comment, so it does not count. Word--and I've intentionally chosen these so that they conflict a little-- word, both "hello" and "confusing" totally apply.

Commented HTML.png

Token Counting

It turns out that the answer that we're looking for is 3. Ob is one and tuse is another, and it's really tempting to put obtuse together because they're separated by a comment. You think "Boy, if I just remove that comment, they'd be one word, obtuse." But remember, you really want to treat comments as if you were totally erasing them so it leaves a gap. The lexer is entering the HTML comment. Mode, it's coming back. Ob and tuse are really quite separate, and then we've got tangle as the third at the end.

Honor And Honour

Let's go through the answers together. If we call ironcrutchli with x as 0, then 0 is less than 2, so we're going to return 0. If we call ironcrutchli when x is 1, This function not so exciting thus far. But now we pass in 2 for x, and 2 is not less than 2. It is equal to 2, however, so we're going to return 2 + 1, or 3. And then over here if we pass in 9, 9 is not less than 2, so we're going to return 10. Iron Crutch Li was one of the 8 immortals of traditional Chinese mythology. Not a particularly nice guy, but he did help out the sick.

Honor and Honour.png

Identifier

# Identifier

# Identifiers are textual string descriptions that refer to program elements,
# such as variables and functions. Write a indentifier token rule for Javascript identifiers.

# The token rule should match:

#   factorial
#   underscore_separated
#   mystery
#   ABC

# The token rule should not match:

#   _starts_wrong
#   123

def t_IDENTIFIER(token):
    r'[a-zA-Z][a-zA-Z_]*'
    return token

Numbers

# Numbers

# Write a indentifier token rule for Javascript numbers that converts the value
# of the token to a float.

# The token rule should match:

#    12
#    5.6
#    -1.
#    3.14159
#    -8.1
#    867.5309

# The token rule should not match:

#    1.2.3.4
#    five
#    jenny

def t_NUMBER(token):
    r'-?[0-9]+(?:\.[0-9]*)?'
    token.value = float(token.value)
    return token

Let me walk you through one way to do this.

As always, the most important part of a token definition is the regular expression that it corresponds to.

Here we can optionally start with a minus sign.

Then we definitely have one or more digits, 867, 3, 1.

And then there's a big optional part.

We can have a dot and then some number of digits after that. We could have the dot and then some number of digits. -1. is okay or 3.14159.

Here we've got 5 trailing digits. Here we've got 0 of them.

And this whole dot followed by digits is optional.

But remember that the dot has special meaning in regular expressions, so I'm going to need to use a \ to escape it to say we literally mean the period and not any character.

Then we turn that string into a floating point number. Then we return it.

Just to walk through these again, this part, the optional minus sign, catches either this minus sign or the nothing that's in front of this 3.

And this part here is this 1, this 3, or 867.

This dot, literally a dot, is this dot, that dot, and I think there was one under here, and then this part is 14159, 5309, or the nothing that's here after this dot.

End Of Line

Python and Javascript both provide a way for the programmer to add comments at the end of a line, which will be ignored by the interpreter.

# Python end-of-line comment.
// Javascript end-of-line comment.

Here is the definition for a Javascript end-of-line comment.  The regular expression looks for two slashes, followed by any number of characters that are not the newline character, \n.  It will take as many of those characters as it can find, so it takes everything up to the next \n.

def t_eolcomment(token):
  r'//[^\n]*'
  pass

Which of the following will yield IDENTIFIER IDENTIFIER NUMBER?

irene joliot_curie 9.1897 // 1956
ralph emerson 1803
henry thoreau // 1817.0
// marie curie 1867

Let's go through the answers together.

Perhaps the easiest to look at is good old Ralph Waldo Emerson. Ralph is one identifier.

Emerson is one identifier. 1803 is a number. It's transcendental-tastic.

This totally matches the pattern we're looking for. By contrast, Henry is an identifier.

Thoreau is an identifier, but these 2 slashes begin an end of line comment, so we'll never see the 1817, so this one does not match.

Down here, Marie Curie 1867, this would be identifier, identifier, number, except that the whole thing is canceled out by this end of line comment that starts over here.

Then up here, her daughter, Irene, this is one identifier.

This connected by an underscore is just one more identifier. Here's a number.

We match perfectly so far.

It looks like we'd stop matching because of this fourth number, but since it's in a comment, we ignore it, so both of these work.

And Marie Curie is relatively well known as a Polish physicist and chemist.

She was the first person to hold Nobel prizes, quite a feat, one in physics and one in chemistry, and in fact, she named the element polonium after her home country.

Thoreau wrote "Walden," which is relatively well known, a book about being self-reliant and dependable in which he starts on the first page by borrowing an ax from his neighbor. Marie Curie's daughter, Irene, also went on to win a Nobel prize for the discovery of artificial radioactivity.

End