cs101 ยป

CS101 - Unit 4 Homework

Question 1 - Data Structures

One thing many search engines do is keep track of the number of times a searcher clicks on each link. This is useful for improving ranking results (which we will talk about in Unit 6).

Suppose we want to modify our index to keep a count of the number of times a user clicks on each url. Which data structure would be a good way to represent the index?

[ [ keyword, [ url, url, url, ... ], count ], [ keyword, [ url, url, url, ... ], count ], ... ]

[ [ keyword, [ [url, count], [url, count], ... ] ], ... ]

[ [ keyword, url, [count, count, count, ...] ], [ keyword, url, [count, count, count, ...] ], ... ]

[ [ count, keyword, [url, url, ... ]], ... ]

Question 2 - Ben Bitdiddle

Ben Bitdiddle suggests changing the index code by replacing the add_to_index and lookup procedures with the ones shown below the question.

def add_to_index(index, keyword, url):
    index.append([keyword, url])

def lookup(index, keyword):
    result = []
    for entry in index:
        if entry[0] == keyword:
            result.append(entry[1])
    return result

This changes the structure of index, but suppose the only way we use index is by calling add_to_index and lookup. How would this affect the search engine?

  1. It would produce the wrong results for some lookup queries.
  2. It would produce the same results for all queries, but lookup would sometimes be faster than the original code.
  3. It would produce the same results for all queries, but add_to_index would be faster and lookup would usually be slower than the original code.
  4. It would produce the same results and take the same amount of time for all queries

Old Code:

def add_to_index(index, keyword, url):
    for entry in index:
        if entry[0] == keyword:
            entry[1].append(url)
            return
    # not found, add new keyword to index
    index.append([keyword, [url]])

def lookup(index, keyword):
    for entry in index:
        if entry[0] == keyword:
            return entry[1]
    return []

Question 3 - Networking

It is taking too long for Udacity to upload our lecture videos to YouTube because the video files are many gigabytes of data. What do we need to solve this problem?

  1. Higher Latency
  2. Lower Latency
  3. More Bandwidth
  4. Less Bandwidth
  5. More colors in the videos
  6. Less bad hair in the videos

Question 4 - Better Splitting (1 Gold Star)

Question 5 - improving the index

# The current index includes a url in the list of urls
# for a keyword multiple times if the keyword appears
# on that page more than once.

# It might be better to only include the same url
# once in the url list for a keyword, even if it appears
# many times.

# Modify add_to_index so that a given url is only
# included once in the url list for a keyword,
# no matter how many times that keyword appears.

def add_to_index(index, keyword, url):
    for entry in index:
        if entry[0] == keyword:
            entry[1].append(url)
            return
    # not found, add new keyword to index
    index.append([keyword, [url]])

def get_page(url):
    try:
        if url == "http://www.udacity.com/cs101x/index.html":
            return '''<html> <body> This is a test page for learning to crawl!
<p> It is a good idea to
<a href="http://www.udacity.com/cs101x/crawling.html">
learn to crawl</a> before you try to
<a href="http://www.udacity.com/cs101x/walking.html">walk</a> or
<a href="http://www.udacity.com/cs101x/flying.html">fly</a>.</p></body>
</html>'''

        elif url == "http://www.udacity.com/cs101x/crawling.html":
            return '''<html> <body> I have not learned to crawl yet, but I am
quite good at  <a href="http://www.udacity.com/cs101x/kicking.html">kicking</a>.
</body> </html>'''

        elif url == "http://www.udacity.com/cs101x/walking.html":
            return '''<html> <body> I cant get enough
<a href="http://www.udacity.com/cs101x/index.html">crawling</a></body></html>'''

        elif url == "http://www.udacity.com/cs101x/flying.html":
            return '''<html>
<body>The magic words are Squeamish Ossifrage!</body></html>'''
    except:
        return ""
    return ""

def union(a, b):
    for e in b:
        if e not in a:
            a.append(e)

def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote

def get_all_links(page):
    links = []
    while True:
        url, endpos = get_next_target(page)
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    index = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            content = get_page(page)
            add_page_to_index(index, page, content)
            union(tocrawl, get_all_links(content))
            crawled.append(page)
    return index

def add_page_to_index(index, url, content):
    words = content.split()
    for word in words:
        add_to_index(index, word, url)

def lookup(index, keyword):
    for entry in index:
        if entry[0] == keyword:
            return entry[1]
    return None

#index = crawl_web("http://www.udacity.com/cs101x/index.html")
#print lookup(index,"is")
#>>> ['http://www.udacity.com/cs101x/index.html']

Question 6 - Counting Clicks (2 Gold Stars)

Question 7 - Time spent at routers

Traceroute, round-trip time, from Birmingham, England to Sundsvall, Sweden takes 75 milliseconds.

The one-way distance from Birmingham to Sundsvall is 2500 km.

The speed of light in optical fiber is 200,000 km/s (speed of data between the routers)

What is the total time spent at the routers?

____5_ ms

(There are 1,000 milliseconds in 1 second)