**These are draft notes extracted from subtitles. Feel free to improve them. Contributions are most welcome. Thank you!**

**Please check the wiki guide for some tips on wiki editing. **

Contents

- 1 Problem Set 1: Visualization
- 1.1 01 GPA Scatter Plot
- 1.2 02 GPA Scatter Plot
- 1.3 03 Scatter Plot Regions
- 1.4 04 Scatter Plot Regions
- 1.5 05 Scatter Plot Regions 2
- 1.6 06 Scatter Plot Regions 2
- 1.7 07 Page Load Scatter Plot
- 1.8 08 Page Load Scatter Plot
- 1.9 09 Bar Chart To Scatter Plot
- 1.10 10 Bar Chart To Scatter Plot
- 1.11 11 Pie Chart To Histogram
- 1.12 12 Pie Chart To Histogram Solution
- 1.13 13 Histogram To Pie Chart
- 1.14 14 Histogram To Pie Chart Solution

Hello, my name is Adam, and I'll be your assistant instructor for Introduction to Statistics. Welcome to our first problem set covering visualization. Let's get started. First, I'd like to ask you about a scatter plot,specifically about the relationship between GPAs or grade point averages in high school and college. GPAs are a measure of academic performance and range from 0 being the worst to4 being the best,so let's look at some simulated data that might describe such a relationship. A student who received the highest GPA in both,a 4 and a 4, would be right about here. And let's fill in the rest of the points. So looking at this relationship,I would like you to tell me whether the relationship between high school and college GPAs is linear and whether it is exact. Check all that apply.

And looking at this data,it certainly appears that one can see a linear relationship, perhaps something like this line looks like it might describe how they relate. But there's a fair bit of variation around it,so I would say that the relationship is linear but not exact.

Now looking at this same data, I'd liketo ask you a different question. Suppose we draw a line at a 45-degree angle,This represents the relationship y = x and divides the plot into 2 regions or 3 if we count the line itself. We have a region here above the line and a region here below the line. What I'd like you tell me iswhat the position of a point relative to these 3 sections tells us about the given students' GPAs. So in which of these regions would a student with the same GPA in high school and college fall?Would it be above the line,on the line or below the line?Select the region in which you think it is.

And such a point would be on the line,because, by definition, this line has all points where this value equals this value.

Now I'd like to ask you the same question except for a student with high school GPA greater than college GPA. So for someone who did better in high school than college, would a point for that person be above the line,on the lineor below the line?

And a point for a student who did better in high school than in college would be below the line. So you can see here,someone in the extreme case,someone with a college GPA of 0,could have any high school GPAand would still be below the line,But as the college GPA gets higher,their high school GPA must be at least equal to the college one to fall into that region. And so that is the answer, below the line.

The next question I’d like to ask you about a different scatter plot--this one relating to web analytics. We’re comparing two measurements that are fairly commonly used. One is Click Through Rate which represents the fraction of people who visit a page that go on to do something--typically clicking an ad, clicking through to a course of Udacity,clicking through an article of a new site, etc. ,and the time it takes to load a web page--the page load time. Now, let’s looks some simulated data. When looking at this data, I'd like you to answer a few questions. First, do you believe these pieces of data are related?That is for different page load times, is the Click Through Rate different?And then similar to what you’ve been asked before, I’d like you to tell me if the relationship between these things is linear--if in fact there is a relationship. And then I'd like you to also tell me whether the relationship is exact. And then I’d also like you to tell me two more things. Is the relationship between these positive or negative?Just recall what a positive and negative relationship are a couple of simplified examples. A perfectly positive linear relationship would look something like this,but this and this are also positive relationships, whereas a negative linear relationship would look like this, but this and this are also negative relationships. Take a look at this data and check all that apply.

Well, it seems there's definitely a relationship between these. the massive point seems to move as page load time increases. I'd say--yes, they are related. They don't really appear to be linear though. If I draw a line to try and fit this,I can do okay for one piece of this but then this piece over here doesn't appear anywhere near that line, or I could possibly draw a line like this that's consistent of a lot of range of points, but this appears to be pretty far away. Definitely not linear and not exact either. You can, say, pick this value. We have a whole bunch of different click-through rates for the same page load time. Now, is the relationship positive or negative?Well, we started off with a higher click-through rate over here. It seems the only going downward as page load time increases, so we have a negative relationship. When in fact, the relationship probably looks something like this.

Now, I'd like to ask you about the relationship between bar charts and scatter plots. Both of these were used to plot two different sets of data against each other. For example, home prices and home sizes. So a bar chart might look like this. For this bar chart, I'd like you to tell me which of these scatter plots looks most similaras if it could represent the same underlying data. Is it this one, this one, this one, or this one? Please check whichever one looks closest.

The answer is this one. They both display a negatively sloping somewhat linear relationship. If we look at this one, the relationship look something like this,so a bar chart would look something like--Whereas this one would just have some equally spaced bars, and this one would actually be the same thing because there's just more noise but it's the same flat non-relationship. Whereas for this one, we can see we could draw bars kind of like that.

Now I'd like to ask you about the relationship between a pie chart and a histogram. Both of which represent count data or frequency data or, if you prefer, relative sizes. For this pie chart, I'd like you to tell me which histogram looks like it came from the same data. Is it this or this or this or is it this? Select the closest answer.

I would say the answer is this one, and the reason for that is it has two clearly larger groups and two smaller group that are about the same size. All the groups over here are pretty different sizes and the same here,and the all groups here are the exact same size.

Now, I'd like to give you a histogram and have you tell me which pie chart looks like it could have come from the same data. Is it this or this or this or is it this? Choose whichever one seems to fit best.

And I would say the answer is this one, because these quadrants were equally spaced and these bars are of equal sizes. Whereas these three seem to have slices that are significantly different in size,which would imply significantly different cuts.