IFSM 300
Name:______________

Analyzing a larger data file.
Bold edit this file in Word. Delete the non-question sections.

Generate 1000000 lines of data in the "Generate Random 'kana' data" webpage at
http://davidwills.us/ifsm300/kana_random.html
and then copy and paste the lines into Notepad and save it.

What is the size as reported by Windows File Browser, 
in KB: ______
in MB: ______
in bytes:_______________

In Notepad:
How many lines does the file have:_________
What separates the fields in each line:_______
Go to the luckiest line (777,777). What is its 6-letter word:_______
How many times does zozozo exist in the file:_____
Does 123456789 exist in the file: line#___________
  If not, does 12345678 exist in the file: line#___________
  If not, does 1234567 exist in the file: line#___________
How many lines start with 123456: ___
Notepad could change data in the file, but we won't do any.
This is about all that an editor like Notepad can do with this data.

Import the file into Excel.
Format column E so that it matches the original data from the text file
 and so that the decimal points line up.
Format the text columns to a fixed-width font like Consolas or Courier New 
 so that it is clear how many characters are in each word. 
Center column D, the single letter field.
Make a new column that is the product of the third and fifth columns,
 i.e. columns C and E. Format it the same as column E.
Add a new column H that is just 1 2 3 ... 
 (we can use this column after sorting on other columns to get back our 
 original order. This column is essentially just a row number).

On a sheet by itself:
Show the sum, average, *median, min, max, and count of the four numeric columns (A,C,E,G)
 --display this in a tabular format, on a sheet by itself, the stats formatted 
   to reflect the data they derive from.
 (Median is another kind of averge; it is the middle value: half the values
  are bigger than it, half are smaller.)
Be careful about the E and G columns averages and medians.
Use the CORREL function to calculate the correlation between a pair of columns. 
 Calculate it between columns C and E, C and G, and E and G.
 (correlation is a measure of how much, if any, two paired measures or 
  variables, or here, columns, move up or down together;
   a correlation around 0 means the two measures aren't "associated",
   a correlation near 1 means the two measures seem to move together,
   either both going up or one up one down, and so might be related somehow). 
 Put these correlation coefficients on the stats sheet.
Print the stats sheet.

Without sorting, what is the 6-letter word of the row with the 
 smallest column A value:_______  (Hint: use Find)
How many rows of column E have the median value of column E:_______
 (the leading 0 might be necessary)
How many rows of column E are greater than or equal to the average of column E:_______

Any chart of this huge amount of data won't be useful or usable.
Copy the first 1000 rows to a new sheet.
We will assume this is a random sample of the population of the data
 and so by the miracle of random sampling, information about the sample
 will be probabilistically accurate about the population, i.e all the data.
Make a scatterchart of column C so its distribution can be visualized
 and thus understood. Improve the chart's axes, (no) legend, title, markers. 
 This is a uniform distribution, the values are from everywhere in the range
   from the min to the max and all the values seem to be equally likely.
Make a scatterchart of column E so its distribution can be visualized
 and thus understood. Improve the chart. 
E clearly doesn't have values everywhere 
  but rather at particular values only, like 1, .5, .33, .25 etc.;
  its range is from 0 to 1 but its average is less than .1.
Make a scatterchart of column G so its distribution can be visualized
 and thus understood. Improve the chart.
Make a scatter chart of columns E and G. Make improvements to the chart.
 Add a Linear Trendline and Display R-squared value. 
Print the four charts.

Back to the 1M row data sheet. 
We can find out how many different values there are in a column.
 Copy the three text columns (B, D, and F) to a new sheet. 
 On that sheet, in Data tab,  Remove Duplicates of each column by itself. 
  (choose "Continue with current selection")
 How many unique values are there of column B:________
 How many unique values are there of column D:________
 How many unique values are there of column F:________
 Sort each column to see that there are no duplicates.
  (choose "Continue with current selection")
  What's the first 6-letter word:_________, the last:__________
  What's the first 4-letter word:_________, the last:__________
  How many different 6-letter words are there that start with 'to':_______
  (hint: apply a Filter to this column, "Begins with")
Copy column A's million values to this new sheet. Remove its duplicates.
 How many duplicates are there in column A:_________
Delete this sheet after you've done the above.

Save your file.
What is the size of the Excel file, in bytes: _________________ KB:_______ and MB:___