IFSM 201
Name:______________

Analyzing a larger data file.
Bold edit this file in Word. Delete the non-question sections.

Generate 1000000 (one million) lines of data in the "Generate Random 'kana' data" webpage at
http://davidwills.us/ifsm300/kana_random.html
and then copy and paste the lines into Notepad and save it.

What is the size of the file as reported by Windows File Browser, 
in KB: ______
in MB: ______
in bytes:_______________

In Notepad:
How many lines does the file have:_________
What separates the fields in each line:_______
Go to the luckiest line (777,777). What is its 6-letter word:_______
How many times does zozozo exist in the file:_____
Does 123456789 exist in the file: line#___________
  If not, does 12345678 exist in the file: line#___________
  If not, does 1234567 exist in the file: line#___________
How many lines start with 123456: ___


Import the file into Excel.            * means something NEW.
Format column E so that it matches the original data from the text file
 and so that the decimal points line up.
Format the text columns to a fixed-width *font like Consolas or Courier New 
 so that it is clear how many characters are in each word. 
Center column D, the single letter field.
Make a new column G that is the product of the third and fifth columns,
 i.e. columns C and E. Format it the same as column E.
*Add a new column H that is just 1 2 3 ... 
 (we can use this column after sorting on other columns to get back our 
 original order. This column is essentially just a [original] row number).

On a sheet by itself:
Show the sum, average, standard deviation, *median, min, max, and count of the 
 four numeric columns (A,C,E,G)
 --display this in a tabular format, on a sheet by itself, the stats formatted 
   to reflect the data they derive from.
 (Median is another kind of averge; it is the middle value: half the values
  are bigger than it, half are smaller.)
Be careful about the E and G columns averages and medians.
Use the CORREL function to calculate the *correlation coefficient, r, 
 between a pair of columns. 
 Calculate it between columns C and E, C and G, and E and G.
 (correlation is a measure of how much, if any, two paired measures or 
  variables, or here, columns, move up or down together;
   a correlation around 0 means the two measures aren't "associated",
   a correlation near 1 means the two measures seem to move together,
   either both going up or one up one down, and so might be related somehow).
 r is the same r whose square is diplayed with a linear trendline.
 Put these correlation coefficients on the stats sheet.
Print the stats sheet.

Without sorting, what is the 6-letter word of the row with the 
 smallest column A value:_______  (Hint: use Find)

How many rows of column E have the median value of column E:_______

How many rows of column E are greater than or equal to the average of column E:_______

When there are many different values in a column, the filter checkbox list might
not show all of them, so you can't rely on it.
How many rows contain zuzuzu as the 6-letter word:_______


Any chart of this huge amount of data won't be useful or usable.
Copy the first 1000 rows to a new sheet. Paste the values to avoid problems
with the formula.
We will assume this is a random sample of the population of the data
 and so by the miracle of random sampling and the central limit theorem, 
 information about the sample will be probabilistically accurate about the 
 population, i.e all the data.
Make a scatterchart of this column C so its distribution can be visualized
 and thus understood. Improve the chart's axes, (no) legend, title, markers. 
 This is a uniform distribution, the values are from everywhere in the range
   from the min to the max and all the values seem to be equally likely.
Make a scatterchart of this column E so its distribution can be visualized
 and thus understood. Improve the chart. 
E clearly doesn't have values everywhere 
  but rather at particular values only, like 1, .5, .33, .25 etc.;
  its range is from 0 to 1 but its average is less than .1.
Make a scatterchart of this column G so its distribution can be visualized
 and thus understood. Improve the chart. 
 Comparing this chart with those of columns C and E: which of C and E appears
 to have more "influence" on the pattern of G?
Print the three charts.


Back to the 1M row data sheet. 
We can find out how many different values there are in a column.
 Copy the three text columns (B, D, and F) to a new sheet. one-by-one?
 On that sheet, in Data tab,  in Data Tools, *Remove Duplicates of each column by itself. 
  (choose "Continue with current selection")
 How many unique values are there of column B:________
 How many unique values are there of column D:________
 How many unique values are there of column F:________
 Sort each column (by itself) to see that there are no duplicates.
  (choose "Continue with current selection")
  What's the first 6-letter word:_________, the last:__________
  What's the first 4-letter word:_________, the last:__________
  How many different 6-letter words are there that start with 'to':_______
  (hint: apply a Filter to this column, "Begins with")

Are column A's numbers all unique?
Copy column A's million values to this new sheet. Remove its duplicates.
 How many duplicates are there in column A:_________
Delete this sheet after you've done the above.

Save your file.
What is the size of the Excel file, in bytes: _________________ KB:_______ and MB:___