IFSM 201 Name:______________ Analyzing a larger data file. Bold edit this file in Word. Delete the non-question sections. Generate 1000000 (one million) lines of data in the "Generate Random 'kana' data" webpage at http://davidwills.us/ifsm300/kana_random.html and then copy and paste the lines into Notepad and save it. What is the size of the file as reported by Windows File Browser, in KB: ______ in MB: ______ in bytes:_______________ In Notepad: How many lines does the file have:_________ What separates the fields in each line:_______ Go to the luckiest line (777,777). What is its 6-letter word:_______ How many times does zozozo exist in the file:_____ Does 123456789 exist in the file: line#___________ If not, does 12345678 exist in the file: line#___________ If not, does 1234567 exist in the file: line#___________ How many lines start with 123456: ___ Import the file into Excel. * means something NEW. Format column E so that it matches the original data from the text file and so that the decimal points line up. Format the text columns to a fixed-width *font like Consolas or Courier New so that it is clear how many characters are in each word. Center column D, the single letter field. Make a new column G that is the product of the third and fifth columns, i.e. columns C and E. Format it the same as column E. *Add a new column H that is just 1 2 3 ... (we can use this column after sorting on other columns to get back our original order. This column is essentially just a [original] row number). On a sheet by itself: Show the sum, average, standard deviation, *median, min, max, and count of the four numeric columns (A,C,E,G) --display this in a tabular format, on a sheet by itself, the stats formatted to reflect the data they derive from. (Median is another kind of averge; it is the middle value: half the values are bigger than it, half are smaller.) Be careful about the E and G columns averages and medians. Use the CORREL function to calculate the *correlation coefficient, r, between a pair of columns. Calculate it between columns C and E, C and G, and E and G. (correlation is a measure of how much, if any, two paired measures or variables, or here, columns, move up or down together; a correlation around 0 means the two measures aren't "associated", a correlation near 1 means the two measures seem to move together, either both going up or one up one down, and so might be related somehow). r is the same r whose square is diplayed with a linear trendline. Put these correlation coefficients on the stats sheet. Print the stats sheet. Without sorting, what is the 6-letter word of the row with the smallest column A value:_______ (Hint: use Find) How many rows of column E have the median value of column E:_______ How many rows of column E are greater than or equal to the average of column E:_______ When there are many different values in a column, the filter checkbox list might not show all of them, so you can't rely on it. How many rows contain zuzuzu as the 6-letter word:_______ Any chart of this huge amount of data won't be useful or usable. Copy the first 1000 rows to a new sheet. Paste the values to avoid problems with the formula. We will assume this is a random sample of the population of the data and so by the miracle of random sampling and the central limit theorem, information about the sample will be probabilistically accurate about the population, i.e all the data. Make a scatterchart of this column C so its distribution can be visualized and thus understood. Improve the chart's axes, (no) legend, title, markers. This is a uniform distribution, the values are from everywhere in the range from the min to the max and all the values seem to be equally likely. Make a scatterchart of this column E so its distribution can be visualized and thus understood. Improve the chart. E clearly doesn't have values everywhere but rather at particular values only, like 1, .5, .33, .25 etc.; its range is from 0 to 1 but its average is less than .1. Make a scatterchart of this column G so its distribution can be visualized and thus understood. Improve the chart. Comparing this chart with those of columns C and E: which of C and E appears to have more "influence" on the pattern of G? Print the three charts. Back to the 1M row data sheet. We can find out how many different values there are in a column. Copy the three text columns (B, D, and F) to a new sheet. one-by-one? On that sheet, in Data tab, in Data Tools, *Remove Duplicates of each column by itself. (choose "Continue with current selection") How many unique values are there of column B:________ How many unique values are there of column D:________ How many unique values are there of column F:________ Sort each column (by itself) to see that there are no duplicates. (choose "Continue with current selection") What's the first 6-letter word:_________, the last:__________ What's the first 4-letter word:_________, the last:__________ How many different 6-letter words are there that start with 'to':_______ (hint: apply a Filter to this column, "Begins with") Are column A's numbers all unique? Copy column A's million values to this new sheet. Remove its duplicates. How many duplicates are there in column A:_________ Delete this sheet after you've done the above. Save your file. What is the size of the Excel file, in bytes: _________________ KB:_______ and MB:___