IFSM 300 Name:______________ Analyzing a larger data file. Bold edit this file in Word. Delete the non-question sections. Generate 1000000 lines of data in the "Generate Random 'kana' data" webpage at http://davidwills.us/ifsm300/kana_random.html and then copy and paste the lines into Notepad and save it. What is the size as reported by Windows File Browser, in KB: ______ in MB: ______ in bytes:_______________ In Notepad: How many lines does the file have:_________ What separates the fields in each line:_______ Go to the luckiest line (777,777). What is its 6-letter word:_______ How many times does zozozo exist in the file:_____ Does 123456789 exist in the file: line#___________ If not, does 12345678 exist in the file: line#___________ If not, does 1234567 exist in the file: line#___________ How many lines start with 123456: ___ Notepad could change data in the file, but we won't do any. This is about all that an editor like Notepad can do with this data. Import the file into Excel. Format column E so that it matches the original data from the text file and so that the decimal points line up. Format the text columns to a fixed-width font like Consolas or Courier New so that it is clear how many characters are in each word. Center column D, the single letter field. Make a new column that is the product of the third and fifth columns, i.e. columns C and E. Format it the same as column E. Add a new column H that is just 1 2 3 ... (we can use this column after sorting on other columns to get back our original order. This column is essentially just a row number). On a sheet by itself: Show the sum, average, *median, min, max, and count of the four numeric columns (A,C,E,G) --display this in a tabular format, on a sheet by itself, the stats formatted to reflect the data they derive from. (Median is another kind of averge; it is the middle value: half the values are bigger than it, half are smaller.) Be careful about the E and G columns averages and medians. Use the CORREL function to calculate the correlation between a pair of columns. Calculate it between columns C and E, C and G, and E and G. (correlation is a measure of how much, if any, two paired measures or variables, or here, columns, move up or down together; a correlation around 0 means the two measures aren't "associated", a correlation near 1 means the two measures seem to move together, either both going up or one up one down, and so might be related somehow). Put these correlation coefficients on the stats sheet. Print the stats sheet. Without sorting, what is the 6-letter word of the row with the smallest column A value:_______ (Hint: use Find) How many rows of column E have the median value of column E:_______ (the leading 0 might be necessary) How many rows of column E are greater than or equal to the average of column E:_______ Any chart of this huge amount of data won't be useful or usable. Copy the first 1000 rows to a new sheet. We will assume this is a random sample of the population of the data and so by the miracle of random sampling, information about the sample will be probabilistically accurate about the population, i.e all the data. Make a scatterchart of column C so its distribution can be visualized and thus understood. Improve the chart's axes, (no) legend, title, markers. This is a uniform distribution, the values are from everywhere in the range from the min to the max and all the values seem to be equally likely. Make a scatterchart of column E so its distribution can be visualized and thus understood. Improve the chart. E clearly doesn't have values everywhere but rather at particular values only, like 1, .5, .33, .25 etc.; its range is from 0 to 1 but its average is less than .1. Make a scatterchart of column G so its distribution can be visualized and thus understood. Improve the chart. Make a scatter chart of columns E and G. Make improvements to the chart. Add a Linear Trendline and Display R-squared value. Print the four charts. Back to the 1M row data sheet. We can find out how many different values there are in a column. Copy the three text columns (B, D, and F) to a new sheet. On that sheet, in Data tab, Remove Duplicates of each column by itself. (choose "Continue with current selection") How many unique values are there of column B:________ How many unique values are there of column D:________ How many unique values are there of column F:________ Sort each column to see that there are no duplicates. (choose "Continue with current selection") What's the first 6-letter word:_________, the last:__________ What's the first 4-letter word:_________, the last:__________ How many different 6-letter words are there that start with 'to':_______ (hint: apply a Filter to this column, "Begins with") Copy column A's million values to this new sheet. Remove its duplicates. How many duplicates are there in column A:_________ Delete this sheet after you've done the above. Save your file. What is the size of the Excel file, in bytes: _________________ KB:_______ and MB:___