Descriptive statistics: describe the data.
Summarize (statistics), tabulate (frequency distribution), graph (histogram, etc.)

Data

Data
Qualitative (nominal, categorical)
words
Quantitative
numbers.
Discrete Continuous
integers real nos. (decimals)
counts measurements

Levels of measurement
Level Examples What can do with
Nominal names, labels, categories Yes/No, Agree/Disagree, Have/Havenot, Success/Failure, M/F, ...
MaritalStatus, State, County, Zipcode, Major, Brand,make,model,color, Place
race,religion,party,ideology..., TaxFilingStatus, Blood type, Housing type, Pet
Count/tally each category. Relative frequency. Mode. Bar chart.
Chi-square Tests (independence, goodness-of-fit)
Confidence interval 1-PropZInt
Ordinal orderable/rankable categories
but differences (obtained by subtraction) between data values either cannot be determined or are meaningless.
class(frosh/soph/jun/sen), trim levels, film ratings, gold/silver/bronze, letter grades, days of week, months, Education level, clothing sizes, pain scales, military rank, star ratings, priority/risk levels
Percentiles.
Likert scale:
Strongly disagree / Disagree / Neutral / Agree / Strongly agree
Very dissatisfied / Dissatisfied / Neutral / Satisfied / Very satisfied
Poor / Fair / Good / Very good / Excellent
Above + median/quartiles, Spearman.
Interval Numbers: orderable, and differences between data values can be found and are meaningful. But no natural zero (meaning none of the quantity). Temperature C or F, Years/Dates, shoe size, IQ/SAT, FICO, pH
0 is fakish
histogram, mean, median, SD...
Estimation, CI: t-test,
Hypothesis testing,
ANOVA,
correlation, linear regression
Ratio Numbers: orderable, and differences between data values can be found and are meaningful, and natural zero (meaning none of the quantity), and ratios (eg. "twice as much") are meanginful. Weight Height Age
Length Area Volume
Time Money TemperatureK Energy
BP LDL BMI
DJI S&P500
Above + CV, GM,


Data "set" (but can have duplicates) consisting of datums/observations/measurements/individuals/scores, all the same meaning, e.g. weights of adults, greasiness of bags of chips, longevity of bulbs, widget regional sales, effect of pill...

Example: Population: weights of adults in country/county.
Not possible to census this. So need a non-biased, representative sample (a teaspoon of the pot of soup).
Ideal: Simple random sample (SRS): every adult equally-likely to be in the sample and every sample of that size is equally-likely.
  Bad: voluntary response, convenience sample.
Collect data. Measured vs self-reported (unreliable).
Calculate/derive statistic from the data: a point estimate of the parameter. But samples have uncertainty/variability so determine [confidence] interval estimate.
Inferential statistics: use probability to understand/quantify/describe uncertainty.
If have census, i.e. population is all known, no need to sample, just describe the population. Sample(s) only useful/taken/needed to estimate population parameter(s).


interval: set of continuous numbers
[1.45,3.7]

Data set mapped/transformed/normalized to z-scores: each datum x: its number of SDs from the mean:
z = (x-mean) / SD
Within ±2 is "normal". [2,2]
≤-2 (-∞,-2] or ≥2 [2,∞) is "statistically significant", i.e. maybe important.


random stochastic aleatory chance luck mis/fortune contingent accidental peradventure fate fortuitous hazard