Statistics summarize complex events and phenomenon. Consider the following statistical statement: “56% of Americans believe gun crime is worse today than it was 20 years ago.”(1) Packed within that one sentence are assumptions about the nature of “gun crime,” historical crime trends, public perceptions of crime and policing, and the reliability and validity(2) of the related survey, conducted by the Pew Research Center. Like all statistics, “56% of Americans…” summarizes a population and reduces social complexity down to a number.

Statistics, according to the late Sir Ronald A. Fisher, the pre-eminent early pedagogue of the subject, “may be regarded as (i) the study of populations, (ii) as the study of variation, (iii) as the study of methods of reduction of data.”(3) Each of these elements demands exposition.

Statistics enable us to **study populations**, or large groups of people or things. In that regard, statistics are distinct from numbers. For instance, I am approximately five feet and six inches tall or 5’6”. These numbers represent my height in a meaningful way for folks who think in feet and inches. Yet, we would not typically refer to my height –or even the collected heights of my immediate family– as statistics. Rather, if we wanted to talk about height statistics, we would focus on schools, cities, states, countries, or other geographic areas. We might focus our statistical inquiry on a subset of people within those locations (e.g., Do the heights of African American third graders vary among the 50 U.S. states?), but we’d still be thinking in large numbers. How many numbers at a minimum? While there’s no universal answer to that, 100 or more individuals, items, or instances is a good baseline for calculating statistics. Often, we are trying to make inferences about much larger populations than that.

Within those populations, we want to know how individuals differ. We **analyze** **variance **because it is a near-universal occurrence, and because it helps us to address social differences and problems. Consider the counter-factual, an instance in which a government spends exactly $5 on the eye care of every citizen. A bar chart reflecting the average spent per age group would look like this:

First, note that we don’t usually create bar charts to represent group averages; usually bar charts reflect simple counts, as you’ll see in the next example. Second, realize that you are unlikely to find a bar chart representing a perfectly uniform characteristic among a population. It doesn’t happen often, and researchers aren’t as interested in what we all have in common. They tend to focus on phenomena that vary within a population, such as complete and permanent blindness (e.g., per 100,000 people), as depicted in the following chart (i.e., arranged by age groups):

Consider the chart in the context of universal eye care coverage (i.e., $5 per person). Is the coverage more useful for the very young or old? Should it be suspended for 80-99 year olds, because it is not helping them much? Do the blind regain their sight at their 100^{th} birthday? These questions are silly, but they reflect an important truth about statistics: understanding their meaning requires more than facility with numbers. Statistics only make sense in context.(4) Additionally, regardless of what a simple count or percentage says about the variance within a population, it is still a reduction of the collected data.

Statistics** reduce data** to summary numbers that help us to grasp trends within a population. When researchers collect raw data, for instance on whether a person is blind or not, each person or item becomes a case and each survey question or research item becomes a single number. The following string of numbers reflect whether 100 individuals are completely and permanently blind (=1) or not (=0); they would typically be recorded in individual boxes or cells within a column (e.g., in Excel):

0000000100000000000000000000001000000000100000000000000000000000000000000000000000000000000000000000

A summary statistic for this group of people is 3%. If this statistic was derived from a valid, reliable, random sample(5) of a sizeable population, we might say that 3%, give or take(6), of all the people in the community are blind.

My gut tells me that this is an unusually high percentage. That “gut feeling,” accompanied by perennial skepticism, would lead me to interrogate the research methods used to collect and record the raw data (e.g., how was a sample selected?, how were the numbers entered into the computer?), the mathematics used to calculate the statistic, and the population under study (i.e., is this group qualitatively different than other populations?).

To learn more, check out these YLS library books:

R.A. Fisher, *Statistical Methods, Experimental Design, and Scientific Inference *(2003)

Robert M. Lawless, Jennifer K. Robbennolt, & Thomas S. Ulen, *Empirical Methods in Law *(2010)

Michael O. Finkelstein, *Basic Concepts of Probability and Statistics in the Law *(2009).

Alan C. Acock, *A Gentle Introduction to Stata *(2012).

Peter Cane & Herbert M. Kritzer (eds.), *The Oxford Handbook of Empirical Legal Research *(2010).

Ian Ayres, *Super Crunchers: Why Thinking-by-Numbers is the New Way to be Smart *(2007).

For more information about this entry, or about empirical legal studies broadly, please contact sarah.ryan@yale.edu

ENDNOTES

(1) Paul Overberg & Meghan Hoyer, *Study: Despite Drop in Gun Crime, 56% Think It’s Worse, *USA Today, May 31, 2013, http://www.usatoday.com/story/news/nation/2013/05/07/gun-crime-drops-but…

(2) Read more about reliability and validity at: http://www.socialresearchmethods.net/kb/relandval.php

(3) Ronald A. Fisher, *Statistical Methods for Research Workers* (14^{th} ed., Hafner Publishing Company, 1973), 1.

(4) See Daniel E. Ho & Donald B. Rubin, *Credible Causal Inference for Empirical Legal Studies, *7 Annu. Rev. Law Soc. Sci. 17 (2011).

(5) Read more about sampling at: http://psychology.ucdavis.edu/rainbow/html/fact_sample.html

(6) Mathematical give or take is expressed in many ways. Two common ways are: (i) the p statistic and (ii) confidence interval. Read more about these concepts at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2689604/