Numbers & statistics

Time spent with friends The Pudding

Science is powered by observational evidence.

But to be systematic, researchers will count, measure & quantify their observations—turning their data into numbers & statistics.

Researchers do this because numbers have special properties:

  • They let you compress hundreds, thousands, or millions of observations into just a handful of digits.
  • They are precise in a way that words are not.
  • You can use mathematics to operate on numbers in consistent ways.
  • You can visualise numbers in a variety of ways.

For example, a website called HyperLeda compiles data from astronomical observations all around the world:

HyperLeda data catalogue

A random Reddit user like u/zonination can take specific observational records from HyperLeda and tabulate them in a spreadsheet like this:

And plot that data on a chart like this:

And then, using some basic maths and physics, they can calculate the approximate age of the universe and post to Reddit, where someone immediately points out they've made a typo:

Now, what does OP mean by "my value came close at 13.77 billion* years"? Close to what?

They mean their estimate is close to the estimate that is currently accepted by most astronomers. 

We can check that estimate for ourselves on Wikipedia:

But there's more to that accepted estimate than meets the eye! 

There are all sorts of parameters that could be taken into consideration (and not everyone involved agrees on which ones to include):

And that's just scratching the surface of the complexity!

This is why, when writing for a general audience, scientists will often translate their numerical evidence into stories and ideas expressed through language.

For example, here's astrophysicist Katie Mack explaining some of the complexities in determining the age of the universe:

Everything in the snippet is evidence, but only a small proportion uses numbers, and even those numbers are simple and high-level.

The rest of the evidence is commentary and narrative based on accepted facts. (We'll talk about accepted facts later.)

All the careful calculations of the age of the universe, based on extrapolating the current expansion back to the Big Bang, suggested the universe was somewhere in the vicinity of 10 or 12 billion years old, whereas measurements of the oldest stars in nearby ancient clusters gave a number closer to 15.

Of course, estimating the ages of stars is not always an exact science, so there was a chance that better data might show that the stars were a bit younger than they looked, shaving maybe a billion or two years off the discrepancy.

But extending the age of the universe to finish solving that problem would create an even bigger one. Making the universe older would have required scrapping the theory of cosmic inflation — one of the most important breakthroughs in the study of the early universe since the discovery of the Big Bang itself.

It would take another three years of combing through data, revising theories, and creating entirely new ways of measuring the cosmos before astronomers would find a solution that didn’t break the early universe. It just broke everything else.

In the end, the answer came down to a new kind of physics woven into the very fabric of the cosmos — one that would fundamentally change our view of the universe and completely rewrite its future.

Counting & measuring isn't limited to science. The precision of numbers is useful in many subjects and situations.

For example, there's a concept called the Bechdel Test, after cartoonist Alison Bechdel, which is one way to assess gender representation in movies.

This article describes the test:

To pass the test, a film has to answer yes to three questions:

1. Are there more than two named female characters?

2. Do the two female characters have a conversation at any point?

3. Is that conversation about anything other than a male character?

That's it.

While this seems like it should be a shockingly simple test to pass, very few films manage to do so. In a study of 1,794 movies from 1970 to 2013, Walter Hickey of FiveThirtyEight found that almost half failed it.

The Bechdel test started as a joke, but people have applied it to thousands of films and counted how many pass or fail the test:

Bechdel Test Over Time

Then people have taken that data and linked it to other data.

For example, this graphic combines Bechdel test scores with box office data to argue that movies that pass the Bechdel test make more money (in aggregate) than movies that fail the test:

Bechdel box office graphic

Now, one of the problems with numbers is that their apparent precision can be misleading.

We can read a number and think it's more certain than it really is.

For example, in the graph above about Bechdel and box office revenue, there is a whole chunk of money for films that are only a "dubious" pass:

What does "dubious" mean? How did the researchers define it? Did the movies really pass or not?

If they could just as easily be said to not pass, then all that money would go to the 'FAIL" category which would then have more revenue than the "PASS" category and change the whole conclusion of the argument—so the definition is important!

Because numbers can be misleading, anyone who is serious about counting & measuring will think very carefully about how they do it.

This video from Vox is a great insight into how a small team of researchers approach gathering data to answer the question, "What happens to an unknown musical artist after they go viral on TikTok?"

So, we got to work and pulled all of the songs from as many playlists as we could find that were added between January and December 2020. Then we ranked the songs by their popularity on TikTok, filtering out any that got fewer than 100,000 posts. This brought us down to about 1500 songs that went viral in 2020. 

"The biggest challenge, I would say, is once we had our arms wrapped around these 1500 songs, making the decision of, is this an established artist that had a TikTok hit or is this the artist's big break? And a big break is a very subjective decision."

A lot of the artists in this list were obviously very established. Cardi B going viral on TikTok is not particularly impressive versus somebody who has never released a song before.

So we went back to Chartmetric to dig into more data points behind these songs and the artists that made them, including their Spotify monthly listeners, the number of times they've been playlisted, and the number of tracks they've released. This made it a lot easier to decide: did this artist have a career beforehand?

Eventually, after filtering out all of the established artists, we narrowed our list to a sample of 125 artists we felt hit all the marks. They all went viral on TikTok in 2020, and as far as we can determine, it was their big break.

"It doesn't actually matter how many artists we examined. There's probably thousands of new artists that went viral on TikTok. What we wanted to do was just wrap our arms around a cohort of artists that are experiencing this phenomenon and then say, what happened to them afterwards?"

What happened to these artists after they went viral was eye-opening.

You can see from that snippet how the researchers have to define exactly what they are measuring and how they intend to measure it, including:

  • What are the criteria for a viral hit?
  • What are the criteria for a new artist?
  • What data can we use to match songs and artists to these criteria?
  • How do we find that data?

Note how, at the end, they talk about their sample size: they don't need all the artists who went viral; they just need a representative sample, a group big enough to give them reliable evidence of what happened after going viral.

(And if you're interested in seeing what they found, Vox's partner in the research, The Pudding, has a great data visualisation for you to explore.)

Because numbers can seem precise and authoritative, and because audiences will often accept numbers at face value, they can be used to present misleading conclusions.

For example, based on a glance at this chart, you'd probably think the KFC Crispy Chicken Twister had half the calories of a taco or burger:

But take a look at the x-axis and you'll notice it starts at 590 calories.

That means the Twister is only 70 calories less than the taco and burgers, which is a difference of 10%, not 50%.

Or check out this famous ad from Frosted Mini-Wheats, where Kellogg claimed kids who eat Frosted Mini-Wheats have nearly 20% better attentiveness at school:

The ad's famous because the claim was so untrue that Kellogg was fined $4 million!

Here's a snippet from the court documents:

In truth and in fact, eating a bowl of Kellogg’s Frosted Mini-Wheats cereal for breakfast is not clinically shown to improve kids’ attentiveness by nearly 20%. In the clinical study referred to in respondent’s advertisements, for example, only about half the kids who ate Frosted Mini-Wheats cereal showed any improvement after three hours as compared to their pre-breakfast baseline. In addition, overall, only one in seven kids who ate the cereal improved their attentiveness by 18% or more, and only about one in nine improved by 20% or more. Therefore, the representation set forth in Paragraph 6 was, and is, false or misleading.

Having said all that, numbers don't have to be real to be useful.

We use made-up numbers in the form of estimates and approximations all the time when making legitimate arguments, especially when dealing with hypothetical situations.

The question is whether or not the numbers are realistic.

For example, here is Randall Munroe from XKCD trying to figure out if you could blot out the sun with a volley of arrows:

Q. In the movie 300 they shoot arrows up into the sky and they seemingly blot out the sun. Is this possible, and how many arrows would it take? — Anna Newell

A. It’s pretty hard to make this work.            


Longbow archers can fire eight to ten arrows per minute. Each arrow spends only a few seconds in the air. If an arrow’s average time over the battlefield is three seconds, then about 50 percent of all archers have arrows in the air at any given time.

Each arrow intercepts about 40 cm2 of sunlight. Since archers have arrows in the air only half the time, each blocks an average of 20 cm2 of sunlight.

If the archers are packed in rows, with two archers per meter and a row every meter and a half, and the archer battery is 20 rows (30 meters) deep, then for every meter of width... there will be 18 arrows in the air.

18 arrows will block only about 0.1 percent of the Sun from the firing range. We need to improve on this.

What If?(2014)

Munroe needs numbers so he can operate on them and be precise.

To get the numbers, he makes estimates, any of which could be disputed.

Then he uses cause & effect reasoning to link those numbers into a model of the archers firing arrows at a certain rate, covering a certain area of the sun, until he concludes it won't work.

Numbers are interesting in that some people find them clear and compelling in a way that stories and experiences are not—while other people find them abstract and intimidating.

To play with numbers & measurements as evidence, let's try making up some dubious data.

Here's an ad for Timberland boots. The main claim is that they will protect you from frostbite.

Supplement the main claim with made-up evidence based on systematically counting, measuring, and quantifying observations.

Supplement the main claim in this ad with made-up evidence based on systematically counting, measuring, and quantifying observations. (3-5 sentences)

Here's an example of using made-up stats as evidence:

In a groundbreaking study conducted by the prestigious Institute of Winter Apparel Research, it was found that 98.7% of people wearing genuine Timberland boots retained full toe wiggling capabilities in temperatures as low as -20°C, while those in imitation brands experienced a 37% decrease in toe mobility. Furthermore, in a sample size of 1,000 winter adventurers, Timberland boots scored an impressive 9.5/10 on the Thermal Comfort Scale, and maintained a 76% higher traction coefficient on icy surfaces compared to counterfeit counterparts.

In a nutshell

  • Counting & measuring can make observational evidence more precise.
  • If observations have been recorded as numbers, we can use maths to compare & transform the data in useful ways.
  • Numbers can be visualised in useful ways.
  • Counting & measuring is often much harder than it seems. Researchers need to: 
    • Work out what they really want to measure,
    • Figure out how to get the measurements.
    • Decide what those measurements actually mean.
  • Counting & measuring let us collect large amounts of data, but important information can be lost in the process.
  • Data can be incorrect, insufficient, or irrelevant.
  • Numbers can sound precise, but they can also be completely meaningless.
  • Some people are comfortable arguing over numbers, while others are intimidated by them.
  • Numbers can hide misinformation, especially if people accept them at face value and don't think about them.