Skip to Main Content

Finding Data & Statistics

Sources and support for finding and using numerical, spatial, and statistical data.

Am I looking for data or statistics?

The two terms are often used interchangably. Although there are some commonly understood distinctions, there are also grey areas: statistics are certainly a kind of data, and data are used to generate statistics. 

Statistics often are:

  • facts or figures
  • time series
  • tables, charts, or graphs
  • to support an argument
  • 'ready to use'

Data can generally be used to:

  • test hypotheses
  • generate custom tables
  • look at responses of individuals
  • analyze in SPSS, SAS, or Stata
  • do Regression, t-test, ANOVA, etc
Another distinction to consider is whether you need microdata or aggregate data. Microdata is the original, unprocessed (except to protect privacy of participants) information: for example, income reported by each household, height and species of each tree in a park. Aggregate data is summarized and combined in some way: average income in a census block, number of oak trees in a city park.

Get back to basics

First things first: slow down. Don't focus on the numbers in the table right away. Instead, carefully review the details around the edges: what information is given by the title or header? What are the row and column labels? Are there any footnotes or references underneath the table? All of this information can help you understand the context of the numbers that are inside the table.

Questions to ask (and answer!) when looking at numerical data or statistics:

  • What's being counted or summarized?
  • What units are being used: thousands of dollars (CAD? USD?), individual spectators, percentage change (from what?), percentage of total, etc
  • Who collected and/or summarized the data?
  • What questions were asked or what sources used to find, solicit, compile, collect, create the data?
  • What was the purpose of collecting the data in the first place?
  • How does all of that fit with what YOU want to do with these numbers?

Limitations to keep in mind

Many factors can affect what data is collected and why, and other factors affect what can be shared with others. A few common issues that arise with published data and statistics  include 1. the need to protect privacy, 2. the effort to control for accuracy and precision, 3. mandated measurements (such as the census), and 4. pre-existing categories with which to organize the data. 

1. Because of privacy concerns, some data may be restricted because the population being counted is so small, it would be possible to identify an individual person or business.

2. If there are concerns regarding the methods used to collect the data or if it wasn't possible to confirm the accuracy of the data, it might not be made  available. Some statistical calculations require specific criteria to be considered valid: for example, if the number of data points is too small, or if the method of obtaining the data was inconsistent, the statistic calculation isn't considered to be accurate and may not be published.

3. Many surveys, including the national census, are required by law or regulation; in some cases, the specific questions and responses collected are explicitly outlined by a government agency or ruling. These regulations can change over time, so the questions asked 10 or 20 years ago might not be the same as those asked today. Consequently, comparing data over time may be complicated or impossible.

4. Standardized methods and categories are often used by many groups to more easily share and compare data sets and statistics. It is convenient to use these standards, but they might not perfectly match your specific question. For example, NAICS codes are commonly used in Canada and the US to collect economic and labour statistics based on industry. Each specialized industry will have a single NAICS code that is a subset of a larger category, which is in turn part of an even larger category, etc. The hierarchical arrangement defined by NAICS might not always fit the way you would like to categorize the industry. These codes can also change over time, something to keep in mind if you are looking at statistics from different decades: NAICS 1997 had 3 codes for internet-related industries. NAICS 2012 has 57.

The dataset that you may want to find may not exist...yet!

Not every dataset that you might want has been collected (i.e. a list of all people named Sam who have ever owned a dog, are left handed, don't like peanut butter and voted in the last Canadian election). This might serve as motivation for you to in fact collect data yourself, or it may mean that you find a proxy variable/ dataset. Regardless, it always means patience and persistence when data seeking, and even, at times, the flexibility to alter your research question or project focus given data availability.