The 2011 Census was a large and complex undertaking and, while considerable effort was taken to ensure high standards throughout all collection and processing operations, the resulting estimates are inevitably subject to a certain degree of error. Users of census data should be aware that such error exists, and should have some appreciation of its main components, so that they can assess the usefulness of census data for their purposes and the risks involved in basing conclusions or decisions on these data.
Errors can arise at virtually every stage of the census process, from the preparation of collection materials through data processing, including the listing of dwellings and the collection of data. Some errors occur at random, and when the individual responses are aggregated for a sufficiently large group, such errors tend to cancel out. For errors of this nature, the larger the group, the more accurate the corresponding estimate. It is for this reason that users are advised to be cautious when using small area estimates. There are some errors, however, which might occur more systematically, and which result in 'biased' estimates. Because the bias from such errors is persistent no matter how large the group for which responses are aggregated, and because bias is particularly difficult to measure, systematic errors are a more serious problem for most data users than the random errors referred to previously.
For census data in general, the principal types of error are as follows:
(a) coverage errors, which occur when dwellings or individuals are missed, incorrectly enumerated or counted more than once
(b) non-response errors, which result when responses cannot be obtained from a certain number of households and/or individuals, because of extended absence or some other reason or when responses cannot be obtained from a certain number of questions in a complete questionnaire
(c) response errors, which occur when the respondent, or sometimes the census representative, misunderstands a census question, and records an incorrect response or simply uses the wrong response box
(d) processing errors, which can occur at various steps including data capture, when responses are transferred from the census questionnaire in an electronic format, by optical character recognition methods or key-entry operators; coding, when 'write-in' responses are transformed into numerical codes; and imputation, when a 'valid,' but not necessarily correct, response is inserted into a record by the computer to replace missing or 'invalid' data ('valid' and 'invalid' referring to whether or not the response is consistent with other information on the record).
The above types of error each have both random and systematic components. These components may be significant.
Coverage errors affect the accuracy of the census counts, that is, the sizes of the various census universes: population, families, households and dwellings. While steps have been taken to correct certain identifiable errors, the final counts are still subject to some degree of error because persons or dwellings have been missed, incorrectly enumerated in the census or counted more than once.
Missed dwellings or persons result in undercoverage. Dwellings can be missed because of the misunderstanding of collection unit boundaries, or because either they do not look like dwellings or they appear uninhabitable or they have recently been built or they are difficult to detect. Persons can be missed when their dwelling is missed or is classified as unoccupied, because the respondent misinterprets the instructions on whom to include on the questionnaire or because the respondent was away during the census period. Some individuals may be missed because they have no usual residence and did not spend census night in a dwelling.
Dwellings or persons incorrectly enumerated or double-counted result in overcoverage. Overcoverage of dwellings can occur when structures unfit for habitation are listed as dwellings (incorrectly enumerated), when there is a certain ambiguity regarding the collection unit boundaries or when units (for example, rooms) are listed separately instead of being treated as part of one dwelling (double-counted). Persons can be counted more than once because their dwelling is double counted or because the guidelines on whom to include on the questionnaire have been misunderstood. Occasionally, someone who is not in the census population universe, such as a foreign resident or a fictitious person, may, incorrectly, be enumerated in the census. On average, overcoverage is less likely to occur than undercoverage and, as a result, counts of dwellings and persons are likely to be slightly underestimated.
For the 2011 Census, three studies are used to measure coverage error; the Dwelling Classification Survey, the Reverse Record Check Study, and the Census Overcoverage Study. Only the Dwelling Classification Survey is used to adjust the census counts.
In the Dwelling Classification Survey, a sample of dwellings listed as unoccupied were revisited to verify that they were correctly classified on Census Day. In addition, dwellings whose households were classified by census collection as not having responded and where classification had not been established were revisited to confirm whether they were occupied on Census Day or not. If either type were occupied, then the number of usual residents living there on Census Day was obtained. Subsequently, the misclassification of occupancy status of dwellings in the census counts was estimated.
Based on the results of the Dwelling Classification Survey, adjustments have been made to the final census counts to account for households and persons missed because their dwelling was incorrectly classified as unoccupied. The census counts are also adjusted for dwellings whose households were classified as non-respondent or unclassifiable. Despite these adjustments, the final counts are still subject to some undercoverage. The undercoverage tends to be higher for certain segments of the population, such as young adults (especially young adult males) and recent immigrants.
The Reverse Record Check Study is used to measure the residual undercoverage for Canada, and each province and territory. The Census Overcoverage Study is designed to investigate overcoverage errors from person enumerated more than once. The results of the Reverse Record Check and the Census Overcoverage Study, when taken together, furnish an estimate of net undercoverage.
While coverage errors affect the number of units in the different census universes, other errors affect the characteristics of those units.
Sometimes it is not possible to obtain a complete response from a household, even though the dwelling was identified as occupied. The household members may have been away throughout the census collection period or, in rare instances, the householder may have refused to complete the questionnaire. More frequently, the questionnaire is returned by mail or submitted through Internet but no response is provided to certain questions. Effort is devoted to ensure as complete a questionnaire as possible. An analysis is performed to detect significant cases of partial non-response and follow-up interviews are attempted to get the missing information. Despite this, at the end of the collection stage, a small number of responses are still missing. Although missing responses are eliminated during processing by replacing each one of them by the corresponding response for a 'similar' record, there remain some potential imputation errors. This is particularly serious if the non-respondents differ in some respects from the respondents; this procedure will then introduce a non-response bias.
Even when a response is obtained, it may not be entirely accurate. The respondent may have misinterpreted the question or may have guessed the answer, especially when answering on behalf of another, possibly absent, household member.
The respondent may also have entered the answer in the wrong place on the questionnaire. Such errors are referred to as response errors. While response errors usually arise from inaccurate information provided by respondents, they can also result from mistakes by the census representative who completed certain parts of the questionnaire, or who followed up to obtain a missing response.
The images of the questionnaire pages are scanned and the information on the images is captured into a computer file. To monitor and to ensure that the number of data capture errors is within tolerable limits, a sample of fields is sampled and reprocessed. Analysis of the two captures is done. Unsatisfactory work is identified, corrected and appropriate feedback is done to the system in order to minimize their occurrence.
Some of the census questions require a written response. During processing, these 'write-in' entries are given a numeric code, either through an automated system that matches them to a coded set of write-ins from previous censuses, or manually by coders. Coding errors can occur when the written response is ambiguous, incomplete, or difficult to read. A quality assurance process is used to detect coding errors and measure quality. This involves selecting and re-coding an ongoing sample of coded responses. Discrepancies between the first and second code are sent to a third coder for arbitration. Feedback on errors is provided to help reduce further occurrences.
The data are edited where they undergo a series of computer checks to identify missing or inconsistent responses. These are replaced during the imputation stage of processing where either a response consistent with the other respondents' data is inferred or a response from a similar donor is substituted. Imputation ensures a complete database where the data correspond to the census counts and facilitate multivariate analyses. Although errors may have been introduced during imputation, the methods used have been rigorously tested to minimize systematic errors.
Various studies are being carried out to evaluate the quality of the responses obtained in the 2011 Census. For each question, non-response rates and edit failure rates have been calculated. These can be useful in identifying the potential for nonresponse errors and other types of errors. Also, tabulations from the 2011 Census have been or will be compared with corresponding estimates from previous censuses, from sample surveys (such as the Labour Force Survey) and from various administrative records (such as birth registrations and municipal assessment records). Such comparisons can indicate potential quality problems or at least discrepancies between the sources.
In addition to these aggregate-level comparisons, there are some micro-match studies done, in which census responses are compared with another source of information at the individual record level. For certain 'stable' characteristics (such as age, sex, mother tongue), the responses obtained in the 2011 Census, for a sample of individuals, are being compared with those for the same individuals in the 2006 Census.
Confidentiality (non-disclosure) rules
The following describes the various rules used to ensure confidentiality (or non-disclosure) of individual respondent identity and characteristics. All census data are subject to confidentiality (non-disclosure) rules.
Area suppression for standard
1 and non-standard geographic areas
Area suppression is used to remove all characteristic data for geographic areas below a specified population size.
The specified population size for all standard areasFootnote1 or aggregations of standard areas is 40, except for blocks, block-faces or postal codes. Consequently, no characteristics or tabulated data are to be released if the total population of the area is less than 40.
The specified population size for six-character postal codes (forward sortation area - local distribution unit [FSA-LDU]), geocoded areas and custom areas built from the block, block-face or LDU levels is 100. Consequently, no characteristics or tabulated data are to be released if the total population of the area is less than 100. Generally, blocks and individual urban block-faces (one side of the street between two intersections) will be too small to meet the above-specified population size thresholds. Where an aggregation of blocks or block-faces fall above the threshold specified by the population size, data can be retrieved through a custom tabulation.
These specified population size thresholds are applied to 2011 Census data as well as all previous census data.
Please refer to section Postal code minimum aggregation rules for additional rules applicable to postal code data.
1Refer to the Census Dictionary for more information on standard areas.
Postal code minimum aggregation rules
In addition to the confidentiality rules on disseminating Census data with the postal codes, the following rules are applied to postal codes. These rules fall under clause 03.01 (n) of the Commercial Non-Mailing licence between Statistics Canada and Canada Post Corporation.
All requests must include batches of two or more postal codes; the only exception being for postal codes which have a zero as the second digit (rural postal codes).
Groups of postal codes are to be assigned a unique classification/number (e.g. K1A 0T6, 0T7, 0T8 = Custom Area 1); under the terms of the contract listed above, clients cannot be provided with lists of postal codes, only the name specified in the client's request can be used.
All other confidentiality rules for custom extractions still apply as per Area suppression for standard and non-standard geographic areas.
Also, the following disclaimer is applicable to all postal code custom requests:
Postal code validation disclaimer: Statistics Canada makes no representation or warranty as to, or validation of the accuracy of any postal codeOM data submitted to Statistics Canada.
Please note these rules are applicable to historical postal code requests as well.
All counts in census tabulations are subjected to random rounding. Random rounding transforms all raw counts to random rounded counts. This reduces the possibility of identifying individuals within the tabulations.
All counts are rounded to a base of 5, meaning they will end in either 0 or 5. The random rounding algorithm employed controls the results and rounds the unit value of the count according to a predetermined frequency. Table below shows those frequencies. Note that counts ending in 0 or 5 are not changed and remain as 0 or 5.
Random rounding frequency
Unit values of |
Will round to count ending in 0 |
Will round to count ending in 5 |
1 |
4 times out of 5 |
1 time out of 5 |
2 |
3 times out of 5 |
2 times out of 5 |
3 |
2 times out of 5 |
3 times out of 5 |
4 |
1 time out of 5 |
4 times out of 5 |
5 |
Never |
Always |
6 |
1 time out of 5 |
4 times out of 5 |
7 |
2 times out of 5 |
3 times out of 5 |
8 |
3 times out of 5 |
2 times out of 5 |
9 |
4 times out of 5 |
1 time out of 5 |
0 |
Always |
Never |
The random rounding algorithm uses a random seed value to initiate the rounding pattern for tables. In these routines, the method used to seed the pattern can result in the same count in the same table being rounded up in one execution and rounded down in the next.
Disclosure avoidance for statistics
Statistics (such as mean, standard error, sum, median, percentile, ratio or percentage) are not subject to random rounding. However, when shown in tabulations accompanying the counts used to calculate the statistic, their presence can result in disclosure of individuals. To prevent this, we use statistic suppression methods, or special statistic calculations.
For all quantitative variables, a statistic is suppressed if the number of actual records used in the calculation is less than 4.
Special statistic calculations
The statistic value is never rounded, except for frequencies.
All statistics based on ranks (medians, percentiles) are calculated the usual way and they are never rounded. We never release the minimum or the maximum of a statistic.
When a sum is specified for age, then the program multiplies the unrounded average of the group in question by the rounded frequency. Otherwise, if a sum is specified for a variable other than age, the program rounds the actual sum.
When a division is specified (averages, percentages, ratios, etc.), the program must apply point (3) to both numerator and denominator before it proceeds with the division.
Note: Statistics based on ranks like median and percentiles are always calculated via linear interpolations. That means that, for low count cells, these statistics are not reliable. That is the reason no additional confidentiality measures are applied to them.
Note: The average of an age is not altered by the rounding, because the numerator is the product of the true average by the rounded frequencies and the denominator is the rounded frequencies. The two frequencies cancel each other, leaving the true average untouched.