Appendices

Appendix A: Overview of Case Study Variables

. Variable Description Type
1 REGION Region HH
2 DIST District HH
3 URBRUR Area of residence HH
4 WGTHH Individual weighting coefficient (country-specific weighting co-efficient to derive individual-level indicators) HH
5 WGTPOP Population weighting coefficient (weighting co-efficient to derive population-level indicators) HH
6 IDH Household unique identification HH
7 IDP Individual identification HH
8 HHSIZE Household members HH
9 GENDER Sex IND
10 REL Relationship to household head IND
11 MARITAL Marital status IND
12 AGEYRS Age in completed years IND
13 AGEMTH Age of child in completed months IND
14 RELIG Religion of household head HH
15 ETHNICITY Ethnicity IND
16 LANGUAGE Language IND
17 MORBID Morbidity last x weeks IND
18 MEASLES Child immunized against Measles IND
19 MEDATT Sought medical attention IND
20 CHWEIGHTKG Weight of the child (Kg) IND
21 CHHEIGHTCM Height of the child (cms) IND
22 ATSCHOOL Current school enrolment IND
23 EDUCY Highest level of education completed IND
24 EDYRS Years of education IND
25 EDYRSCURRAT Years of education for currently enrolled IND
26 SCHTYP Type of school attending IND
27 LITERACY Literacy status IND
28 EMPTYP1 Type of employment, Primary job IND
29 UNEMP1 Unemployed IND
30 INDUSTRY1 1 digit industry classification, Primary job IND
31 EMPCAT1 Employment categories, Primary job IND
32 WHOURSWEEK1 Hours worked last week, Primary job IND
33 OWNHOUSE Ownership of dwelling unit HH
34 ROOF Main material used for roof HH
35 TOILET Main toilet facility HH
36 ELECTCON Connection of electricity in dwelling HH
37 FUELCOOK Main cooking fuel HH
38 WATER Main source of water HH
39 OWNAGLAND Ownership of agricultural land HH
40 LANDSIZEHA Land size owned by household (ha) HH
41 OWNMOTORCYCLE Ownership of motorcycle HH
42 CAR Ownership of car HH
43 TV Ownership of television HH
44 LIVESTOCK Number of large-sized livestock owned HH
45 INCRMT Total amount of remittances received from remittance sending members HH
46 INCWAGE Wage and salaries (annual) HH
47 INCBONSOCALL Bonus and social allowance from wage job (annual) HH
48 INCFARMBSN Gross income from household farm businesses (annual) HH
49 INCNFARMBSN Gross income from household non-farm businesses (annual) HH
50 INCRENT Rental income (annual) HH
51 INCFIN Financial income from savings, loans, tax refunds, maturity payments on insurance HH
52 INCPENSN Pension and other social assistance (annual) HH
53 INCOTHER Other income(annual) HH
54 INCTOTGROSSHH Total gross household income (annual) HH
55 FARMEMP Farm employment HH
56 THOUSEXP Total expenditure on housing HH
57 TFOODEXP Total food expenditure HH
58 TALCHEXP Total alcohol expenditure HH
59 TCLTHEXP Total expenditure on clothing and footwear HH
60 TFURNEXP Total expenditure on furnishing HH
61 THLTHEXP Total expenditure on health HH
62 TTRANSEXP Total expenditure on transport HH
63 TCOMMEXP Total expenditure on communications HH
64 TRECEXP Total expenditure on recreation HH
65 TEDUEXP Total expenditure on education HH
66 TRESTHOTEXP Total expenditure on restaurants and hotel HH
67 TMISCEXP Total miscellaneous expenditure HH
68 TANHHEXP Total annual nominal household expenditures HH

Appendix B: Example of Blanket Agreement for SUF

Agreement between [providing agency] and [receiving agency] regarding the deposit and use of microdata

A. This agreement relates to the following microdatasets:

  1. _______________________________________________________
  2. _______________________________________________________
  3. _______________________________________________________
  4. _______________________________________________________
  5. _______________________________________________________
  1. Terms of the agreement:

As the owner of the copyright in the materials listed in section A, or as duly authorized by the owner of the copyright in the materials, the representative of [providing agency] grants the [receiving agency] permission for the datasets listed in section A to be used by [receiving agency] employees, subject to the following conditions:

  1. Microdata (including subsets of the datasets) and copyrighted materials provided by the [providing agency] will not be redistributed or sold to other individuals, institutions or organisations without the [providing agency]’s written agreement. Non-copyrighted materials which do not contain microdata (such as survey questionnaires, manuals, codebooks, or data dictionaries) may be distributed without further authorization. The ownership of all materials provided by the [providing agency] remains with the [providing agency].
  2. Data will be used for statistical and scientific research purposes only. They will be employed solely for reporting aggregated information, including modeling, and not for investigating specific individuals or organisations.
  3. No attempt will be made to re-identify respondents, and there will be no use of the identity of any person or establishment discovered inadvertently. Any such discovery will be reported immediately to the [providing agency].
  4. No attempt will be made to produce links between datasets provided by the [providing agency] or between [providing agency] data and other datasets that could identify individuals or organisations.
  5. Any books, articles, conference papers, theses, dissertations, reports or other publications employing data obtained from the [providing agency] will cite the source, in line with the citation requirement provided with the dataset.
  6. An electronic copy of all publications based on the requested data will be sent to the [providing agency].
  7. The [providing agency] and the relevant funding agencies bear no responsibility the data’s use or for interpretation or inferences based upon it.
  8. An electronic copy of all publications based on the requested data will be sent to the [providing agency].
  9. Data will be stored in a secure environment, with adequate access restrictions. The [providing agency] may at any time request information on the storage and dissemination facilities in place.
  10. The [recipient agency] will provide an annual report on uses and users of the listed microdatasets to the [providing agency], with information on the number of researchers having accessed each dataset, and on the output of this research.
  11. This access is granted for a period of [provide information on this period, or state that the agreement is open ended].
  1. Communications:

The [receiving organisation] will appoint a contact person who

will act as unique focal person for this agreement. Should the focal person be replaced, the [recipient agency] will immediately communicate the name and coordinates of the new contact person to the [providing agency]. Communications for administrative and procedural purposes may be made by email, fax or letter as follows:

Communications made by [providing agency] to [recipient agency] will be directed to:

Name of contact person:

Title of contact person:

Address of the recipient agency:

Email:

Tel:

Fax:

Communications made by [recipient agency] to [depositor agency]

will be directed to:

Name of contact person:

Title of contact person:

Address of the recipient agency:

Email:

Tel:

Fax:

D. Signatories

The following signatories have read and agree with the Agreement as presented above:

Representative of the [providing agency]

Name ____________________________________________________

Signature _______________________________ Date ______________

Representative of the [recipient agency]

Name ____________________________________________________

Signature _______________________________ Date ______________

Source: DuBo10

Appendix C: Internal and External Reports for Case Studies

This appendix provides example of internal and external reports on the anonymization process for the case studies in Section 9.1. The internal report consists of two parts: the first is for the anonymization of the household-level variables and the second is for the anonymization of the individual-level variables.

Case study 1 - Internal report

SDC report (adapted from the report function in sdcMicro)

The dataset consists of 10,574 observations (i.e., 10,574 individuals in 2,000 households).

Household-level variables

Anonymization methods applied to household-level variables:

  • Removing households of size larger than 13 (29 households)
  • Local suppression to achieve 2-anonymity, with importance vector to prevent suppressing values of the variables HHSIZE, REGION and URBRUR
  • Recoding the variable LANDSIZEHA: rounding to one digit for values smaller than 1, rounding to zero digits for other values, grouping values 5-19 and 20-40, topcoding at 40
  • PRAMming the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCYCLE, CAR, TV and LIVESTOCK
  • Noise addition (level 0.01 and 0.05 for outliers) to the income and expenditure components, replacing aggregates by sum of perturbed components

Selected (key) variables:

  • Modifications on categorical key variables: TRUE
  • Modifications on continuous key variables: TRUE
  • Modifications using PRAM: TRUE
  • Local suppressions: TRUE

Disclosure risk (household-level variables):

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (unmodified data: 103)

3-Anonymity: 104 (unmodified data: 229)

5-Anonymity: 374 (unmodified data: 489)

Percentage of observations violating

2-Anonymity: 0% (unmodified data: 5.15%)

3-Anonymity: 5.28% (unmodified data: 11.45%)

5-Anonymity: 18.7% (unmodified data: 24.45%)

Disclosure risk categorical variables:

Expected Percentage of Re-identifications: 0.05161614% (~ 1.0 observations)

(unmodified data: 0.001820465% (~ 0.36 observations))

10 combinations of categories with highest risk:

  URBRUR REGION HHSIZE OWNAGLAND RELIG fk Fk
1 2 6 2 3 7 1 372.37
2 1 5 1 1 6 1 226.35
3 2 5 2 3 6 1 430.21
4 2 2 1 1 NA 1 173.05
5 2 6 1 1 5 1 80.05
6 1 6 1 3 5 1 343.27
7 2 5 1 2 NA 1 140.60
8 2 6 1 3 7 1 230.29
9 2 5 12 1 9 1 475.01
10 2 6 3 1 1 1 338.57

Disclosure risk continuous scaled variables:

Distance-based Disclosure Risk for Continuous Key Variables:

Disclosure Risk is between 0% and 100% in the modified data. In the original data, the risk is approximately 100%.

Data Utility (household-level variables):

29 households have been removed due to their household sizes

Frequencies categorical key variables

URBRUR

categories1 1 2 NA
orig 1316 684 0
categories2 1 2 NA
recoded 1299 666 6

REGION

categories1 1 2 3 4 5 6 NA
orig 324 334 371 375 260 336 0
categories2 1 2 3 4 5 6 NA
recoded 315 328 370 370 257 330 1

HHSIZE

categories1 1 2 3 4 5 6 7 8 9 10 11 12
orig 152 194 238 295 276 252 214 134 84 66 34 21
categories1 13 14 15 16 17 18 19 20 21 22 33  
orig 11 6 6 5 4 2 1 2 1 1 1  
categories2 1 2 3 4 5 6 7 8 9 10 11 12
recoded 152 194 238 295 276 252 214 134 84 66 34 21
categories2 13                      
recoded 10                      

OWNAGLAND

categories1 1 2 3 NA
orig 763 500 332 405
categories2 1 2 3 NA
recoded 735 482 310 444

RELIG

categories1 1 5 6 7 9 NA
orig 179 383 267 7 154 1010
categories2 1 5 6 7 9 NA
recoded 175 380 260 5 148 1003

Local suppressions

Number of local suppressions:

  URBRUR REGION HHSIZE OWNAGLAND RELIG
absolute 6 1 1 48 16
relative (in percent) 0.304% 0.051% 0.051% 2.435% 0.812%

Data utility of continuous scaled key variables:

Univariate summary:

  Min. 1st Qu Median Mean 3rd Qu Max.
TANHHEX P 0 0,2 1 6,689 2,421 1214
TANHHEX P.m 0 0,2 1 3,427 2 40
TFOODEX P 498 15170 17090 24340 23260 353200
TFOODEX P.m 127,1 15100 17060 23410 22110 275300
TALCHEX P 0 8438 11890 12920 13070 127900
TALCHEX P.m -209,7 8377 11880 12570 13030 124800
TCLTHEX P 0 0 0 401,7 0 85280
TCLTHEX P.m -77,53 -13,59 6,42 404,7 30,69 85280
THOUSEX P 0 121 131 733,8 672,8 28400
THOUSEX P.m -54,65 111,4 138,8 706,1 618,9 28410
TFURNEX P 0 1211 1340 2233 1970 197500
TFURNEX P.m -39,54 1198 1340 2066 1933 49230
THLTHEX P 0 153,8 167 479,8 302 17780
THLTHEX P.m -18,79 146,8 168,6 453,1 295,2 15720
TTRANSE XP 0 1 634 961 687 49650
TTANSEX P.m -80,58 26,66 627,1 917,2 692,4 49640
TCOMMEX P 0 146 241 1158 434 91920
TCOMMEX P.m -115,2 139,1 238,3 1104 403,2 91920
TRECEXP 0 3 95 577,2 107 34000
TRECEXP .m -61,27 21,35 92,28 555,4 128,8 33960
TEDUEXP 0 0 0 123,7 0 15880
TEDUEXP .m -29,23 -5,06 1,213 121,8 9,748 15860
TRESHOT EXP 0 154 722 2730 784 240300
TRESHOT EXP.m -396,1 190,5 671,6 2568 872 240400
TMISCEX P 0 0 467 875,1 528 63700
TMISCEX P.m -93,39 0,7588 442,7 860,7 531,9 63680
INCTOTG ROSSHH 0 444 1041 1148 1126 67420
INCTOTG ROSSHH. m -24,92 446 1041 1087 1124 14940
INCRMT 5000 12400 13390 30840 24200 683900
INCRMT. m 4069 9071 17000 33040 36680 570000
INCWAGE 0 0 0 1276 0 300000
INCWAGE .m -295,1 -46,95 20,93 1261 114,4 300100
INCFARM BSN 0 9262 12950 23460 14570 683900
INCFARM BSN.m -1466 9336 12980 23420 14750 684000
INCNFAR MBSN 0 0 0 3809 3900 165400
INCNFAR MBSN.m -232,4 -10,69 142,6 3415 3846 160100
INCRENT 0 0 827,5 9166 7307 400000
INCRENT .m -757,4 43,89 783,7 8637 7267 394800
INCFIN 0 0 0 1783 0 120000
INCFIN. m -248,5 -56,57 11,54 1608 90,27 120000
INCPENS N 0 0 0 74,58 0 14400
INCPENS N.m -20,2 -4,591 0,1964 76,62 5,796 14380
INCOTHE R 0 0 0 331,3 0 60000
INCOTHE R.m -123,3 -24,78 -0,0261 7 331,1 26,75 60050
LANDSIZ EHA 0 0 0 549,1 0 82300
LANDSIZ EHA.m -126,2 -21,91 3,4 486,7 30,88 79670

Information loss:

Criteria IL1: 0.01219892

Individual-level variables

  • Modifications on categorical key variables: TRUE
  • Modifications on continuous key variables: FALSE
  • Modifications using PRAM: FALSE
  • Local suppressions: TRUE

Disclosure risk (individual-level variables):

Anonymization methods applied to individual-level variables:

  • Recoding AGEYRS from months to years for age under 1, and to ten-year intervals for age values between 15 and 65, topcoding age at 65
  • Local suppression to achieve 2-anonymity

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (unmodified data: 998)

3-Anonymity: 0 (unmodified data: 1384)

5-Anonymity: 935 (unmodified data: 2194)

Percentage of observations violating

2-Anonymity: 0% (unmodified data: 9.91%)

3-Anonymity: 0% (unmodified data: 13.75%)

5-Anonymity: 6.23% (unmodified data: 21.79%)

Disclosure risk categorical variables:

Expected Percentage of Reidentifications: 0.02% (~ 2.66 observations)

(unmodified data: 0.24% (~23.98 observations))

Expected Percentage of Reidentifications (hierarchical risk): 0.1% (~ 15.34 observations)

(unmodified data: 1.26 % (~ 127.12 observations))

10 combinations of categories with highest risk:

  GENDER REL MARITAL AGEYRS EDUCY EDYRSATCURRAT ATSCHOOL INDUSTRY1 fk Fk
1 1 1 3 38 6 NA 0 9 1 73. 31
2 1 1 3 20 1 NA 0 6 1 69. 53
3 1 1 2 39 2 NA 0 5 1 54. 63
4 1 1 1 36 6 NA 0 9 1 73. 31
5 1 1 3 42 2 NA 0 1 1 39. 58
6 0 1 6 74 1 NA 0 1 1 58. 12
7 0 1 6 34 2 NA 0 1 1 57. 40
8 1 1 1 26 4 NA 0 5 1 66. 21
9 1 1 4 35 1 NA 0 10 1 57. 13
10 1 6 1 12 1 NA 0 5 1 57. 13

Data utility (individual-level variables):

Frequencies categorical key variables

GENDER

categories1 0 1 NA
orig 5197 4871 0
categories2 0 1 NA
recoded 5197 4871 0

REL

categories1 1 2 3 4 5 6 7 8 9 NA
orig 1970 1319 4933 57 765 89 817 51 63 4
categories2 1 2 3 4 5 6 7 8 9 NA
recoded 1698 1319 4933 52 765 54 817 40 63 327

MARITAL

categories1 1 2 3 4 5 6 NA
orig 3542 2141 415 295 330 329 3016
categories2 1 2 3 4 5 6 NA
recoded 3542 2141 415 295 330 329 3016

AGEYRS

categories1 0 1/12 2/12 3/12 4/12 5/12 6/12 7/12 8/12 9/12
orig 178 8 1 14 15 19 17 21 18 7
categories1 10/12 11/12 1 2 3 4 5 6 7 8
orig 5 8 367 340 332 260 334 344 297 344
categories1 9 10 11 12 13 14 15 16 17 18
orig 281 336 297 326 299 263 243 231 196 224
categories1 19 20 21 22 23 24 25 26 27 28
orig 202 182 136 146 150 137 128 139 117 152
categories1 29 30 31 32 33 34 35 36 37 38
orig 111 143 96 123 104 107 148 91 109 87
categories1 39 40 41 42 43 44 45 46 47 48
orig 89 93 58 78 72 64 84 74 48 60
categories1 49 50 51 52 53 54 55 56 57 58
orig 58 66 50 55 29 30 34 38 33 44
categories1 59 60 61 62 63 64 65 66 67 68
orig 35 36 25 33 21 15 30 18 13 29
categories1 69 70 71 72 73 74 75 76 77 78
orig 26 36 17 16 12 3 16 10 8 18
categories1 79 80 81 82 83 84 85 86 87 88
orig 11 13 5 2 7 7 7 3 2 2
categories1 89 90 91 92 93 95 NA      
orig 4 4 3 1 1 1 188      
categories2 0 1 2 3 4 5 6 7 8 9
recoded 311 367 340 332 260 334 344 297 344 281
categories2 10 11 12 13 14 20 30 40 50 60
recoded 336 297 326 299 263 1847 1220 889 554 314
categories2 65 NA                
recoded 325 188                

EDUCY

categories1 0 1 2 3 4 5 6 NA
orig 1582 4755 1062 330 139 46 104 2050
categories2 0 1 2 3 4 5 6 NA
recoded 1582 4755 1062 330 139 46 104 2050

EDYRSATCURR

categories1 0 1 2 3 4 5 6 7 8 9
orig 177 482 445 446 354 352 289 266 132 127
categories1 10 11 12 13 15 16 18 NA    
orig 143 58 46 27 18 10 54 6642    
categories2 0 1 2 3 4 5 6 7 8 9
recode 177 482 445 446 354 352 289 266 132 127
categories2 10 11 12 13 15 16 18 NA    
recode 143 58 46 27 18 10 54 6642    

ATSCHOOL

categories1 0 1 NA
orig 4696 3427 1945
categories2 0 1 NA
recoded 4696 3427 1945

INDUSTRY1

categories1 1 2 3 4 5 6 7 8 9 10 NA
orig 5300 16 153 2 93 484 95 17 70 292 3546
categories2 1 2 3 4 5 6 7 8 9 10 NA
recoded 5300 16 153 2 93 484 95 17 70 292 3546

Local suppressions

Number of local suppressions:

  GENDER REL MARITAL AGEYRS EDUCY
absolute 0 323 0 0 0
relative (in percent) 0 3.21% 0 0 0
  EDYRSATCURR ATSCHOOL INDUSTRY1    
absolute 0 0 0    
relative (in percent) 0 0 0    

Case study 1 - External report

This case study microdata set has been treated to protect confidentiality. Several methods have been applied to protect the confidentiality: removing variables from the original dataset, removing records from the dataset, reducing detail in variables by recoding and top-coding, removing particular values of individuals at risk (local suppression) and perturbing values of certain variables.

Removing variables

The released microdata set has only a selected number of variables contained in the initial survey. Not all variables could be released in this SUF without breaching confidentiality rules.

Removing records

To protect confidentiality, records of households larger than 13 were removed. Thirty households out of a total of 2,000 households in the dataset were removed.

Reducing detail in variables by recoding and top-coding

The variable LANDSIZEHA was rounded to one digit for values smaller than 1, rounded to zero digits for other values, grouped for values 5-19 and 20-40 and topcoded at 40. The variable AGEYRS was recoded to ten-year age intervals for values in the age range 15 ΓÇô 65.

Local suppression

Values of certain variables for particular households and individuals were deleted. In total, six values of the variable URBRUR, one of the REGION variable, 48 for the OWNAGLAND variable, 16 for the RELIG variable and 323 values of the variable REL were deleted.

Perturbing values

Uncertainty was introduced in the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCYCLE, CAR, TV and LIVESTOCK by using the PRAM method. This method changes a certain percentage of values of variables within each variable. Here invariant PRAM was used, which guarantees that the univariate tabulations stay unchanged. Multivariate tabulations may be changed. Unfortunately, the transition matrix cannot be published.

The income and expenditure variables were perturbed by adding noise (adding small random values to the original values). The noise added was 0.01 times the standard deviation in the original data and 0.05 for outliers. Noise was added to the components and the aggregates were recomputed to guarantee that the proportions of the different components did not change.

Case study 2 - Internal report

SDC report (adapted from the report function in sdcMicro)

This report describes the anonymization measures for the PUF release additional to those already taken in the first case study. Therefore, this report should be read in conjunction with the internal report for case study 1. The original dataset consists of 10,574 observations (i.e., 10,574 individuals in 2,000 households). The dataset used for the anonymization of the PUF file is the anonymized SUF file from case study 1. This dataset consists of 10.068 observations in 1,970 households. The difference is due to the removal of large households and sensitive or identifying variables in the first case study.

Household-level variables

Anonymization methods applied to household-level variables:

  • For SUF release (see case study 1):
    • Removing households of size larger than 13 (29 households)
    • Local suppression to achieve 2-anonymity, with importance vector to prevent suppressing values of the variables HHSIZE, REGION and URBRUR
  • For PUF release:
    • Remove variables OWNLANDAG, RELIG and LANDSIZEHA
    • Local suppression to achieve 5-anonymity, with importance vector to prevent suppressing values of the variables HHSIZE and REGION
    • PRAMming the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCYCLE, CAR, TV and LIVESTOCK
    • Create deciles for aggregate income and expenditure (TANNEXP and INCTOTGROSSHH) and replace the actual values with the mean of the corresponding decile. Replace income and expenditure components with the proportion of original totals.

Selected (key) variables:

categorical URBRUR REGION HHSIZE
continuous TANHHEXP INCTOTGROSSHH  
weight WGTPOP    
hhID not defined    
strata not defined    
  • Modifications on categorical key variables: TRUE
  • Modifications on continuous key variables: TRUE
  • Modifications using PRAM: TRUE
  • Local suppressions: TRUE

Disclosure risk (household-level variables):

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (PUF file: 0, unmodified data: 103)

3-Anonymity: 0 (PUF file: 18, unmodified data: 229)

5-Anonymity: 0 (PUF file: 92, unmodified data: 489)

Percentage of observations violating

2-Anonymity: 0.00% (PUF file: 0.00%, unmodified data: 5.15%)

3-Anonymity: 0.00% (PUF file: 0.91%, unmodified data: 11.45%)

5-Anonymity: 0.00% (PUF file: 4.67%, unmodified data: 24.45%)

Disclosure risk categorical variables:

Expected Percentage of Re-identifications: 0.0000526% (~ 0.10 observations),

PUF file: 0.0000642% (~ 0.13 observations), unmodified data: 0.001820465% (~ 0.36 observations)

11 combinations of categories with highest risk in PUF file:

  URBRUR REGION HHSIZE fk Fk
1 2 4 1 7 1152.084
2 2 4 1 7 1152.084
3 2 2 9 2 2356.926
4 2 4 1 7 1152.084
5 2 4 1 7 1152.084
6 2 4 1 7 1152.084
7 2 5 12 2 2978.454
8 2 4 1 7 1152.084
9 2 4 1 7 1152.084
10 2 5 12 2 2978.454
11 2 2 9 2 2356.926

Disclosure risk continuous scaled variables:

Distance-based Disclosure Risk for Continuous Key Variables:

Disclosure Risk is between 0% and 100% in the modified data. In the original data, the risk is approximately 100%.

Data Utility (household-level variables):

Frequencies categorical key variables

URBRUR

categories1 1 2 NA
orig 1316 684 0
categories2 1 2 NA
recoded 1280 623 67

REGION

categories1 1 2 3 4 5 6 NA
orig 324 334 371 375 260 336 0
categories2 1 2 3 4 5 6 NA
recoded 311 325 369 370 253 329 13

HHSIZE

categories1 1 2 3 4 5 6 7 8 9 10 11 12
orig 152 194 238 295 276 252 214 134 84 66 34 21
categories1 13 14 15 16 17 18 19 20 21 22 33  
orig 11 6 6 5 4 2 1 2 1 1 1  
categories2 1 2 3 4 5 6 7 8 9 10 11 12
recoded 152 194 238 295 276 252 214 134 84 66 34 21
categories2 13                      
recoded 10                      

Local suppressions

Number of local suppressions:

  URBRUR REGION HHSIZE
absolute 61 125 0
relative (in percent) 3.096% 0.609% 0.000%

Data utility of continuous scaled key variables:

Univariate summary:

  Min. 1st Qu Median Mean 3rd Qu Max.
TANHHEX P 498 15,170 17,090 24,340 23,260 353,230
TANHHEX P.m 827 14,700 17,060 23,420 22,750 83,963
INCTOTG ROSSHH 5,000 12,400 13,390 30,840 24,200 683,900
INCTOTG ROSSHH. m 6353 12,390 13,400 30,250 24,240 149,561

Information loss:

Criteria IL1: 0.2422625

Disclosure risk (individual-level variables):

Anonymization methods applied to individual-level variables:

  • For SUF release (see case study 1):
    • Recoding AGEYRS from months to years for age under 1, and to ten-year intervals for age values between 15 and 65, topcoding age at 65
    • Local suppression to achieve 2-anonymity
  • For PUF release:
    • Remove variable EDYRSCURRAT
    • Recode REL to ‘Head’, ‘Spouse’, ‘Child’, ‘Other relative’, ‘Other’
    • Recode MARITAL to ‘Never married’, ‘Married/Living together’, ‘Divorced/Separated/Widowed’
    • Recode AGEYRS for values under 15 to 7
    • Recode EDUCY to ‘No education’, ‘Pre-school/ Primary not completed’, ‘Completed lower secondary or higher’
    • Recode INDUSTRY1 to ‘Primary sector’, ‘Secondary sector’, ‘Tertiary sector’

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (PUF file: 0, unmodified data: 998)

3-Anonymity: 0 (PUF file: 167, unmodified data: 1384)

5-Anonymity: 0 (PUF file: 463, unmodified data: 2194)

Percentage of observations violating

2-Anonymity: 0.00% (PUF file: 0.00%, unmodified data: 9.91%)

3-Anonymity: 0.00% (PUF file: 1.66%, unmodified data: 13.75%)

5-Anonymity: 0.00% (PUF file: 4.60%, unmodified data: 21.79%)

Disclosure risk categorical variables:

Expected Percentage of Re-identifications: 0.00% (~0.41 observations)

(PUF file: 0.02 % (~ 1.69 observations), unmodified data: 0.24% (~23.98 observations))

Expected Percentage of Re-identifications (hierarchical risk): 0.02% (~2.29 observations)

(PUF file: 0.10 % (~ 9.57 observations), unmodified data: 1.26 % (~ 127.12 observations))

10 combinations of categories with highest risk:

  GEDNER REL MARIT AL AGEYR S EDUCY INDUS TRY1 fk Fk
1 1 1 2 50 1 7 2 324.9 275
2 0 1 3 40 3 6 2 330.0 521
3 0 1 6 60 0 3 2 350.5 000
4 0 1 3 40 3 6 2 330.0 521
5 1 1 2 30 4 5 2 253.7 431
6 1 1 2 50 1 7 2 324.9 275
7 0 1 6 50 1 6 2 255.6 142
8 1 1 4 40 1 10 2 175.0 797
9 1 1 4 40 1 10 2 175.0 797
10 1 1 3 30 1 6 2 323.4 879

Data utility (individual-level variables):

Frequencies categorical key variables

GENDER

categories1 0 1 NA
orig 5,197 4,871 0
categories2 0 1 NA
recoded 5,197 4,871 0

REL

categories1 1 2 3 4 5 6 7 8 9 NA
orig 1,970 1,319 4,933 57 765 89 817 51 63 4
categories2 1 2 3 7 9 NA        
recoded 1,698 1,319 4,933 1,688 103 327        

MARITAL

categories1 1 2 3 4 5 6 NA
orig 3,542 2,141 415 295 330 329 3,016
categories2 1 2 9 NA      
recoded 3,542 2,851 659 3,016      

AGEYRS

categories1 0 1/12 2/12 3/12 4/12 5/12 6/12 7/12 8/12 9/12
orig 178 8 1 14 15 19 17 21 18 7
categories1 10/12 11/12 1 2 3 4 5 6 7 8
orig 5 8 367 340 332 260 334 344 297 344
categories1 9 10 11 12 13 14 15 16 17 18
orig 281 336 297 326 299 263 243 231 196 224
categories1 19 20 21 22 23 24 25 26 27 28
orig 202 182 136 146 150 137 128 139 117 152
categories1 29 30 31 32 33 34 35 36 37 38
orig 111 143 96 123 104 107 148 91 109 87
categories1 39 40 41 42 43 44 45 46 47 48
orig 89 93 58 78 72 64 84 74 48 60
categories1 49 50 51 52 53 54 55 56 57 58
orig 58 66 50 55 29 30 34 38 33 44
categories1 59 60 61 62 63 64 65 66 67 68
orig 35 36 25 33 21 15 30 18 13 29
categories1 69 70 71 72 73 74 75 76 77 78
orig 26 36 17 16 12 3 16 10 8 18
categories1 79 80 81 82 83 84 85 86 87 88
orig 11 13 5 2 7 7 7 3 2 2
categories1 89 90 91 92 93 95 NA      
orig 4 4 3 1 1 1 188      
categories2 7 20 30 40 50 60 65 NA    
recoded 4,731 1,847 1,220 889 554 314 325 188    

EDUCY

categories1 0 1 2 3 4 5 6 NA
orig 1582 4755 1062 330 139 46 104 2050
categories2 0 1 2 3 4 5 6 NA
recoded 1,582 4,755 1,062 330 139 46 104 2,050

INDUSTRY1

categories1 1 2 3 4 5 6 7 8 9 10 NA
orig 5,300 16 153 2 93 484 95 17 70 292 3,546
categories2 1 2 3 NA              
recoded 5,316 248 958 3,546              

Local suppressions

Number of local suppressions:

  GENDER REL MARITAL AGEYRS EDUCY INDUSTRY1
absolut e 0 0 0 91 0 0
relativ e (in percent ) 0.00% 0.00% 0.00% 0.90% 0.00% 0.00%

Case study 2- External report

This case study microdata set has been treated to protect confidentiality. Several methods have been applied to protect the confidentiality: removing variables from the original dataset, removing records from the dataset, reducing detail in variables by recoding and top-coding, removing particular values of individuals at risk (local suppression) and perturbing values of certain variables.

Removing variables

The released microdata set has only a selected number of variables contained in the initial survey. Not all variables could be released in this PUF without breaching confidentiality rules.

Removing records

To protect confidentiality, records of households larger than 13 were removed. Twenty-nine households out of a total of 2,000 households in the dataset were removed.

Reducing detail in variables by recoding and top-coding

The variable AGEYRS was recoded to ten-year age intervals for values in the age range 15 ΓÇô 65 and bottom- and top-coded at 15 and 65. The variables REL, MARITAL, EDUCY and INDUSTRY1 were recoded to less detailed categories. The total income and expenditure variables were recoded to the mean of the corresponding deciles and the income and expenditure components to the proportion of the totals.

Local suppression

Values of certain variables for particular households and individuals were deleted. In total, 67 values of the variable URBRUR, 126 of the REGION variable, 91 for the AGEYRS variable and 323 values of the variable REL were deleted.

Perturbing values

Uncertainty was introduced in the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCYCLE, CAR, TV and LIVESTOCK by using the PRAM method. This method changes a certain percentage of values of variables within each variable. Here invariant PRAM was used, which guarantees that the univariate tabulations stay unchanged. Multivariate tabulations may be changed. Unfortunately, the transition matrix cannot be published.

Appendix D: Execution Times for Multiple Scenarios Tested using Selected Sample Data

_images/image22.png
_images/image23.png

Fig. 24 Description of anonymization scenarios

References

[DuBo10]Dupriez, O., & Boyko, E. (2010). Dissemination of Microdata Files; Principles, Procedures and Practices. International Household Survey Network (IHSN).