Rounding
methods are common techniques in many statistical offices to protect sensitive
information when publishing data in tabular form. Classical versions of these
methods do not consider protection levels while searching patterns with minimum
information loss, and therefore typically the so-called auditing phase is
required to check the protection of the proposed patterns. This paper presents
a mathematical model for the whole problem (CRP) of finding a protected pattern
with minimum loss of information, and proposes a branch-and-cut algorithm to
solve it. It also describes a new methodology closely related to the classical
Controlled Rounding methods but with several advantages. The new methodology is
named Cell Perturbation and leads to a different optimization problem (CPP)
which is simpler to solve than the previous problem. This paper presents a
cutting-plane algorithm for finding an exact solution of the new problem, which
is a pattern guaranteeing the same protection level requirements but with
smaller loss of information when compared with the classical Controlled
Rounding optimal patterns. The auditing phase is unnecessary on the solutions
generated by the two algorithms. The paper concludes with computational results
on real-world instances and discusses a modification in the objective function
to guarantee statistical properties in the solutions.
A few definitions (read the technical paper for full
details):
Table: It is a pair of sets. One set
is named "cells", and for each element we have a cell value (a
number) plus other details (weigh in the loss of information, sensitive or no,
protection requirements when sensitive, etc.). Another set is named
"equations", and each element represents a mathematical formulae
relating some cells (typically, each one says that a cell value should be equal
to the sum of other
cells values, but many other more general linear links are allowed). The two
sets are written in the *.txt files following a simple ASCII format used by the
tau-ARGUS software package.
Base number: It is a pre-specified number
"b", given by the statistical agency, and which will be used in the
two methodologies described in this technical paper (i.e., Controlled Rounding
and Cell Perturbation). The idea of this number is that, when observing a value
"y" in a cell of the output table, an attacker will have the
information that the true value "x" in that cell is inside the
interval "[y-b,y+b]".
Due to the other cell values published, to the equations, to the external
bounds known on some cells, etc. the attacker can reduce this interval of
potential values, but the mathematical model under our algorithm will ensure
that this new interval will never violate the protection requirements on the
unsafe cells. See the technical paper for full details.
Unsafe cells: This is a subset of cells, where
there is special protection requirements on the tightest interval that an
attacker should be allow to compute
Non-zeros in M: Considering the cells as
columns and the equations as rows, it is possibile to create a matrix M, thus
defining the linear system "M y = 0" that a set of cell values
"y" must satisfy in order to be an additive table.
CRP: linear-relaxation
solution:
This is the solution of the Continuous Linear Programming relaxation of the
model given in the paper for the Controlled Rounding methodology. In other
words, th e solution of the program without imposing the 0-or-1 condition on
the decision variables. Therefore, typically it contains fractional values depending
on the structure of the table (i.e., on the matrix "M"). It has the
advantage that it can be computed faster than the integer solution of the
model.
CRP: (integer) optimal
solution:
This is the integer solution of the model given in the paper for the Controlled
Rounding Problem (CPP). To be obtained, we start computing its
linear-relaxation solution and then continue with a branching scheme where
fractional variables are either rounded down or rounded up. Each value in the
output is a multiple of the base number "b".
CPP: solution: It is the optimal solution of the
model given in the paper for the Cell Perturbation Problem (CPP). Typically it
contains fractional values with the condition that the true value is plus o
minus "b" times this value in the solution, for each cell.
input file name |
cells |
equations |
unsafe cells |
non-zeros in M |
CRP: linear-relaxation solution |
CPP: solution |
36570 |
36310 |
2260 |
136912 |
|||
2020 |
3313 |
112 |
11929 |
<infeasible> |
||
3564 |
5484 |
224 |
19996 |
|||
10399 |
11362 |
1178 |
52624 |
|||
10733 |
17295 |
1661 |
58135 |
|||
6546 |
7340 |
858 |
32920 |
|||
5681 |
9629 |
720 |
34310 |
|||
34552 |
52983 |
|
|
|
|
|
34501 |
58825 |
|
|
|
|
|
18969 |
47675 |
|
|
|
|
The first seven tables (in *.txt) in this collection were created by Ramesh
Dandekar, and it is described in
Dandekar, R.A. (2004) "Maximum Utility-Minimum Information Loss Table
Server Design
for Statistical Disclosure Control of Tabular Data'', 121--
"Privacy in Statistical Databases", Lecture Notes in Computer Science
3050, Springer.
The last three tables (in *.txt) in
this collection were also provided by Ramesh Dandekar, and they are described
in a file “TwentyD.pdf” and “hier13d4.readme” included
in the ZIP file (CRPlib.zip)
The column "CRP:linear-relaxation solution" contains the solutions of
the
linear programming relaxation of the Controlled Rounding methodology with a
base number 250.
The column "CPP:solution" contains the solutions of the
Cell Perturbation methodology with a base number 250.
In addition, we include a frequency table
created by Anco Hundepool (CBS,
input file name |
cells |
equations |
unsafe cells |
non-zeros in M |
CRP: (integer) optimal solution |
CPP: (fracctional) optimal solution |
30886 |
39800 |
10680 |
120819 |
|||
30886 |
39800 |
0 |
120819 |
The used base number was 5. The optimal CRP
table (anco.crp) is
at a distance of 42600 units far from the original table (anco.txt). The optimal CPP table (anco.cpp) is at a distance of
37306.166 from the original table (anco.txt) and it has 4121 fractional cell values. If no
protection levels were required on the unsafe cells, then an optimal CRP table
(anco0.crp) would be at a distance 42300 units
far from the original table (anco0.txt), while the optimal CPP table (anco0.cpp) would be at a distance 38146.018
and it has 3194 fractional cell values. This remark the fact that the objective
function used in the CPP problem is not equivalent to minimize the distance
between input and output. This was done, not only to have a linear function in
the CPP model, but also (and more important) to avoid the original input table
as the final output solution. The objective function minimize the distance
between the output and a (non-additive) table where the cell values are rounded
to the nearest values.
New Benchmark instances of the Controlled Rounding Problem (base=5 ; no
zero-restricted)
input table |
upper bound |
lower bound |
output table |
3501 |
3230.9 |
||
7004 |
5576.1 |
||
|
|
|
|
11536 |
9913.9 |
||
8972 |
8415.7 |
||
18388 |
15931.5 |
||
|
|
|
A statistical agency collects data
to be processed and published. This data is often obtained from individual
respondents under a pledge of confidentiality: statistical agencies have the
responsibility of not releasing any data or data summaries from which
individual respondent information can be revealed (sensitive data). On
the other hand, statistical agencies aim at publishing as much information as
possible. This results into a trade-off between privacy rights and information
loss, an issue of primary importance in practice. The interested reader is
referred, e.g., to Willenborg and Waal (1996) for a discussion on disclosure
control methodologies.
Cell suppression is one of the most popular
techniques for disclosure avoidance, and is typically applied to 2- or
3-dimensional tables whose entries (cells) are subject to marginal
totals. We next outline the standard complete cell suppression
technique, and then introduce an alternative version that we call partial
cell suppression. Both methodologies are illustrated by means of the following
introductory example, taken from Willenborg and Waal (1996).
|
Region A |
Region B |
Region C |
Total |
Activity I |
20 |
50 |
10 |
80 |
Activity II |
8 |
19 |
22 |
49 |
Activity III |
17 |
32 |
12 |
61 |
Total |
45 |
101 |
44 |
190 |
Fig. 1: Investment of enterprises by activity
and region
Figure 1 exhibits a
table giving the investment of enterprises (per millions of guilders),
classified by activity and region. Let us assume that the information in the
cell corresponding to Activity II and Region C is confidential, hence it is
viewed as a sensitive cell to be suppressed. By using the marginal totals an
attacker interested in the disclosure of the sensitive cell can however
recompute its missing value, hence other table entries must be suppressed as
well, e.g. those of Figure 2.
|
Region A |
Region B |
Region C |
Total |
Activity I |
20 |
50 |
10 |
80 |
Activity II |
* |
19 |
* |
49 |
Activity III |
* |
32 |
* |
61 |
Total |
45 |
101 |
44 |
190 |
Fig. 2: A possible complete CSP output.
With this choice, the
attacker cannot disclose the nominal value of the sensitive cell exactly,
although he/she can still compute a range for the values of this cell which are
consistent with the published entries (those of Figure 2). Indeed, the minimum
value y- 23 for sensitive cell (2,3) can be computed by solving a linear
program in which the values yij for the missing cells (i,j) are treated as
unknowns, namely
y-23 := minimize y23
subject to
y21
+ y23 = 30
y31
+ y33 = 29
y21
+ y31 = 25
y23
+ y33 = 34
y21=0,
y23=0, y31=0, y33=0
Notice that the
right-hand side values are known to the attacker, as they can be obtained as
the difference between the marginal and the published values in a row/column.
The maximum value y+23 for the sensitive cell can be computed in a perfectly
analogous way, by solving the linear program of maximizing y23 subject to the
same constraints as before. In the example, y-23=5 and y+23=30, i.e., the
sensitive information is "protected" within the protection interval
[5...30]. If this interval is considered sufficiently wide by the statistical
office, then the sensitive cell is called protected; otherwise new
complementary suppressions are needed. Notice however that the extreme values
of interval [5...30] are only attained if the cell corresponding to Activity II
and Region A takes the quite unreasonable values of 0 and 25. Bounding the cell
variation to 50% of the nominal value (say) results in the more realistic
protection interval [18...26]. The statistical office then aims at finding a
set of complementary suppressions protecting all the sensitive cells against an
attacker, and such that the loss of information associated with the suppressed
entries is minimized.
This combinatorial
optimization problem is known as the (Complete) Cell Suppression Problem, or
CSP for short. CSP belongs to the class of the strongly NP-hard problems (see,
e.g., Kelly et al. 1992, Geurts 1992, Kao 1996), meaning that it is very
unlikely that an algorithm for the exact solution of CSP exists, which
guarantees an efficient (i.e., polynomial-time) performance for all possible
input instances. Previous work on CSP from the literature mainly concentrate on
2-dimensional tables with marginals. Heuristic solution procedures have been
proposed by several authors, including Cox (1980, 1995), Sande (1984), Kelly et
al. (1992), and Carvalho et al. (1994). Kelly (1990) proposed a mixed-integer
linear programming formulation involving a huge number of variables and
constraints (for instance, the formulation involves more than 20,000,000
variables and 30,000,000 constraints for a two-dimensional table with 100 rows,
100 columns and 5% sensitive entries). Geurts (1992) refined this model, and
reported computational experiences on small-size instances, the largest
instance solved to optimality being a table with 20 rows, 6 columns and 17
sensitive cells. Gusfield (1988) gave a polynomial algorithm for a special case
of the problem. Heuristics for 3-dimensional tables have been proposed in Robertson
(1994), Sande (1984), and Dellaert and Luijten (1996).
We proposed in Fischetti
and Salazar (1997) a new method capable of solving to proven optimality, on a personal
computer, 2-dimensional tables with about 250,000 cells and 10,000 sensitive
entries. An extension of this methodology resulted into an algorithm presented
in Fischetti and Salazar (1998a), capable of solving to proven optimality
real-world 3- and 4-dimensional tables as well as linked tables of reasonable
size. We addressed cell suppression in the very general context of tables whose
entries are linked by a generic system of linear equations. In Fischetti
and Salazar (1998b) we introduce and analyze here an
alternative protection methodology that we call partial cell suppression. The
methodology consists in publishing intervals for some table entries, as opposed
to replacing some table entries by "missing". The published intervals
must provide a convenient set of possible values for the correspondent entries,
containing the true original value. Figure 3 exhibits a possible partial cell
suppression output for the original table in Figure 1, to be compared with the
complete cell suppression output in Figure 2
|
Region A |
Region B |
Region C |
Total |
Activity I |
[18...24] |
50 |
[6...12] |
80 |
Activity II |
[4...10] |
19 |
[20...26] |
49 |
Activity III |
17 |
32 |
12 |
61 |
Total |
45 |
101 |
44 |
190 |
Fig. 3: A possible partial CSP output
The new method has three
important advantages over complete (i.e., standard) cell suppression. First,
the loss of information needed to meet the required protection levels is
typically significantly smaller by using the partial suppression methodology. A
second advantage is related to computational complexity. Indeed, we provide in
this paper an efficient (i.e., polynomial time) method for finding optimal
partial suppressions. A third advantage is that the methodology itself
provides, with no extra computational effort, the auditing range for each
partially suppressed entry, defined by the minimum and maximum cell values that
an attacker can compute by using the published information.
·
Models and Algorithms for the Cell Suppression Problem, 3rd
International Seminar on Statistical Confidentiality, paper Bled (
·
Modeling and Solving the Cell Suppression Problem for
Linearly-Constrained Tabular Data, Statistical Disclosure Protection' 98,
seminar
paper and revised
version;
Lisboa (
·
Complementary Cell Suppression for Statistical Disclosure Control in
Tabular Data with Linear Constraints, International Symposium on Linked
Employer-Employee Data, paper Washington (U.S.A.) may 1998.
·
Models and Algorithms for the 2-Dimensional Cell Suppression Problem in
Statistical Disclosure Control, seminar paper, may 1997; revised
version,
december 1997. This last version appeared on Mathematical Programming,
84 (1999) 283-312.
·
Models and Algorithms for Optimizing Cell Suppression in Tabular Data
with Linear Constraints, seminar paper, october 1998; first
revised version, june 1999; second revised version, january 2000. To appear on Journal
of the American Statistical Association.
·
Partial Cell Suppression: a New Methodology for Statistical Disclosure
Control, seminar paper, october 1998; revised version, september 1999. To appear on Statistics
and Computing.
·
"A Unified Mathematical Framework for different Statistical
Disclosure Limitation Methods for Tabular Data", work presented at the CASC workshop
in
This is a test-bed
library containing benchmarking instances for testing and comparing new
approaches for solving the Cell Suppression Problem. All data is publicly
available. Thanks to the anonymous contributors. It is our intention not only
to distribute problem data but also solutions. See appendix in Fischetti and
Salazar (1998a) for details on the file format.
Benchmark test-bed instances:
Inputs and Outputs
Statistical results solving the
Complete Cell Suppression Problem
Statistical results solving the
Partial Cell Suppression Problem
40 dimensional test case ( sixty 5-D sections in
40_D Space )
Table with 4+ million
non-zero cells and 10+ million equations
Kelly-Golden-Assad instance generator
Carvalho-Dellaert-Osorio instance generator
2-dimensional instance generator
3-dimensional instance generator
4-dimensional instance generator
J.J. Salazar and Larry Cox in the UNECE/ECE
meeting (Skopje, Macedonia)
[Contador de visitas a la página]