This webpage contains material related to “Statistical Disclosure Control”, mainly test-bed instances for testing algorithms solving the “Controlled Rounding Problem” and the “Cell Suppression Problem”.

 

Controlled Rounding Problem


Technical Report  (PDF file):

Rounding methods are common techniques in many statistical offices to protect sensitive information when publishing data in tabular form. Classical versions of these methods do not consider protection levels while searching patterns with minimum information loss, and therefore typically the so-called auditing phase is required to check the protection of the proposed patterns. This paper presents a mathematical model for the whole problem (CRP) of finding a protected pattern with minimum loss of information, and proposes a branch-and-cut algorithm to solve it. It also describes a new methodology closely related to the classical Controlled Rounding methods but with several advantages. The new methodology is named Cell Perturbation and leads to a different optimization problem (CPP) which is simpler to solve than the previous problem. This paper presents a cutting-plane algorithm for finding an exact solution of the new problem, which is a pattern guaranteeing the same protection level requirements but with smaller loss of information when compared with the classical Controlled Rounding optimal patterns. The auditing phase is unnecessary on the solutions generated by the two algorithms. The paper concludes with computational results on real-world instances and discusses a modification in the objective function to guarantee statistical properties in the solutions.

A few definitions (read the technical paper for full details):

Table:  It is a pair of sets. One set is named "cells", and for each element we have a cell value (a number) plus other details (weigh in the loss of information, sensitive or no, protection requirements when sensitive, etc.). Another set is named "equations", and each element represents a mathematical formulae relating some cells (typically, each one says that a cell value should be equal to the sum of other
cells values, but many other more general linear links are allowed). The two sets are written in the *.txt files following a simple ASCII format used by the tau-ARGUS software package.

Base number: It is a pre-specified number "b", given by the statistical agency, and which will be used in the two methodologies described in this technical paper (i.e., Controlled Rounding and Cell Perturbation). The idea of this number is that, when observing a value "y" in a cell of the output table, an attacker will have the information that the true value "x" in that cell is inside the interval "[y-b,y+b]".
Due to the other cell values published, to the equations, to the external bounds known on some cells, etc. the attacker can reduce this interval of potential values, but the mathematical model under our algorithm will ensure that this new interval will never violate the protection requirements on the unsafe cells. See the technical paper for full details.

Unsafe cells: This is a subset of cells, where there is special protection requirements on the tightest interval that an attacker should be allow to compute

Non-zeros in M:  Considering the cells as columns and the equations as rows, it is possibile to create a matrix M, thus defining the linear system "M y = 0" that a set of cell values "y" must satisfy in order to be an additive table.

CRP: linear-relaxation solution:   This is the solution of the Continuous Linear Programming relaxation of the model given in the paper for the Controlled Rounding methodology. In other words, th e solution of the program without imposing the 0-or-1 condition on the decision variables. Therefore, typically it contains fractional values depending on the structure of the table (i.e., on the matrix "M"). It has the advantage that it can be computed faster than the integer solution of the model.

CRP: (integer) optimal solution:  This is the integer solution of the model given in the paper for the Controlled Rounding Problem (CPP). To be obtained, we start computing its linear-relaxation solution and then continue with a branching scheme where fractional variables are either rounded down or rounded up. Each value in the output is a multiple of the base number "b".

CPP: solution: It is the optimal solution of the model given in the paper for the Cell Perturbation Problem (CPP). Typically it contains fractional values with the condition that the true value is plus o minus "b" times this value in the solution, for each cell.

 

CRPlib (ZIP file):

input file name

cells

equations

unsafe cells

non-zeros in M

CRP: linear-relaxation solution

CPP: solution

bts4.txt

36570

36310

2260

136912

bts4.crp

bts4.cpp

hier13.txt

2020

3313

112

11929

<infeasible>

hier13.cpp

hier16.txt

3564

5484

224

19996

hier16.crp

hier16.cpp

nine12.txt

10399

11362

1178

52624

nine12.crp

nine12.cpp

nine5d.txt

10733

17295

1661

58135

nine5d.crp

nine5d.cpp

ninenew.txt

6546

7340

858

32920

ninenew.crp

ninenew.cpp

two5in6.txt

5681

9629

720

34310

two5in6.crp

two5in6.cpp

five20b.txt

34552

52983

 

 

 

 

five20c.txt

34501

58825

 

 

 

 

hier13d4.txt

18969

47675

 

 

 

 


The first seven tables (in *.txt) in this collection were created by Ramesh Dandekar, and it is described in
Dandekar, R.A. (2004) "Maximum Utility-Minimum Information Loss Table Server Design
for Statistical Disclosure Control of Tabular Data'', 121--135 in Domingo-Ferrer, J. (editor)
"Privacy in Statistical Databases", Lecture Notes in Computer Science 3050, Springer.

The last three tables (in *.txt) in this collection were also provided by Ramesh Dandekar, and they are described in a file “TwentyD.pdf” and “hier13d4.readme” included in the ZIP file (CRPlib.zip)

The column "CRP:linear-relaxation solution" contains the solutions of the
linear programming relaxation of the Controlled Rounding methodology with a base number 250.

The column "CPP:solution" contains the solutions of the
Cell Perturbation methodology with a base number 250.

In addition, we include a frequency table created by Anco Hundepool (CBS, Netherlands) with a similar size to the largest one in the previous collection:

input file name

cells

equations

unsafe cells

non-zeros in M

CRP: (integer) optimal solution

CPP: (fracctional) optimal solution

anco.txt

30886

39800

10680

120819

anco.crp

anco.cpp

anco0.txt

30886

39800

0

120819

anco0.crp

anco0.cpp

The used base number was 5. The optimal CRP table (anco.crp) is at a distance of  42600 units far from the original table (anco.txt). The optimal CPP table (anco.cpp) is at a distance of  37306.166 from the original table (anco.txt) and it has 4121 fractional cell values. If no protection levels were required on the unsafe cells, then an optimal CRP table (anco0.crp) would be at a distance 42300 units far from the original table (anco0.txt), while the optimal CPP table (anco0.cpp) would be at a distance 38146.018 and it has 3194 fractional cell values. This remark the fact that the objective function used in the CPP problem is not equivalent to minimize the distance between input and output. This was done, not only to have a linear function in the CPP model, but also (and more important) to avoid the original input table as the final output solution. The objective function minimize the distance between the output and a (non-additive) table where the cell values are rounded to the nearest values.



New Benchmark instances of the Controlled Rounding Problem (base=5 ; no zero-restricted)

input table

upper bound

lower bound

output table

Hier13.dat

3501

3230.9

Hier13.dat.sol

Hier16.dat

7004

5576.1

Hier16.dat.sol

Nine12.dat

 

 

 

Ninenew.dat

11536

9913.9

Ninenew.dat.sol

Two5in6.dat

8972

8415.7

Two5in6.dat.sol

Nine5d.dat

18388

15931.5

Nine5d.dat.sol

Bts4.dat

 

 

 




 


 

Cell Suppression Problem


Working paper and articles.

CSPlib:

Table generators:

Pictures:


A statistical agency collects data to be processed and published. This data is often obtained from individual respondents under a pledge of confidentiality: statistical agencies have the responsibility of not releasing any data or data summaries from which individual respondent information can be revealed (sensitive data). On the other hand, statistical agencies aim at publishing as much information as possible. This results into a trade-off between privacy rights and information loss, an issue of primary importance in practice. The interested reader is referred, e.g., to Willenborg and Waal (1996) for a discussion on disclosure control methodologies.

Cell suppression is one of the most popular techniques for disclosure avoidance, and is typically applied to 2- or 3-dimensional tables whose entries (cells) are subject to marginal totals. We next outline the standard complete cell suppression technique, and then introduce an alternative version that we call partial cell suppression. Both methodologies are illustrated by means of the following introductory example, taken from Willenborg and Waal (1996).

 

Region A

Region B

Region C

Total

Activity I

20

50

10

80

Activity II

8

19

22

49

Activity III

17

32

12

61

Total

45

101

44

190

Fig. 1: Investment of enterprises by activity and region

Figure 1 exhibits a table giving the investment of enterprises (per millions of guilders), classified by activity and region. Let us assume that the information in the cell corresponding to Activity II and Region C is confidential, hence it is viewed as a sensitive cell to be suppressed. By using the marginal totals an attacker interested in the disclosure of the sensitive cell can however recompute its missing value, hence other table entries must be suppressed as well, e.g. those of Figure 2.

 

Region A

Region B

Region C

Total

Activity I

20

50

10

80

Activity II

*

19

*

49

Activity III

*

32

*

61

Total

45

101

44

190

Fig. 2: A possible complete CSP output.

With this choice, the attacker cannot disclose the nominal value of the sensitive cell exactly, although he/she can still compute a range for the values of this cell which are consistent with the published entries (those of Figure 2). Indeed, the minimum value y- 23 for sensitive cell (2,3) can be computed by solving a linear program in which the values yij for the missing cells (i,j) are treated as unknowns, namely

y-23 := minimize y23

subject to

y21 + y23 = 30

y31 + y33 = 29

y21 + y31 = 25

y23 + y33 = 34

y21=0, y23=0, y31=0, y33=0

Notice that the right-hand side values are known to the attacker, as they can be obtained as the difference between the marginal and the published values in a row/column. The maximum value y+23 for the sensitive cell can be computed in a perfectly analogous way, by solving the linear program of maximizing y23 subject to the same constraints as before. In the example, y-23=5 and y+23=30, i.e., the sensitive information is "protected" within the protection interval [5...30]. If this interval is considered sufficiently wide by the statistical office, then the sensitive cell is called protected; otherwise new complementary suppressions are needed. Notice however that the extreme values of interval [5...30] are only attained if the cell corresponding to Activity II and Region A takes the quite unreasonable values of 0 and 25. Bounding the cell variation to 50% of the nominal value (say) results in the more realistic protection interval [18...26]. The statistical office then aims at finding a set of complementary suppressions protecting all the sensitive cells against an attacker, and such that the loss of information associated with the suppressed entries is minimized.

This combinatorial optimization problem is known as the (Complete) Cell Suppression Problem, or CSP for short. CSP belongs to the class of the strongly NP-hard problems (see, e.g., Kelly et al. 1992, Geurts 1992, Kao 1996), meaning that it is very unlikely that an algorithm for the exact solution of CSP exists, which guarantees an efficient (i.e., polynomial-time) performance for all possible input instances. Previous work on CSP from the literature mainly concentrate on 2-dimensional tables with marginals. Heuristic solution procedures have been proposed by several authors, including Cox (1980, 1995), Sande (1984), Kelly et al. (1992), and Carvalho et al. (1994). Kelly (1990) proposed a mixed-integer linear programming formulation involving a huge number of variables and constraints (for instance, the formulation involves more than 20,000,000 variables and 30,000,000 constraints for a two-dimensional table with 100 rows, 100 columns and 5% sensitive entries). Geurts (1992) refined this model, and reported computational experiences on small-size instances, the largest instance solved to optimality being a table with 20 rows, 6 columns and 17 sensitive cells. Gusfield (1988) gave a polynomial algorithm for a special case of the problem. Heuristics for 3-dimensional tables have been proposed in Robertson (1994), Sande (1984), and Dellaert and Luijten (1996).

We proposed in Fischetti and Salazar (1997) a new method capable of solving to proven optimality, on a personal computer, 2-dimensional tables with about 250,000 cells and 10,000 sensitive entries. An extension of this methodology resulted into an algorithm presented in Fischetti and Salazar (1998a), capable of solving to proven optimality real-world 3- and 4-dimensional tables as well as linked tables of reasonable size. We addressed cell suppression in the very general context of tables whose entries are linked by a generic system of linear equations. In Fischetti and Salazar (1998b) we introduce and analyze here an alternative protection methodology that we call partial cell suppression. The methodology consists in publishing intervals for some table entries, as opposed to replacing some table entries by "missing". The published intervals must provide a convenient set of possible values for the correspondent entries, containing the true original value. Figure 3 exhibits a possible partial cell suppression output for the original table in Figure 1, to be compared with the complete cell suppression output in Figure 2

 

 

Region A

Region B

Region C

Total

Activity I

[18...24]

50

[6...12]

80

Activity II

[4...10]

19

[20...26]

49

Activity III

17

32

12

61

Total

45

101

44

190

Fig. 3: A possible partial CSP output

The new method has three important advantages over complete (i.e., standard) cell suppression. First, the loss of information needed to meet the required protection levels is typically significantly smaller by using the partial suppression methodology. A second advantage is related to computational complexity. Indeed, we provide in this paper an efficient (i.e., polynomial time) method for finding optimal partial suppressions. A third advantage is that the methodology itself provides, with no extra computational effort, the auditing range for each partially suppressed entry, defined by the minimum and maximum cell values that an attacker can compute by using the published information.


 

Working papers and articles:

·         Models and Algorithms for the Cell Suppression Problem, 3rd International Seminar on Statistical Confidentiality, paper Bled (Slovenia) october 1996

·         Modeling and Solving the Cell Suppression Problem for Linearly-Constrained Tabular Data, Statistical Disclosure Protection' 98, seminar paper and revised version; Lisboa (Portugal) march 1998.

·         Complementary Cell Suppression for Statistical Disclosure Control in Tabular Data with Linear Constraints, International Symposium on Linked Employer-Employee Data, paper Washington (U.S.A.) may 1998.

·         Models and Algorithms for the 2-Dimensional Cell Suppression Problem in Statistical Disclosure Control, seminar paper, may 1997; revised version, december 1997. This last version appeared on Mathematical Programming, 84 (1999) 283-312.

·         Models and Algorithms for Optimizing Cell Suppression in Tabular Data with Linear Constraints, seminar paper, october 1998; first revised version, june 1999; second revised version, january 2000. To appear on Journal of the American Statistical Association.

·         Partial Cell Suppression: a New Methodology for Statistical Disclosure Control, seminar paper, october 1998; revised version, september 1999. To appear on Statistics and Computing.

·         "A Unified Mathematical Framework for different Statistical Disclosure Limitation Methods for Tabular Data", work presented at the CASC workshop in Plymouth (April 3-5, 2002).


 

CSPlib: 

This is a test-bed library containing benchmarking instances for testing and comparing new approaches for solving the Cell Suppression Problem. All data is publicly available. Thanks to the anonymous contributors. It is our intention not only to distribute problem data but also solutions. See appendix in Fischetti and Salazar (1998a) for details on the file format.

 Benchmark test-bed instances: Inputs and Outputs

Statistical results solving the Complete Cell Suppression Problem

Statistical results solving the Partial Cell Suppression Problem

40 dimensional test case ( sixty 5-D sections in 40_D Space )

Table with 4+ million non-zero cells and 10+ million equations


Table generators:

Benedetti instance generator

  Kelly-Golden-Assad instance generator

  Carvalho-Dellaert-Osorio instance generator  

2-dimensional instance generator

3-dimensional instance generator

  4-dimensional instance generator


Pictures:

J.J. Salazar and Larry Cox in the UNECE/ECE meeting (Skopje, Macedonia)

 

[Contador de visitas a la página]