Data privacy

Data privacy studies methods, tools, and theory to avoid the disclosure of sensitive information. Its origin is in statistics with the goal to ensure the conﬁdentiality of data gathered from census and questionnaires. The topic was latter introduced in computer science and more particularly in data mining, where due to the large amount of data currently available, has attracted the interest of researchers, practitioners, and companies. In this paper we will review the main topics related to data privacy and privacy-enhancing technologies


INTRODUCTION
D ata privacy and privacy-enhancing technologies (PET) study techniques and tools to avoid the unintentional disclosure of sensitive information. They have been studied in the areas of computer science and statistics. Statistical Disclosure Control (SDC) was first developed to solve the needs of statistical offices to publish data from census and questionnaires while avoiding confidentiality problems. Within computer science, tools for data privacy have been developed in relation to communications, security, databases, and data mining. Tools and methods related to communications and security are often classified as PET whereas the ones related to data mining are studied in privacy-preserving data mining (PPDM). While there exist different communities focusing on different types of applications and data uses, background and concepts, as well as some methods are common.
In this overview, we will discuss the main concepts and tools in data privacy, giving a general perspective of the field and presenting them independently of the community in which they originated. This is a broad and relatively nontechnical description intended for readers without a strong background in the field. We provide, through the paper, several references which should allow the interested reader to gain a deeper understanding in specific topics of the field.
The structure of the paper is as follows. First, we will present a classification of the methods for data privacy. We will review different dimensions. In particular, we will see that one of the dimensions is about the subjects involved in the data privacy process: the respondent, the owner, and the user. We will focus then on user-privacy, methods to be implemented and used by the users of a system to ensure their own privacy. Then, we will focus on respondent and owner privacy. The paper finishes with some conclusions and provides some references for further study.

CLASSIFICATION
The literature presents different taxonomies of the methods for data privacy. [1][2][3][4][5][6][7][8] In this section we review three of them. We will use them to classify and review the main methods. They are as follows.
• On whose privacy is being sought. In the whole process of data collection, data protection, and data analysis several subjects (individuals or entities) are involved. This dimension focuses on the subject whose privacy is considered the main motivation for the application of a method. Three subjects are considered: respondent, owner, and user.
• On the computations to be done. Data are protected for a certain use. That is, for the application of a certain algorithm or to do some type of analysis. In this dimension, methods are distinguished according to the type of computation or analysis a data miner (or another user) will perform with the protected data. For example, before data protection we may know that clustering algorithms will be applied to the data, and this can help on the selection of an appropriate method for data protection.
• On the number of data sources. Data have to be protected, and then used. It is different when a single data set is considered and when a collection of data sets are considered. This dimension focuses on the number of such data sets.
We discuss these dimensions in more detail in the sections that follow. Figure 1 outlines the classification.

On Whose Privacy Is Being Sought
As enumerated above, in the data privacy protection process we typically consider three subjects. They are the respondent, the owner, and the user. We describe their specific meaning below.
• The respondent. Following the terminology in SDC, the respondent is the person whose data have been collected and are in the database.
• The owner. Data from respondents are collected and stored by a company or an administration who is the holder of the data, liable for disclosure of confidential information, and possibly with economic interest on the data. This is the owner of the data.
• The user. When a person accesses a system (e.g., a database, a search engine, or an email system) some trails are left which can be recorded in a secondary database or sniffed and used or disclosed later on. From a certain point of view, the user is just a respondent of this secondary database. Nevertheless, we distinguish the user from a respondent when the user can perform some actions against the system or sniffers. That is, the user corresponds to an individual that can act to avoid disclosure of his own information, i.e., an active subject, while a respondent is a passive subject almost a strange to his own data.
We can consider data privacy focusing on the three subjects discussed above. That is, we can consider respondent privacy, owner privacy, and user privacy. This dimension is based on the one presented in Ref 9.
• Respondent privacy. It focuses on technologies that avoid the disclosure of sensitive information about the respondents of a database. Some specific goals of respondent privacy are to avoid that data are linked to a particular individual, to avoid increasing the knowledge about particular individuals, and to avoid that someone can find that a particular individuals' data is in a database.
• Owner privacy. It focuses on tools to avoid the disclosure of information that is relevant to the owner of the database. On the one hand this information can be information on particular individuals, as in respondent privacy. On the other hand, this information can lead to knowledge that can only be inferred from the whole database. An example of the latter is when a database owner pretends to publish a database but avoiding that third parties are able to mine certain rules which are of high relevance in the business.
• User privacy. As stated above, the role of users is similar to the one of respondents in the fact that their data are collected. The difference is that here users can act to protect their privacy. Therefore, user privacy focus on tools that can be implemented and used by users. For example, to avoid leaving trails on their interests when accessing a search engine.

On the Computations To Be Done
Data are protected to be used by third parties. If we know how data will be used, and what the third party wants to compute we can take that into account in the data protection process. For example, we may know that a statistician wants to apply a linear regression of income with respect to age or know that a data miner will apply some non-supervised machine learning algorithm (e.g., some clustering algorithms).
In this dimension we consider three situations. We describe them below.
• Computation-driven or specific-purpose data privacy protection methods. In this case, we know which type of algorithm a researcher will apply to the data, and we can tailor the protected data to this type of method.
• Data-driven or general-purpose data privacy protection methods. This corresponds to the case in which there is only rough information or no information at all on the type of analysis to apply to the data. This is the case when data are published on the web.
• Result-driven data protection privacy methods. In this case, we know the analysis that the researcher will do on the data. Nevertheless, there is a fundamental difference with computation-driven methods because protection is not focused on the original database but on the results obtained from the analysis. That is, we want to avoid that the data miner or statistician gets some particular results. The case described above in which we want to avoid a data miner to obtain a certain association rule from the data belongs to this case.
We can find in the literature another classification (dimension) of methods distinguishing perturbative and cryptographic methods. Perturbative methods are those that add some kind of noise to the data. That is, they mask the original data so that the true values are no longer found. Examples are adding noise or swapping values of the data. In this way, disclosure risk is reduced at the cost of some information (or utility) loss. On the contrary, cryptographic methods describe protocols so that researchers get their desired result without accessing to the original data. Most perturbative methods can be considered as data driven while cryptographic methods can be considered as computation driven. The latter can only be defined when we know which is the intended computation of the user.
We will discuss in more detail these methods later in relation to respondent and owner privacy.

On the Number of Data Sources
The number of data sources is another way to distinguish data privacy protection methods. There are methods to be applied when we only want to publish a single data set, when we want to publish different data sets (e.g., multiple tables of a single database), and when we want to obtain a computation from multiple data sets.
Methods focused on the publication of a single data set typically correspond to data-driven methods. The same case occurs when a single owner publishes several data sets (see e.g., . A concrete case of this last scenario corresponds to the protection of stream or dynamic data. Protecting data streams can be done by producing several protected data sets in a timely basis or providing incremental versions of the data set. [13][14][15][16] Dynamic data, which also consider deletion of elements, produce evolving protected data set that reflects the updates (deletions and/or insertions) in the original data. 17,18 Computation-driven methods can be applied if the analysis of the data miner or statistician is known.
The case of computing a function from multiple data sets typically corresponds to computation-driven methods. In fact, most computation-driven methods focus on the following type of problem: n data owners decide to compute a function f of their data so that the only additional knowledge each of the owners get after the computation of f is the outcome of f applied to their data. Cryptographic protocols are defined for this purpose. 19 An important advantage of cryptographic methods is that they compute the function exactly (there is no error in the outcome of the function) and they ensure complete privacy. The main inconvenience is that if the function is changed, the protocol has to be changed. This is so, even in the case of a small variation in the function. So, the main disadvantage is that there is no flexibility on the function to be computed. In some scenarios there might also be computational or communication costs, because cryptographic operations have to be performed through a communication protocols in a reasonable time. On the contrary, data-driven (perturbative) methods are not exact because they decrease risk by adding some noise into the data, and this causes a perturbation in the results of any analysis, and in addition they do not ensure 100% privacy. The risk depends on the amount and type of perturbation added into the data. However, these methods permit the use of the same data to compute different functions. The main advantage is flexibility.

USER PRIVACY
As stated above, user privacy focuses on methods that can be implemented and applied by the user to ensure his/her own privacy. We can distinguish two main families of methods: • Methods to protect the data generated by the user.
Tools for anonymous communications belong to user privacy. These tools are expected to be implemented by the user to avoid the disclosure of some information related to him. For example, when a user A sends a message m to user B, A may want to avoid that third parties know that he is the sender of m, or to hide the content of m (or try to keep others unaware that he sent a message after all).
Tools for user privacy have also been developed in the context of querying databases or search engines. In this case if A queries a search engine Y with query q, we may have the case that we want to avoid Y to know who is the sender of the query, and the case in which Y knows that A is the sender but is not aware of the query q.
We will give some examples of these tools in the next two sections.

Methods to Protect the Identity of the User
In the context of communication, we have anonymous communication mechanisms in which the sender of the message (or, in general, the origin of the communication) is not disclosed. Mix networks, 20 onion routing, 21 and crowds 22 are examples of such systems.
In the context of querying databases, this problem is studied in anonymous database search. One approach to this problem is allowing users to submit queries in behalf of other users. Peer-to-peer user private information retrieval 23-27 (P2P UPIR) follows this approach. Cryptography is used to define communities of users and communication spaces.

Methods to Protect the Data Generated by the User
In the context of communication, cryptographic mechanisms are used to protect the content of the messages. In addition to that, there are systems developed to ensure unobservability, i.e., that third parties do not even know that a message is sent, for example, we have dining cryptographer networks. 28 Private information retrieval (PIR) studies this type of problems in the context of querying databases. Informally, this problem can be stated as finding a way to retrieve an element of a database without the database being able to deduce which element is of interest to the user.
Information theoretic PIR faces this problem considering the case in which there is no privacy breach even in the case of an unlimited computing power. However, it has been proven that if we consider a single database, all information theoretic PIR schemes require Ω(n) bits of information (where n is the number of records in the database). This means (see Ref 29) that essentially the only thing that the user can do to avoid the database to know his query is to ask for a copy of the whole database.
Because of this theoretical result different alternatives have been considered in the literature. We review some of them below.
First, within the information theoretic PIR the literature considers solutions in which instead of a single database there are replicated copies of this database. Then, the user queries differently each of the copies and from the results of the queries obtains the desired result. In this case, solutions sublinear in n exist. Some of the solutions are resistant to coalitions of databases. That is, even in the case a certain number of databases collude they will not be able to find out which is the query of the user. See e.g., Ref 29. Another approach is computational PIR (cPIR). In this case, a server with a limited computational capacity is considered. Refs 30 and 31 are two of the proposed solutions for cPIR.
A third approach is the use of trusted-hardware. See e.g., Refs 32 and 33 on trusted-hardware PIR.
These three approaches are based on cryptographic tools and ensure no privacy leakage. Another approximation consists on methods that mask the real query in a set of other queries. This is the case of GooPIR 34 and TrackMeNot. [35][36][37] They add to the query, either at the query level or at the session level, additional terms with the goal that the server cannot distinguish the real queries of the user among the added ones. This query obfuscation approach can however be attacked by analyzing the user query history from the server side. In Ref 38 authors can re-identify users from their obfuscated queries by using common classifiers and clustering techniques on the user query history.
Dissociating Privacy Agent 39,40 (DisPA) follows another approach, also to protect the data generated by the user. In this case, the system (a plug-in for Firefox) generates different identities for a given user, and then distributes the queries among the identities. The basis of this system is to consider that what makes a user unique is the union of all queries. Therefore, the disaggregation of queries permits to keep the profile of the user unknown to the search engine. Disaggregation of queries is done according to topics, so if a user often queries about data privacy, Japanese recipes, and sports/squash, it will result that the search engine will just know that there are three individuals one interested on data privacy, another on Japanese recipes, and a third one about Sports/squash.

RESPONDENT AND OWNER PRIVACY
According to what has been described in the previous sections, we have that respondent and owner privacy are typically implemented by the owner of a database. According to our discussion on the dimensions about the computations to be done and the number of sources, we have that there are the following typical scenarios in respondent and owner privacy.
• Result-driven methods (mainly used in owner privacy). Given a database D, a data mining algorithm A, and a certain knowledge K that we do not want to disclose, the goal is to modify D into D ′ so that the algorithm A cannot infer K from D ′ . Ref 41 is an overview on this topic, and Refs 42, 43 describe algorithms in the case that A are rule mining algorithms.
• Computation-driven methods with the typical scenario with several data sources belonging to different data owners. This scenario corresponds to owner privacy. As described above, this type of problem is solved defining cryptographic protocols for the specific function the owners want to compute. Ref 19 describes several computation-driven methods and Ref 44 is a survey on methods for horizontally partitioned data (i.e., different owners have data on different individuals but on the same variables).
• Computation-driven methods with a single database release. If the function is completely specified, the most common scenario is when researchers can access to a database and send specific queries (see e.g., Ref 45). If the function is not completely specified but it is known that the user applies, e.g., clustering, then data-driven approaches would be applied with particular emphasis on methods that behave well with respect to this use (clustering). There are studies (see e.g., Ref 46) comparing different data-driven methods with respect to clustering, supervised learning algorithms, and so on.
• Data driven either with one or multiple data releases. As already cited in a previous section 10, 16,11,12 focus on data-driven approaches of multiple data releases or streaming data. Data-driven approaches for a single database are further discussed in the next section.

Data-Driven Methods
As described above, data-driven methods, usually referred as masking methods, are appropriate when we do not know beforehand what type of analysis will be applied to the data. Given a database D, the usual way to proceed is to modify D into D ′ so that the risk of disclosure decreases while at the same time we preserve the utility of D. That is, modify D into D ′ so that the disclosure risk decreases while keeping information loss as low as possible. Note that we use the term information loss as a computation-oriented definition of data utility when referring to data utility. More information loss means less utility of the data once it is masked. Due to the fact that these methods are not disclosure risk free, several disclosure risk measures have been considered in the literature to quantify the risk in D ′ . At the same time, as the modification of D can decrease the utility of the database, some information loss measures have been defined to measure the extent of this loss. Naturally, disclosure risk decreases at the expenses of some information loss. Then, a good privacy method is the one that modifies D into D ′ in such a way that the disclosure risk is very low and the information loss is also very low.
As a summary, we conclude that research in datadriven methods needs to focus on masking methods, disclosure risk measures and utility measures. In the following three sections, we discuss: disclosure risk measures, information loss measures, and data masking methods.

Disclosure Risk and Some Definitions of Privacy for Data-Driven Methods
In this section, we describe several approaches to measure the degree of privacy provided by a given method. These measures are normally referred as disclosure risk measures, or presented as privacy properties to be satisfied by the protected data.

Properties for Disclosure Risk
Data-driven methods add noise to the data to avoid disclosure. Then, we can either consider risk as a Boolean condition that is either satisfied or not satisfied, or as a measurable (non-Boolean) condition and define measures of risk.
Differential privacy 45 and k-anonymity [47][48][49][50] follow this first approach. That is, they define conditions in which we say that the file satisfies our requirements of privacy. At the same time, such definitions permit to define algorithms that given a privacy condition only focus on the minimization of information loss.

Volume 4, July/August 2014
The k-anonymity property ensures that in a protected data set there are at least k records indistinguishable from each other. Or, from the point of view of re-identification, that the probability of re-identifying and individual from the data set is 1/k. This property is very common in statistical data like census, where the perturbation is applied to attributes known as quasi-identifiers (they cannot be used to re-identify an individual by themselves, but their union might be), and sensitive attributes are left without perturbation. Example of quasi-identifiers can be age, sex, or postal code, while typical sensitive attributes are salary or disease. Consider a k-anonymous data set, where the set of k records sharing the same quasi-identifiers (known as anonymity set), also have the same sensitive attribute. Although the table might not be used to directly re-identify a given individual, it leaks information about the sensitive attribute. That is an attacker will know the sensitive attribute of the individual knowing to which anonymity set it belongs. This is a well known problem of k-anonymity (see Ref 51 for a detailed description of the problem and discussion of current solutions). To address this issue several properties have emerged. l-diversity 52 requires at least l well-represented values of the sensitive attribute in each anonymity set. Moreover t-closeness 53 requires the distribution of sensitive attributes to be close to their distribution in the overall data set. In this same line, p-sensitivity has also been defined as the concrete case of l-diversity where the number of values for each sensitive attribute is at least p for each anonymity set. See Ref 54 for a review of p-sensitive k-anonymity models.
Some generalizations of k-anonymity have been defined in the literature. For example, k-confusion 55 and probabilistic k-anonymity, 56 in which instead of requiring indistinguishable records the focus is on the probability of re-identification. k-Concealment 57 requires computationally indistinguishable records 58 (each record can be matched with k − 1 generalized records).
Differential privacy states that adding or removing an item from a data set does not significantly affect the outcome of any analysis. That is, the outcomes should be probabilistically similar. This definition of privacy has boosted a great number of literature on mechanisms to provide differential privacy, 59 but it has also raised some concerns. For Example, see Refs 60, 61 question the practicality of differential privacy as a general case approach for data privacy.

Disclosure Risk Measures
As an alternative to Boolean conditions, there are measures of disclosure risk defined under the premise that risk is not binary but a measurable condition. Then, it has sense to consider different levels of risk and the trade-off of the risk with respect to the utility of the data. In this setting, the problem is not to define algorithms with the only purpose of optimizing information loss but with the purpose of finding a good trade-off between information loss and disclosure risk. Therefore, from an optimization point of view, we have a multi-objective (two-objective) optimization problem instead of a minimization problem. The perspective of an optimization problem has been exploited in, e.g., Refs 62,63. Some of the measures of disclosure risk are based on the concept of uniqueness, and on re-identification algorithms. Key references on disclosure risk based on uniqueness are Refs 64, 65 and based on reidentification algorithms are Refs 66-68. Disclosure risk measures based on re-identification algorithms model the scenario in which intruders use their knowledge (represented in terms of a database) to attack the published data set. In this case, the intruders will try to link their data with the one in the data set by means of the best available technology for database integration (re-identification algorithms, schema matching, and record linkage algorithms). This approach is flexible enough to cope with a large number of scenarios. For example, disclosure risk has been studied for masked data, 69 synthetic data, 70,66 and for the case in which the intruder and the protected data are not using the same variables 71,72 or are using different terms (e.g., ontology-base record linkage in Ref 73).
As a general purpose estimation of the disclosure risk, re-identification can be attempted on the protected data set assuming the knowledge of all the attributes from the original data set. For example by applying record linkage between the original records and the same protected records. This approach was introduced in Ref 74, and widely used afterwards. 66,70,75,76 The percentage of re-identification is used as a generic index of disclosure risk, that can be used to compare different masking methods. 67 A parameterized record linkage allows to provide an upper bound index of re-identification by finding the optimal distance between records (one that provides the highest re-identification index) using machine learning techniques. 77 Disclosure risk based on re-identification methods can also be used to model the case in which the intruder uses information about the data masking process to attack the data. That is, in case that an institution publishes a data set giving information on the algorithm applied and the parameters used, we can use this information to attack more effectively the data. This has been proven to be effective in  the case that data was protected using rank swapping and microaggregation. Methods resistant to this type of attacks are needed for the sake of transparency. 81

Utility and Information Loss Measures
Utility measures are used to measure to what extent the protected database diverges from the original one for some statistics and analysis. We can measure the utility of the data once it is masked as compared to the original one. This measure can be given in terms of the loss of information produced by the masking method. A masking method that yields a higher loss of information will present lower utility. Then, given a database D, a protected database D ′ , and a certain analysis f , an information loss measure is a function where divergence is a way to compare the result of the analysis f on D and D ′ . Naturally, the function divergence should achieve zero when D = D ′ , and increasing the more f (D) and f (D ′ ) differ.
We can distinguish between generic utility (or information loss) measures and specific utility (or information loss) measures. We have specific utility measures when they focus on particular uses of the data. This would be the case if we consider clustering as a data use, and then we use clustering algorithms and functions to compare partitions to define an information loss measure. This is the case in Ref 46. Otherwise, we have generic utility measures when we, e.g., aggregate some statistics of the data. This latter approach is used in Refs 82, 2.

Masking Methods
Data masking methods are typically classified in three main classes. See Refs 3, 4 for detailed descriptions of the methods.
• Perturbative methods. Given a database D, these methods modify the database adding some noise to D. This can be modeled as follows: There are several perturbative methods. The simplest one is noise addition where the error to be added to D follows a normal distribution. Most important methods are noise addition, 83 multiplicative noise, 84 microaggregation (applicable to all types of data), 85 rank swapping (for data in ordinal or numerical scales), 86 and PRAM (for ordinal or categorical scales). 87 • Non-perturbative methods. Given a database, these methods modify the database changing the level of detail of the data but not introducing errors to the data. One masking method is generalization, which replaces a category by a more general one (e.g., town is replaced by county), another one is suppression (suppression can be considered as equivalent to a generalization to the most general category), and finally we have discretization in the case of numerical data (again, a kind of generalization).
• Synthetic data generators. Instead of publishing the original data, we generate a model of the data and then replace the original values by the outcomes of the model. This approach can be considered as a kind of perturbative method.
All the methods described here have been used, and compared in terms of their trade-off between information loss and disclosure risk (defined in terms of re-identification algorithms). In the case of using differential privacy as a standard to ensure risk, the most common masking method is to use Laplace noise. See e.g., Ref 45 for details. In the case of using k-anonymity as the standard for risk, the most common masking method is generalization and suppression. See e.g., Refs 88, 89, 50 for details. Note that such methods focus on numerical data for differential privacy and categorical data for k-anonymity.

DISCUSSION
For details in the topics presented in this paper, the reader can look to the following books 3,4,8,19 and also to the material in the web page. 90 Refs 3, 4, 8 follow a SDC perspective, while Ref 19 a PPDM perspective. In addition, Refs 5, 6 focus on some specific topics. Ref 5 is a survey on the use of information fusion techniques in data privacy, mainly focusing on the use of aggregation functions and record linkage techniques. Ref 6 focuses on the use of explicit knowledge (either in the data privacy protection process or in re-identification).
We have discussed the main topics related to data privacy. Although the discussion is general and independent on the type of data used, research is not. Initial research in the field focused on standard databases with either numerical and categorical data. Further research has been done in longitudinal/time-series data, 91 and there are more recent trends on data privacy for (search) logs, [92][93][94][95][96] locations, 97,98 and graphs. 99,100 Research in these areas follows the same lines discussed here. There is research on online social Volume 4, July/August 2014 networks that focuses on respondent and owner privacy, while there is other research focusing on user's perspective (i.e., user-privacy). There are perturbative approaches (e.g., to avoid re-identification) and non-perturbative approaches (e.g., to achieve k-anonymity) for online social networks, and also results to achieve differential privacy in online social networks. Similarly, there are also such lines of research in location privacy, or in methods for search logs.
In any case, the development of methods has to take into account the specificities of the data. Ignoring them can cause disclosure as an intruder can use such vulnerabilities to attack the data. Some of the scandals 101,102 in privacy have been due to a lack of understanding of these specificities (e.g., different logs from the same person, which alone are not sensitive, can be combined to re-identify this person).
The data privacy research and application field is gaining popularity and there is a growing community interested in advancing the research field. There are open issues and research fields specially active: data privacy techniques for very large datasets, including stream data is becoming important as the data processing capabilities are rapidly increasing. Moreover, the interest of other research areas in data privacy is also becoming very relevant, examples are machine learning, or game theory.

CONCLUSION
In this paper we have presented a review of the main techniques related with data privacy. We have presented the main dimensions that permit to classify data privacy protection methods, we have enumerated some of them, and discussed the main concepts in the area.