Papers on Packet Trace Sanitization:
------------------------------------
@article{ xu-design,
    author = "Jun Xu and Jinliang Fan and Mostafa Ammar and Sue B. Moon",
    title = "On the Design and Performance of Prefix-Preserving IP Traffic Trace Anonymization",
    journal = "Internet Measurement Workshop (San Francisco, CA, USA: 2001)",
    pages = "263--266",
    year = "2001",
    url = "citeseer.nj.nec.com/462352.html" }
This paper focuses on the problem of prefix-preserving IP anonymization, in which IP addresses with the same k-bit prefix are mapped to anonymized addresses that share the same k-bit prefix. If some anonymized addresses have been compromised such that their raw addresses are known, then for any anonymized address y, if the longest match between y and a compromised address is k bits long, the first k+1 bits of y are revealed. This is because the (k+1)th bit of y cannot be mapped to the same bit as the (k+1)th bit of the compromised address. So the highest security possible is to make the remaining (32-k-1) bits completely random. They introduce two metrics in order to plot the amount of information leaked from a trace as the number of compromised addresses increases. The first metric is U, the sum of all unknown bits. If there are N unique addresses in the trace, U is 32*N if no addresses have been compromised. However, since we are preserving prefixes, certain bits of different IP addresses must be the same. If we only count these bits once, this gives us the second metric, C, the compressed number of unknown bits. Breaking up U into U1, U2, U3, and U4 corresponding to the number of unknown bits for the first 8 bits, next 8 bits, and so on, they show that U4, U3, U2, U1 drop at increasingly faster rates.

They then describe how TCPdpriv works. As it anonymizes a trace, it stores pairs of raw and anonymized addresses. To anonymize a new raw address, it goes through all of the existing pairs to find the longest match, and then creates the anonymized address by using the same prefix, and a pseudo-random generator for the rest of the bits. There are three disadvantages, corresponding to three ways researchers may want to collect and anonymize traces. First, for a large amount of trace data, we may want a router to spit out anonymized packets during real-time, but the method used by TCPdpriv requires too much memory for this to be feasible. Secondly, we may want different traces to be anonymized in parallel, such that the same raw IP address that appears in both traces would be consistently mapped to the same anonymized address. However, this is not possible unless traces are processed serially, since the anonymization depends on the order that addresses appear in the trace. Thirdly, if our trace is very large, we cannot break it up and process it in parallel for the same reason above.

The authors come up with a cryptography-based approach to resolve the drawbacks of TCPdpriv. It removes the dependency on the order that IP addresses appear in the trace by using a function. The prefixes are still vulnerable if some addresses have been compromised as before, but the rest of the bits are scrambled, and their randomness is as good as the cryptographic strength of the hash functions used as part of the randomization. This is good enough to ensure security from computationally constrained adversaries, and is the best security that can be offered given that prefixes must be preserved. It seems that this method is more computationally expensive, but is more efficient since large amounts of data can be processed in parallel.


@article{
author = "Jun Xu and Jinliang Fan and Mostafa Ammar and Sue B. Moon",
title = "Prefix-Preserving IP Address Anonymization: Measurement-based Security Evaluation and a New Cryptography-based Scheme",
journal = "ICNP 2002",
year = "2002",
url="ipmon.sprint.com/pubs_trs/pubs/ sbmoon/icnp02_security.pdf" }

This paper is a revised version of the paper above. In this paper, the authors explain more thoroughly the geometric interpretation of prefix-preserving anonymization in the form of a binary anonymized address tree. A prefix-preserving anonymization function can be viewed as specifying a binary variable for each non-leaf node, which tells us whether the function "flips" the original bit or not. They go on to prove a theorum that shows that this method of prefix-preserving anonymization is the only method. They improve on the previous version of this paper by looking at possible attacks that their method could be vulnerable to, which falls into two categories: cryptographic-based attacks in which the cryptographic key is broken, and semantic-based attacks in which information can be inferred without breaking the key. To measure the effects of attacks, they use the following metrics: number of unknown compressed bits C, number of unknown uncompressed bits U, and the number of addresses with exactly i known bits, Fi. They show the results of compromising addresses randomly and with a greedy algorithm. They then look at the effects of two "obvious" forms of semantic attacks: frequency analysis and compromising all DNS server addresses. Using frequency analysis, it is possible to infer the IP addresses of popular web sites since they occur frequently in the trace. They find that frequency analysis tends to affect the most significant bits, and so does not seem to be a serious threat to the trace they used. IP addresses of DNS servers can be inferred due to their hierarchical structure, assuming that the attacker has sufficient knowledge of the DNS server hierarchy. It would seem that DNS server addresses are a large enough fraction of total addresses such that if all the DNS server addresses were compromized, the anonymization would break. However, DNS server addresses are not random. Compromising addresses at random is much more effective than if they are not, so fewer addresses would be revealed with this attack than we might expect. Other semantic attacks that they mention for further research are active attacks, such as the injection of packets into a network during a trace, port scanning which can help attackers recognize anonymized hosts that appear in a trace, and inference from routing table information, which is sometimes anonymized and released together with the anonymized trace.

See also Crypto-PAN, the implementation of this algorithm
http://www.cc.gatech.edu/computing/Telecomm/cryptopan/


@article{ peuhkuri-method,
    author = "Markus Peuhkuri",
    title = "A Method to Compress and Anonymize Packet     Traces",
    journal = "Internet Measurement Workshop (San Francisco, California, USA: 2001)",
    pages = "257--261",
    year = "2001",
    ISBN = "1-58113-435-5"
    url = "citeseer.nj.nec.com/462175.html" }
This paper addresses two problems associated with using packet traces: large packet volume and privacy issues. The problem of large packet volume is dealt with by using a compression scheme at the level of traffic flows.

In the compression scheme, the following fields of the IP header are used to identify a flow: source address, destination address, protocol no., and port no.(for TCP, UDP packets only). The combination of these fields is hashed to a value used to index a table. If the hash value is not already in the table, a new entry is created, and a new flow record is written to the output stream.

Privacy is addressed by using a cryptographic-based approach, which is similar to the paper above, but seems to be more complex, with an additional lookup table and database needed. It first takes a maximum 1024 byte security key, whose sole purpose is to be used by MD5 to generate a 129-bit Blowfish key. Every time an IP address is seen in a packet, in the IP header or elsewhere, it is scrambled. To do this, a table is consulted to find the "host part", the number of bits to be scrambled, while the "network part" is left unencrypted. The "original IP address" (unencrypted 32 bits?) is concatenated with a "32-bit block of the key" (probably the Blowfish key, not clear which of the 128 bits is used) and encrypted into a 64-bit value, referred to as the "encrypted value", using Blowfish in electronic codebook mode, and the low part of the encrypted value is used as an index to the hash table. If there is no entry at the location or the encrypted value is different, a record is written into the packet output stream with the top 24 bits of real address and encrypted value, total 96 bits (not sure how they get this number, since 24+64=88). Each IP address in a datagram is replaced with a value, which has the topmost 8 bits from the original address and the lowest ceiling(log2(tablesize)) bits from the encrypted value. When the compressed and scrambled stream is expanded, the encrypted values are used as a key to the database where the random IP addresses within each network are generated. If there is no database entry for the encrypted value, a new free random IP address within the network is selected and used as a replacement for the encrypted value.

This scheme could have been described more clearly by assigning names, so that it is clear what is being referred to, since there are so many keys and encrypted values. Also, we are not given any insight into the reasoning behind the steps of the scheme, or at least the general idea of why it works. There does not appear to be software that implements this anonymization scheme from my searches on the web.


@inproceedings{ paxson97why,
    author = "Vern Paxson and Sally Floyd",
    title = "Why We Don't Know How To Simulate the Internet",
    booktitle = "Winter Simulation Conference",
    pages = "1037-1044",
    year = "1997",
    url = "citeseer.nj.nec.com/paxson97why.html" }
Better to use Source-level trace-driven simulation if in a different context, due to adaptive congestion control.


A High-Level Programming Environment for Packet Trace Anonymization and Transformation 
Ruoming Pang (Princeton University), Vern Paxson (ICSI)
To appear in ACM SIGCOMM 2003, Karlsruhe (Germany, August 2003)
http://www.cs.princeton.edu/~rpang/cv0305.txt
(Not yet available on the web)

@article{ author = "Jeffrey Mogul", title = "Trace anonymization misses the point", abstract = "Presentation during WWW 2002 panel on Web Measurements" year = "2002", url = "www2002.org/presentations/mogul-n.pdf" }
This is a presentation on the problems associated with trace anonymization. Two main problems are finding a balance between usable traces and protecting privacy, and verifying that the anonymizer does not have errors. A solution that is proposed is to "move the code to the data", where code is shipped to the ISP, which executes the code. The code and results would then be placed on the Internet Traffic Archive. Only the results would be returned, and not the anonymized inputs. Common Log Format and similar formats have certain problems: they leave out information such as timestamps, lack first-class support for proxies and caches, and are hard to parse. It seems that whether this approach can be used depends on whether ISP's and corporations have any motivation to cooperate with researchers.


@article{ california-proceedings,
author = "Simon Patarin and Mesaac Makpangou",
title = "Pandora: A Flexible Network Monitoring Platform",
journal = "Proceedings of 2000 USENIX Annual Technical Conference (San Diego, CA, USA, June 18-23, 2000)",
year = "2000",
url = "citeseer.nj.nec.com/551208.html" }

This paper describes Pandora, a network monitoring platform that captures packets using passive techniques, and is able to process all packets in real-time. Pandora is designed with the aim of being flexible and extensible. It is designed as a stack of monitoring components, each handling a smaller task, as well as control components. Pandora preserves user privacy by allowing the flexibility of controlling the anonymization policy. The authors did not focus on how they chose to implement anonymization in Pandora.