@inproceedings{ bro-anonymizer-sigcomm03,
    author = "Ruoming Pang, Vern Paxson",
    title = "A High-level Programming Environment for Packet Trace Anonymization and Transformation",
    journal = "ACM SIGCOMM, (Karlsruhe, Germany: 2003)",
    year = "2003",
    url = "www.cs.princeton.edu/~rpang/bro-anonymizer-sigcomm03.pdf" }

This paper presents a tool for implementing packet anonymization of both packet headers and packet payloads. They do this by reassembling packets to recreate the original application-level data, anonymizing it, and then converted back into trace data. This paper is not focused primarily on making traces suitable for testing IDS's, so some modifications of what they describe here may need to be made for this purpose.

For example, payload reassembly may fail if there is a DoS attack present in the trace which tries to crash a router with malformed fragments that cannot be reassembled correctly. The reassembly step must be able to deal with such packets and preserve them in the trace to make it realistic.

Also, when the application-level data is converted back into a trace, their checksums and lengths are recalculated to give correct values. If the network traffic is not well-formed, this property will be lost. This is important in testing IDS's because it affects the measured false positive rate. Moreover, attacks present in the original trace could be lost. For example, IP checksums could be used as a covert channel for communication between compromised DDoS machines. One way to solve this problem is to make a log of every time the original packet has an incorrect field value, which could be incorporated into the output trace. However, one positive aspect of making the output trace well-formed is that this removes the risk of OS-fingerprinting which could be used to identify individuals. We could try to remove the problem of OS fingerprinting by making the identity of the network unknown, but this may not be very effective since there could be ways to deduce the network, especially if it is known that only a few networks of a certain size allow traces to be taken. Moreover, since for our purpose we use prefix-preserving IP addresses, an attacker could match the topology with the real network. The number of hosts in the trace could be scaled down, if there is a way to ensure that the relevant properties of the traffic are preserved.

They discuss the concept of "knowledge representation" in order to counter known-text attacks, where information is anonymized differently depending to the context. We may not be able to use this since some IDS's look for certain strings in payloads.

The "filter-in" policy that they use is a good idea, since it makes it less likely that private information is accidentally left in the trace. This also allows us to specify strings to be left in the trace which IDS's look for to identify attacks.

The authors address the use of anonymized traces for testing IDS's, and state that whether an attack survives anonymization depends on its characteristics, and how it is detected. It seems that we need to find out specifically which attack characteristics must be preserved. Simply filtering according to the presence of certain strings, which they mention, may not be good enough to preserve attacks, since these strings can be easily changed by attackers. This also depends on if the IDS works by examining packet fields or by examining behaviors of streams.

In this paper, the traffic was not modified as much as possible, and only the contents of packets were changed. Since we have different constraints on what properties of the trace need to be preserved, we can go further and add, remove, or shuffle existing streams of packets, though we may have some constraints on changing the order of streams since an IDS may be stateful.