Pitfalls to Avoid
1. We cannot assume that the ratio of signal (attack traffic) to noise (background traffic and response of systems to the attack) is known.
2. We cannot assume that there will be a period of normal traffic that can be sampled by the defense mechanism before an attack starts.
3. Testing must be carried out consistently, so it is not a good idea to allow developers to adapt the data used for testing arbitrarily.
4. Using only one set of test data may not guarantee accurate results, since the sample size is small.
5. If synthesized data is used, it must be validated using appropriate metrics to prove that it is a realistic model of the actual network.
6. In modelling normal background traffic, we must take into account that real network traffic is not well-behaved due to poor implementations of network protocols.
7. We need to consider data rate and how it varies over time.
8. Our model of attack traffic needs to be realistic compared to the actual network with regards to the proportion of total traffic that is attack traffic, and the mix of different attacks.
9. The topology of our test bed must be representative of the actual network topology.
10. We must be able to show that the artificial environment did not bias the results.
11. The taxonomy of attacks we use should be from the perspective of the defense mechanism, not the attacker. Attacks could be categorized according to the layer in the protocol stack, and protocol used, or whether a protocol handshake must be completed in order to carry out the attack.
12. The method of doing analysis on the results must be justified by taking into account the underlying assumptions of that method, as well as any biases the method may exhibit.
13. We must try to avoid the unit of analysis problem.
14. We need to come up with a reasonable, consistent, and clear measure of what constitues a satisfactory false positive rate.
15. Other objectives of evaluating defense mechanisms are to develop an understanding of how specific defense mechanisms have been designed, as well as to be able to compare classes of defense mechanisms which work on the same principles. An automated method of testing for known vulnerabilities may not be well-suited to fulfilling these objectives, so we may need to develop a taxonomy of defense mechanisms in addition to our taxonomy of attacks.
16. We must know the security policy of the network, since some activities such as scanning can be a precursor to both legitimate and illegitimate activities. Other activities may be only associated with attacks, but whether the initial activity itself is allowed depends on how strict the security policy is. Clearly defining what is considered an attack will help us measure the false positive rate more accurately.
17. Evaluations of how well our tests predict what happens in real networks will be useful in determining how good our own testing methodology is. Also, using a cost-benefit analysis to weigh the costs and overhead of using our testing methodology against the benefits derived from it will tell us if we have succeeded in making the testing framework practical enough to be used on a large scale.
18. The ROC method of plotting the true positive rate against the false positive rate does not tell us where the weakness lies in the defense mechanism, so this method and others similar to it should be avoided.
19. It would be worthwhile to create a program to automate the generation of different kinds of attacks, which could be incorporated into our testing methodology.