Jean, following is a White Paper for the Workshop on Research Directions for the Next Generation Internet. Thank you, Dave Wiltzius LLNL wiltzius@llnl.gov (510) 422-1551 Workshop on Research Directions for the Next Generation Internet Exploiting Reliable Switched Networks to Gain TCP Performance March 27, 1997 Dave P. Wiltzius Lawrence Livermore National Laboratory L-60 Livermore, CA 94550 P: 510.422.1551 F: 510.423.8715 E: wiltzius@llnl.gov Exploiting Reliable Switched Networks to Gain TCP Performance Dave P. Wiltzius Lawrence Livermore National Laboratory L-60 Livermore, CA 94550 P: 510.422.1551 F: 510.423.8715 E: wiltzius@llnl.gov Summary of Proposal High performance networks are often implemented using reliable media and switching. This is also true of interconnects for massively parallel processors (MPPs) and, more recently, clustered SMPs. Some of these MPP and clustered SMP machines are unable to achieve the much of the media speed when using TCP/IP. One reason for this is the TCP checksum, which requires additional computation and typically also additional data movement. We are proposing that, during packet transmission, the network interface calculate a CRC for the entire UDP and TCP/IP packet and append it at the end of the packet. This CRC protected packet may then traverse a heterogeneous switch fabric to the receiver. The receiver's network interface would then calculate the CRC and compare it to the CRC calculated by the sender. If the same, the packet has arrived intact and the TCP checksum (and any associated overhead) may be avoided. Note that as long as the CRC protected IP packet remains on the switched fabric (even to different media, such as from the interconnect to HIPPI), the packet is never modified. Only the routing activity requires that the packet be modified (i.e., IP header's time-to-live (TTL) field is decremented and the IP header checksum adjusted). We are also proposing that routing over a heterogeneous switched fabric avoid the IP header adjustment of TTL and header checksum - a reasonable proposal that will be justified if a subsequent paper is written. If the receiver does not take advantage of this CRC, then the packet is received and processed just like other IP packets. Similarly, if the destination requires that the packet leave the heterogeneous switched fabric, then the router would forward it, recalculating the CRC when transmitting, if possible. Scenarios Note that a reasonable scenario is as follows: An interior node of an MPP or clustered SMP would send a UDP or TCP/IP packet over the interconnect switched fabric to an I/O node. The I/O node (a special node on the MPP or clustered SMP) would forward the TCP/IP packet onto a HIPPI switched fabric. The packet would then be received by the I/O node of another MPP or clustered SMP, or a visualization server, or an archival storage server. With higher bandwidth WANs, it would be possible to configure a SONET path with IP/SONET at each endpoint. HIPPI switches with an IP/SONET interface could then be used to bridge between the geographically distant HIPPI system area networks using IP/SONET. IP/SONET could also be used to bridge between the switched fabrics on two geographically distant LANs. Motivation Analysis of TCP/IP overheads abounds in data communication literature. Still, many vendors fail to implement TCP/IP protocol stacks that perform well on faster media. Some of this may be alleviated with faster processors and more kernel buffer space - not entirely satisfying solutions since the user is being denied some resources (CPU, memory) to gain access to others (network performance). There's a large investment in networking applications. Hence it is important to users that the TCP/IP application programming interface (API) be preserved. This is even more compelling when commercial software is taken into account. Additionally, TCP/IP is available on nearly every computer platform. So using TCP/IP is quite attractive, particularly compared to proposing a "lightweight" protocol (presenting another implementation challenge for vendors). Observations This proposal for a CRC protected IP packet results from several observations in the high performance computing arena: 1) Switched networks are often found in high-end computing environments. This includes interconnects (e.g., DEC's Memory Channel, IBM's SP2, SGI's SPIDER), system area networks (e.g., HIPPI-800 and HIPPI-6400, ATM, Fibre Channel), and even wide-area networks (e.g., ATM, and SONET if provisioned paths are included). 2) There is more interest in IP/SONET as wide-area network (WAN) bandwidth demands outpace ATM products, and the corresponding waning interest in ATM local area networks (LANs) and end-to-end ATM applications. Reasonably sized WAN backbones can be build using routers meshed with SONET paths (running IP/SONET) instead of ATM circuits. 3) Associated with the availability of high performance WANs, there is growing interest in distance computing. In this scenario local and remote computational resources are connected via a high performance WAN. 4) Few platforms achieve high performance with TCP/IP. We present the following observations pertaining to this CRC protected IP packet proposal: 1) The CRC would be calculated during transmission by the network interface. Conversely, it would be calculated during reception by the network interface, and then placed in a separate "IP input" queue if CRC protected for special handling. Hence the hardware support is in the network interface. 2) CRC capable hosts would interoperate with CRC hosts that did not participate in this CRC protection scheme. 3) The CRC would save on checksum calculations for all IP payloads - UDP, TCP and others. 4) No changes to the TCP/IP API - and hence applications - will be required (unlike most zero-copy TCP/IP implementations). 5) We envision that modifications to the operating system required to take advantage of this CRC feature would be very reasonable, and probably separate from the "production" TCP/IP stack. 6) Most high performance networks are switched, including WANs.