OP2P (Open Peer to Peer Interface) Wishbone Aurora Bridge

OP2P (Open Peer to Peer Interface) Wishbone Aurora Bridge


Category: Communication Controller

Created: November 10, 2011

Updated: January 27, 2020

Language: VHDL

Other project properties

Development Status: Beta

Additional info: FPGA proven, Specification done

WishBone compliant: Yes

WishBone version: n/a

License: LGPL


Open Peer to Peer Interface, Wishbone to Aurora Bridge (OP2P).
This interface logic has been designed to provide a very high performance multi-lane multi-gigabit fully non-transparent (independent address spaces) peer-to-peer (no master/slave or root-complex/endpoint relationships) communiction link where the rest of the communication stack is implemented in hardware. It can be used for both cable or backplnane links. The aim of the project is to provide a network-like, high-bandwidth, flexible, serial-I/O-based replacement of originally PCI-based multi-processor and storage systems.
The destination of a transaction is specified with a 16-bit ID which is made of a 10-bit chassys ID and a 6-bit slot-ID. For point to point cable links we can use ID=0 which means the packet is intended for the device receiving it (the link partner). The OP2P protocol supports multi-hop mesh topologies where not every card has direct conenction to every other (like in a full-mesh), and the device receiving a packet with a non-matching destination ID will forward the packet on another port to reach the inteded recipient. This is called distributed switching, no switch cards are needed in the system/network. The system or network can be backplane-based or cable-based, or a mixture of them. There are similarities with PCI-express in the way of handlindling the packets, but without the limitation of speed, number of non-transparent ports on a device and the master-slave relationships. There are also similarities with Ethernet, without the excessive software overhead and the limitations of the link-width and speed unflexibility.
This IP core is only one port. It implements a higher (transaction) layer of the communication stack, while the lower (physical) layer is implemented inside the Xilinx Aurora interface IP (using various types of the Xilinx multi-gigabit serial transceivers) generated in the Xilinx CoreGenerator program. The OP2P interface was developed to provide a low latency, low software-overhead board-to-board communication interface. It is basically a “Buffer-Copy” interface; it copies data from a DRAM memory buffer on one board to a memory buffer on another board, initiated by a command which specifies the address locations within both the source and the target buffers. The buffers should be memory mapped within the system address spaces of the boards independently (PCI/PCIe devices). It is based on PCI-express, with certain modifications: all ports are non-transparent and peer-to-peer supports packet forwarding in indirect mesh connections without the on-board system processor’s (usually X86 high performance processor like Intel Core-x, Xeon…) intervention. This interface cannot be used to replace a master-peripheral type PCI system, since it requires more intelligence in a peripheral card, and it is not compatible with the PCI Plug&Play BIOS/software, also all ports are non-transparent. The host (x86) processor does not read/write data directly from/to the OP2P port, but instead it provides a command (fill up 5 FIFOs with transaction parameters) to allow the OP2P port logic to take the data from/to the local DRAM buffer. A complete bridge/switch (FPGA chip logic) would consist of multiple OP2P ports with a local DRAM buffer, and the host (X86 processor) will have to read/write that DRAM buffer directly instead of reading/writing the ports.

About the Xilinx Aurora Interface IP Core:
This core is used as the physical layer logic for the OP2P interface. It implements packet frame generation/detection, flow control, error detection and link initialization. The clocking architecture used is pleisosynchronous, meaning that the reference clock signal does not have to be distributed to each device, instead they use clock compensation packets inserted into the data stream.
Electrical Characteristics: The actual silicon hardware electrical interface is implemented using the Xilinx FPGA's built-in multi-gigabit serial transceivers. The OP2P electrical characteristics therefore are equal to the Xilinx transceiver characteristics. These parameters are documented on the chosen FPGA device's datasheets and Characterization Report documents that are all available from the Xilinx website. Xilinx normally characterizes their transceivers against interface standards like SATA/PCIe, against electrical standards like various OIF (Optical Interface Forum) CEI (Common Electrical Interface) documents, and sometimes against mediums like CAT-5/6 UTP and other cables. The different Xilinx FPGAs have different types of transceivers built-in, for example the GTP, GTX, GTH, which all have their different maximum speed capability limits. In Q4 2011 they have FPGA built-in transceivers with maximum limits of 3.1Gbit/s to 28Gbit/sec on every differential pair.

This code is a proof of concept only. The data bandwidth is limited, since the on-chip data buses/processing is 32-bit. The Spartan-6 (-2 speed grade) device is able to run the interface with an around 100MHz on-chip parallel bus, but since the valid PLL settings don't allow for parallel bus speed in this range, the on-chip logic runs on 75MHz with serial line rate of 1.5Gbps. On a newer 7-series FPGA (Kintex-7, Virtex-7), it would run faster.
This code implements the on-chip Wishbone parallel bus as a single-dword/transaction method instead of using bursts for simplicity, which actually limits the data bandwidth. Therefore production boards it will have to be implemented with burst support. The burst support has to divide the incoming/outgoing packet payload data into smaller Wishbone-bus bursts up to a maximum specified size. For example a 1 Kbytes (256 Dword) packet will be loaded on the Wishbone bus in eight consecutive 32-Dword bursts (128 Bytes). This would increase the on-chip parallel bus performance to 60-80% of the theoretical bandwidth (based on width and clock frequency) of the parallel bus. Without burst support, the real bandwidth is about 10-25% of the theoretical limit. Implementing bursts in the existing VHDL code should be simpler than the bus-width increase, although it would still require lots of changes and debugging with the ChipScopePro logic analyser.

Target Devices:
The Xilinx Spartan-6 LXT XC6SLX45T (for a 1.5GGbit/s x2 using CAT-6-UTP), device was used for initial debugging. Theoretically the core is suitable for the series-5/6/7 FPGAs. The aurora-5.x cores have the same TRN interface which was used on the reference design. The newer version of the core only has an AXI-bus interface, which would require a partial redesign of the OP2P core. For series-7 FPGAs the Coregenerator only allows to use the latest Aurora cores, with AXI parallel-interface and 64/128-bit buses, which is incompatible with the current design of the OP2P core. For using the interface on these series-7 FPGAs we would need a 128-bit bus anyway to avoid having an on-chip performance bottleneck, and to take an advantage of the available serial I/O bandwidth with x4 serial ports. An optimal device for backplane applications could be the Xilinx Kintex XC7K355T-2FFG901 (10.3Gbit/s 4x4, for 6U VPX), or a Kintex XC7K160T-2FFG676 (10.3Gbit/s 4x1, for 3U VPX), but any Xilinx Kintex or Virtex series FPGA would be suitable.
This design includes the aurora interface core which was generated by the Xilinx CoreGenerator. All the VHD files were copied here, including the ones from the "Reference Design" folder. This file is the top level source of the module, and is not generated by CoreGen. Search for all VHD files in all subfolders, then copy all.
Two files had to be modified:
The last one has its filename and module name also modified, not only the internal logic.

Project status:
The OP2P port IP core is fully functional and was tested on a PCIe card with Xilinx Spartan-6 FPGA, using CAT-6 UTP cable at 1.5Gbit/sec. The performance of the on-chip Wishbone bus is limited due to the simplified design (no Wishbone burst transactions implemented). This will have to be improved (adding Wishbone burst support) before using the core on a product.

Please check the SVN for the source code, the reference design code and the documentation.
If you are planning to use this core for a product development (not a student project), then please drop me an email about it: buenos@opencores.org