Why use FPGA?

You are here: Home » News » Why use FPGA?

Why use FPGA?

Views: 0 Author: Hu Wei Publish Time: 2021-12-15 Origin: Electronic Theory•Author

Inquire

This article is reproduced from "Source: Electronic Theory•Author: Hu Wei"In recent years, the concept of FPGA has appeared more and more. For example, Bitcoin mining uses FPGA-based mining machines. Also, Microsoft previously stated that it will use FPGAs to "replace" CPUs in data centers, and so on.In fact, FPGA is no stranger to professionals, it has been widely used all the time. However, most people still don't know it too well, and they have many questions about it-what exactly is FPGA? Why use it? Compared with CPU, GPU, ASIC (dedicated chip), what are the characteristics of FPGA?

Today, with this series of questions, let's come together—demystifying FPGA.

Why use FPGA?

As we all know, the Moore's Law of general-purpose processors (CPU) is in its twilight years, while the scale of machine learning and Web services is growing exponentially.

People use custom hardware to accelerate common computing tasks, but the ever-changing industry requires these custom hardware to be reprogrammed to perform new types of computing tasks.

FPGA is a hardware reconfigurable architecture. Its English name is Field Programmable Gate Array, and its Chinese name is Field Programmable Gate Array.

FPGAs have been used as small-volume alternatives to application-specific chips (ASICs) for many years. However, in recent years, they have been deployed on a large scale in the data centers of companies such as Microsoft and Baidu to provide powerful computing capabilities and sufficient flexibility at the same time.

▲Comparison of performance and flexibility of different architectures

Why is FPGA fast? "It's all set off well by peers."

Both CPU and GPU belong to the von Neumann structure, instruction decoding and execution, and shared memory. The reason why FPGAs are more energy efficient than CPUs and even GPUs is essentially the benefits of an architecture that has no instructions and no shared memory.

In Feng's structure, since the execution unit (such as the CPU core) may execute arbitrary instructions, it requires instruction memory, decoders, arithmetic units for various instructions, and branch and jump processing logic. Due to the complexity of the control logic of the instruction stream, it is impossible to have too many independent instruction streams. Therefore, the GPU uses SIMD (Single Instruction Stream Multiple Data Stream) to allow multiple execution units to process different data at the same pace. The CPU also supports SIMD. instruction.

The function of each logic unit of the FPGA is determined during reprogramming (programming), and no instructions are required.

The use of memory in Feng's structure has two functions. One is to save the state, and the other is to communicate between execution units.

Since the memory is shared, access arbitration is required; in order to take advantage of access locality, each execution unit has a private cache, which must maintain the consistency of the cache between the execution components.

For the need to save the state, the registers and on-chip memory (BRAM) in the FPGA belong to their own control logic, without unnecessary arbitration and buffering.

For communication requirements, the connection between each logic unit of the FPGA and the surrounding logic unit is determined during reprogramming (programming), and there is no need to communicate through shared memory.

With all these three thousand feet in height, how does the FPGA actually perform? Let's look at calculation-intensive tasks and communication-intensive tasks separately.

Examples of computationally intensive tasks include matrix operations, image processing, machine learning, compression, asymmetric encryption, Bing search sorting, etc. For this type of task, the CPU generally offloads the task to the FPGA for execution. For this type of task, we are currently using Altera (it seems to be called Intel, I am still used to call Altera...) The integer multiplication performance of Stratix V FPGA is basically equivalent to that of a 20-core CPU, and the floating-point multiplication performance is equivalent to 8-core. The CPU is basically the same, but an order of magnitude lower than the GPU. The next-generation FPGA we are about to use, Stratix 10, will be equipped with more multipliers and hardware floating-point arithmetic components, so that theoretically, it can achieve computing power equivalent to the current top GPU computing cards.

▲ FPGA's integer multiplication capability (estimated value, DSP is not used, estimated based on logic resource usage)

▲Floating point multiplication capability of FPGA (estimated value, soft core for float16, hard core for float 32)

In the data center, the core advantage of FPGA over GPU is latency.

For tasks like Bing search sorting, to return search results as quickly as possible, it is necessary to reduce the delay of each step as much as possible.

If you use GPU for acceleration, if you want to make full use of the GPU's computing power, the batch size cannot be too small, and the delay will be as high as milliseconds.

If you use FPGA to accelerate, you only need a microsecond PCIe delay (our current FPGA is used as a PCIe acceleration card).

In the future, after Intel launches Xeon + FPGA connected via QPI, the delay between CPU and FPGA can be reduced to less than 100 nanoseconds, which is no different from accessing main memory.

Why does FPGA have so much lower latency than GPU?

This is essentially a difference in architecture.

FPGA has both pipeline parallelism and data parallelism, while GPU is almost only data parallel (the pipeline depth is limited).

For example, there are 10 steps to process a data packet. FPGA can build a 10-stage pipeline. Different stages of the pipeline process different data packets, and each data packet is processed after passing through 10 stages. Every time a data packet is processed, it can be output immediately.

The GPU's data parallel method is to make 10 computing units, and each computing unit is also processing different data packets, but all computing units must follow a unified pace and do the same thing (SIMD, Single Instruction Multiple Data). This requires 10 data packets to be input and output together, and the delay of input and output increases.

When tasks arrive one by one rather than in batches, pipeline parallelism can achieve lower latency than data parallelism. Therefore, FPGAs have inherent advantages in latency over GPUs for streaming computing tasks.

Computationally intensive tasks, order of magnitude comparison of CPU, GPU, FPGA, ASIC (take 16-bit integer multiplication as an example, the number is only an order of magnitude estimate

ASIC dedicated chips are beyond criticism in terms of throughput, latency, and power consumption, but Microsoft did not adopt them for two reasons:

Data center computing tasks are flexible and changeable, while ASIC R&D costs are high and the cycle is long. Finally, a batch of accelerator cards of a certain neural network were deployed on a large scale. As a result, another neural network became more popular, and the money was wasted. FPGA can update the logic function in only a few hundred milliseconds. The flexibility of FPGA can protect investment. In fact, Microsoft's current FPGA gameplay is very different from the original idea.

The data center is leased to different tenants. If some machines have neural network accelerator cards, some machines have Bing search accelerator cards, some machines have network virtualization accelerator cards, task scheduling and server operation Wei will be very troublesome. Using FPGAs can maintain the homogeneity of the data center.

Next, let's look at communication-intensive tasks.

Compared with computationally intensive tasks, communication-intensive tasks are not very complicated to process each input data. Basically, even if it is simple, it will be output. At this time, communication often becomes a bottleneck. Symmetric encryption, firewalls, and network virtualization are all communication-intensive examples.

▲Communication-intensive tasks, the order of magnitude comparison of CPU, GPU, FPGA, ASIC (taking 64-byte network packet processing as an example, the number is only an order of magnitude estimate)

For communication-intensive tasks, FPGAs have greater advantages over CPUs and GPUs.

In terms of throughput, the transceiver on the FPGA can be directly connected to a 40 Gbps or even 100 Gbps network cable to process data packets of any size at wire speed; while the CPU needs to receive the data packets from the network card to process it, and many network cards cannot Wire-speed processing of small packets of 64 bytes. Although high performance can be achieved by inserting multiple network cards, the number of PCIe slots supported by the CPU and motherboard is often limited, and the network cards and switches themselves are also expensive.

In terms of latency, the network card receives the data packet from the CPU, and the CPU sends it to the network card. Even if a high-performance data packet processing framework such as DPDK is used, the delay is 4 to 5 microseconds. A more serious problem is that the latency of general-purpose CPUs is not stable enough. For example, when the load is high, the forwarding delay may rise to tens of microseconds or even higher (as shown in the figure below); clock interruption and task scheduling in modern operating systems also increase the uncertainty of delay.

Comparison of the forwarding delay between ClickNP (FPGA) and Dell S6000 switch (commercial switch chip), Click+DPDK (CPU) and Linux (CPU), error bar means 5% and 95%. Source: [5]

Although the GPU can also process data packets with high performance, the GPU does not have a network port, which means that the data packets need to be received by the network card first, and then the GPU can do the processing. In this way, throughput is limited by the CPU and/or network card. Not to mention the latency of the GPU itself.

So why not incorporate these network functions into a network card, or use a programmable switch? The flexibility of ASIC is still flawed.

Although there are more and more powerful programmable switch chips, such as Tofino that supports P4 language, ASICs still cannot do complex stateful processing, such as a certain custom encryption algorithm.

In summary, the main advantage of FPGAs in the data center is stable and extremely low latency, which is suitable for streaming computing-intensive tasks and communication-intensive tasks.

Microsoft's practice of deploying FPGA

In September 2016, "Wired" magazine published a report "Microsoft bets the future on FPGA" [3], telling the past and present of the Catapult project.

Immediately afterwards, Doug Burger, the boss of the Catapult project, gave a demonstration of FPGA-accelerated machine translation with Microsoft CEO Satya Nadella at the Ignite 2016 conference.

The total computing power demonstrated is 1.03 million T ops, which is 1.03 Exa-op, which is equivalent to 100,000 top GPU computing cards. The power consumption of an FPGA (plus on-board memory and network interfaces, etc.) is about 30 W, which only increases one-tenth of the power consumption of the entire server.

▲Demonstration on Ignite 2016: 1 Exa-op (10^18) machine translation computing power per second

Microsoft's deployment of FPGAs has not been smooth sailing. The question of where to deploy the FPGA has roughly gone through three stages:

Dedicated FPGA cluster filled with FPGAs

One FPGA for each machine, using dedicated network connection

One FPGA for each machine, placed between the network card and the switch, sharing the server network

▲Three stages of Microsoft FPGA deployment

The first stage is a dedicated cluster, which is filled with FPGA accelerator cards, just like a supercomputer composed of FPGAs.

The picture below is the earliest BFB experiment board. 6 FPGAs are placed on a PCIe card, and 4 PCIe cards are inserted on each 1U server.

▲The earliest BFB experiment board with 6 FPGAs on it.

You can notice the company's name. In the semiconductor industry, as long as the batch is large enough, the price of the chip will tend to the price of sand. According to rumors, it was precisely because the company refused to give "the price of sand" that it chose another company.

Of course, both companies have FPGAs in the data center field. As long as the scale is large enough, worry about the high price of FPGA will be unnecessary.

▲The earliest BFB experiment board, with 4 FPGA cards inserted in the 1U server.

The deployment method like a supercomputer means that there is a dedicated cabinet full of servers with 24 FPGAs like the picture above (the left picture below).

There are several problems with this approach:

The FPGAs of different machines cannot communicate, and the scale of problems that FPGAs can handle is limited by the number of FPGAs on a single server;

Other machines in the data center have to centrally send tasks to this cabinet, forming an in-cast, and it is difficult to stabilize the network delay.

The FPGA dedicated cabinet constitutes a single point of failure, as long as it breaks, no one can speed it up;

The server equipped with FPGA is customized, which increases the trouble of cooling and operation and maintenance.

▲Three ways to deploy FPGA, from centralized to distributed.

A less radical approach is to deploy a server full of FPGAs on one side of each cabinet (pictured above). This avoids the above-mentioned problems (2)(3), but (1)(4) still does not solve.

In the second stage, in order to ensure the isomorphism of the servers in the data center (this is also an important reason for not using ASICs), an FPGA (on the right in the figure above) is inserted into each server, and the FPGAs are connected through a dedicated network. This is also the deployment method used by Microsoft for papers published on ISCA'14.

▲Open Compute Server is in the rack.

▲Internal view of Open Compute Server. The red box is where the FPGA is placed.

▲Open Compute Server after inserting FPGA.

▲The connection and fixation between FPGA and Open Compute Server.

FPGA adopts Stratix V D5, has 172K ALM, 2014 M20K on-chip memory, 1590 DSP. There is one 8GB DDR3-1333 memory, one PCIe Gen3 x8 interface, and two 10 Gbps network interfaces on the board. The FPGA between a cabinet is connected by a dedicated network, a group of 10G network ports are connected in a group of 8 in a ring, and the other group of 10G network ports are connected in a group of 6 in a ring, and no switch is used.

▲Network connection mode between FPGAs in the cabinet.

Such a cluster of 1632 servers and 1632 FPGAs has doubled the overall performance of Bing's search results sorting (in other words, it saves half of the servers).

As shown in the figure below, every 8 FPGAs are threaded into a chain, and the aforementioned 10 Gbps dedicated network cable is used for communication in the middle. These 8 FPGAs perform their own duties, some are responsible for extracting features from documents (yellow), some are responsible for calculating feature expressions (green), and some are responsible for calculating document scores (red).

▲FPGA accelerates Bing's search and sorting process.

▲FPGA not only reduces the delay of Bing search, but also significantly improves the stability of the delay.

▲ Both the local and remote FPGA can reduce the search delay, and the communication delay of the remote FPGA is negligible compared to the search delay.

The deployment of FPGA in Bing was successful, and the Catapult project continued to expand within the company.

Microsoft has the most servers inside the cloud computing Azure department.

The problem that the Azure department urgently needs to solve is the cost of network and storage virtualization. Azure sells virtual machines to customers and needs to provide network functions such as firewall, load balancing, tunneling, and NAT for the virtual machine's network. Since the physical storage of cloud storage is separated from the computing node, data needs to be transported from the storage node through the network, and compressed and encrypted.

In the era of 1 Gbps networks and mechanical hard drives, the CPU overhead of network and storage virtualization is not worth mentioning. As the network and storage speeds get faster and faster, the network has 40 Gbps, and the throughput of an SSD can reach 1 GB/s, and the CPU gradually becomes too weak.

For example, the Hyper-V virtual switch can only handle traffic of about 25 Gbps, and cannot reach the line speed of 40 Gbps. When the data packet is small, the performance is worse; AES-256 encryption and SHA-1 signature, each CPU core can only handle 100 MB /s, only one-tenth of the throughput of an SSD.

▲The number of CPU cores required for the network tunnel protocol and firewall to process 40 Gbps.

In order to accelerate network functions and storage virtualization, Microsoft deployed FPGAs between the network card and the switch.

As shown in the figure below, each FPGA has a 4 GB DDR3-1333 DRAM, which is connected to a CPU socket through two PCIe Gen3 x8 interfaces (physically a PCIe Gen3 x16 interface, because FPGA does not have a x16 hard core, it is logically regarded as two X8). The physical network card (NIC) is an ordinary 40 Gbps network card, which is only used for communication between the host and the network.

▲Azure server deployment FPGA architecture.

FPGA (SmartNIC) virtualizes a network card for each virtual machine, and the virtual machine directly accesses this virtual network card through SR-IOV. The data plane function originally in the virtual switch is moved to the FPGA. The virtual machine does not need the CPU to participate in sending and receiving network data packets, nor does it need to go through the physical network card (NIC). This not only saves the CPU resources available for sale, but also improves the network performance of the virtual machine (25 Gbps), and reduces the network delay between virtual machines in the same data center by 10 times.

▲Accelerated architecture of network virtualization. Source: [6]

This is the third-generation architecture used by Microsoft to deploy FPGAs, and it is also the architecture currently adopted for large-scale deployments of "one FPGA per server".

The original intention of FPGA reuse host network is to accelerate the network and storage, and the more far-reaching impact is to expand the network connection between FPGAs to the scale of the entire data center, making it a true cloud-scale "supercomputer".

In the second-generation architecture, the network connection between FPGAs is limited to the same rack. It is difficult to expand the scale of the private network interconnection between FPGAs, and the overhead is too high for forwarding through the CPU.

In the third-generation architecture, FPGAs communicate through LTL (Lightweight Transport Layer). The delay in the same rack is within 3 microseconds; it can reach 1000 FPGAs within 8 microseconds; and it can reach all FPGAs in the same data center in 20 microseconds. Although the second-generation architecture has lower latency within 8 machines, only 48 FPGAs can be accessed through the network. In order to support a wide range of inter-FPGA communication, the LTL in the third-generation architecture also supports the PFC flow control protocol and the DCQCN congestion control protocol.

▲Vertical axis: LTL delay, horizontal axis: the number of FPGAs that can be reached. Source: [4]

▲The logic module relationship in FPGA, where each Role is user logic (such as DNN acceleration, network function acceleration, encryption), and the external part is responsible for the communication between each Role and the communication between Role and peripherals. Source: [4]

▲The data center acceleration plane formed by FPGA is between the network switching layer (TOR, L1, L2) and traditional server software (software running on the CPU). Source: [4]

FPGAs interconnected through a high-bandwidth, low-latency network form a data center acceleration plane between the network switching layer and traditional server software.

In addition to the network and storage virtualization acceleration required by every server that provides cloud services, the remaining resources on the FPGA can also be used to accelerate Bing search, deep neural network (DNN) and other computing tasks.

For many types of applications, as the scale of distributed FPGA accelerators expands, the performance improvement is super linear.

For example, CNN inference, when only one FPGA is used, because the on-chip memory is not enough to hold the entire model, it is necessary to constantly access the model weights in DRAM, and the performance bottleneck is in DRAM; if the number of FPGAs is large enough, each FPGA is responsible for one part of the model. A layer or several features in a layer make the model weights completely loaded into the on-chip memory, which eliminates the performance bottleneck of DRAM and fully utilizes the performance of the FPGA computing unit.

Of course, too small a split will also lead to an increase in communication overhead. The key to splitting tasks into distributed FPGA clusters is to balance computing and communication.

▲From neural network model to FPGA on HaaS. Using the parallelism within the model, different layers and features of the model are mapped to different FPGAs. Source: [4]

At the MICRO'16 conference, Microsoft put forward the concept of Hardware as a Service (HaaS), that is, hardware as a schedulable cloud service, making the centralized scheduling, management and large-scale deployment of FPGA services possible.

▲Hardware as a Service (HaaS). Source: [4]

From the first generation of dedicated server clusters filled with FPGAs, to the second generation of FPGA accelerator card clusters connected through a dedicated network, to the current large-scale FPGA cloud that reuses data center networks, three ideas guide our route:

Hardware and software are not a mutual substitution relationship, but a cooperative relationship;

Must have flexibility, that is, the ability to use software;

Must have scalability (scalability).

The role of FPGA in cloud computing

Finally, I will talk about my personal thinking about the role of FPGA in cloud computing. As a third-year doctoral student, my research at Microsoft Research Asia tried to answer two questions:

What role should FPGA play in a cloud-scale network interconnection system?

How to program the FPGA + CPU heterogeneous system efficiently and expandably?

My main regret for the FPGA industry is that the mainstream usage of FPGAs in data centers, from Internet giants other than Microsoft, to the two major FPGA manufacturers, and to academia, most of them treat FPGAs as computationally intensive tasks like GPUs. Accelerator card. But is FPGA really suitable for GPU things?

As mentioned earlier, the biggest difference between FPGA and GPU lies in the architecture. FPGA is more suitable for streaming processing that requires low latency, and GPU is more suitable for processing large batches of homogeneous data.

Because many people intend to use FPGAs as computing accelerator cards, the high-level programming models introduced by the two major FPGA vendors are also based on OpenCL, imitating the GPU-based batch processing mode based on shared memory. If the CPU wants to give the FPGA to do something, it needs to put it in the DRAM on the FPGA board, and then tell the FPGA to start execution. The FPGA puts the execution result back into the DRAM, and then informs the CPU to retrieve it.

The CPU and FPGA could communicate efficiently through PCIe, so why go around the DRAM on the board? Perhaps it is a problem of project implementation. We found that it takes 1.8 milliseconds to write DRAM, start the kernel, and read DRAM through OpenCL. However, it only takes 1 to 2 microseconds to communicate via PCIe DMA.

The communication between multiple kernels in OpenCL is even more exaggerated, and the default method is also through shared memory.

At the beginning of this article, FPGAs are more energy efficient than CPUs and GPUs. The fundamental advantage of the architecture is that there are no instructions and no shared memory. Using shared memory to communicate between multiple kernels is unnecessary in the case of sequential communication (FIFO). Moreover, DRAM on FPGA is generally much slower than DRAM on GPU.

Therefore, we proposed the ClickNP network programming framework [5], which uses channels instead of shared memory to communicate between execution units (element/kernel), execution units and host software.

Applications that require shared memory can also be implemented on the basis of pipelines. After all, CSP (Communicating Sequential Process) and shared memory are theoretically equivalent. ClickNP is currently a framework based on OpenCL, subject to the limitations of C language describing hardware (of course, HLS is indeed much more efficient than Verilog in development). The ideal hardware description language will probably not be the C language.

▲ClickNP uses channel to communicate between elements, source: [5]

▲ClickNP uses channel to communicate between FPGA and CPU, source: [5]

For low-latency streaming processing, communication is most needed.

However, due to the limitation of parallelism and the scheduling of the operating system, the CPU is not efficient in communication and the delay is also unstable.

In addition, communication inevitably involves scheduling and arbitration. Due to the limitation of single-core performance and the inefficiency of inter-core communication, the performance of scheduling and arbitration is limited, and the hardware is very suitable for this kind of repetitive work. Therefore, my PhD research defines FPGA as the "big housekeeper" of communication, whether it is the communication between the server and the server, the communication between the virtual machine and the virtual machine, the communication between the process and the process, the CPU and the storage device. All communications can be accelerated by FPGA.

Winning or losing does not matter. The lack of instructions is the strength and weakness of FPGA at the same time.

Every time you do something different, a certain amount of FPGA logic resources will be occupied. If the things to be done are complex and not repetitive, it will take up a lot of logic resources, most of which are idle. At this time, it is better to use a processor with a von Neumann structure.

Many tasks in the data center are highly localized and repetitive: one part is the network and storage required by the virtualization platform, which belong to communication; the other part is in the client's computing tasks, such as machine learning, encryption and decryption.

First, use FPGA for the communication it is best at. In the future, FPGA may also be leased to customers as a computing accelerator card like AWS does.

Regardless of communication, machine learning, encryption and decryption, the algorithm is very complicated. If you try to completely replace the CPU with FPGA, it will inevitably lead to a great waste of FPGA logic resources and increase the cost of FPGA program development. A more practical approach is to work together with FPGA and CPU, where the locality and repetition are strong, belong to the FPGA, and the complicated one belongs to the CPU.

When we use FPGA to accelerate more and more services such as Bing search and deep learning; when the data plane of basic components such as network virtualization and storage virtualization is controlled by FPGA; when the "data center acceleration plane" composed of FPGA becomes the network and The sky between the servers... It seems that FPGA will control the whole situation, and the computing tasks on the CPU will become fragmented, driven by FPGA. In the past, we were CPU-based, offloading repetitive computing tasks to FPGA; will we become FPGA-based in the future, offloading complex computing tasks to the CPU? With the advent of Xeon + FPGA, will the ancient SoC rejuvenate in the data center?

"Across the memory wall and reach a fully programmable world."

References:

[1] Large-Scale Reconfigurable Computing in a Microsoft Datacenter https://www.microsoft.com/en-us/research/wp-content/uploads/2014/06/HC26.12.520-Recon-Fabric-Pulnam-Microsoft-Catapult.pdf

[2] A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA'14 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Catapult_ISCA_2014.pdf

[3] Microsoft Has a Whole New Kind of Computer Chip—and It’ll Change Everything

[4] A Cloud-Scale Acceleration Architecture, MICRO'16 https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf

[5] ClickNP: Highly Flexible and High-performance Network Processing with Reconfigurable Hardware - Microsoft Research

[6] Daniel Firestone, SmartNIC: Accelerating Azure's Network with. FPGAs on OCS servers.

This article is reproduced from "Source: Electronic Theory • Author: Hu Wei"