Beware of scams impersonating Jump Trading Group. We only communicate through our official accounts.
 Firedancer
 Thinking
 Connect
Improving Twiddle Access for Number Theoretic Transforms: The Beginnings
Kaveh Aasaraai
Kevin J Bowers
Emanuele Cesena
Rahul Maganti
Nicolas Stalder
Javier Alejandro Varela
Nov 29 2022 _ 6 min read
Introduction
As blockchains attempt to onboard new users and applications, primitives that can increase transaction throughput while maintaining a high degree of decentralization will play a central role in the evolution of crypto.
Zeroknowledge proofs (ZKPs) represent a new generation of these primitives that are poised to challenge existing blockchain architecture and enable the use of resourceintensive applications and highdemand ecosystems.
Number Theoretic Transforms (NTTs) play a critical role in the generation of ZKPs. Several notable ZK proof systems, including Plonky2 and other STARKbased proof system rely heavily on NTTs. These computations often contribute to a significant portion of the latency related to proof generation.
We focused on NTTs of size \(N = 2^{18}, N = 2^{24}\) over the Goldilocks Field for FPGA (a 64bit field built to help speed up NTT operations on CPUs).
💡 Key Results:
\(2^{18}\) in 1.137 ms
\(2^{24}\) in 101.589 ms
We're excited to present CycloneNTT, a novel architecture for solving for large scale NTTs on FPGA.
The paper and the corresponding code will be released soon but for now, we present the highlevel details and some background on what led us down these different optimization paths.
What is a Number Theoretic Transform?
Background
NTTs are fundamentally a variation of the Fast Fourier Transform (FFT). When FFTs are performed over finite fields, we call them NTTs. FFTs are used extensively across a broad swath of applications, from the wireless antennas in your phone to the A/C unit in your car. Their impact on modern computing is almost unparalleled, having been largely responsible for the birth of digital signal processing. As a result, there have been numerous attempts over the last half century to improve both the asymptotic and concrete efficiency of the FFT.
NTTs are increasingly showing up in blockchain and zeroknowledge applications. Accelerating NTT operations can have a significant impact on proof generation times, helping to address key bottlenecks limiting the widespread use of ZKPs more broadly in Web3.
While NTTs of relatively small sizes have been explored before, computing NTTs over large datasets is a relatively challenging task and a somewhat unexplored topic in current literature.
There are two types of inputs to an NTT of size \(N\) — a vector of coefficients and a vector of twiddles (or roots of unity). If you’re a math geek, you can check out a more indepth explanation of finite fields and roots of unity here. Simply put, a root of unity is a field element that when raised to the power of some positive integer gives you 1.
Structure of an NTT / FFT
From a hardware perspective, an NTT is a sequence of interleaved butterfly operations on the data.
 At each layer, we pass the data through a network of butterflies (more details on this in the following post) but just think of it as some arbitrary operation, or even your favorite mathematical operation.
 The input of each layer depends on the output of the previous layer, so there’s a data dependency — for sake of this post, let’s assume we process one layer at a time.
Figure 1 provides a visual representation of a size8 NTT.
Design
We had three key design goals in mind for CycloneNTT:
 Minimize communication between the CPU / host and the FPGA
 Support large datasets that do not fit into onchip SRAM
 Avoid random access to offchip memory because this is more costly
The performance enhancements we developed stemmed from two main areas:
 Leveraging Algorithmic Improvements for FFT
 Streaming Data from External Memory
Because of the size of the NTT, we couldn’t fit all the points and twiddles directly on the onchip memory (SRAM); we had to use external memory (DRAM) to store all the data.
The highlevel design works as follows:

Initialize
a. Precompute all the twiddles and store in a table
b. Sort them in bitreversed order
c. Send the twiddles over from the host to the FPGA 
Send
a. Send input values to the FPGA 
Compute
a. Run the following loop:
i. Load the twiddles into the input FIFOs from external (DRAM) to onchip memory (SRAM)
ii. Process the twiddles via the subNTT module
iii. Send the results back to the output FIFOs
iv. Jump back to step (i) 
Receive
a. Send values from the FPGA to the host
Modular Reductions
The Goldilocks Field
While our design and architecture is flexible and can be adapted to any field, CycloneNTT is implemented over the Goldilocks Field.
The Goldilocks field is a 64bit field that enables efficient modular reductions on 64bit machines. In particular, computing a single reduction is equivalent to multiplying by a 32bit integer. The details of the structure of this field are outofscope for this document but a more indepth explanation can be found here.
The other advantage of using this field is that it is a relatively small field and helps minimize the amount of storage required to store points and corresponding twiddles.
Architecture
In this section, we discuss the architecture in more depth, including the algorithmic improvements to the FFT we leveraged and our approach to implementing this on hardware.
gNTT: MemoryOptimized NTT
In our search for faster, memoryoptimized FFT algorithms, we ran across a paper entitled Improving Twiddle Access for Fast Fourier Transforms, coincidentally coauthored by none other than Jump’s Chief Science Officer, Kevin Bowers!
The algorithm described in this paper significantly reduces the number of memory accesses necessary to compute an FFT with a small but impactful change. We posted a PR explaining the optimization in more detail to the Plonky2 repo here.
💡 This optimization yielded a 1015% improvement!
Bowers, et al. introduced the gFFT algorithm. Their key observation is that it is possible to rearrange the computation and drastically reduce the number of memory accesses to the twiddle factors by a logarithmic factor! When the input size $N$ is really large, this can a make a big difference in performance.
💡 The gFFT algorithm proposed by Bowers, et al. reduces twiddle accesses from the CooleyTukey O(N log N) to O(N)
Streaming all the Way
For every butterfly computation, we need two numbers / coefficients and one twiddle factor that needs to be read from offchip memory.
Streaming Twiddles
In this architecture, we precompute all the twiddle factors required for the entire datasets. We the stream the twiddles in along with associated points to compute the butterflies.
For each layer, the number of twiddles required is cut in half (e.g. for an NTT of size 8, we need 8 twiddle for the first layer, 4$twiddles for the next layer, and so on).
To ensure that we don’t want random access to memory, we order the twiddles beforehand so that they can be processed one after another.
FIFOs for Combining
A core component of this architecture was developing a hardwareoptimized quasistreaming of the gFFT algorithm. To manage dataflows and memory accesses efficiently, we construct a set of FIFOs / queues.
At each layer, we compute several butterflies in parallel. Instead of pushing the output to back into the same data structure (contiguous piece of memory), we effectively “pingpong” the outputs between two FIFOs so that after computing layer, we can get the inputs for the next layer more easily.
See the diagram below to see how the inputs can be combined.
Note that for a multilayer streaming architecture, this can be extended to multiple FIFOs (number of FIFOs should be even).
More layers
We explored architectures that process multiple layers at at time. Experiments with 1, 3, and 6layers showed that the 3layer approach yields the best latency vs. power tradeoff.
We implemented our architecture both on an AWS F1 platform with DDR (Double Data Rate RAM for external memory and the Xilinx CC1100 board with (High Bandwidth Memory) HBM memory.
We’ll describe the multilayer architecture later in a subsequent post.
Conclusion
Our work on the fundamental aspects of ZK proof generation stems from a desire to not only contribute to a ZK future, but to build a longterm path to sustainability for crypto as a whole.
If you want to talk hardware acceleration for proof systems or ZKs more generally, connect with us on Twitter @rahulmaganti_, @0x0ece, or @nickraystalder!
Share
Stay up to date with the latest from Jump_
More articles
Disclaimer
The information on this website and on the Brick by Brick podcast or Ship Show Twitter spaces is provided for informational, educational, and entertainment purposes only. This information is not intended to be and does not constitute financial advice, investment advice, trading advice, or any other type of advice. You should not make any decision – financial, investment, trading or otherwise – based on any of the information presented here without undertaking your own due diligence and consulting with a financial adviser. Trading, including that of digital assets or cryptocurrency, has potential rewards as well as potential risks involved. Trading may not be suitable for all individuals. Recordings of podcast episodes or Twitter spaces events may be used in the future.