Beware of scams impersonating Jump Trading Group. We only communicate through our official accounts.

Firedancer

Terms of Use_Privacy Policy_Disclaimers_

Improving Twiddle Access for Number Theoretic Transforms: The Beginnings

Kaveh Aasaraai

Kevin J Bowers

Emanuele Cesena

Rahul Maganti

Nicolas Stalder

Javier Alejandro Varela

Zero-Knowledge Infrastructure Research

Nov 29 2022 _ 6 min read

Improving Twiddle Access for Number Theoretic Transforms: The Beginnings

Introduction

As blockchains attempt to onboard new users and applications, primitives that can increase transaction throughput while maintaining a high degree of decentralization will play a central role in the evolution of crypto.

Zero-knowledge proofs (ZKPs) represent a new generation of these primitives that are poised to challenge existing blockchain architecture and enable the use of resource-intensive applications and high-demand ecosystems.

Number Theoretic Transforms (NTTs) play a critical role in the generation of ZKPs. Several notable ZK proof systems, including Plonky2 and other STARK-based proof system rely heavily on NTTs. These computations often contribute to a significant portion of the latency related to proof generation.

We focused on NTTs of size $N = 2^{18}, N = 2^{24}$ over the Goldilocks Field for FPGA (a 64-bit field built to help speed up NTT operations on CPUs).

💡 Key Results:
$2^{18}$ in 1.137 ms
$2^{24}$ in 101.589 ms

We're excited to present CycloneNTT, a novel architecture for solving for large scale NTTs on FPGA.

The paper and the corresponding code will be released soon but for now, we present the high-level details and some background on what led us down these different optimization paths.

What is a Number Theoretic Transform?

Background

NTTs are fundamentally a variation of the Fast Fourier Transform (FFT). When FFTs are performed over finite fields, we call them NTTs. FFTs are used extensively across a broad swath of applications, from the wireless antennas in your phone to the A/C unit in your car. Their impact on modern computing is almost unparalleled, having been largely responsible for the birth of digital signal processing. As a result, there have been numerous attempts over the last half century to improve both the asymptotic and concrete efficiency of the FFT.

NTTs are increasingly showing up in blockchain and zero-knowledge applications. Accelerating NTT operations can have a significant impact on proof generation times, helping to address key bottlenecks limiting the widespread use of ZKPs more broadly in Web3.

While NTTs of relatively small sizes have been explored before, computing NTTs over large datasets is a relatively challenging task and a somewhat unexplored topic in current literature.

There are two types of inputs to an NTT of size $N$ — a vector of coefficients and a vector of twiddles (or roots of unity). If you’re a math geek, you can check out a more in-depth explanation of finite fields and roots of unity here. Simply put, a root of unity is a field element that when raised to the power of some positive integer gives you 1.

Structure of an NTT / FFT

From a hardware perspective, an NTT is a sequence of interleaved butterfly operations on the data.

At each layer, we pass the data through a network of butterflies (more details on this in the following post) but just think of it as some arbitrary operation, or even your favorite mathematical operation.
The input of each layer depends on the output of the previous layer, so there’s a data dependency — for sake of this post, let’s assume we process one layer at a time.

Figure 1 provides a visual representation of a size-8 NTT.

Figure 1: diagram of 8-input butterfly network

Design

We had three key design goals in mind for CycloneNTT:

Minimize communication between the CPU / host and the FPGA
Support large datasets that do not fit into on-chip SRAM
Avoid random access to off-chip memory because this is more costly

The performance enhancements we developed stemmed from two main areas:

Leveraging Algorithmic Improvements for FFT
Streaming Data from External Memory

Because of the size of the NTT, we couldn’t fit all the points and twiddles directly on the on-chip memory (SRAM); we had to use external memory (DRAM) to store all the data.

The high-level design works as follows:

Initialize
a. Precompute all the twiddles and store in a table
b. Sort them in bit-reversed order
c. Send the twiddles over from the host to the FPGA
Send
a. Send input values to the FPGA
Compute
a. Run the following loop:
i. Load the twiddles into the input FIFOs from external (DRAM) to on-chip memory (SRAM)
ii. Process the twiddles via the sub-NTT module
iii. Send the results back to the output FIFOs
iv. Jump back to step (i)
Receive
a. Send values from the FPGA to the host

Modular Reductions

The Goldilocks Field
While our design and architecture is flexible and can be adapted to any field, CycloneNTT is implemented over the Goldilocks Field.

The Goldilocks field is a 64-bit field that enables efficient modular reductions on 64-bit machines. In particular, computing a single reduction is equivalent to multiplying by a 32-bit integer. The details of the structure of this field are out-of-scope for this document but a more in-depth explanation can be found here.

The other advantage of using this field is that it is a relatively small field and helps minimize the amount of storage required to store points and corresponding twiddles.

Architecture

In this section, we discuss the architecture in more depth, including the algorithmic improvements to the FFT we leveraged and our approach to implementing this on hardware.

gNTT: Memory-Optimized NTT

In our search for faster, memory-optimized FFT algorithms, we ran across a paper entitled Improving Twiddle Access for Fast Fourier Transforms, coincidentally co-authored by none other than Jump’s Chief Science Officer, Kevin Bowers!

The algorithm described in this paper significantly reduces the number of memory accesses necessary to compute an FFT with a small but impactful change. We posted a PR explaining the optimization in more detail to the Plonky2 repo here.

💡 This optimization yielded a 10-15% improvement!

Bowers, et al. introduced the gFFT algorithm. Their key observation is that it is possible to re-arrange the computation and drastically reduce the number of memory accesses to the twiddle factors by a logarithmic factor! When the input size $N$ is really large, this can a make a big difference in performance.

💡 The gFFT algorithm proposed by Bowers, et al. reduces twiddle accesses from the Cooley-Tukey O(N log N) to O(N)

Streaming all the Way

For every butterfly computation, we need two numbers / coefficients and one twiddle factor that needs to be read from off-chip memory.

Streaming Twiddles

In this architecture, we precompute all the twiddle factors required for the entire datasets. We the stream the twiddles in along with associated points to compute the butterflies.

For each layer, the number of twiddles required is cut in half (e.g. for an NTT of size 8, we need 8 twiddle for the first layer, 4$twiddles for the next layer, and so on).

To ensure that we don’t want random access to memory, we order the twiddles beforehand so that they can be processed one after another.

FIFOs for Combining

A core component of this architecture was developing a hardware-optimized quasi-streaming of the gFFT algorithm. To manage data-flows and memory accesses efficiently, we construct a set of FIFOs / queues.

At each layer, we compute several butterflies in parallel. Instead of pushing the output to back into the same data structure (contiguous piece of memory), we effectively “ping-pong” the outputs between two FIFOs so that after computing layer, we can get the inputs for the next layer more easily.

See the diagram below to see how the inputs can be combined.

Figure 2: FIFO architecture for streaming twiddle factors through the butterfly operator.

Note that for a multi-layer streaming architecture, this can be extended to multiple FIFOs (number of FIFOs should be even).

More layers

We explored architectures that process multiple layers at at time. Experiments with 1, 3, and 6-layers showed that the 3-layer approach yields the best latency vs. power tradeoff.

We implemented our architecture both on an AWS F1 platform with DDR (Double Data Rate RAM for external memory and the Xilinx CC1100 board with (High Bandwidth Memory) HBM memory.

We’ll describe the multi-layer architecture later in a subsequent post.

Conclusion

Our work on the fundamental aspects of ZK proof generation stems from a desire to not only contribute to a ZK future, but to build a long-term path to sustainability for crypto as a whole.

If you want to talk hardware acceleration for proof systems or ZKs more generally, connect with us on Twitter @rahulmaganti_, @0x0ece, or @nickraystalder!

Stay up to date with the latest from Jump_

SAFU: Creating a Standard for Whitehats

Whitehats and DeFi protocols need a shared understanding of security policy. We propose the SAFU - Simple Arrangement for Funding Upload - as a versatile and credible way to let whitehats know what to...

Oct 24 2022 _ 17 min

Shelby: Decentralized Storage Designed to Serve

Samir Islam | @dragonfrond Decentralized, byzantine fault-tolerant technology (blockchain) promised us a shared world computer, unshackled from any single gatekeeper, and accessible to all. This dist...

Jun 24 2025 _ 5 min

Huckleberry: IBC Event Hallucinations

This blog post describes a vulnerability in ibc-go, the reference implementation of the Interblockchain Communication Protocol (IBC) used by most Cosmos blockchains

Sep 06 2023 _ 4 min

Disclaimer

The information on this website and on the Brick by Brick podcast or Ship Show Twitter spaces is provided for informational, educational, and entertainment purposes only. This information is not intended to be and does not constitute financial advice, investment advice, trading advice, or any other type of advice. You should not make any decision – financial, investment, trading or otherwise – based on any of the information presented here without undertaking your own due diligence and consulting with a financial adviser. Trading, including that of digital assets or cryptocurrency, has potential rewards as well as potential risks involved. Trading may not be suitable for all individuals. Recordings of podcast episodes or Twitter spaces events may be used in the future.