Accelerating Multi-Scalar Multiplication in Hardware… and Software
Javier Alejandro Varela
Zero-Knowledge proofs have many amazing properties, but today, they suffer from a fundamental issue — they’re expensive and difficult to make. Multi-scalar Multiplication (MSM) is a core component of this process and often the primary bottleneck. In this article, we describe our approach to this problem and our efforts in accelerating the path toward a ZK future.
We just published a preprint on CycloneMSM, our implementation of MSM on elliptic curve, accelerated via FPGA. Specifically, we focused on accelerating MSM of N=2^26 points on the elliptic curve BLS12-377.
The best software implementation we could find (gnark-crypto v0.6.1) on the fastest server available to us (18-core Intel i9 at 4.8GHz) computes a large N=2^26 MSM in 24s. Our FPGA computes the same result in 5.66s on an AWS F1 instance that uses a ~5 yr old FPGA model, and a N=2^22 MSM is less than a second. What’s also very interesting, we think, is that the techniques that we developed for hardware can also be applied to software, leading to a 10-20% faster MSM over the existing implementation in gnark-crypto.
<1s for 2^22 size MSM
5.66s for 2^26 size MSM
For all the technical details, including all math and algorithms, we refer to the paper. In this post, we want to provide a bit of the back story behind this result, as well as some high-level pointers to learn more before diving into the research work. The code will be open-sourced soon.
All the techniques we used are pretty well-known in the world of elliptic curve cryptography. You can think of implementing MSM as implementing a multi-layer library, where every layer introduces a new abstraction.
At the top level, we have the bucket method (aka Pippenger algorithm). This can be further split in 3 phases: bucket accumulation, bucket aggregation, and final result aggregation. Originally, we only wanted to perform bucket accumulation in hardware, and do the aggregations on the host, in parallel threads. It turns out it’s too slow, at least on the AWS F1 with only 4 real cores at 2.3GHz. Our final solution includes bucket aggregation in the FPGA in a kind of creative way (next section).
At a layer below, we have the arithmetic on the elliptic curve. Because BLS12-377 supports it, we chose to use extended Twisted Edwards coordinates. By incorporating a little trick, we can do a curve addition in just seven field multiplications, which means we need less FPGA resources to build the full curve addition pipeline.
Finally, at the lowest layer, there’s the field arithmetic at 377 bits — in particular field multiplication. In software, the best implementation is in Montgomery representation using CIOS. Alternative methods include doing a 377x377 integer multiplication, followed by a Montgomery reduction (that can be done with two more integer multiplications). Integer multiplications can be done by combining the schoolbook, Karatsuba and Toom-Cook methods. Factoring all these options in, we had a lot of combinations to to try out.
Result #1: Fully Pipelined EC Adder at 250MHz
We built a fully pipelined elliptic curve Adder in extended Twisted Edwards coordinates that runs at 250MHz and uses seven field multipliers.
Let’s unpack the previous sentence, starting from the fully pipelined adder. Fully pipelined is the hardware equivalent of having no loops or jumps in software. Input bits come in a time t0, they progress in the circuit at every clock cycle, and they reach the output at time t0+L, where L is called latency of the pipeline. The big advantage of a fully pipelined adder is that you can start a new point addition at every clock cycle (with a minor caveat that we’re going to discuss later, see if you can spot it).
Now, we mentioned our pipeline runs at 250MHz. This means that we can start a new point addition every 4ns (and get the result after about 100 cycles, i.e. 400ns). In the software world, we’re used to CPUs that run at GHz, and a curve addition probably takes even a bit less than 400ns, let’s say 350ns. But the CPU is busy for the whole time, so to do N curve additions you need N * 350ns. On an FPGA, with a fully pipelined adder, you can input a new addition at every cycle, so the total time is (N + 100) * 4ns =~ N * 4ns, which is about 2 orders of magnitude faster!
Exercise. Slightly simplifying, the bucket method requires to compute 16 times N=2^26 curve additions. On one FPGA at 250MHz, this should take no less than 16 * 2^26 * 4ns = 4.3s. On a 18-core CPU, assuming perfectly parallelizable, it should take no less than 16 * 2^26 * 350ns / 18 = 20.88s. There we go, our 5.66s on FPGA vs 24s for gnark on the 18-core Intel i9.
We want to stress that 250MHz is an achievement we’re pretty proud of. In many research papers we see results at 125MHz and claims that performance scales linearly at 250 or even 500MHz. But operating at higher frequencies introduces additional complexities that are by no means trivial! In our experience, a prototype at 125MHz took about one month of work. Increasing to 250MHz took over four months and pretty much required a complete redesign.
Our pipeline implements addition in extended Twisted Edwards coordinates using seven field multipliers, as shown in the figure below.
Naturally, the less field operations we use, the simpler the circuit is, the easier it is to run it at higher frequencies. We had to try a LOT of combinations, especially at the field multiplier level, to achieve this working design.
One non-obvious constraint we found is the number of buckets we can use, that we had to set to 2^15 buckets. Buckets are stored in SRAM, SRAM blocks are physically distributed along the device surface, and we need to physically wire the inputs and outputs of our pipeline to all these SRAM blocks. As it turns out, that that’s a lot of wires! And when we tried 2^16 buckets, we never got a working design at 250MHz.
For more technical details on our FPGA design we refer to the paper.
Result #2: Delayed Scheduler
The adder pipeline can begin a new point addition at every clock cycle that, at 250MHz, means every 4ns. The challenge is now to keep the pipeline as busy as possible.
If we look at software libraries, they’re processing the N points in the exact order they were given in input. Not surprisingly, this is not the best approach for hardware. In hardware, we introduced the concept of a scheduler that reorders the points to maximize the usage of the adder pipeline.
What is perhaps a bit surprising is that the same scheduler can also solve a software problem: implementing MSM using affine coordinates using batch inversion. This is a technique that’s relatively well known in theory. We see discussions about it in the various libraries as well, yet it’s never been successfully implemented.
By working with FPGA first, and with gnark-crypto after, we learned that building an efficient scheduler has its own complexity. It’s not difficult per se. In fact, it’s the opposite — the scheduler is the most naive we could think of. It’s successful because it does the bare minimum to reshuffle points (in particular, it minimizes memory writes on the host), therefore its impact on the computation time is negligible.
In the conclusions of the paper, we highlighted a few problems that are still bugging us, let’s add a bit more color here. If you’re interested in any of these feel free to connect with us, we’re really open to collaborations.
Let’s go faster: largest sub-second.
We showed that our current implementation computes a N=2^22 MSM in sub-second. At least in theory, if we had to compute N=2^23 elliptic curve (EC) additions, times 16 windows, perfectly scheduling 1 addition every 4ns, that’d be 2^23 * 16 * 4ns = 537ms. This suggests that we should be able to compute a N=2^23 MSM in sub-second so… we should do it. By a similar reasoning, a N=2^24 MSM doesn’t seem possible in sub-second, at least until the FPGA runs at 250MHz… yet this looks like another very interesting challenge.
Let’s go bigger: N=2^36 and beyond.
The largest MSM we could find published in a research paper is for N=2^35. Computing this MSM took several days, and the authors say they “stopped at these sizes only due to time constraints”. Again, in theory, we should be able to compute a N=2^36 in just a few hours, paying only $1.65/hour with an on-demand F1 instance. Definitely worth a try!
Let’s go smarter.
Last but not least, we feel a bit resentful about the scheduler. We spent quite some time thinking about efficient schedulers, but at the end of the day, the best that we were actually able to implement in a performant way is the simplest one: the delayed scheduler. The main reason for inefficiencies is memory writes on the host. So, we definitely look forward to achieving a smarter, yet very efficient scheduler. For this, we don’t even need to tweak the hardware and can first experiment on gnark-crypto or arkworks.
Stay up to date with the latest from Jump_
SAFU: Creating a Standard for Whitehats
Whitehats and DeFi protocols need a shared understanding of security policy. We propose the SAFU - Simple Arrangement for Funding Upload - as a versatile and credible way to let whitehats know what to...
Oct 24 2022 _ 17 min
Huckleberry: IBC Event Hallucinations
This blog post describes a vulnerability in ibc-go, the reference implementation of the Interblockchain Communication Protocol (IBC) used by most Cosmos blockchains
Sep 06 2023 _ 4 min
The information on this website and on the Brick by Brick podcast or Ship Show Twitter spaces is provided for informational, educational, and entertainment purposes only. This information is not intended to be and does not constitute financial advice, investment advice, trading advice, or any other type of advice. You should not make any decision – financial, investment, trading or otherwise – based on any of the information presented here without undertaking your own due diligence and consulting with a financial adviser. Trading, including that of digital assets or cryptocurrency, has potential rewards as well as potential risks involved. Trading may not be suitable for all individuals. Recordings of podcast episodes or Twitter spaces events may be used in the future.