Beware of scams impersonating Jump Trading Group. We only communicate through our official accounts.
Firedancer Reliability Efforts
Richard Patel
Dec 05 2022 _ 9 min read
TL;DR
Backed by decades of experience developing global trading systems operating at far greater scale than public blockchains, Jump seeks to improve Solana network reliability by building an independent validator client. We present Firedancer’s modular architecture and other security practices to enable robust high-throughput validator operation.
Introduction
Solana mainnet block production has stalled on four occasions requiring manual intervention of hundreds of validator operators to perform a recovery. This slew of outages has called the reliability of the Solana network into question.
However, we believe that the Solana protocol itself does not require any fundamental redesign. Rather, downtime can be attributed to failures in software modules causing consensus issues — some of which have also affected other blockchain networks in the past.
Enter: Firedancer, a fully independent consensus node for the Solana network built by the R&D team led by Kevin Bowers.
Firedancer is built to be reliable with a modular architecture, minimal dependencies, extensive testing plan, and it will operate at a capacity limited by hardware.
We recognize that this is a difficult effort, but it’s not the first time Jump has built highly-reliable global networks. We have industry-leading experience building best-in-class trading infrastructure in traditional markets that demand performance measured in nanoseconds.
Timeline
To understand Solana network reliability, let’s look into the reasons of the four historical outages first.
- 2021-09-14: 17h outage caused by a transaction flood (peer-to-peer interface). Postmortem by Jump Crypto
- 2022-04-30: 7h outage caused by a transaction flood (peer-to-peer interface). Postmortem by Solana Foundation
- 2022-06-01: 4.5h outage caused by a chain split bug (execution layer). Postmortem by Solana Foundation
- 2022-09-30: 9h outage caused by a fork choice rule bug (consensus layer). Postmortem by Solana Foundation
One critical non-engineering component has held up, namely validator governance. We find that the validator set is both sufficiently decentralized and responsive. Solana’s Nakamoto coefficient stands strong at 31 and hundreds of validators collectively have coordinated a network recovery in just 8 hours.
We present several of these common failure causes and use each to motivate a design decision in Firedancer that addresses the problem.
High Performance Packet Processing
The peer-to-peer interface is made up of transport protocols driving the Solana protocol over physical networks.
A historical weakness of Solana’s peer-to-peer interface is the lack of congestion control for incoming transactions. Transaction floods halted the network on 2021-09-14 (17 hours) and 2022-04-30 (7 hours).
Flow Control
Recent design changes and network upgrades have made the validator more robust against floods of incoming transactions.
Solana’s peer-to-peer interface has been upgraded to adopt QUIC, a modern Internet standard (RFC 9000). Serving as the transport layer of HTTP/3, QUIC also finds widespread adoption across service providers and commercial DDoS protection services.
Many data centers use specialized hardware that mitigates DDoS attacks for common communication protocols like TCP and QUIC. Spamming the same transaction will not increase the probability of confirmation with QUIC, thereby removing the economic incentive behind transaction floods.
Network Processing
Even with QUIC, Solana aims to scale up to the gigabit per second realm, pushing the existing validator to its limits. If a validator fell behind on packet processing, consensus messages would inevitably get lost.
To give a sense of scale, Solana nodes currently operate at ~0.2 Gbps. The largest recorded spike on a Jump node was just over 40 Gbps.
As such, performance of networking code directly translates into reliability when designing the peer-to-peer interface.
Firedancer introduces a novel message-passing framework for composing an application from highly-performant C “tiles” – faster than the Linux kernel allows. Therefore, we bypass kernel networking using AF_XDP, allowing Firedancer to directly read from network interface buffers.
The tile system facilitates various high-performance computing concepts such as NUMA awareness, cache locality optimization, lockless concurrency, and large page sizes. We’ll spare you the technical details (for now), but first results show that Firedancer can reliably accept and filter about 100 Gbps of network traffic on a typical Solana validator server.
The modular architecture of Firedancer tiles should yield an appreciable improvement in Solana network resiliency.
Modular Architecture
To understand why the tile system is naturally more resilient to failures compared to a monolithic design, let’s take a look at CVE-2021-3449, a critical “0-day” vulnerability in the commonly-used OpenSSL server.
Failure Domains
The two web server implementations in Node.js and NGINX were affected much in the same way. However, when sending the crash payload to both, NGINX continued to serve requests just milliseconds after the crash. Most other web servers crashed.
Crash Logs: Node.js vs NGINX
This is owed to NGINX’s multi-process architecture. Like Firedancer (albeit simpler), it runs a root process and multiple subprocesses to handle network traffic. Once a subprocess crashes, it is immediately revived by the root process which is not directly exposed to untrusted inputs.
In Firedancer, robustness hinges on having small, redundant tiles with independent failure domains (or as programmers call it, a ”small blast radius”).
Zero-Downtime Upgrades
Firedancer is designed to replace and upgrade every running tile within seconds without validator downtime.
Tiles store validator state in workspaces (shared memory objects) which exist as long as a machine is powered on. Upon restarting, each tile simply picks up processing where it has left off.
Thanks to binary stability in the C runtime model, this mechanism also works across most software upgrades.
What is an application binary interface (ABI)?
Comparatively, the Rust validator has to fully shut down ahead of upgrades because lack of ABI stability makes it impossible to implement on-the-fly upgrades in pure Rust. The compiler might arbitrarily change the layout of data structures when changing the rustc
version or internal dependencies.
Finally, we aim to drastically reduce vote downtime resulting from validator upgrades from the current ten minutes.
Runtime Correctness
Our final stop in the road to reliability is the consensus and execution layers. (The lines between consensus and execution are somewhat blurred in Solana, as both lay in the same blockchain network.)
Every public blockchain, including Solana, uses a variation of the following steps.
- Ordering: Pack incoming transactions into blocks.
- Consensus: Determine the canonical fork (attestation/voting on PoS or mining on PoW).
- Execution: Apply each transaction and validate resulting state.
It is critical that these steps happen deterministically across the network.
Simply put, every step must have identical outcomes on every node. One of the innovations of public blockchains is the ability to independently verify any data without trust assumptions. So, by design, any diverging behavior causes a chain split.
Blockchain systems – from Solana to Tendermint-based chains – continue to be affected by bugs in ordering, consensus, and most commonly, execution (1, 2) in recent times.
Bitcoin (execution bug in 2013) and Ethereum (execution bugs in 2020, 2021) have had partial outages and caused erroneous transaction confirmations. Block production has continued, however, as Nakamoto consensus-based systems favor availability over consistency.
Other proof-of-stake systems such as Solana do the opposite: when a fork fails to meet the supermajority of stake for an extended period, block production stops.
Luckily, network outages are preventable. The classes of vulnerabilities are well known and we have enough engineering resources to identify and address them. In fact, most chain split bugs have been caught during testing before they ever reached a live network.
Jump has developed extensive testing frameworks for its core trading infrastructure and looks forward to applying lessons learned to Firedancer.
Testing Roadmap
By far, the biggest security/reliability effort is a rigorous testing roadmap.
Test Networks
Integration testing is crucial when introducing a new validator to mainnet. As such, Jump and Solana Labs will run test networks made up of both validator clients.
While it is impossible to comprehensively mirror the nature of mainnet, we can still test individual properties to analyze safety. Taking inspiration from the successful Ethereum merge procedure, test networks will be subjected to various attacks and failures such as duplicate nodes, failing network links, packet floods, consensus violations, and more.
Just like we test high-frequency trading systems at Jump, these networks are going to be subjected to significantly higher loads than any realistic scenario on mainnet.
Fuzz Testing
We complement manual reviews, fixtures, and real-world tests with automated fuzzing to try to weed out particularly hidden bugs. This process involves running thousands of automated test inputs per second on code targets, then running coherence checks on the execution result. Coverage-guided fuzzing engines such as LLVM libFuzzer and AFL++ optimize for code coverage (i.e. creating a diverse corpus of inputs that reaches as much code as possible).
EVMFuzz revealed at least five instances of unsound behavior in Ethereum, proving its effectiveness at finding critical blockchain vulnerabilities. Each of these could have caused a chain split on Ethereum.
The Solana protocol has naturally developed with the validator written by Solana Labs. However, up to this point, much of the protocol was implementation-defined.
Firedancer defines fuzzing targets for every component that accepts untrusted user inputs, including the P2P interface (parsers) and the SBF virtual machine. OSS-Fuzz continually maintains fuzz coverage throughout code changes, which integrates nicely with Firedancer's rules_fuzzing build targets. To date, OSS-Fuzz has found over 40000 bugs in various open source projects including Chromium and GCC.
Specifications
We are working to produce specification documents to define the Solana protocol. In the end, one should be able to create a Solana validator just by looking at the documentation, not the Rust validator code.
Aside from helping Firedancer stay compatible in the long run, we hope that comprehensive documentation enables other teams to build their own clients. Ultimately, Solana and the Sealevel VM should grow to become an open standard governed by a diverse community of core contributors.
Supply Chain Security
We want to be confident in every line of code that the validator executes. This requirement forces us to consider not only our own code but also external sources.
In fact, many Node.js, Rust, and even Go projects are known to have large dependency trees which make up the vast majority of source code. This is driven by modern package managers – tools that trivialize sharing and using open-source modules.
Example: Geth, the majority Ethereum execution client, pulls in over 250 other Go modules as dependencies. Meanwhile, the Solana Rust implementation depends on around 750 Rust modules.
# Don't believe us? Verify it!
curl -sS 'https://raw.githubusercontent.com/ethereum/go-ethereum/master/go.sum' | awk '{ print $1 }' | sort -u | wc -l
271
For node software, software supply chain security is critical, as a compromise in a single package affects an entire validator. The only measure one can take is reviewing and updating every external dependency regularly (possibly millions of lines of code for large dependency trees).
Build System
To avoid many of the problems that come with a complex supply chain, Firedancer instead focuses on simplicity and security.
The Firedancer build system is designed around the following set of rules.
- Use minimal amount of external dependencies.
- Treat not just code, but all tools involved in the build process as dependencies.
- Pin every dependency (including code compilers) to an exact version.
- Isolate the system environment from build steps (portability).
Regardless of environment, the build system produces byte-by-byte reproducible outputs.
For those reasons, we chose the Bazel build system, which puts emphasis on hermeticity. Bazel is also trusted by companies like SpaceX and BMW for high-reliability software.
Conclusion
Learning from past network events, we are well-equipped to tackle the various modes of failure moving forward. Would Firedancer have stopped previous outages?
Possibly, but we didn’t design Firedancer by just looking in the rearview mirror. Firedancer is not a short-term patch or a bug fix. While it may not stop every future issue, we believe Firedancer will enhance the resiliency and robustness of the Solana network.
Meanwhile, the research arm of the project is working to support an ultra-high-bandwidth mode of operation using hardware acceleration (FPGAs) and custom networks. Firedancer will push the bounds of computing to offer scale and reliability to the Solana community. We are ready for the future.
Share
Stay up to date with the latest from Jump_
More articles
Disclaimer
The information on this website and on the Brick by Brick podcast or Ship Show Twitter spaces is provided for informational, educational, and entertainment purposes only. This information is not intended to be and does not constitute financial advice, investment advice, trading advice, or any other type of advice. You should not make any decision – financial, investment, trading or otherwise – based on any of the information presented here without undertaking your own due diligence and consulting with a financial adviser. Trading, including that of digital assets or cryptocurrency, has potential rewards as well as potential risks involved. Trading may not be suitable for all individuals. Recordings of podcast episodes or Twitter spaces events may be used in the future.