Builds_Writing_The Pit_Firedancer_
  • Builds
  • Compositions
    • Research
    • Macro
    • Spotlights
    • News
    • Podcasts
    • Media
  • Chronicle
  • The Pit
  • Firedancer
  • Careers
  • Connect
  • Brand & Press
Terms of Use_Privacy Policy_Disclaimers_

Firedancer Reliability Efforts

Richard Patel
Richard Patel
FiredancerUpdates

Dec 05 _ 9 min read

Firedancer Reliability Efforts

TL;DR

Backed by decades of experience developing global trading systems operating at far greater scale than public blockchains, Jump seeks to improve Solana network reliability by building an independent validator client. We present Firedancer’s modular architecture and other security practices to enable robust high-throughput validator operation.

Introduction

Solana mainnet block production has stalled on four occasions requiring manual intervention of hundreds of validator operators to perform a recovery. This slew of outages has called the reliability of the Solana network into question.

However, we believe that the Solana protocol itself does not require any fundamental redesign. Rather, downtime can be attributed to failures in software modules causing consensus issues — some of which have also affected other blockchain networks in the past.

Enter: Firedancer, a fully independent consensus node for the Solana network built by the R&D team led by Kevin Bowers.

Firedancer is built to be reliable with a modular architecture, minimal dependencies, extensive testing plan, and it will operate at a capacity limited by hardware.

We recognize that this is a difficult effort, but it’s not the first time Jump has built highly-reliable global networks. We have industry-leading experience building best-in-class trading infrastructure in traditional markets that demand performance measured in nanoseconds.

Timeline

To understand Solana network reliability, let’s look into the reasons of the four historical outages first.

One critical non-engineering component has held up, namely validator governance. We find that the validator set is both sufficiently decentralized and responsive. Solana’s Nakamoto coefficient stands strong at 31 and hundreds of validators collectively have coordinated a network recovery in just 8 hours.

We present several of these common failure causes and use each to motivate a design decision in Firedancer that addresses the problem.

High Performance Packet Processing

The peer-to-peer interface is made up of transport protocols driving the Solana protocol over physical networks.

A historical weakness of Solana’s peer-to-peer interface is the lack of congestion control for incoming transactions. Transaction floods halted the network on 2021-09-14 (17 hours) and 2022-04-30 (7 hours).

Flow Control

Recent design changes and network upgrades have made the validator more robust against floods of incoming transactions.

Solana’s peer-to-peer interface has been upgraded to adopt QUIC, a modern Internet standard (RFC 9000). Serving as the transport layer of HTTP/3, QUIC also finds widespread adoption across service providers and commercial DDoS protection services.

Many data centers use specialized hardware that mitigates DDoS attacks for common communication protocols like TCP and QUIC. Spamming the same transaction will not increase the probability of confirmation with QUIC, thereby removing the economic incentive behind transaction floods.

Network Processing

Even with QUIC, Solana aims to scale up to the gigabit per second realm, pushing the existing validator to its limits. If a validator fell behind on packet processing, consensus messages would inevitably get lost.

To give a sense of scale, Solana nodes currently operate at ~0.2 Gbps. The largest recorded spike on a Jump node was just over 40 Gbps.

As such, performance of networking code directly translates into reliability when designing the peer-to-peer interface.

Firedancer introduces a novel message-passing framework for composing an application from highly-performant C “tiles” – faster than the Linux kernel allows. Therefore, we bypass kernel networking using AF_XDP, allowing Firedancer to directly read from network interface buffers.

Multiple C tiles reading off of network interface buffers.

The tile system facilitates various high-performance computing concepts such as NUMA awareness, cache locality optimization, lockless concurrency, and large page sizes. We’ll spare you the technical details (for now), but first results show that Firedancer can reliably accept and filter about 100 Gbps of network traffic on a typical Solana validator server.

The modular architecture of Firedancer tiles should yield an appreciable improvement in Solana network resiliency.

Modular Architecture

💡
Unlike the Solana Rust validator which runs as a single process, Firedancer consists of many individual C processes called tiles.

To understand why the tile system is naturally more resilient to failures compared to a monolithic design, let’s take a look at CVE-2021-3449, a critical “0-day” vulnerability in the commonly-used OpenSSL server.

Failure Domains

The two web server implementations in Node.js and NGINX were affected much in the same way. However, when sending the crash payload to both, NGINX continued to serve requests just milliseconds after the crash. Most other web servers crashed.

Crash Logs: Node.js vs NGINX

$ docker logs -f cve-2021-3449-node &
server started
sending initial ClientHello
connected
sending malicious ClientHello
malicious handshake failed, exploit might have worked: EOF

Thread 1 "node" received signal SIGSEGV, Segmentation fault.
Container stopped.
The Node.js container stops and requires a restart.
$ docker logs -f cve-2021-3449-nginx &
sending initial ClientHello
connected
sending malicious ClientHello
malicious handshake failed, exploit might have worked: EOF

2021/03/27 03:24:40 [alert] 7: worker process 8 exited on signal 11 (core dumped)

# Still alive?
sending another ClientHello
connected
NGINX logs a crash but still responds to requests.

This is owed to NGINX’s multi-process architecture. Like Firedancer (albeit simpler), it runs a root process and multiple subprocesses to handle network traffic. Once a subprocess crashes, it is immediately revived by the root process which is not directly exposed to untrusted inputs.

In Firedancer, robustness hinges on having small, redundant tiles with independent failure domains (or as programmers call it, a ”small blast radius”).

Zero-Downtime Upgrades

Firedancer is designed to replace and upgrade every running tile within seconds without validator downtime.

Tiles store validator state in workspaces (shared memory objects) which exist as long as a machine is powered on. Upon restarting, each tile simply picks up processing where it has left off.

Thanks to binary stability in the C runtime model, this mechanism also works across most software upgrades.

💡
The ABI is the binary interface that system processes adhere to, similarly to APIs provided by network services.

What is an application binary interface (ABI)?

Comparatively, the Rust validator has to fully shut down ahead of upgrades because lack of ABI stability makes it impossible to implement on-the-fly upgrades in pure Rust. The compiler might arbitrarily change the layout of data structures when changing the rustc version or internal dependencies.

Finally, we aim to drastically reduce vote downtime resulting from validator upgrades from the current ten minutes.

Runtime Correctness

Our final stop in the road to reliability is the consensus and execution layers. (The lines between consensus and execution are somewhat blurred in Solana, as both lay in the same blockchain network.)

Every public blockchain, including Solana, uses a variation of the following steps.

  1. Ordering: Pack incoming transactions into blocks.
  2. Consensus: Determine the canonical fork (attestation/voting on PoS or mining on PoW).
  3. Execution: Apply each transaction and validate resulting state.

It is critical that these steps happen deterministically across the network.

Simply put, every step must have identical outcomes on every node. One of the innovations of public blockchains is the ability to independently verify any data without trust assumptions. So, by design, any diverging behavior causes a chain split.

Blockchain systems – from Solana to Tendermint-based chains – continue to be affected by bugs in ordering, consensus, and most commonly, execution (1, 2) in recent times.

Bitcoin (execution bug in 2013) and Ethereum (execution bugs in 2020, 2021) have had partial outages and caused erroneous transaction confirmations. Block production has continued, however, as Nakamoto consensus-based systems favor availability over consistency.

Other proof-of-stake systems such as Solana do the opposite: when a fork fails to meet the supermajority of stake for an extended period, block production stops.

👉
Refer to our writeup from last year on safety vs liveness.

Luckily, network outages are preventable. The classes of vulnerabilities are well known and we have enough engineering resources to identify and address them. In fact, most chain split bugs have been caught during testing before they ever reached a live network.

Jump has developed extensive testing frameworks for its core trading infrastructure and looks forward to applying lessons learned to Firedancer.

Testing Roadmap

By far, the biggest security/reliability effort is a rigorous testing roadmap.

Test Networks

Integration testing is crucial when introducing a new validator to mainnet. As such, Jump and Solana Labs will run test networks made up of both validator clients.

While it is impossible to comprehensively mirror the nature of mainnet, we can still test individual properties to analyze safety. Taking inspiration from the successful Ethereum merge procedure, test networks will be subjected to various attacks and failures such as duplicate nodes, failing network links, packet floods, consensus violations, and more.

Just like we test high-frequency trading systems at Jump, these networks are going to be subjected to significantly higher loads than any realistic scenario on mainnet.

Fuzz Testing

We complement manual reviews, fixtures, and real-world tests with automated fuzzing to try to weed out particularly hidden bugs. This process involves running thousands of automated test inputs per second on code targets, then running coherence checks on the execution result. Coverage-guided fuzzing engines such as LLVM libFuzzer and AFL++ optimize for code coverage (i.e. creating a diverse corpus of inputs that reaches as much code as possible).

EVMFuzz revealed at least five instances of unsound behavior in Ethereum, proving its effectiveness at finding critical blockchain vulnerabilities. Each of these could have caused a chain split on Ethereum.

The Solana protocol has naturally developed with the validator written by Solana Labs. However, up to this point, much of the protocol was implementation-defined.

Firedancer defines fuzzing targets for every component that accepts untrusted user inputs, including the P2P interface (parsers) and the SBF virtual machine. OSS-Fuzz continually maintains fuzz coverage throughout code changes, which integrates nicely with Firedancer's rules_fuzzing build targets. To date, OSS-Fuzz has found over 40000 bugs in various open source projects including Chromium and GCC.

Specifications

We are working to produce specification documents to define the Solana protocol. In the end, one should be able to create a Solana validator just by looking at the documentation, not the Rust validator code.

Mind-map of major Solana network components

Aside from helping Firedancer stay compatible in the long run, we hope that comprehensive documentation enables other teams to build their own clients. Ultimately, Solana and the Sealevel VM should grow to become an open standard governed by a diverse community of core contributors.

Supply Chain Security

We want to be confident in every line of code that the validator executes. This requirement forces us to consider not only our own code but also external sources.

In fact, many Node.js, Rust, and even Go projects are known to have large dependency trees which make up the vast majority of source code. This is driven by modern package managers – tools that trivialize sharing and using open-source modules.

Example: Geth, the majority Ethereum execution client, pulls in over 250 other Go modules as dependencies. Meanwhile, the Solana Rust implementation depends on around 750 Rust modules.

# Don't believe us? Verify it!

curl -sS 'https://raw.githubusercontent.com/ethereum/go-ethereum/master/go.sum' | awk '{ print $1 }' | sort -u | wc -l
271

For node software, software supply chain security is critical, as a compromise in a single package affects an entire validator. The only measure one can take is reviewing and updating every external dependency regularly (possibly millions of lines of code for large dependency trees).

Build System

To avoid many of the problems that come with a complex supply chain, Firedancer instead focuses on simplicity and security.

The Firedancer build system is designed around the following set of rules.

  1. Use minimal amount of external dependencies.
  2. Treat not just code, but all tools involved in the build process as dependencies.
  3. Pin every dependency (including code compilers) to an exact version.
  4. Isolate the system environment from build steps (portability).

Regardless of environment, the build system produces byte-by-byte reproducible outputs.

For those reasons, we chose the Bazel build system, which puts emphasis on hermeticity. Bazel is also trusted by companies like SpaceX and BMW for high-reliability software.

Conclusion

Learning from past network events, we are well-equipped to tackle the various modes of failure moving forward. Would Firedancer have stopped previous outages?

Possibly, but we didn’t design Firedancer by just looking in the rearview mirror. Firedancer is not a short-term patch or a bug fix. While it may not stop every future issue, we believe Firedancer will enhance the resiliency and robustness of the Solana network.

Meanwhile, the research arm of the project is working to support an ultra-high-bandwidth mode of operation using hardware acceleration (FPGAs) and custom networks. Firedancer will push the bounds of computing to offer scale and reliability to the Solana community. We are ready for the future.

Share

Contributors

Richard Patel
Richard Patel

Richard is a Software Developer at Jump Crypto focused on Solana core engineering. He currently maintains the Firedancer validator client. Previously, he was a research engineer at Blockdaemon.

.View all posts (1)

More articles

SAFU: Creating a Standard for Whitehats

SAFU: Creating a Standard for Whitehats

Whitehats and DeFi protocols need a shared understanding of security policy. We propose the SAFU - Simple Arrangement for Funding Upload - as a versatile and credible way to let whitehats know what to...

Oct 24 _ min

Share

Disclaimer

The information on this website and on the Brick by Brick podcast or Ship Show Twitter spaces is provided for informational, educational, and entertainment purposes only.  This information is not intended to be and does not constitute financial advice, investment advice, trading advice, or any other type of advice.  You should not make any decision – financial, investment, trading or otherwise – based on any of the information presented here without undertaking your own due diligence and consulting with a financial adviser.  Trading, including that of digital assets or cryptocurrency, has potential rewards as well as potential risks involved. Trading may not be suitable for all individuals. Recordings of podcast episodes or Twitter spaces events may be used in the future.

Jump_
Terms of Use_Privacy Policy_Disclaimers_

© 2022 Jump Crypto. All Rights Reserved.