Welcome to the first in a series of posts about the design and implementation of a custom ASIC for computing verifiable delay functions, or VDFs. This post will provide a brief overview of VDFs, discuss the goals of the VDF hardware project, and lay out the foundations of the current design.
About VDFs
A verifiable delay function (VDF) is a cryptographic primitive that yields cryptographically verifiable time delays through sequential computation. One of the use cases of VDFs is generating public and unbiasable randomness without a trusted third party. This randomness can be used to enable more secure and scalable consensus protocols, make proofs-of-replication more efficient, and more. VDFs were first defined in a June 2018 paper by Stanford professor of cryptography Dan Boneh and others. Verifiable delay functions are defined as functions which require lots of sequential computation to evaluate, and for which the function output can be quickly verified via a short proof. An older ‘timelock puzzle’ construction by Rivest, Shamir and Wagner also uses sequential computation to create time delays and encrypt data ‘into the future’. In these RSA-based ‘timelock’ schemes, the sequential computation that must be done is modular squaring. More information about VDFs can be found at https://vdfresearch.org/.
Open VDF Project Goals
This project, as part of the VDF Alliance, began in early 2019 to evaluate the feasibility of developing fast, open, hardware for computing verifiable delay functions. While open source software powers so much of our core internet security today, open source hardware is still a nascent ecosystem, and open source hardware built for cryptography is even more niche. The goal of the Open VDF project is to provide some common building blocks for future cryptographic accelerators, share information about hardware design, and advance the state of the art for hardware acceleration of VDFs. Source code, as well as performance results from synthesis or other sources, will be shared where possible. We welcome and encourage feedback and participation from the community on improving the design and advancing the project forward.
Design Targets
Before discussing the current Open VDF design, it is important to understand the design goals of the custom hardware. Efficient computation of VDFs requires low latency modular squaring to evaluate the delay function and high throughput modular multiplication to generate the delay function proof. This initial blog series will focus solely on the low latency modular squaring. Lowering the latency of the VDF evaluation to approach its assumed minimum speed allows for improved user experience and security when used in blockchain protocols. This gap between the current state of the art hardware and the assumed lower bound is known as amax. In addition to being ‘fast’, the VDF processor must also be practical and affordable for home users.
Current Design
The current architecture under consideration is based on the Öztürk multiplier, which was presented in this MIT VDF Day talk. This modular squaring unit (MSU) is composed of two main operations:
Squaring: The design uses a schoolbook polynomial multiplication method to generate the square. There are many existing resources for learning more about this method, including here and also algorithm 7 here.
Reduction: We utilize a look-up table based reduction algorithm as explained in Algorithms 8 and 9 from. For our design, instead of a d-bit look-up table, we utilize a 1-bit look-up table approach. Also, for a more balanced critical path, instead of reducing the top ~2048 bits of the squaring operation, we reduce the most significant ~1024 bits and least significant ~1024 bits using out look-up table based approach, utilizing Montgomery Reduction for the least significant bits.
The data flows from top to bottom in this drawing through the following steps:
Partial product generation — In binary representation the partial products are formed by AND’ing the input coefficients according to the schoolbook squaring algorithm.
Partial product accumulation — The green, white and red triangles represent compressor trees (CSAs) that accumulate the partial products. The red triangles need to be as fast as possible to feed into reduction below. The middle white portion can take more time as it only feeds into the final output, keeping it out of the critical path.
Carry propagate adder (CPA) — Sums the carry and sum redundant outputs of the compressor trees into a single result.
Reduction lookup table reads — After squaring the resulting ~4k bits wide product needs to be reduced back to 2k bits. Precomputed reduction tables are indexed using the top 1k bits and bottom 1k bits, each producing a 2048 bit value. This results in ~2k polynomials that need to be summed with the output of the white portion of the triangle to produce a result in redundant form.
Reduction compression trees (turquoise box) — Since we are summing many partial products together the resulting polynomial coefficients will grow in width, up to about 29 bits.
CPA — Sums the carry and sum outputs of the reduction compression trees and partially reduces the wide coefficients from the previous step by carrying bits above bit 16 into the coefficient one degree higher.
Upcoming Posts
In the upcoming posts we will dive into more detail on designing a low latency MSU by exploring:
Öztürk multiplier design
Carry propagate adders (CPA)
Carry save adders (CSA) and compressor trees
Partial product generation (PPG)
Ideal coefficient bitwidth
Complete MSU design
Source Code
Source code for the designs presented here are available on GitHub at supranational/hardware and is licensed under the Apache 2 license.