Open VDF: Partial Product Generation

In this blog we’ll look at partial product generation for the Öztürk multiplier. We’ve considered three techniques for partial product generation:

  1. Use small (17-bit) multipliers to generate partial products that feed into compressor trees. This was the approach used in the FPGA-based design leveraging the high speed DSPs.

  2. Use small (17-bit) multipliers as above, but using Booth encoding.

  3. Use bitwise partial product generation with AND gates.

The last approach is theoretically the best for ASIC designs from a complexity standpoint. However, since our goal is to achieve the lowest possible latency, we wanted to rule out the possibility that a more structured design, or an alternative encoding scheme, could end up producing a better result in practice.

As in the previous posts, we’ll take the approach of analyzing this discrete component of the design to guide our decisions on the best way to construct the full modular squaring unit. In this instance we will look at using multipliers and carry save adder trees to construct the squaring circuit.

Partial Product Generation

In our last post we introduced a large integer multiplier construction using the “schoolbook” approach. We reuse that concept here for partial product generation.

When using the polynomial multiplier design, one option is to divide the 2048 bit input into 16-bit digits. When using the redundant representation we arrive at 130 digits, each 17 bits in length. We can then multiply these 17-bit digits to form partial products, which are summed down the columns through the use of compressor trees.

An example is shown for squaring 4 digits in the following drawing. Recall that we can optimize for squaring by doubling (shifting left 1 bit) the duplicated partial products shown in light purple.

triangle.png

This approach works very well on FPGAs since it makes good use of the built in high-speed DSP blocks. On one popular FPGA platform, the DSPs support up to 26-bit by 17-bit multiplication.

For an ASIC we can break this down even further and consider each digit in the polynomial to be one bit wide. In this case, the schoolbook algorithm is the same, but partial product generation is reduced to a single AND gate per pair of terms. The result is many more partial products and therefore larger compressor trees to sum them.

In concrete terms, we found the bitwise squaring approach resulted in the deepest column with 1,224 entries to sum. Using a tree, and including the PPG AND gate, results in a depth of log₂(1224) + 1 = 12 levels in the critical path. In pseudo code, disregarding for now the single AND gate:

A = [ x0, x1, …, x1223] # 16-bit unsigned integers
sum = 0
for i in range(len(A)):
    sum = sum + A[i]

Alternatively, with 17-bit digits, the PPG multiplier has to sum 34 inputs and the compressor tree has to sum 130 partial products, leading to a total depth of log₂(34) + log₂(130) = 13 levels. In pseudo code:

Cn = An * Bn
sum = 0
for i in range(130):
    sum = sum + C[i]

In the following sections we present the results for these two approaches.

Booth Encoding

In addition to considering a standard representation multiplier we also considered a Booth encoded multiplier.

Booth’s multiplication algorithm was originally developed to perform signed multiplication. It can also have performance advantages, particularly when the operands contain consecutive ones or zeros. We won’t go into detail on the algorithm as there is abundant material online. Examples include Booth’s multiplication algorithm and Booth’s original paper on the topic.

Our purpose in exploring it here is to see if it offers any advantage for low latency designs.

Study Results

We considered four implementations for partial product generation:

  1. 17x17 bit multiplier PPG followed by compressors coded using “+”

  2. 17x17 bit Booth encoded multiplier PPG followed by compressors coded using “+”

  3. 17x17 bit Booth encoded multiplier PPG followed by compressors coded using explicit trees

  4. 1x1 bit PPG using AND gates followed by compressors coded using “+”. This design is architected specifically to minimize critical path gate depth for a low latency ASIC.

We will begin by first identifying the best approach to building a 17 bit x 17 bit multiplier (implementations #1 through #3), and then compare this result to a 1 bit x 1 bit PPG approaching using AND gates.

17 bit x 17 bit Multipliers

We first considered the 17x17 bit multiplier on its own by looking at three designs:

  1. 17x17 bit multiplier coded with “*”

  2. 17x17 bit Booth encoded multiplier coded with “+” for internal accumulators

  3. 17x17 bit Booth encoded multiplier coded with bit level compressors for internal accumulators

The multiplier results are shown in the following chart:

Comparison of 17x17 multiplier PPA vs. multiplier baseline coded using “*”

Comparison of 17x17 multiplier PPA vs. multiplier baseline coded using “*”

In this chart, latency refers to the lowest latency where timing closes. For area and power measurements each design style are measured at iso-latency for an aggressive latency target. Individual component sizes are chosen to be representative of what would be needed for a full 2048-bit modular squaring design.

From these results it is clear that “*” is the preferred approach for building a multiplier. This is the multiplier we will assume as we look at the full squaring operation in the next section.

PPG with AND gates vs. 17x17 Multipliers

For the entire squaring operation we took the best multiplier from above, paired it with a 130 element compressor, and then compared it to a single large compressor tree to assess the relative performance:

  1. 1224 element compressor tree coded with “+”

  2. 17x17 multiplier coded with “*”, followed by a 130 element compressor tree

PPA results are shown in the following chart:

PPG using 1224 element compressor tree vs a 17x17 multiplier plus 130 element compressor tree

PPG using 1224 element compressor tree vs a 17x17 multiplier plus 130 element compressor tree

Again we find that the simpler approach produces the best results, that is, it is better to build a single large compressor tree per polynomial digit rather than use a two level design that uses 17x17 bit multipliers.

These results align with those from our earlier CSA blog where we looked at composing large CSA trees from smaller component trees. At a high level, this is not surprising. We’ve essentially replaced some of the component trees from the earlier block with small multipliers, and as a results some of the same inefficiencies exist. Namely, the synthesis tool is no longer able to optimize over the entire tree. This means that most, or all, of the sub-components are suboptimal for the particular circumstances where they are instantiated.

Note that we did not take into account the required PPG AND gate. However, we expect it to be a small incremental cost and will confirm this when we look at the entire modular squaring unit.

Conclusions

In this blog we looked at using various styles of multipliers to generate partial products for the modular squaring operation. The results reinforce some of our conclusions from earlier blogs, specifically:

  • Explicitly using an alternative multiplier encoding style (Booth) produced worse results than letting the tools optimize

  • We consistently see that a simpler coding style for arithmetic operations (“+”, “*”) leads to better results when using modern ASIC EDA tools

The only remaining question before building the entire modular squaring unit is to consider what bitwidth size should be used for the polynomial coefficients. When designing for an FPGA we were constrained by the size of the available DSP resources. However, this is no longer the case on an ASIC where we’re free to build the components that we need. In the next blog we’ll discuss the tradeoffs involved in various bitwidth sizes, and perform an analysis to identify the optimal choice to take forward.

Source Code

Source code for the designs presented here are available on GitHub at supranational/hardware and is licensed under the Apache 2 license.

Acknowledgements

We’d like provide a special acknowledgement to Synopsys® for their support of this work.