Fast Monoid CRC32

The monoid hashing described in Parallel and Incremental CRCs is useful in some scenarios. A monoid hash requires a binary operation that combines the hashes of two snippets, with the result being equivalent to hashing the concatenation of those two snippets. Additionally, there is an identity element that, when combined with any hash, yields the same hash.

Though, the naive algorithm described in the original post hasn’t reached the limits of modern hardware. It’s not fast enough!

To illustrate this point more clearly, let’s start with a core abstraction for fast monoid CRC32 calculation.

type t
val fold_bytes: [u8] → t
val combine: t → t → t
val identity: t

We use fold_bytes: [u8] → t instead of from_byte: u8 → t. While the latter form is more general, many architectures support hardware CRC32 calculations, which makes fold_bytes more efficient.

Regarding the abstract type t, a reasonable and dedicated representation includes two polynomials, each fitting into a 32-bit integer. These polynomials store the standard CRC32 value and an additional value that encodes the length \( n \) as \( x^n \bmod p \).

type r32 = u32 (* r32 means bit-(r)eflecting 32-bit polynomial *)
type t = {
  crc: r32
  len: r32
}

The fold_bytes operation defined for this type is straightforward.

let fold_bytes data  =
  let crc = crc32 data in
  let len = xnmodp @@ length data in
  { crc, len }

Here, crc32 and xnmodp are both primitives used for normal CRC32 calculations. A brilliant implementation can be found at Fast CRC32.

According to the original post, the combination is defined as:

let combine m n =
  let crc = addp32 (mulp32 m.crc n.len) n.crc in
  let len = mulp32 m.len n.len in
  { crc, len }

This time, we use two primitives: addp32 (add 32-bit polynomials) and mulp32 (multiply 32-bit polynomials and modulo P). For the addition over polynomial, it’s simply xor. However, polynomial multiplication is more complex. While there are corresponding hardware instructions (clmul on x86-64 and pmull on AArch64), the result must be reduced modulo the CRC32 polynomial P after obtaining the widened product.

As a good background, Galois field instructions on 2021 CPUs suggested to use crc32 instruction for reduction. However, I chose another approach that explicitly reduces the result with the equation:

\[S \bmod P = (((S * (T / P)) \bmod T) * P) / T\]

(still a contribution by @cosrix!)

Following the steps in https://crypto.stackexchange.com/a/114412/126155 (sorry, I don’t want to replicate the content to this post!), we have the following code (on aarch64):

uint64x2_t clmul_lo(uint64x2_t a, uint64x2_t b) {
  uint64x2_t r;
  __asm("pmull %0.1q, %1.1d, %2.1d\n" : "=w"(r) : "w"(a), "w"(b));
  return r;
}

uint64x2_t clmul_hi_e(uint64x2_t a, uint64x2_t b, uint64x2_t c) {
  uint64x2_t r;
  __asm("pmull2 %0.1q, %2.2d, %3.2d\neor %0.16b, %0.16b, %1.16b\n"
        : "=w"(r), "+w"(c)
        : "w"(a), "w"(b));
  return r;
}

uint32_t clmulr32(uint32_t a, uint32_t b) {
  uint64x2_t aa = vdupq_n_u64(a), bb = vdupq_n_u64(b);
  uint64x2_t j = vshlq_u64(clmul_lo(aa, bb), vdupq_n_s64(1));
  uint64x2_t k;
  {
    static const uint64_t __attribute__((align(16)))
                k_[] = {0x90d3d871bd4e27e2ull};
    k = vld1q_dup_u64(k_);
  }
  uint64x2_t l = clmul_lo(j, k);
  {
    static const uint64_t __attribute__((align(16)))
        k_[] = {0xfffffffe00000000ull, (uint64_t)-1ll};
    k = vld1q_u64(k_);
  }
  uint64x2_t n = vandq_u64(l, k);
  {
    static const uint64_t __attribute__((align(16)))
        k_[] = {0x82f63b7880000000ull};
    k = vld1q_dup_u64(k_);
  }
  uint64x2_t hi = clmul_lo(n, k);
  uint64x2_t shl = vextq_u64(hi, vdupq_n_u64(0), 1);
  uint64x2_t r = clmul_hi_e(n, k, shl);
  return (uint32_t)vgetq_lane_u64(r, 0);
}

Benchmark data shows it’s enough efficient for my application.

Now, let’s conclude the definition of the identity for the last piece of the fast monoid CRC32.

let identity = {
  crc = 0,
  len = 0x80000000
}