Skip to content

Commit

Permalink
poly1305: modify s390x assembly to implement MAC interface
Browse files Browse the repository at this point in the history
The vector (vx) implementation has been updated to read in the
state and update it - as opposed to being a single shot function.
This has allowed the new MAC interface can be implemented.

For performance reasons s390x uses a larger buffer than the generic
implementation. There is a relatively high fixed cost to read the
state, calculate the key coefficients and serialize the state, so
it makes sense to buffer more blocks before calling it.

For now I've had to remove the faster VMSL implementation. It is
too complex for me to update in time for Go 1.15. At some point
I'd like to revisit it but for now it looks like using the MAC
interface is more of a win than using VMSL.

The benchmarks show considerable improvements when using the MAC
interface. The Sum benchmarks show slowdown due to a combination
of the removal of the VMSL implementation and also the added
overhead from splitting the summation function into multiple parts.

poly1305:

name              old speed      new speed      delta
64                1.33GB/s ± 0%  0.80GB/s ± 1%   -39.51%  (p=0.000 n=16+20)
1K                4.04GB/s ± 0%  2.97GB/s ± 0%   -26.46%  (p=0.000 n=19+19)
2M                5.32GB/s ± 1%  3.63GB/s ± 0%   -31.76%  (p=0.000 n=20+19)
64Unaligned       1.33GB/s ± 0%  0.80GB/s ± 0%   -39.80%  (p=0.000 n=19+18)
1KUnaligned       4.09GB/s ± 1%  2.94GB/s ± 0%   -28.23%  (p=0.000 n=19+18)
2MUnaligned       5.33GB/s ± 1%  3.52GB/s ± 0%   -34.04%  (p=0.000 n=20+19)
Write64           1.03GB/s ± 1%  1.49GB/s ± 1%   +44.34%  (p=0.000 n=20+20)
Write1K           1.21GB/s ± 0%  3.24GB/s ± 0%  +169.02%  (p=0.000 n=20+17)
Write2M           1.24GB/s ± 1%  3.63GB/s ± 0%  +192.36%  (p=0.000 n=20+19)
Write64Unaligned  1.04GB/s ± 1%  1.50GB/s ± 0%   +44.16%  (p=0.000 n=19+14)
Write1KUnaligned  1.21GB/s ± 0%  3.20GB/s ± 0%  +164.55%  (p=0.000 n=20+16)
Write2MUnaligned  1.24GB/s ± 1%  3.51GB/s ± 0%  +183.96%  (p=0.000 n=20+19)

chacha20poly1305 (this vs. using generic MAC interface - post CL 206977):

name         old speed      new speed      delta
Open-64       147MB/s ± 2%   156MB/s ± 1%   +6.15%  (p=0.000 n=20+19)
Seal-64       151MB/s ± 0%   164MB/s ± 1%   +8.86%  (p=0.000 n=19+16)
Open-64-X     104MB/s ± 2%   111MB/s ± 1%   +6.24%  (p=0.000 n=20+20)
Seal-64-X     109MB/s ± 2%   111MB/s ± 1%   +2.11%  (p=0.000 n=20+19)
Open-1350     555MB/s ± 0%   751MB/s ± 1%  +35.19%  (p=0.000 n=20+20)
Seal-1350     557MB/s ± 0%   759MB/s ± 0%  +36.23%  (p=0.000 n=20+20)
Open-1350-X   517MB/s ± 1%   683MB/s ± 1%  +31.97%  (p=0.000 n=20+20)
Seal-1350-X   511MB/s ± 0%   683MB/s ± 0%  +33.77%  (p=0.000 n=18+19)
Open-8192     672MB/s ± 0%  1013MB/s ± 0%  +50.65%  (p=0.000 n=19+19)
Seal-8192     674MB/s ± 0%  1018MB/s ± 0%  +50.98%  (p=0.000 n=18+20)
Open-8192-X   663MB/s ± 0%   979MB/s ± 0%  +47.57%  (p=0.000 n=20+20)
Seal-8192-X   658MB/s ± 0%   985MB/s ± 0%  +49.62%  (p=0.000 n=18+20)

name         old allocs/op  new allocs/op  delta
Open-64          0.00           0.00          ~     (all equal)
Seal-64          0.00           0.00          ~     (all equal)
Open-64-X        0.00           0.00          ~     (all equal)
Seal-64-X        0.00           0.00          ~     (all equal)
Open-1350        0.00           0.00          ~     (all equal)
Seal-1350        0.00           0.00          ~     (all equal)
Open-1350-X      0.00           0.00          ~     (all equal)
Seal-1350-X      0.00           0.00          ~     (all equal)
Open-8192        0.00           0.00          ~     (all equal)
Seal-8192        0.00           0.00          ~     (all equal)
Open-8192-X      0.00           0.00          ~     (all equal)
Seal-8192-X      0.00           0.00          ~     (all equal)

chacha20poly1305 (this vs. using asm Sum interface - pre CL 206977):

name         old speed      new speed      delta
Open-64       144MB/s ± 0%   156MB/s ± 1%    +8.16%  (p=0.000 n=20+19)
Seal-64       150MB/s ± 0%   164MB/s ± 1%    +9.35%  (p=0.000 n=20+16)
Open-64-X     104MB/s ± 1%   111MB/s ± 1%    +6.15%  (p=0.000 n=19+20)
Seal-64-X     109MB/s ± 1%   111MB/s ± 1%    +1.43%  (p=0.000 n=19+19)
Open-1350     702MB/s ± 1%   751MB/s ± 1%    +6.98%  (p=0.000 n=20+20)
Seal-1350     715MB/s ± 0%   759MB/s ± 0%    +6.09%  (p=0.000 n=19+20)
Open-1350-X   642MB/s ± 0%   683MB/s ± 1%    +6.37%  (p=0.000 n=19+20)
Seal-1350-X   639MB/s ± 0%   683MB/s ± 0%    +6.98%  (p=0.000 n=20+19)
Open-8192     994MB/s ± 0%  1013MB/s ± 0%    +1.85%  (p=0.000 n=20+19)
Seal-8192    1.00GB/s ± 0%  1.02GB/s ± 0%    +1.90%  (p=0.000 n=20+20)
Open-8192-X   965MB/s ± 0%   979MB/s ± 0%    +1.43%  (p=0.000 n=19+20)
Seal-8192-X   962MB/s ± 0%   985MB/s ± 0%    +2.39%  (p=0.000 n=20+20)

name         old allocs/op  new allocs/op  delta
Open-64          1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Seal-64          1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Open-64-X        1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Seal-64-X        1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Open-1350        1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Seal-1350        1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Open-1350-X      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Seal-1350-X      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Open-8192        1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Seal-8192        1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Open-8192-X      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
Seal-8192-X      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)

Updates golang/go#25219.

Change-Id: Ib491e3a47b6b3ec8bbbe1f41f7bf42ad82f5c249
Reviewed-on: https://go-review.googlesource.com/c/crypto/+/219057
Run-TryBot: Michael Munday <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Filippo Valsorda <[email protected]>
  • Loading branch information
mundaym committed Apr 29, 2020
1 parent 729f1e8 commit 4b2356b
Show file tree
Hide file tree
Showing 9 changed files with 548 additions and 1,222 deletions.
2 changes: 1 addition & 1 deletion poly1305/mac_noasm.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// +build !amd64,!ppc64le gccgo purego
// +build !amd64,!ppc64le,!s390x gccgo purego

package poly1305

Expand Down
4 changes: 3 additions & 1 deletion poly1305/poly1305.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@ const TagSize = 16
// 16-byte result into out. Authenticating two different messages with the same
// key allows an attacker to forge messages at will.
func Sum(out *[16]byte, m []byte, key *[32]byte) {
sum(out, m, key)
h := New(key)
h.Write(m)
h.Sum(out[:0])
}

// Verify returns true if mac is a valid authenticator for m with the given key.
Expand Down
38 changes: 35 additions & 3 deletions poly1305/poly1305_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ package poly1305

import (
"crypto/rand"
"encoding/binary"
"encoding/hex"
"flag"
"testing"
Expand All @@ -15,9 +16,10 @@ import (
var stressFlag = flag.Bool("stress", false, "run slow stress tests")

type test struct {
in string
key string
tag string
in string
key string
tag string
state string
}

func (t *test) Input() []byte {
Expand Down Expand Up @@ -48,9 +50,33 @@ func (t *test) Tag() [16]byte {
return tag
}

func (t *test) InitialState() [3]uint64 {
// state is hex encoded in big-endian byte order
if t.state == "" {
return [3]uint64{0, 0, 0}
}
buf, err := hex.DecodeString(t.state)
if err != nil {
panic(err)
}
if len(buf) != 3*8 {
panic("incorrect state length")
}
return [3]uint64{
binary.BigEndian.Uint64(buf[16:24]),
binary.BigEndian.Uint64(buf[8:16]),
binary.BigEndian.Uint64(buf[0:8]),
}
}

func testSum(t *testing.T, unaligned bool, sumImpl func(tag *[TagSize]byte, msg []byte, key *[32]byte)) {
var tag [16]byte
for i, v := range testData {
// cannot set initial state before calling sum, so skip those tests
if v.InitialState() != [3]uint64{0, 0, 0} {
continue
}

in := v.Input()
if unaligned {
in = unalignBytes(in)
Expand Down Expand Up @@ -140,6 +166,9 @@ func testWriteGeneric(t *testing.T, unaligned bool) {
input = unalignBytes(input)
}
h := newMACGeneric(&key)
if s := v.InitialState(); s != [3]uint64{0, 0, 0} {
h.macState.h = s
}
n, err := h.Write(input[:len(input)/3])
if err != nil || n != len(input[:len(input)/3]) {
t.Errorf("#%d: unexpected Write results: n = %d, err = %v", i, n, err)
Expand All @@ -165,6 +194,9 @@ func testWrite(t *testing.T, unaligned bool) {
input = unalignBytes(input)
}
h := New(&key)
if s := v.InitialState(); s != [3]uint64{0, 0, 0} {
h.macState.h = s
}
n, err := h.Write(input[:len(input)/3])
if err != nil || n != len(input[:len(input)/3]) {
t.Errorf("#%d: unexpected Write results: n = %d, err = %v", i, n, err)
Expand Down
3 changes: 2 additions & 1 deletion poly1305/sum_generic.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ func newMACGeneric(key *[32]byte) macGeneric {
// the value of [x0, x1, x2] is x[0] + x[1] * 2⁶⁴ + x[2] * 2¹²⁸.
type macState struct {
// h is the main accumulator. It is to be interpreted modulo 2¹³⁰ - 5, but
// can grow larger during and after rounds.
// can grow larger during and after rounds. It must, however, remain below
// 2 * (2¹³⁰ - 5).
h [3]uint64
// r and s are the private key components.
r [2]uint64
Expand Down
18 changes: 0 additions & 18 deletions poly1305/sum_noasm.go

This file was deleted.

72 changes: 54 additions & 18 deletions poly1305/sum_s390x.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,74 @@
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// +build go1.11,!gccgo,!purego
// +build !gccgo,!purego

package poly1305

import (
"golang.org/x/sys/cpu"
)

// poly1305vx is an assembly implementation of Poly1305 that uses vector
// updateVX is an assembly implementation of Poly1305 that uses vector
// instructions. It must only be called if the vector facility (vx) is
// available.
//go:noescape
func poly1305vx(out *[16]byte, m *byte, mlen uint64, key *[32]byte)
func updateVX(state *macState, msg []byte)

// poly1305vmsl is an assembly implementation of Poly1305 that uses vector
// instructions, including VMSL. It must only be called if the vector facility (vx) is
// available and if VMSL is supported.
//go:noescape
func poly1305vmsl(out *[16]byte, m *byte, mlen uint64, key *[32]byte)
// mac is a replacement for macGeneric that uses a larger buffer and redirects
// calls that would have gone to updateGeneric to updateVX if the vector
// facility is installed.
//
// A larger buffer is required for good performance because the vector
// implementation has a higher fixed cost per call than the generic
// implementation.
type mac struct {
macState

buffer [16 * TagSize]byte // size must be a multiple of block size (16)
offset int
}

func sum(out *[16]byte, m []byte, key *[32]byte) {
if cpu.S390X.HasVX {
var mPtr *byte
if len(m) > 0 {
mPtr = &m[0]
func (h *mac) Write(p []byte) (int, error) {
nn := len(p)
if h.offset > 0 {
n := copy(h.buffer[h.offset:], p)
if h.offset+n < len(h.buffer) {
h.offset += n
return nn, nil
}
if cpu.S390X.HasVXE && len(m) > 256 {
poly1305vmsl(out, mPtr, uint64(len(m)), key)
p = p[n:]
h.offset = 0
if cpu.S390X.HasVX {
updateVX(&h.macState, h.buffer[:])
} else {
poly1305vx(out, mPtr, uint64(len(m)), key)
updateGeneric(&h.macState, h.buffer[:])
}
} else {
sumGeneric(out, m, key)
}

tail := len(p) % len(h.buffer) // number of bytes to copy into buffer
body := len(p) - tail // number of bytes to process now
if body > 0 {
if cpu.S390X.HasVX {
updateVX(&h.macState, p[:body])
} else {
updateGeneric(&h.macState, p[:body])
}
}
h.offset = copy(h.buffer[:], p[body:]) // copy tail bytes - can be 0
return nn, nil
}

func (h *mac) Sum(out *[TagSize]byte) {
state := h.macState
remainder := h.buffer[:h.offset]

// Use the generic implementation if we have 2 or fewer blocks left
// to sum. The vector implementation has a higher startup time.
if cpu.S390X.HasVX && len(remainder) > 2*TagSize {
updateVX(&state, remainder)
} else if len(remainder) > 0 {
updateGeneric(&state, remainder)
}
finalize(out, &state.h, &state.s)
}
Loading

0 comments on commit 4b2356b

Please sign in to comment.