Redundancy and Optimization of tANS Entropy Encoders

—Nowadays entropy encoders are part of almost all data compression methods, with the Asymmetrical Numeral Systems (ANS) family of entropy encoders having recently risen in popularity. Entropy encoders based on the tabled variant of ANS are known to provide varying performances depending on their internal design. In this paper, we present a method that calculates encoder redundancies in almost linear time, which translates in practice to thousand-fold speedups in redundancy calculations for small automatons, and allows redundancy calculations for automatons with tens of millions of states that would be otherwise prohibitive. We also address the problem of improving tabled ANS encoder designs, by employing the aforementioned redundancy calculation method in conjunction with a stochastic hill climbing strategy. The proposed approach consistently outperforms state-of-the-art methods in tabled ANS encoder design. For automatons of twice the alphabet size, experimental results show redundancy reductions around 10% over the default initialization method and over 30% for random initialization.

While most of the interest in ANS from the data compression community has been focused on practical efforts, other relevant papers have been published recently, such as an approximate method of calculating the efficiency of encoders [27], [28] an improved probability approximation strategy [29], or their use as extremely parallel decoders [30].
In this manuscript, we study Tabled ANS (tANS), which is an ANS variant that is well suited for hardware implementations [31], [32], [33].It employs a finite state machine that transitions from state to state as symbols are encoded, which is certainly not a new idea [34].However, the ANS theory enables the creation of effective encoding tables for such encoders.The underlying idea of this ANS variant is that for non-integer amounts of information an encoder produces integer-length codewords, while state transitions are employed to carry over fractional bits of information to subsequent codewords.
In particular, we focus our study on efficiently measuring the redundancy produced by tANS encoders, and on optimizing tANS encoders to minimize their implementation footprint or to improve their coding efficiency.We introduce a novel redundancy measurement strategy that exploits the structure imposed by tANS encoders in state transition matrices.This strategy achieves almost-linear time complexities for redundancy calculations when combined with an efficient calculation of average codeword length for all states.In addition, we employ the proposed redundancy calculation technique to optimize tANS automatons.We do so by exploring automaton permutations through a stochastic hill climbing method.We expect these efforts to translate into, for example, less area being allocated to a hardware implementation, to having smaller coding tables sent as side information, or to compression methods that can dynamically determine adequate encoder sizes.Many compression techniques incorporate entropy encoders with suitably high efficiency.In this work, we aim at providing better tANS encoders at equivalent efficiency levels, but with lower hardware costs.Particularly relevant to this research is the optimization strategy for tANS encoders in [35].Further comparison with this method is provided in the following sections.
The rest of this paper is organized as follows.In what remains of this introduction we present some background information on tANS.Afterward, we provide an efficient method to calculate the redundancy of tANS encoders in Section II, and we show a method to optimize tANS encoders in Section III.In Section IV we discuss our experimental results, and conclusions are drawn in Section V.

A. Tabled Asymmetric Numeral Systems
In this section we provide the necessary background on how tANS encoders operate and establish some common notation.For the underlying rationale behind tANS see [2].See Table V for a description of each symbol.
Let an n-symbol discrete memoryless source have an underlying alphabet A = {0, . . ., n − 1}.For symbol s ∈ A, let p s be the associated occurrence probability, which we assume to be finitely representable (otherwise, see [29]).
A tANS encoder (or decoder) of size m is a deterministic finite-state automaton that transitions from state to state as symbols are being encoded (or decoded).Such automaton is defined by its key vector, f = f 0 f 1 . . .f m−1 ∈ A m , which uniquely describes the encoding and decoding functions that the automaton employs (i.e., how are symbols mapped to codewords and which state transitions occur).Let m s be the number of occurrences of symbol s in f , with s∈A m s = m.For example, given A = {0, 1, 2}, the key f = 00210 defines an automaton of size m = 5, with m 0 = 3 and m 1 = m 2 = 1.
Given the current state of the automaton, represented as an integer x ∈ I = {m, . . ., 2m − 1}, the symbol s is encoded as follows: 1) First, state x is renormalized to x ∈ I s = {m s , . . ., 2m s −1} by removing as many least significant bits from x as necessary (i.e., a bitwise right shift operation), and pushing them to a stack s (least-significant bit first).
2) Then, an encoding function C s : I s → I is employed to produce the resulting state C s (x ) = y.
Hence, the encoding process results in a state transition from x ∈ I to y ∈ I and some bits pushed onto a stack.
The resulting compressed file contains only the contents of the stack, the final state of the automaton, and, if it cannot be deduced, the number of symbols encoded.The initial state of the automaton need not be included in the compressed file, and can be chosen arbitrarily.However an initial state x = m is expected to produce less renormalization bits for the first encoded symbol (renormalizing any initial state larger than m produces at least as many least-significant bits and possibly more).
Decoding operations are performed in reverse order, with the peculiarity that the sequence of decoded symbols is produced in reversed order.To decode a symbol, a decoding function D : I → (A, I s ) is employed to produce a decoded symbol s and a state x ∈ I s .Then, state x is renormalized by popping as many bits as necessary from the stack s and appending them as least-significant bits to x so that the resulting x is in I.
Algorithms 1 and 2 below describe the procedure to encode and decode one symbol, respectively.The encoding and decoding functions are uniquely determined by the key f as discussed in more detail in what follows.
Algorithm 1 Algorithm for encoding one symbol.
Algorithm 2 Algorithm for decoding one symbol.
As an example, the encoding and decoding functions for the automaton with key f = 10211011 are provided in Table I.For this key, m 0 = 2, m 1 = 5, m 2 = 1 and m = 8.To encode symbol s = 0 in this example, suppose that the current state of the automaton is x = 8.Renormalizing x into x ∈ I 0 = {m 0 , . . ., 2m 0 − 1} = {2, 3} requires taking the two least significant bits from x, resulting in x = 2. Applying C 0 to x yields y = 9, which is the new state of the automaton.The two least significant bits "00" are pushed to the stack.
A key f unambiguously defines functions C s and D as follows: • The first element of the ordered pair produced by the decoding function is obtained directly from the symbols, in order, contained in f .That is, The second element of the ordered pair, x , is given after the definition for C s below.
• The values in the coding table for C s are, in order, the states that have the symbol s as the first element of the ordered pair in the coding table for D.
• Destination states in D (the second element of the ordered pair) are inverse to those in C s .I.e., D(x) = (s, x ) where x = C s (x ).It is well known that low encoding redundancies are obtained for keys where m s ≈ m p s [2].However, as others have shown [35] and we further show, lower redundancies can be achieved by taking into account the order of symbols in f .An efective key construction method is described in [2], which we use as a baseline.We reproduce that method in Algorithm 3 with slight modifications to ensure that all symbols appear at least once in the key.In the algorithm, heap tuples are compared lexicographically, and the 'pop' operation returns the smallest element.
Algorithm 3 Algorithm for baseline key creation.

II. EFFICIENT REDUNDANCY CALCULATION
As for any entropy encoder, the redundancy of a tANS encoder for a given source can be obtained as the difference between the entropy of the source and the average codeword length produced by the encoder.Thus, less redundant encoders produce smaller compressed files.
While redundancy calculation for tANS is well understood, the straightforward calculation of average codeword lengths is a computationally expensive procedure, which prevents its use in important applications.In particular, straightforward calculation can become infeasible for large automatons, empirical redundancy studies, and most importantly, data-driven key optimization procedures.
It is worth noting that a redundancy approximation to tANS was provided by Duda [2] and later refined by Yokoo [28].However, the approximations may present notable divergences from the true redundancy or may fail to distinguish the best of several tANS encoders.See Fig. 1 for an example where Duda's redundancy approximation yields the same result for two different automatons, and where redundancy is significantly underestimated around p 1 0.2.As seen in the figure, the same statement is true for Yokoo's approximation.In this section we describe a procedure to efficiently calculate average codeword lengths and, in turn, the redundancy of tANS encoders.
The average codeword length of a tANS encoder can be written as where P(x) is the probability of the automaton being in the state x in a given instant, and x is the per-state average codeword length.
State transitions can be modeled as a Markov process for which P(x) is the stationary probability of being in state x.These probabilities can be obtained as the dominant unitary left eigenvector of the transition matrix P (i.e., the dominant unitary right eigenvector of P T ).Here we consider only irreducible aperiodic Markov chains, so that unique stationary probabilities are guaranteed to exist.In particular, stationary probabilities for automatons with reducible Markov chains may not be unique, i.e., there can be more than one stationary distribution for a given automaton.On the other hand, a stationary probability may fail to exist for a periodic Markov chain.While outside the scope of this article, if necessary, an approximate solution can be found for these cases by adding a small constant to all the elements of the stochastic matrix and thus ensuring irreducibly and aperiodicity [36].
Regarding per-state average codeword lengths, these can be obtained as where C(s, x) = lg x ms is the number of bits pushed to the stack by Algorithm 1 when encoding symbol s in state x.Here, and in what follows, note that lg denotes the base-two logarithm.It is well known that, for each symbol s, C(s, x) can only take two values, which are consecutive integers, and that a threshold t s on x suffices to distinguish the correct value [2], [35].Specifically, where In what follows we show how to obtain x and P (x) in a computationally efficient manner, to facilitate the calculation of Eq. 2 (as illustrated in Fig. 2).First we describe how to calculate per-state average codeword lengths directly from the automaton size, symbol occurrence counts in the automaton key, and symbol probabilities.Afterwards, we describe how to obtain state probabilities by exploiting the structure of the state transition matrix of a tANS encoder.We obtain a compact representation this matrix, which we then use in a modified power method to obtain its dominant eigenvector.

A. Per-state average codeword length
While a straightforward approach to calculate all x values through (3) and (4) requires O(mn) operations, we show a method that obtains per-state average codeword lengths in O(m) by employing finite differences.Note that throughout the document we assume that arithmetic operations have a O(1) complexity, including exponentiation and logarithms.
Applying a forward difference operator (i.e., where Given symbol s, the value of ∆ x C(s, x) is only 1 for a single x value, which implies that each term in the summation in ( 5) is only non-zero for a single x value.Given this fact, we can obtain all values of ∆ x x at once by iterating over alphabet symbols and only calculating non-zero terms in the summation.Values of x can then be found through cumulative summation from ∆ x x values, except for m , which needs to be obtained from (5) directly.This method is formalized in Algorithm 4.
Regarding the complexity of the algorithm, given that |A| ≤ |I|, it can be seen that complexity is dominated by the first and last loops.Each loop performs m individual operations, and thus the algorithm complexity is O(m).
Algorithm 4 Per-state average codeword length calculation.
Input: m, m s , p s ∀s ∈ A Output:

B. Efficient calculation of the state transition matrix and its dominant eigenvector
Having already seen how to obtain x , in this section we proceed to obtain P(x) by exploiting the particular structure of the transition matrix P of tANS automatons.
The re-normalization process in tANS creates structure and regularity in the automaton transition matrix which we can exploit.Due to re-normalization it can easily be seen that, in Algorithm 1, multiple and consecutive input values of x produce the same output value (same state transition) due to the floor operation applied in the re-normalization step.
For example, we show in Fig. 3 the transition matrix P and key f for a three-symbol automaton assuming p 0 = 0.2, p 1 = 0.3, and p 2 = 0.5.It is readily apparent that row i of P T is formed by runs of value p fi , which occasionally wrap around.I.e., runs containing the probability of the symbol in the same row of f T , which occasionally reach the right-most column continue on the left-most column.We can employ this fact to efficiently create equivalent compact representations of the transition matrices of tANS automatons.In addition, it is possible to observe that runs for the same symbol are successive and of equal length, up to the wrapping point, where its length is halved, but we do not exploit this for our purposes.
We can represent row y−m of P T , associated with destination state y ∈ I as M y = (α, β, p), where the interval [α, β) specifies a run of origin states x ∈ I that lead to y with probability p.For runs that wrap around, the end of the run β is increased by m so that α < β, which simplifies notation in further steps.In the example of Fig. 3, there are 15 states labeled 15 through 29.We then have M 15 = (20, 22, 0.2) for the top row (row 0) of P T , and M 27 = (28, 31, 0.3)  0.3 0.3 0.2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 for the third row from the bottom (row 12).We can find all states x ∈ I with destination state y ∈ I starting from the re-normalized state x ∈ I s such that C s (x ) = y as follows.We look for all x ∈ I that re-normalize to x , i.e., x = x 2 k + z, with 0 ≤ z ≤ 2 k − 1 and k, z ∈ Z.For z = 0 we obtain a run start by solving x 2 k + 0 ≥ m, which yields k = lg m/x , as only one value for k is possible [2].Similarly, for z = 2 k − 1 we obtain a run end by solving x 2 k + (2 k − 1) + 1 ≥ m, which yields k = lg m/(x + 1) .Hence, α = x 2 lg m/x , and either β = (x +1)2 lg m/(x +1) or β = (x + 1)2 lg m/(x +1) + m, depending whether there is a wrap around or not.This yields M y = (α, β, p).
Hereafter we refer to this representation, M , as a compact transition matrix, which Algorithm 5 obtains in O(m) time complexity, given that Algorithm 5 Compact transition matrix creation.
Now we see that we can employ the power method on a compact transition matrix to produce its dominant eigenvector (i.e., the vector with elements P(x) ∀x ∈ I).It is well known that the power method can be employed by iterating over to produce a dominant eigenvector in O R − lg |λ2| iterations, where R is the required precision and λ 2 is the secondlargest eigenvalue [37] (see [38] for further estimation of this eigenvalue).In what follows, we show that, by exploiting the transition matrix structure, we can obtain the dominant eigenvector in a O m R − lg |λ2| complexity.To do so, we show how to obtain the result of the matrix-vector multiplication w T = P T v T in the power method in O(m) instead of O(m 2 ).
For non-wrapping runs, the i th element of w T can be formulated as with α and β being the closed and open boundaries of the run in row i of the transition matrix, as described previously.For wrapping runs we can consider Due to the particular definition of β, both summations in (9) are equivalent to where u = (v 0 , . . ., v m−1 , v 0 , . . ., v m−1 ).Thanks to this, arbitrary summations of continuous u i elements can be obtained in O(1) by subtracting two elements from a pre-calculated cumulative sum of the elements in u.
The resulting method to obtain stationary state probabilities for tANS automatons is presented in Algorithm 6.As previously discussed, Algorithm 6 converges in O R − lg |λ2| iterations, with a O(m) cost per iteration.Thus the complexity of Algorithm 6 is O m R − lg |λ2| .This complexity dominates those from the previous steps and results in the total complexity for the tANS redundancy calculation.
Algorithm 6 Compact power method.
At this point we have presented a method to obtain the average codeword length of a tANS automaton and thus its redundancy in a very efficient manner, which allows us to explore the optimization of automatons.

III. REDUNDANCY REDUCTION
The basic principle behind ANS dictates that, when encoding a symbol, an encoder should transition to a next state that is approximately 1/p s times larger than the current state (i.e., y ≈ x/p s ), in order to accommodate the extra lg(1/p s ) bits of information [2].When this is translated to tANS, it suggests that the number of occurrences of each symbol in tANS keys should be m s ≈ m p s .However, in addition to symbol frequencies, the order of these symbols in the key also plays an important role in tANS performance, as larger states emit more symbols to reach the same I s during re-normalization, and not all automaton states are equally probable [2].
For example, Fig. 4 shows the redundancy of all two-symbol automatons that have keys sorted in ascending order, which we It can be seen that, as m increases, the redundancies obtained by even very large ordered automatons seem to stagnate, still having relative redundancies significantly larger than the automatons with non-ordered symbols.Note how the curves for m = 1000, m = 10000, and m = 100000 are nearly identical.This might have interesting implications for the redundancies achievable by the rANS variant of ANS, which for redundancy purposes are equivalent to ordered tANS automatons, as it seems to suggest that rANS encoders may not converge to zero redundancy as automatons grow, or that they do so at a much slower rate than non-ordered tANS automatons.
In the remainder of this section we describe how to obtain well-performing orders for the elements of tANS automaton keys.The approach, which is based on hill climbing, is as follows: given an initial automaton key, a succession of pseudorandom pairwise permutations of key elements are considered.Whenever an element permutation reduces the automaton redundancy, the permutation is applied to the key, otherwise the permutation is discarded.Algorithm 3 can be employed to create initial automaton keys that have appropriate symbol frequencies.
Traditionally, measuring the effects of any replacement or permutation of the key elements has been computationally challenging.In particular, obtaining the dominant eigenvector after a change in a state transition matrix is a difficult and wellstudied problem [37], [39].Existing solutions are significantly more computationally expensive than the straightforward application of our redundancy calculation method, or are just not very effective.For example, employing an approximate eigenvector as the starting point for the power method may seem promising, but is not as effective as intuition would suggest [37].Hence, in what follows, we directly employ the proposed redundancy calculation method to evaluate the effect of a key change.
Informed decisions could also be employed to explore the space of automaton keys.For example, the estimator E = −P(x) x − P(y) y + P(x) y + P(y) x = (P(x) − P(y)) could be employed to guide the selection of keys to be evaluated.This estimator is the redundancy variation after the elements of a key associated with states x and y are swapped, under the assumption that state probabilities will remain unaltered.Recently, Dubé and Yokoo proposed a similar approach in [35], where key elements are rearranged as per the probability of their associated automaton state in descending order.In their state-of-the-art proposal this approach is repeated multiple times, and the key producing the least redundancy is finally selected.
The motivation of such approach is to reduce complexity by reducing the number of keys to be evaluated.Given the low cost of evaluating keys via the proposed method, such approaches may not be necessary.Indeed, directly exploring pseudo-random permutations reaches more key variations than those suggested by heuristics and approximations, and ultimately achieve lower redundancies.
Fig. 5 shows redundancy reductions versus iterations for three different frequency tables for our proposed method as compared to the informed approach proposed by Dubé and Yokoo.It is worth emphasizing that for both approaches the plots indicate actual redundancy reduction calculated via algorithms proposed earlier in this paper, without employing any approximations or heuristics.In each case, the key size m is set to approximately 5n.It can be observed that the informed approach yields significant redundancy reductions with very few iterations, with little or no further improvement provided by subsequent iterations.In contrast, for pseudorandom permutations, significant redundancy reductions are only obtained after a moderate number of iterations, but a larger redundancy reduction is eventually obtained.

IV. EXPERIMENTAL RESULTS
In this section we first present the corpus employed to test the described methods, followed by experimental results regarding the performance of the method proposed in Section II, and results regarding the optimization method described in Section III.
In order for the experimental results to be applicable in practical scenarios, a data corpus has been carefully curated.
The corpus, detailed in Table II representative of the expected input to the entropy encoder of a general-purpose compression method.Three additional synthetic distributions are included in the corpus.
Regarding the performance of the method presented in Section II, in addition to the previous theoretical time complexity discussion, several wall-clock measurements are reported here.All results are for single-thread execution (including BLAS routines).We have also developed a 'reference' implementation for comparison purposes where the power method is employed, but none of the other complexity reduction strategies described in Section II are used.As compared with this reference implementation, the proposed method yields substantial execution time improvements.Albeit the reference implementation is not as optimized as the proposed one, the difference in wall-clock performance is of around three or four orders of magnitude.For m ≥ 10 5 , results for the reference method have not been produced due to its quadratic memory requirements.In addition, the proposed technique has been checked against this reference implementation for correctness, with additional checks against LAPACK routines for the power method.
Regarding the performance of the optimization method presented in Section III, results are presented in Fig. 6 and Table IV.Results in Fig. 6 show the relation between automaton size and its redundancy for two different frequency tables.
Results for a 'default' automaton (Algorithm 3) are compared with the proposed 'optimized' method, and with three random permutations of the 'default' automaton.It can be observed that random permutations tend to yield poor results, whereas the optimized method provides improvements over the default For some frequency tables, such as that in Fig. 6(a), larger automatons consistently yield smaller redundancies.However, for the frequency table presented in Fig. 6(b) and others, the redundancy obtained by default automatons may significantly increase as their size is increased.Thus, even for cases where automaton optimization may not be worth the extra computational effort, carefully evaluating automaton size can yields important benefits.
In Table IV, comprehensive results are presented for all frequency tables of all data sets.It is worth mentioning that automaton keys that are close in size to the number of alphabet symbols may present less opportunities to improve performance by permuting key elements (e.g., the improvement obtained for the proba02 table is negligible for m 1.1n).However, these keys may also fail to mimic the frequency table.This is particularly relevant for the highlybiased table in the proba80 data set, where an automaton with m 1.1n = 7.7 cannot effectively represent the required occurrence probabilities.However, as automaton sizes are increased, redundancy reductions become substantial, and even more so as compared to randomly selected permutations of an automaton key.

V. CONCLUSIONS
This paper addresses the problem of efficiently obtaining tANS automaton redundancies and the optimization of said automatons.We describe a method to efficiently obtain automaton redundancies, and we do so by efficiently obtaining perstate average codeword lengths through a forward difference operator, and by exploiting the structure of the state transition matrix to obtain the stationary probabilities of the automaton states.Our proposed method operates in almost-linear time complexity O m R − lg |λ2| , and three to four orders of magnitude faster in practice.In addition, we show that we can employ the proposed redundancy calculation method in a stochastic hill climbing optimization process to improve tANS encoder designs.Experimental results with novel corpus of frequency tables obtained from realistic data compression processes indicate that automatons are consistently improved over other informed strategies of automaton optimization.In particular, results show that improvements increase as automatons grow larger.

VI. ACKNOWLEDGMENTS
We would like to thank Charles Bloom for his well-thoughtout "rants" blog.

A. Notation
A summary of the symbols employed through this manuscript is included in Table V.

Figure 1 :
Figure 1: Example where Duda's and Yokoo's redundancy approximations yield identical results for two different automatons having f 1 and f 2 , which in reality have significantly different redundancy profiles.Additionally, both methods dramatically underestimate redundancy for p 1 0.2.

Figure 2 :
Figure 2: Diagram of the redundancy calculation method.

Figure 4 :
Figure4: Redundancy achieved with automatons of a given size where keys are restricted to be in order.1

Figure 6 :
Figure 6: Redundancy as a function of automaton size.

Table I
Table III reports on performance of the proposed redundancy calculation method.Results are produced on an AMD Threadripper 1950X with 3200 Mhz DDR4 RAM, by employing a hybrid implementation of higher level Python 3 code and lower-level elements either in NumPy or custom C code.

Table II :
Details of the corpus of frequency tables employed in the experimental results.The minimum, average, and maximum number of symbols is reported for each set of frequency tables.

Table III :
Wall-clock measurements of the proposed method (measured in seconds).The first table of each dataset is employed for the measurements.Invalid automaton sizes (m < n) are denoted by '-'.

Table IV :
Redundancy reductions, in percentage, over default and random key construction methods.Automaton sizes are set as a multiple of the alphabet size.Results are averaged over all frequency tables in each data set.For each result 50 thousand iterations are used.

Table V :
Symbol list.
I Interval {m, . . ., 2m − 1} I s Interval {m s , . . ., 2m s − 1} k s Threshold value in C for symbol s L Automaton average codeword length x Average codeword length for state x i , v i , w i i'th elements in u, v, w, respectively x, x , y Automaton states