Rucknium's review of “Uniformly Most Powerful Tests for Ad Hoc Transactions in Monero” #2

Rucknium · 2024-10-25T01:11:06Z

This review is for the 21 October 2024 version of Goodell, Salazar, & Slaughter (2024). “Uniformly Most Powerful Tests for Ad Hoc Transactions in Monero.”

Non-technical summary of paper

This paper proposes a statistical hypothesis test as a classifier to guess which transactions on the Monero blockchain may be self-send “churn” transactions, which are a conjectured strategy to increase a real transaction's anonymity set beyond the standard ring size of 16. The test has a high probability of misclassification. The test assumes that a churning user is using a best-case churning delay rule, which is to wait to churn by a random amount of time where the randomness is the standard Monero decoy selection algorithm so that the real spend blends in as much as possible with the decoys. The test attempts to detect if a ring's age distribution is closest to the decoy selection algorithm, which would suggest the transaction is a churn, or if it has a distribution that reflects “normal use” by average Monero users. The ability of this test to correctly classify transactions is limited by the fact that only a single ring member would have this “normal use” distribution. And the test does not know which of the ring members may have this “normal use” distribution because that is equivalent to knowing which ring member is the real spend.

The paper is an important contribution to the understanding of Monero's privacy properties, but it leaves many churning questions unanswered. The test does not attempt to guess which transaction a churn transaction is actually linked with. In other words, the test does not try to guess the real spend in suspected churn transactions, which would be an attempt to defeat a user's churning strategy. The paper does not attempt to model a local-neighborhood transaction graph where adversarial observers have partial information about users' real spends in rings, as in the Eve-Alice-Eve (EAE) and related attacks. And the paper does not answer questions about whether churn transactions that consolidate a large number of outputs are helpful or harmful to user privacy.

Technical summary of paper

This paper proposes and rigorously parameterizes a statistical hypothesis test as a classifier to guess which transactions on the Monero blockchain may be self-send “churn” transactions, which are a conjectured strategy to increase a real transaction's anonymity set beyond the standard ring size of 16. The paper supposes that churn transactions follow a plausible strategy for minimizing the probability that the real spend in a ring will be correctly guessed by an adversary. The strategy is to delay a churn transaction by a random amount of time according to the time delay distribution of decoy ring members drawn by the standard wallet software.

In the test, the null hypothesis is that all of a transaction's ring members follow the age distribution of the standard wallet software's decoy distribution. The alternative hypothesis is that one of the ring members, out of 16, does not follow standard wallet software's decoy distribution. If the null hypothesis is not rejected, the decision is made to label the transaction as likely a churn transaction. If the null hypothesis is rejected, the transaction is labeled as likely an “ad hoc” transaction, a term the paper uses to describe transactions that users make in normal use of the Monero as peer-to-peer electronic cash. Since churn transactions are likely rare (and may not even exist in the strict form that the paper proposes), the null and alternative hypotheses are in opposite positions compared to the usual case. They are in these positions because the test presumes to know exact information about the form of churn transactions, but limited information about “ad hoc” transactions.

The paper states that the test is uniformity most powerful (UMP), which is a nice, tidy, technical property for a test to have. It has this properly because it fulfills the requirements of the Neyman-Pearson lemma. Because the test is UMP, there is no test in its class that is more powerful for every parameter value in the alternative hypothesis's parameter space. There is no need to look for more powerful tests when the statistical practitioner's information set is limited to the assumptions of this test. Nevertheless, the actual power of the test in this setting is low. The power is approximately equal to its size: “Thus, in the first case, if our basic test has significance α, then our related test will have significance approximately α, indicating that our basic test also has power α.” If the size of the test is set to the conventional 0.05, its power will also be approximately 0.05. Power of 0.05 is far below the conventional minimum desirable power of 0.8 in many fields of study (Britt & Weisburd 2010 and Dumas-Mallet et al. 2017). The test's probability of misclassification is high. The main reasons for this are that essentially the test is using information from only a single observation (the one real spend in a ring of 16) and the test does not have specific information about the alternative hypothesis for this observation.

The paper is an important contribution to the analysis of churning, an important research question that has evaded rigorous analysis for years. The paper takes pains to account for tedious practical issues like the true dynamic, nonparametric nature of the decoy selection algorithm and what happens when there is a delay between transaction construction and broadcast. Nonetheless, the paper has a limited scope. A complete picture of churning's effectiveness against de-anomymization attacks will require more research.

Major points:

In Section 3.5. “Computing PMF”, the paper says “The function $f$ is generally not easy to derive in a closed-form solution, even with access to default wallet source code.” And in Section 7.0.1. “Future Work”, the paper says “Future work may include any of the following. Theoretically modeling the PMF $f$ ....” It may not be easy to accomplish, but fortunately this has been done in Rucknium (2023), based on documentation provided by @jeffro256 in Pull Request 9024.[1] An implementation in R is available as get.dsa.pmf() in this comment.
Due to the test's low power, the test has a very high rate of misclassification. When the test is set at the conventional significance level of 0.05, its power is approximately 0.05. Then, the probability of Type II error is 1 - 0.05 = 95 percent. That means that 95 percent of “ad hoc” transactions will be incorrectly classified as churns. This needs to be made more clear to the reader.
The paper does not specify a metric of user privacy nor a mathematical definition of the adversary's objective. Should the user be more concerned about Type I or Type II errors? What about the adversary? How would an adversary set the size of the test? Is a metric like effective anonymity set as in Serjantov & Danezis (2003) or the metrics of the large differential privacy literature appropriate in this setting, and what is the effect of this classification technique on the metrics?
“The existence of these tests represent a tremendous security threat for users of TPSASAs, even when these tests have low power.” In my opinion, the case for this statement has not been made. “Tremendous” is too strong. The test may represent a tremendous security threat, but the paper has not yet shown this. There is no analysis of a reduction in a user's anonymity set, a differential privacy analysis, or really any hard numbers to justify this strong statement.
The paper states “Our tests have no practical utility when user behavior is very well-described by the default wallet distribution, whether due to low prevalence of ad hoc transactions or similarity between user behavior and the default wallet distributions.” It may be worth backing up this statement with citations to previous work that urges the default wallet distribution be set as close to the use behavior distribution as possible. A list of references with relevant quotes is conveniently available on the first page of Rucknium (2022). Changes to the decoy distribution can be done at hard fork boundaries to avoid wallet software fingerprinting, but users can autonomously decide to churn if doing so improves their privacy.
In Section 6.5 “Too Many People Have to Churn”, the paper says “In an environment with high prevalence (of ad hoc transactions), low power tests are nevertheless useful; positive predictive value can remain high until prevalence drops below a critical threshold.” This statement leave out the fact that the positive predictive value is the share of transactions labeled as “ad hoc” transactions that are actually “ad hoc” transactions. This is the downside of the way the test is structured: “ad hoc” transactions are the alternative hypothesis. Since the adversary is trying to detect churn transactions instead of “ad hoc” transaction, the high positive predictive value in this setting is not very helpful for the adversary.
The discussion of the hypothesis test does not mention that there are multiple nonstandard decoy selection algorithms in custom wallet software. The hypothesis test applied to transactions constructed with this software would likely reject the null hypothesis almost always, not because the null is necessarily false, but because every drawn decoy would be drawn from a nonstandard decoy distribution. More information on this issue can be found in Monero Reseach Lab Issue 99, Hammad & Victor (2024), and ringxor.
Section 5.1 “Basic Attack” is interesting, but seems to suggest that churning actually is an effective strategy, contrary to statements elsewhere in the paper. The algorithm only tries to guess the real spend in “ad hoc” transactions. When it arrives at a churn transaction, the algorithm accepts that a churn is a black hole from which information about the real spend cannot be extracted. Since a large number of “ad hoc” transactions would be classified as churns, the algorithm would encounter many black holes.
Section 5.1 “Basic Attack” describes a “rolling” procedure where multiple dependent hypotheses are tested sequentially. In general, testing in this manner will give you tests that are not the correct size because of multiple/sequential hypothesis testing issues.
The “maximum likelihood estimate for $i^{*}$ ” is a classifier for the real spend in an “ad hoc” truncation. This estimator does not use any information about the real spend age distribution. The accuracy of this classifier could be compared with the MAP Decoder attack of Aeeneh et al. (2021), which requires the the real spend age distribution. Its Proposition 4 states “The decision rule in (11) is the optimal decision rule in terms of error probability when using $Y$ [set of ring member ages] as the input information.” Aeeneh et al. (2021) notes “We showed that if the distribution used for selecting decoy TXOs are known, then a malicious party can significantly weaken the privacy features of CryptoNote by tracing back the transactions. However, our studies on Monero, a CryptoNote-based blockchain, show that users are using different source codes each having a different algorithms for mixin selection. Fortunately, since there is no distribution that a strong majority of users are using for mixin selection, an attacker cannot derive the distribution of truly spent transactions and arrange the second attack on Monero.” Ongoing work partially described in Rucknium (2022) seeks to overcome this problem of multiple unknown decoy selection distributions.
At two points the paper gives an example when the test's power is 0.15 and its size is 0.05. Since the paper says that the test's power is approximately equal to its size, the examples should use numbers that are consistent with that approximate equality.
Section 6.2 “Effect of Binning” suggests that binning does not mitigate the threat of the paper's main hypothesis test. That may be true, but the reader should be informed that binning is not designed to defend against this type of analysis. Instead, it is supposed to act as a defense-in-depth against attacks against the real spend anonymity when the age of the real spend is guessed. Ronge et al. (2021) discusses a “hybrid” between a “partitioning” and “mimicking” decoy selection algorithm. The hybrid algorithm corresponds to a binning algorithm.
Section 6.1. “Low Sensitivity to Anonymity Set Size” states that increasing the ring size by one unit has a small negative effect on the test's power. The paper should give specific numbers, e.g. raising ring size from 16 to 17 and from 16 to 32, to provide the reader with a complete picture.

Minor points:

In Section 3.1 “Ad Hoc Anonymity Sets”, the paper says “For each $1 \leq i \leq n$ such that $i \neq i^{*}$, independently sample $y_{i} \in Y$ with PMF $f$ employed by the default wallet software, re-sampling in the case of collisions.” The default wallet software draws more than $n$ possible decoys because some outputs may be time locked. The effect of this on the actual distribution is probably negligible, however. See Pull Request 9023 and especially this comment.
The paper states “Default wallet distributions $f$ are not unimodal in practice, because they are locally sensitive to blockchain density.” The paper does mot make clear why the unimodal property is important in this setting.
The paper states “However, this attack is not particularly efficient” and “Since matching algorithms can be made very efficient, this is approach is likely much more efficient than the naı̈ve attack presented in Section ??, and with similar performance.” It should be clarified whether “efficient” is meant in the computational sense or in the statistical sense.
Use $\mathbb{1}\lbrace\rbrace$ instead of $I()$ for the indicator function for clearer notation.
Fix ?? broken references in the LaTeX code.
Typographical errors: “expoit” in the abstract. “moding” in Section 4.2 “Model Selection”. Missing word in “However, the procedure requires taking $L · m · n$ samples of from the wallet distribution $f$...” in Section 4.1. “Suggested Empirical Approach”. Missing word in “An upper bound on the size α a level of the test” in Section 2.

Note that I have not checked the paper's algebraic manipulations for correctness.

Notes

[1] The description in Rucknium (2023) says “The indices are counted from the first output in the 11th most recent block, starting from index 1.” Version 0.13.0.2 to 0.18.2.1 of the standard wallet software used the 11th most recent block. Version 0.18.2.2 and after used the 10th most recent block. See https://www.getmonero.org/2023/06/08/10block-old-decoy-selection-bug.html

References

Aeeneh, Chervinski, Yu, & Zlatanov (2021). “New attacks on the untraceability of transactions in cryptonote-style blockchains.” 2021 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), 1–5.

Britt & Weisburd (2010). “Statistical Power.” In: Piquero, A., Weisburd, D. (eds) Handbook of Quantitative Criminology. Springer, New York, NY. https://doi.org/10.1007/978-0-387-77650-7_16

Dumas-Mallet, Button, Boraud, Gonon, & Munafò (2017) “Low statistical power in biomedical science: a review of three human research domains.” R. Soc. open sci. 4: 160254. http://dx.doi.org/10.1098/rsos.160254

Hammad, & Victor (2024). “Monero Traceability Heuristics: Wallet Application Bugs and the Mordinal-P2Pool Perspective.”

Ronge, Egger, Lai, Schröder, & Yin (2021). “Foundations of Ring Sampling.” Proceedings on Privacy Enhancing Technologies, 2021(3), 265–288.

Rucknium (2022). “Fully Specified Estimation Plan for Optimal Static Parametric Estimation of Arbitrary Distributions (OSPEAD).” https://github.com/Rucknium/OSPEAD/releases/tag/v0.1.0

Rucknium (2023). “Closed-form Expression of Monero’swallet2 Decoy Selection Algorithm.” https://github.com/Rucknium/misc-research/blob/main/Monero-Decoy-Selection-Closed-Form/pdf/monero-decoy-selection-closed-form.pdf

Serjantov & Danezis (2003). “Towards an Information Theoretic Metric for Anonymity”. In R. Dingledine & P. Syverson (Eds.), Privacy Enhancing Technologies (Vol. 2482, pp. 41–53). Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36467-6_4

The text was updated successfully, but these errors were encountered:

plowsof mentioned this issue Oct 26, 2024

Monero Community Workgroup Meeting: October 26th 15:00 UTC monero-project/meta#1095

Closed

DiosDelRayo mentioned this issue Oct 28, 2024

Rucknium's review of “Uniformly Most Powerful Tests for Ad Hoc Transactions in Monero” DiosDelRayo/DiosDelRayo.github.io#128

Open

Rucknium mentioned this issue Oct 31, 2024

Monero Research Lab Meeting - Wed 30 October 2024, 17:00 UTC monero-project/meta#1102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rucknium's review of “Uniformly Most Powerful Tests for Ad Hoc Transactions in Monero” #2

Rucknium's review of “Uniformly Most Powerful Tests for Ad Hoc Transactions in Monero” #2

Rucknium commented Oct 25, 2024

Rucknium's review of “Uniformly Most Powerful Tests for Ad Hoc Transactions in Monero” #2

Rucknium's review of “Uniformly Most Powerful Tests for Ad Hoc Transactions in Monero” #2

Comments

Rucknium commented Oct 25, 2024

Non-technical summary of paper

Technical summary of paper

Major points:

Minor points:

Notes

References