Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wallet2: apply gamma distribution from chain tip when selecting decoys (#7807) #7821

Merged
merged 1 commit into from
Aug 20, 2021

Conversation

j-berman
Copy link
Collaborator

@j-berman j-berman commented Jul 30, 2021

Overview

When the wallet selects decoys using the gamma distribution, the expected distribution is supposed to be fit to the chain tip, but instead is being fit prior to the unlock time. The fix in this PR simply shifts the result of the gamma forward the expected duration of the unlock time. If the gamma spits out an output spent less than the unlock time, then it places that output in a random block within the first 50 spendable blocks (while still factoring in block density). I assumed that outputs younger than the unlock time that the gamma suggests should still be factored into the distribution because one would expect outputs selected by the gamma in this range to be spent soon after unlock. "Someone who spent an output after 1 block back when it was allowed likely would spend that output soon after it unlocks today."

The material negative impact of this general issue still appears to be contained to the very earliest spents, however deeper analysis would be needed to arrive at harder figures.

Reasoning behind gamma being fit to the start of the chain

The decoy selection algo is applying the gamma distribution starting 10 blocks prior to the chain tip (bc of unlock time), but it appears the gamma distribution should be applied starting at the chain tip.

As mentioned in "The fix" of #7807 , this was my first thought as to the issue, but I figured it may have been incorrect because the gamma suggests that there is a non-negligible chance of spending an output between 1 and 9 blocks old, which I assumed had no chance of being plausible because of the unlock time, and so there must be some other explanation for the issue. But it seems the gamma distribution in the paper was taken from a time when the unlock time wasn't enforced by consensus. And as a talk done by isthmus demonstrated, some didn't follow this convention at the time. Therefore, there were some outputs on chain spent before 10 blocks that the gamma distribution factors in.

I reached out to one of the authors of the paper to sanity check, and the above does seem to be the case. See the final section where it fits the chain data to the gamma, specifically:

def spendTimes():
    query = """
      MATCH (i:Input)<-[:TX_INPUT]-(tx:Transaction)-[:IN_BLOCK]->(b2:Block)
      WHERE i.mixin > 0 AND (i)-[:SPENDS]->()
      WITH i, b2.timestamp as ts
      MATCH (b1:Block)<-[:IN_BLOCK]-(:Transaction)-[:TX_OUTPUT]->(:Output)<-[:SPENDS]-(i)
      RETURN ID(i) as iId, ts - b1.timestamp as timeDiff
    """
    df = to_data_frame(graph.run(query))
    return df

Analysis of the fix

See this comment for an explanation of how I arrived at using a window of 50 to place outputs spit out by the gamma more recent than the unlock time. It also provides analysis on the fix's impact.

The analysis below is outdated, but keeping it here for posterity so the flow of discussion in comments below makes more sense

Results of the fix

I used this code to simulate get_outs with the fix in this PR, and plotted against the current:

10 block unlock fix vs  current  1

10 block unlock fix vs  current  2

You can see in the above charts that the very earliest outputs are selected in higher frequency with the fix. As you move further right, the outputs are selected in marginally lower frequency. Then it hits a steady state of selecting roughly equivalent outputs.

The numbers explaining the above observation

The gamma's expected probability of a spent output between 0 to 10 blocks old is ~2.1%.

Between 1 to 10 blocks, it's still ~2.1% (since 0 to 1 block is negligible).

Between 10 to 11 blocks, it's ~0.3%.

Between 11 to 20 blocks, it's ~2.3%.

Between 20 to 30 blocks, it's ~2.2%.

With the fix, we would expect outputs less than < 10 blocks old produced by the gamma to be spent in the first available block; thus, we should expect 2.1% + 0.3% of outputs to be spent between 10 to 11 blocks. However, the current algorithm suggests that between 10 to 11 blocks, close to 0% of outputs would be selected as decoys (before factoring in density). Thus, between 10 to 11 blocks, it would seem the current algorithm under-selects decoys by about 2.4%. This means that the outputs observed in this age range of 10 to 11 blocks are more likely to be real spents. In practice, however, thanks to the decoy selection algorithm factoring in density and therefore likely selecting some decoys in this age range, and considering we've only observed ~0.5% of outputs in this age range, it appears as though only a very small percentage of real spents would have been identifiable.

Between 11 to 20 blocks, the current algorithm is expected to match the gamma's 1 to 10 block range of 2.1%. And as noted, the gamma's expected probability of a spent output between 11 and 20 blocks is 2.3%. Thus, the current algorithm only slightly under-selects outputs in this age range, which means it is likely that nothing definitive can be gained from observing outputs in rings in this age range (ignoring the impact of block density).

Between 20 to 30 blocks, the current algorithm is expected to match the gamma's 10 to 20 block range of 0.3% + 2.3%. And as noted, for comparison, the gamma's expected is 2.2%. Therefore the current algorithm over-selects decoys in this interval, and thus, outputs spent in this range are overly protected, aka are even less likely to be deducible as real. The same goes for the range between 30 to 250 blocks.

Beyond 250 blocks, both current algorithm and fixed algorithm are roughly equivalent.

Conclusion

As first reported, the impact appears mostly contained to the very earliest spents that are spent right when they unlock, and were created in relatively smaller blocks than average. The initial maximum estimate of 1% of transactions affected seems corroborated by the above findings. A deeper analysis factoring in block density would be necessary to arrive at harder figures.

Appendix: impact on tx uniformity

As a result of the fix, approximately 1 in 5 rings ((1 - [1 - (2.1% + 0.3%)]^10) = 21% = 1 in 5) are now expected to include at least 1 very early decoy. With this information, someone would likely be able to guess that a transaction is coming from a fixed wallet with a higher degree of certainty, however, I do not see a practical vulnerability that can stem from this knowledge.

Edit 1: small change to select unspendable output from first spendable block randomly
Edit 2: corrected charts for small change
Edit 3: updated for using a window of 50

@j-berman
Copy link
Collaborator Author

@UkoeHB Made a small change for when the gamma spits out an output more recent than unlock time. In that case, it now selects a random value between 0 and 120, rather than scales down the gamma to between 0 and 120 because the scaling down approach biases toward 60 and 120. Don't see why it should have a bias like that, and it seems all outputs should actually be equally likely to be picked from that initial 0 to 120 should it fall into that else.

@j-berman j-berman changed the title Apply gamma distribution from chain tip when selecting decoys (#7807) wallet2: apply gamma distribution from chain tip when selecting decoys (#7807) Jul 30, 2021
@j-berman
Copy link
Collaborator Author

j-berman commented Aug 3, 2021

While working through the impact of tx uniformity in #7798, I recognized a potential area of concern as a result of this PR's impact on tx uniformity. I don't think it's a reason to hold this PR back, but figured it's worth sharing to better understand this risk.

If you assume that 99% of clients update, and 1% do not, then the 1% non-updated clients may stick out.

Assume that a user constructs a transaction with 20 rings and not a single one has a very early output in it. The chances of an updated client doing that are very low ([1 - 21%]^20 = 0.9%), so it is safe to assume that if a tx is constructed in this way, it is from a non-updated client. Assume a user then spends outputs created in that tx in similar fashion (many rings, non-updated client). That would then be very strong evidence that the outputs in the ring coming from the original non-updated client tx are real.

Such a circumstance seems extraordinarily rare. I can, however, see it negatively harming a user who is very late to update their wallet. But I cannot see it impacting other users who do not update. And considering that current users are being harmed today by the implications of this issue, it needs to be patched for those users immediately.

@j-berman
Copy link
Collaborator Author

Latest development

In monero-dev IRC, @luigi1111 suggested smoothing out the simulation of the decoy selection algorithm and plotting that, in order to get a better idea of how the patch would alter the algorithm. After graphing a smooth simulation of the decoy selection algorithm, I observed that my initial proposed fix alters the shape of the distribution such that it seems to perform marginally worse for outputs 15-100 blocks old:

patch  #7821 alone + window of 1

I went back to the drawing board and applied many different combinations of fixes to arrive at what I believe is the safest, do-no-harm approach, that patches the issue at hand, while simultaneously providing sufficient protection for the earliest spents.

The current proposed patch

  1. Apply the gamma from the chain tip, same as initially proposed.

  2. When the gamma picks an output younger than the unlock time, replace that output with a randomly selected output from the 50 most recent spendable blocks.

The justification for continuing to factor in outputs younger than the unlock time picked by the gamma is still that people who spent outputs very quickly back when the gamma was observed, likely would still be spending outputs relatively quickly today. The justification for using a window of 50 to slot them in is that empirically it seems to perform well:

patch  #7821 alone + window of 50

Math backing up the decision

As mentioned by a fellow by the username of Rucknium to me in a 1-on-1 IRC, as well as in Miller et al, the Kolmogorov-Smirnov test is a test to quantify the distance between an observed distribution, and expected. As such, it can be used to quantify how well the proposed patch would perform compared to the current decoy selection algorithm. The lower the distance given by the Kolmogorov-Smirnov statistic, the better the algorithm is at matching the observed distribution.

If my math is correct (see below), the K-S statistic for the current decoy selection algorithm is 0.0167, while the K-S statistic of the current proposed fix is 0.0071. Thus, the patch is a material improvement over the current.

Finally, the K-S statistic of the patch I initially proposed (dumping outputs in the first spendable block) is 0.0190. Thus, the fix I'm proposing now is a material improvement over the patch I initially proposed.

Math to recreate K-S Statistic

Download the following CSV, then run the following python script:

import csv

n_observed = 0
n_current_algo = 0
n_initial_proposed_fix = 0
n_current_proposed_fix = 0

cumulative_sum_observed = []
cumulative_sum_current_algo = []
cumulative_sum_initial_proposed_fix = []
cumulative_sum_current_proposed_fix = []

with open('output_age_data.csv','r') as csvfile:
    rows = csv.reader(csvfile, delimiter = ',')
    
    # get n outputs for each distribution type & cumulative sums
    idx = 0
    for row in rows:
        if idx > 0:
            observed         = int(row[1])
            current_algo     = int(row[2])
            initial_proposed = int(row[3])
            current_proposed = int(row[4])

            n_observed              += observed
            n_current_algo          += current_algo
            n_initial_proposed_fix  += initial_proposed
            n_current_proposed_fix  += current_proposed
            
            if idx == 1:
                cumulative_sum_observed.append(observed)
                cumulative_sum_current_algo.append(current_algo)
                cumulative_sum_initial_proposed_fix.append(initial_proposed)
                cumulative_sum_current_proposed_fix.append(current_proposed)
            else:
                cumulative_sum_observed.append(observed + cumulative_sum_observed[len(cumulative_sum_observed) - 1])
                cumulative_sum_current_algo.append(current_algo + cumulative_sum_current_algo[len(cumulative_sum_current_algo) - 1])
                cumulative_sum_initial_proposed_fix.append(initial_proposed + cumulative_sum_initial_proposed_fix[len(cumulative_sum_initial_proposed_fix) - 1])
                cumulative_sum_current_proposed_fix.append(current_proposed + cumulative_sum_current_proposed_fix[len(cumulative_sum_current_proposed_fix) - 1])
               
        idx += 1

cdf_observed = []
cdf_current_algo = []
cdf_initial_proposed_fix = []
cdf_current_proposed_fix = []

# iterate over cumulative sums and divide by sum to get F(x) for each block
idx = 0
for cumulative_observed in cumulative_sum_observed:
    cdf_observed.append(cumulative_sum_observed[idx] / n_observed)
    cdf_current_algo.append(cumulative_sum_current_algo[idx] / n_current_algo)
    cdf_initial_proposed_fix.append(cumulative_sum_initial_proposed_fix[idx] / n_initial_proposed_fix)
    cdf_current_proposed_fix.append(cumulative_sum_current_proposed_fix[idx] / n_current_proposed_fix)

    idx += 1

# iterate over F(x)'s and get abs(F(observed) - F(x))
set_of_diffs_current_algo = []
set_of_diffs_initial_proposed_fix = []
set_of_diffs_current_proposed_fix = []

idx = 0
for f_of_observed in cdf_observed:
    set_of_diffs_current_algo.append(abs(f_of_observed - cdf_current_algo[idx]))
    set_of_diffs_initial_proposed_fix.append(abs(f_of_observed - cdf_initial_proposed_fix[idx]))
    set_of_diffs_current_proposed_fix.append(abs(f_of_observed - cdf_current_proposed_fix[idx]))

    idx += 1

# get the max of each
print("K-S statistic for current algorithm:    ", max(set_of_diffs_current_algo))
print("K-S statistic for initial proposed fix: ", max(set_of_diffs_initial_proposed_fix))
print("K-S statistic for current proposed fix: ", max(set_of_diffs_current_proposed_fix))

@j-berman
Copy link
Collaborator Author

j-berman commented Aug 13, 2021

Latest Development

In monero-dev IRC, @luigi1111 requested to see what using a window of 20 would look like, and @Gingeropolous requested to see simulations run over different epochs, as well as comparisons to the gamma distribution (ignoring block density).

  • I re-ran the simulations using windows of 1, 3, 10, 20, 40, 50, 75, and 100 (I had previously looked at 1, 3, 40, 50, and 100 locally before settling on 50).

  • I re-ran the analysis over different hard fork intervals using @neptuneresearch 's suggestion as a guide.

  • I included the gamma on each plot to allow the reader to form a basis of comparison.

After doing the above, I still lean towards the current window of 50 as best achieving the sanest, least-potential-for-harm approach that I described in IRC as follows:

the idea of this fix is to get something out the door that does a good enough job to cover the earliest spent issue, and doesn't cause any potential harm. I can't imagine that getting [the decoy selection algorithm to select outputs that are] marginally closer to the [observed output age distribution on-chain] causes harm (and may help), but moving further away could hurt.

As discussed in IRC, beyond this PR, we could continue with deeper research toward an even stronger solution that takes another crack at the assumptions laid out in Miller et al, and factors in observed chain data since then. Said research will take a fair amount of time to complete, and this PR's do-no-harm approach offers a solid patch until that research is finished. Also shoutout to @Rucknium who has some excellent ideas and is an applied statistician by trade who has offered to contribute in this area :)

Results simulating different windows, over different epochs

v14 (2210720 - 2413735)

different_windows_v14  PDF

First general takeaways from this chart

The current decoy selection algorithm (orange) appears to be under-selecting outputs relative to the observed output ages in rings (blue), as is apparent by the large triangular gap above the orange line to the blue. Basically, there are many more outputs observed on-chain than the current decoy selection algorithm would produce over that age range, which is visualized as the triangular gap.

Here are 3 potential causes for this triangular gap:

  1. A popular wallet is not implementing the same decoy selection algorithm, and is suggesting too recent of outputs.

  2. A nefarious actor with many outputs is modifying their client to select more recent outputs.

  3. Recent real outputs unaccounted for by the decoy selection algorithm are appearing in that range.

After doing a fair amount of investigating, I lean strongly toward 3 for a host of reasons, and I can dive further into my reasoning for why. But for the sake of staying focused, will hold off unless asked. And will continue with the assumption that 3 seems most likely (that the decoy selection algorithm as is likely is under-selecting recent real outputs).

Continuing with that assumption, it would make sense to try and arrive at a solution that would bridge the gap between the current decoy selection algorithm and observed.

Additionally, it would be apparent that the gamma distribution (green line) is not perfectly applicable on its own as fitting the distribution to, and serving as "the source of truth". Meaning that if we were to try and match the decoy selection algorithm identically to the gamma (which is what the wallet did before block density was factored in), then the algorithm would likely perform even worse by under-selecting even fewer outputs in the range in the chart (since green is a fair amount below orange, which is already below blue).

Comparing the different windows

The window of 20 seems to perform well between ages 11 and 30 (by "well", I mean it bridges the gap from current to observed on-chain data, which is defined as "well" because of the assumption above that the gap from current to observed is caused by the algorithm missing real outputs). However, around age 30, it shifts below the current decoy selection algorithm, which is also below observed on-chain. Thus, it starts to marginally under-select outputs, and is thus performing marginally worse than the current. This, to me, felt like potential for harm (it is why I initially chose 50 over 40, because you can see that 40 starts to move slightly below orange line later on as well).

The potential for harm does, however, seem very small. Here are the Kolmogorov-Smirnov statistics for each distribution when comparing to observed on-chain data (recall, smaller means the distance to observed is lower, and therefore it performs "better" based on the above assumption that getting closer to observed is the desirable outcome):

Distribution type              |   K-S Stat
-------------------------------------------------
 Current decoy selection algo  |   0.02104
 Normal gamma                  |   0.03239
 Window 1                      |   0.01946
 Window 3                      |   0.01598
 Window 10                     |   0.00897
 Window 20                     |   0.00890
 Window 40                     |   0.00896
 Window 50                     |   0.00886
 Window 75                     |   0.01286
 Window 100                    |   0.01549

Given the above K-S stats, it seems you can't really go wrong with a window between 10-50. Thus, I figure visual analysis combined with the K-S stat seems a prudent route to arrive at a window of 50.

v12 (1978433 - 2210000)

different_windows_v12  PDF


Distribution type              |   K-S Stat
-------------------------------------------------
 Current decoy selection algo  |   0.02520
 Normal gamma                  |   0.03955
 Window 1                      |   0.01745
 Window 3                      |   0.01299
 Window 10                     |   0.01152
 Window 20                     |   0.01148
 Window 40                     |   0.01149
 Window 50                     |   0.01279
 Window 75                     |   0.01715
 Window 100                    |   0.01988

Seems to exhibit very similar properties to v14, and the same analysis above applies.

v11 (1788720 - 1978433)

different_windows_v11  PDF

 Distribution type              |   K-S Stat
-------------------------------------------------
 Current decoy selection algo  |   0.04025
 Normal gamma                  |   0.02125
 Window 1                      |   0.04278
 Window 3                      |   0.04281
 Window 10                     |   0.04272
 Window 20                     |   0.04281
 Window 40                     |   0.04275
 Window 50                     |   0.04282
 Window 75                     |   0.04284
 Window 100                    |   0.04273

In this interval, it appears clear that the gamma does apply best. There isn't really much to go off of here in way of deciding between the different windows. I believe this interval looks like this because the code to factor in block density was not released until July 17, 2019, or circa block 1881000. Further, this code wasn't released as part of a hard fork. So I don't think it makes much sense to use it as part of analysis to decide how to modify the current approach.

average_output_time shift from 2 to 1 (2383730 - 2413735)

different_windows_average_output_time_shift  PDF

Distribution type              |   K-S Stat
-------------------------------------------------
 Current decoy selection algo  |   0.03077
 Normal gamma                  |   0.04961
 Window 1                      |   0.02540
 Window 3                      |   0.02550
 Window 10                     |   0.02558
 Window 20                     |   0.02534
 Window 40                     |   0.02540
 Window 50                     |   0.02536
 Window 75                     |   0.02562
 Window 100                    |   0.02561 

As highlighted in #7798, the decoy selection algorithm is currently missing recent outputs at a step in the selection calculation by a factor of nearly 2x because of an integer truncation issue. Since block 2383730, average_output_time is being truncated down to 1, when it should really be around 1.9. This epoch demonstrates how the current decoy selection algorithm is now under-selecting recent outputs by a wider margin as a result (the integer truncation issue has the largest factor impact in this epoch).

Not much extra to add to this analysis in way of choosing between the windows, however.

Conclusion

I still believe a window of 50 offers the least-potential-for-harm approach, as it won't marginally start to under-select outputs between ages 30 - 100 (like a smaller window appears to do), and it will also select a decent-sized share of younger outputs, as appears to be desired.

My data + code to plot diagrams and calculate K-S statistics

Output age data for each epoch

Unzip the following csv's:

v14 (2210720 - 2413735).zip
v12 (1978433 - 2210000).zip
v11 (1788720 - 1978433).zip
average_output_time shift from 2 to 1 (2383730 - 2413735).zip

My code to produce the csv's is a bit messy, but happy to clean it up and share if desired. I heavily modified blockchain_usage.cpp to write the files (the code doesn't really resemble blockchain_usage.cpp at all).

Python to plot diagrams

import csv
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

plots = [
    'v14 (2210720 - 2413735)',
    'v12 (1978433 - 2210000)', 
    'v11 (1788720 - 1978433)', 
    'average_output_time shift from 2 to 1 (2383730 - 2413735)'
]

for plot in plots:
    headers = []
    values = []

    with open(plot + '.csv', 'r') as csvfile:

        rows = csv.reader(csvfile, delimiter = ',')
    
        idx = 0
        for row in rows:
            if idx == 0:
                for h in row:
                    headers.append(h)

                i = 0
                while (i < len(row)):
                    values.append([])
                    i += 1

            elif idx > 0:
                i = 0
                for v in row:
                    values[i].append(int(v))
                    i += 1

            idx += 1
  
    aggregate_sums = []

    for distr in values:
        aggregate_sum = 0
        for value in distr:
            aggregate_sum += value

        aggregate_sums.append(aggregate_sum)
    
    probabilities = []
    idx = 0
    for distr in values:
        probabilities.append([])
        for value in distr:
            probabilities[idx].append(value / aggregate_sums[idx])
        idx += 1

    ax = plt.axes()

    idx = 1
    while (idx < len(headers)):
        blocks = values[0]

        line_style = '-'
        if idx > 3:
            line_style = '--'

        ax.plot(blocks, probabilities[idx], line_style, label=headers[idx])

        idx += 1

    plt.legend()
    plt.xscale('log')

    ax.xaxis.set_major_formatter(mticker.StrMethodFormatter('{x:.0f}'))
    ax.xaxis.set_minor_formatter(mticker.NullFormatter())

    plt.xlim(left=10, right=110)
    plt.ylim(top=.0075, bottom=0)

    plt.title('Output age [' + plot + ']')

    plt.ylabel('Probability (frequency of output age / total number of outputs)')
    plt.xlabel('Output Age (blocks)')
    plt.show()

Python to calculate K-S statistics

import csv

plots = [
    # 'output_age_data',
    'v14 (2210720 - 2413735)',
    'v12 (1978433 - 2210000)', 
    'v11 (1788720 - 1978433)', 
    'average_output_time shift from 2 to 1 (2383730 - 2413735)'
]

for plot in plots:
    headers = []
    cumulative_sums = []

    with open(plot + '.csv', 'r') as csvfile:

        rows = csv.reader(csvfile, delimiter = ',')
    
        # get cumulative number of outputs at each block, for each distribution
        idx = 0
        for row in rows:
            if idx == 0:
                for h in row:
                    headers.append(h)

                i = 1 # skip "Blocks" column, counting outputs in other columns
                while (i < len(row)):
                    cumulative_sums.append([int(0)])
                    i += 1
            else:
                i = 1 # skip "Blocks" column
                while (i < len(row)):
                    prior_cumulative_sum_index = len(cumulative_sums[i - 1]) - 1
                    prior_cumulative_sum = cumulative_sums[i - 1][prior_cumulative_sum_index]

                    current_cumulative_sum = prior_cumulative_sum + int(row[i])

                    cumulative_sums[i - 1].append(current_cumulative_sum)
                    i += 1

            idx += 1

    # get cdf's for each distribution
    cdfs = []

    i = 0
    for cumulative_sum_distr in cumulative_sums:
        cdfs.append([])

        for cumulative_sum in cumulative_sum_distr:
            num_outputs_observed = cumulative_sum_distr[len(cumulative_sum_distr) - 1]
            cdfs[i].append(cumulative_sum / num_outputs_observed)

        i += 1

    # iterate over F(x)'s and get abs(F(observed) - F(x))
    set_of_diffs = []
    cdf_observed = cdfs[0]

    i = 0
    for cdf in cdfs:
        if i > 0:
            set_of_diffs.append([])

            j = 0
            for cumulative_percentage in cdf:
                set_of_diffs[i - 1].append(abs(cdf_observed[j] - cumulative_percentage))
                j += 1

        i += 1

    # get the max of each to get K-S statistic
    i = 2
    print("BLOCK RANGE: ", plot, "\n")
    print("%-30s |   %-6s" % ("Distribution type", "K-S Stat"))
    print("-------------------------------------------------")
    for diff_set in set_of_diffs:
        ks_stat = max(diff_set)
        print("%-30s |   %1.5f" % (headers[i], ks_stat))
        i += 1

    print("\n*****************\n")

Edit: as requested by @Rucknium in monero-dev IRC, made the charts consistent probability densities with the same x- and y-axis cutoffs. I also noticed a small, inconsequential off-by-1 issue when pulling the normal gamma distribution data and fixed it.

@luigi1111
Copy link
Collaborator

I agree with the analysis on the surface, and the conclusion that something in the 10-50 range is the "least wrong" of the current set. I have no particular objection to 50 as the choice. In any case, will give this some additional time for further percolation and input, if any.

@j-berman j-berman changed the title wallet2: apply gamma distribution from chain tip when selecting decoys (#7807) [WIP] wallet2: apply gamma distribution from chain tip when selecting decoys (#7807) Aug 17, 2021
@j-berman j-berman changed the title [WIP] wallet2: apply gamma distribution from chain tip when selecting decoys (#7807) wallet2: apply gamma distribution from chain tip when selecting decoys (#7807) Aug 18, 2021
@Rucknium
Copy link

From a statistical perspective, I support the latest version. What is accomplished here is "thickening" the probability density function of the selection algorithm in the section closest to zero. This more closely mimics the observed distribution of mixins + real spends. However, in the near future it is crucial that we consider moving away from the current selection algorithm that is based on Moser et al. 2018. I have some ideas about how to accomplish this.

src/wallet/wallet2.cpp Outdated Show resolved Hide resolved
src/wallet/wallet2.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@UkoeHB UkoeHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

- matches the paper by Miller et al to apply the gamma from chain tip, rather than after unlock time
- if the gamma produces an output more recent than the unlock time, the algo packs that output into one of the first 50 spendable blocks, respecting the block density factor
j-berman added a commit to j-berman/monero that referenced this pull request Oct 27, 2021
- select_outputs.gamma: decrease expected median for recent changes to algorithm (monero-project#7821) & re-attempt test with 10x larger sample size if the test fails on first try
- select_outputs.density: allow a wider deviation from chain data to selected data for larger blocks, and smaller deviation for smaller blocks (the allowed deviation is proportional to size now) + test some other sensible heuristics
- select_outputs.same_distribution: allow slightly larger average deviation from picks to chain data
- still not perfect, but only deterministic tests can be perfect
j-berman added a commit to j-berman/monero that referenced this pull request Oct 27, 2021
- select_outputs.gamma: decrease expected median for recent changes to algorithm (monero-project#7821) & re-attempt test with 10x larger sample size if the test fails on first try
- select_outputs.density: allow a wider deviation from chain data to selected data for larger blocks, and smaller deviation for smaller blocks (the allowed deviation is proportional to size now) + test some other sensible heuristics
- select_outputs.same_distribution: allow slightly larger average deviation from picks to chain data
- still not perfect, but only deterministic tests can be perfect
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants