-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sampling for synthetic event group spec #1425
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r1, all commit messages.
Reviewable status: 2 of 3 files reviewed, 7 unresolved discussions (waiting on @uakyol)
a discussion (no related file):
Forgive my lack of understanding on how all the math works, but how does this interact with VidSamplingInterval
from MeasurementSpec
? Are they unrelated?
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r1 (raw file):
populationSpec: SyntheticPopulationSpec, syntheticEventGroupSpec: SyntheticEventGroupSpec, randomSeed: Long = 0L,
Pass a Kotlin Random object defaulting to Random
instead of passing a seed. This way the caller can pass a pre-seeded Random or some non-default Random impl.
Code quote:
randomSeed: Long = 0L,
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 190 at r1 (raw file):
private fun SyntheticEventGroupSpec.FrequencySpec.hasOverlaps(): Boolean { return this.vidRangeSpecsList
nit: unnecessary qualification
Code quote:
this.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 191 at r1 (raw file):
private fun SyntheticEventGroupSpec.FrequencySpec.hasOverlaps(): Boolean { return this.vidRangeSpecsList .toList()
Isn't this already a List?
Code quote:
.toList()
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 194 at r1 (raw file):
.flatMap { vidRangeSpec: VidRangeSpec -> listOf( RangePoint(vidRangeSpec.vidRange.start, true),
I feel like there's a better way to implement this that can take advantage of Kotlin ranges. See https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.ranges/-open-end-range/ and how we implemented our own overlaps
method for a range of Instant.
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 116 at r1 (raw file):
// A map of `non_population_fields` from `SyntheticPopulationSpec` to // their values. map<string, FieldValue> non_population_field_values = 3;
Since this message is already in use the field should not be renumbered.
Code quote:
3
src/test/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGenerationTest.kt
line 386 at r1 (raw file):
} fun `generateEvents fails when overlapping vidRanges exist`() {
nit: be more specific than "fails"
Suggestion:
generateEvents throws IllegalArgumentException when vid ranges overlap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 5 unresolved discussions (waiting on @SanjayVas)
a discussion (no related file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Forgive my lack of understanding on how all the math works, but how does this interact with
VidSamplingInterval
fromMeasurementSpec
? Are they unrelated?
They are completely unrelated. After these vids are generated here, they are hashed to the [0,1] interval.
VidSamplingInterval
from MeasurementSpec
is an interval in [0,1] it selects a bunch of them and based on the size of the interval, reach result is scaled to approximate what would have been if they weren't sampled.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r1 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Pass a Kotlin Random object defaulting to
Random
instead of passing a seed. This way the caller can pass a pre-seeded Random or some non-default Random impl.
ah right. My mind got twisted, I thought that wouldn't provide the same vids for the same random object but rest of this func is deterministic and not based on anything on the requisition, it should work. Thank you.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 191 at r1 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Isn't this already a List?
Done.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 194 at r1 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
I feel like there's a better way to implement this that can take advantage of Kotlin ranges. See https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.ranges/-open-end-range/ and how we implemented our own
overlaps
method for a range of Instant.
ah, actually I didn't need what I did nor do I need Ranges. I simplified the implementation wdyt?
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 116 at r1 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Since this message is already in use the field should not be renumbered.
ah thanks!
Previously, uakyol wrote…
On second look, my sentence is hard to parse, let me rephrase : After these vids are generated here, they are hashed to the [0,1] interval.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @uakyol)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 194 at r1 (raw file):
Previously, uakyol wrote…
ah, actually I didn't need what I did nor do I need Ranges. I simplified the implementation wdyt?
Can still be a little clearer by having the overlapping predicate in any
rather than a transformation in zipWithNext
.
vidRangeSpecsList
.map { it.vidRange }
.sortedBy { it.start }
.zipWithNext()
.any { (first, second) -> first.overlaps(second) }
Feel free to either add an extension for overlaps
or inline the impl.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
populationSpec: SyntheticPopulationSpec, syntheticEventGroupSpec: SyntheticEventGroupSpec, random: Random = Random(0L),
Don't specify a default seed.
Suggestion:
Random
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
populationSpec: SyntheticPopulationSpec, syntheticEventGroupSpec: SyntheticEventGroupSpec, random: Random = Random(0L),
Just a double-check on determinism here. These synthetic specs must always produce the same events regardless of even what language the generator is implemented in. I'm pretty sure that means that it cannot use randomness anywhere in its implementation as a result.
Code quote:
random: Random
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 3 files reviewed, 4 unresolved discussions (waiting on @SanjayVas)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 194 at r1 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Can still be a little clearer by having the overlapping predicate in
any
rather than a transformation inzipWithNext
.vidRangeSpecsList .map { it.vidRange } .sortedBy { it.start } .zipWithNext() .any { (first, second) -> first.overlaps(second) }
Feel free to either add an extension for
overlaps
or inline the impl.
Done.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Just a double-check on determinism here. These synthetic specs must always produce the same events regardless of even what language the generator is implemented in. I'm pretty sure that means that it cannot use randomness anywhere in its implementation as a result.
So, that basically is the whole purpose of this PR. This approach changes that assumption. The alternative is to generate a massive SyntheticEventGroupSpec that contains non overlapping vidranges that are sampled from a larger vid range. This new spec will be :
- Much larger
- Will only approximate uniform sampling because it samples ranges rather than individual vids
- Will itself need to be generated randomly
We will need randomness to generate realistic intersections between EDPs, it is not feasible to hand craft realistic data and vid intersection between EDPs. If we don't do it here, than we will have to do it in another spec that will be parsed and randomly generates this.
Fwiw, if you do not specify any sample_size, then there is no randomness.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Don't specify a default seed.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 3 files reviewed, 4 unresolved discussions (waiting on @kungfucraig)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
I personally think this is a no-go then. Synthetic generation was added as part of the larger effort to guarantee determinism for event queries, i.e. you can look at the spec and know exactly what events you will get.
Will itself need to be generated randomly
I think this part is fine. As far as the other two points, this is why I was suggesting that we may need a new spec proto definition that e.g. does not rely on knowing the VID ranges from the population spec.
CC @kungfucraig to see if his opinion differs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 3 files reviewed, 4 unresolved discussions (waiting on @kungfucraig and @SanjayVas)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
I personally think this is a no-go then. Synthetic generation was added as part of the larger effort to guarantee determinism for event queries, i.e. you can look at the spec and know exactly what events you will get.
Will itself need to be generated randomly
I think this part is fine. As far as the other two points, this is why I was suggesting that we may need a new spec proto definition that e.g. does not rely on knowing the VID ranges from the population spec.
CC @kungfucraig to see if his opinion differs.
To get on the same page : That new spec proto definition will have use randomness in its implementation. My idea was to use this spec so that it will have randomness in its implementation if and only if the "sample_size" is set. Wdyt of documenting this more clearly in the proto and going with this?
I am suggesting this because any SyntheticEventGroupSpec without sample_size will still satisfy your point of guarantee determinism for event queries, i.e. you can look at the spec and know exactly what events you will get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 3 files reviewed, 4 unresolved discussions (waiting on @kungfucraig)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
I am suggesting this because any SyntheticEventGroupSpec without sample_size will still satisfy your point
This isn't sufficient. Any synthetic event generation process must be deterministic always.
That new spec proto definition will have use randomness in its implementation
To clarify what I mean by spec proto definition, I mean the definition of an alternative to the current SyntheticEventGroupSpec or alternative submessages (e.g. with oneofs). The implementation of such must be deterministic. Any randomness would have to be applied in generating the spec itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 3 files reviewed, 4 unresolved discussions (waiting on @kungfucraig and @SanjayVas)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
This isn't sufficient. Any synthetic event generation process must be deterministic always.
Ah, gotcha on this
The implementation of such must be deterministic. Any randomness would have to be applied in generating the spec itself.
I don't think that is possible. In order to do that, we would need to list the actual vids that the spec will generate in the spec itself. In other words, we need to generate the vids randomly and put them in a spec and then just read the spec and output those vids. Which makes it not a spec but the data itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 3 files reviewed, 4 unresolved discussions (waiting on @kungfucraig)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
I don't think that is possible. In order to do that, we would need to list the actual vids that the spec will generate in the spec itself.
I'm not convinced yet that this is true. e.g. perhaps specifying multipliers or offsets. Regardless, the task is to have deterministic event generation from compressed specs. This means restricting what the synthetic generation is capable of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r2.
Reviewable status: 2 of 3 files reviewed, 5 unresolved discussions (waiting on @SanjayVas and @uakyol)
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 116 at r3 (raw file):
// Number of vids sampled uniformly without replacement from vid_range. // If this is 0, no sampling is done and all the vids in range are taken. int32 sample_size = 3;
Per offline conversation @uakyol will add a seed and rng type to the spec proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 5 unresolved discussions (waiting on @kungfucraig and @SanjayVas)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 50 at r2 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
I don't think that is possible. In order to do that, we would need to list the actual vids that the spec will generate in the spec itself.
I'm not convinced yet that this is true. e.g. perhaps specifying multipliers or offsets. Regardless, the task is to have deterministic event generation from compressed specs. This means restricting what the synthetic generation is capable of.
updated based on offline conversation
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 116 at r3 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
Per offline conversation @uakyol will add a seed and rng type to the spec proto.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r4, all commit messages.
Reviewable status: 1 of 3 files reviewed, 7 unresolved discussions (waiting on @kungfucraig and @uakyol)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 47 at r4 (raw file):
* @param populationSpec specification of the synthetic population * @param syntheticEventGroupSpec specification of the synthetic event group * @param random object to be used in sampling vids when sampleSize is provided in specification
This param no longer exists.
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 58 at r4 (raw file):
"Expecting KOTLIN_RANDOM rng type, got ${syntheticEventGroupSpec.rngType}" } val random = Random(syntheticEventGroupSpec.randomSeed)
Don't we need to reinitialize this for each VidRangeSpec? The spec doesn't have any guarantee on the order that the ranges are sampled, and changing that order would affect the random output.
Regardless, whatever we go with needs to be documented in the spec definition so there's no ambiguity.
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 138 at r4 (raw file):
repeated DateSpec date_specs = 2; // Random seed to be fed into the random number generator to sample vids.
Document that this is required if any VidRangeSpec specifies a sample_size.
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 149 at r4 (raw file):
} // Activation state of this `Account`. Output-only.
Wrong comment copy-pasted from somewhere. Also document that this is required if random_seed
is specified.
Code quote:
// Activation state of this `Account`. Output-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @SanjayVas and @uakyol)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 121 at r8 (raw file):
Previously, uakyol wrote…
sorry, I couldn't understand this comment. vidRangeSpec.sampleSize should be smaller than vidRangeSequence.size() for there to be sampling. This if statement covers the case for when sampling is not requested. We treat if the sampleSize field is set to default value "0" then there should be no sampling.
nvm. What I was saying is not actually relevant here as I take a second look.
Should we throw an error if sampleSize is larger than size of vidRangeSequence?
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 147 at r8 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
We actually have one in the works as we need it for HMSS, but we want to roll out with a single impl for now since this is blocking 0.5.0. We don't strictly need the language-agnostic one until someone wants to implement this in another language, perhaps when we're adding conformance testing.
I'd prefer to roll out our own since a language specific one will almost certainly be deprecated in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @stevenwarejones and @uakyol)
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 147 at r8 (raw file):
Previously, stevenwarejones (Steven Ware Jones) wrote…
I'd prefer to roll out our own since a language specific one will almost certainly be deprecated in the future.
I agree that a language-agnostic one would likely fully supplant a language-specific one, but we don't actually need the language-agnostic one any time soon. We'd only need for conformance testing if we plan on using sampling for those specs.
More importantly, waiting on the language-agnostic one would push 0.5.0 back a few weeks. Getting this in is critical for Origin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @kungfucraig, @SanjayVas, and @stevenwarejones)
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 121 at r8 (raw file):
Previously, stevenwarejones (Steven Ware Jones) wrote…
nvm. What I was saying is not actually relevant here as I take a second look.
Should we throw an error if sampleSize is larger than size of vidRangeSequence?
Thanks for catching, I thought I was checking this. Added check and test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @SanjayVas and @stevenwarejones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @SanjayVas and @uakyol)
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 147 at r8 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
I agree that a language-agnostic one would likely fully supplant a language-specific one, but we don't actually need the language-agnostic one any time soon. We'd only need for conformance testing if we plan on using sampling for those specs.
More importantly, waiting on the language-agnostic one would push 0.5.0 back a few weeks. Getting this in is critical for Origin.
Lets not make it part of the proto then. We can add it to the proto later once we support a platform-independent method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @stevenwarejones)
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 147 at r8 (raw file):
Previously, stevenwarejones (Steven Ware Jones) wrote…
Lets not make it part of the proto then. We can add it to the proto later once we support a platform-independent method.
It's specifically because we intend to support different RNG implementations that it must be in the spec definition. Or more accurately, the fact that the RNG implementation affects the output it must be in the spec definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @SanjayVas and @uakyol)
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 147 at r8 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
It's specifically because we intend to support different RNG implementations that it must be in the spec definition. Or more accurately, the fact that the RNG implementation affects the output it must be in the spec definition.
Then I'd prefer to roll out something very simple - roll out our own - I don't think it should take weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 2 unresolved discussions (waiting on @kungfucraig, @SanjayVas, and @stevenwarejones)
src/main/proto/wfa/measurement/api/v2alpha/event_group_metadata/testing/simulator_synthetic_data_spec.proto
line 147 at r8 (raw file):
Previously, stevenwarejones (Steven Ware Jones) wrote…
Then I'd prefer to roll out something very simple - roll out our own - I don't think it should take weeks.
updated based on offline conversation and used java.util.Random
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r10, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @kungfucraig and @SanjayVas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r10, 1 of 1 files at r12, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @SanjayVas and @stevenwarejones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r12, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion
src/main/kotlin/org/wfanet/measurement/loadtest/dataprovider/SyntheticDataGeneration.kt
line 135 at r12 (raw file):
return vidRangeSequence } return vidRangeSequence.shuffled(random).take(vidRangeSpec.sampleSize)
an integrator would need to know what shuffle algorithm to use as well. I'd prefer we just stick with Collections.shuffle
for that
Support sampling for synthetic event group spec
Consider the following problem for generating synthetic data for 3 Edps that represents a realistic intersection
Using the current VidRange field
Edp1 range -> (0,100)
Edp2 range -> (50,150) [At this point we have 50 intersection for Edp1 and 2]
Edp3 range -> ?? [How should we decide on this]
Here are 2 possible solutions :
This PR is the implementation of the first option