-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hypernetwork - Variable Dropout Structure #4288
Conversation
This uses shared options, which can be changed async. HN Release should be done with this option OFF, unless they're planning to allow others to continue training from it.
Tested with my own All HNs.
|
@enn-nafnlaus Typically dropouts are applied after activation, But LayerNorm (or norms) does not matter, actually in practice, its being used in both way. I'd say that both are practical. Actually we might want very general way to handle ANY HN structures, not like current way. I'm still waiting for brilliant In Practice for classification tasks, recommend values are between 0.2 to 0.5. But for RL or Transfer learning, its variously used, casually between 0 to 0.35 If you have large dataset (possibly contains general images too), dropout ratio does not have to be bigger. It is meant to be used for smaller dataset to avoid overfitting. Personally I use small values when its close to input, and bigger values at output. for example, [0, 0.1, 0.2, 0] Thus you try to drop small amount of actual inputs, but try to drop big amount of hidden layer connections. |
As per this: It seems we're applying far, far too large dropouts because of the small size of our hypernetworks. That the smaller the network, the lower the dropout should be. A 5 layer network might ideally use, say, a mere 1% dropout. Again, I think it's very important that we set reasonable defaults, and that these at least be suggested to the user, if not outright chosen behind the scenes. |
Dropout ratio highly depends on representation of our data. If we assume if we have shallowly- decomposed latent space, dropout ratio should be bigger. If we assume very critical - or sparsely decomposed latent space, we need smaller dropout ratio. The problem is - do we have those appropriate ratio? well, no.... we only rely on practices. . But note that we should have low dropout rate if its closer to input. The only certain thing I could say is, to not use too high dropout ratio like 0.5. But I'll suggest structure as default for [0, 0.05, 0.15, 0], which matches with small for input, big for hidden layer. |
So, I'm trying this out now. Thanks so much for adding this. :) That said, the implementation is rather weird. Basically, we have to lead with a dummy zero that doesn't actually mean anything, to account for the fact that there's one fewer dropout sites than number of layers? Why not just omit it altogether and have the number of dropout sites be 1 less than the number of layers, as it actually is? I had to open up the code to figure out what was going on. I'd advise having the dropout specification be what's actually used (no leading dummy zero, that will just confuse people, as it did me), and have better documentation in the UI. Anyway, though, thanks a bunch for adding this! :) |
Note : this pr continues from here
We were using fixed 0.3 constant possibility, it was just an example, which should be converted to variable.
This patch includes supports for previous HNs made with dropouts or without dropouts.
Someone might argue that dropout is not meaningful - but no, I have one working example made from dropout.
prompt information -
1girl, golden hair, masterpiece, looking at viewer, school uniform
Steps: 34, Sampler: Euler a, CFG scale: 6.5, Seed: 2272754403, Size: 512x512, Model hash: 925997e9
This example does not mean whether dropout is good or bad way to do it.
UI change
Users will be able to input dropout probabilities if they want.
If the dropout structure is 'empty' or use dropout option is False dropout won't be applied.
First and last sequence must be zero.
All numbers should be between [0, 1).
Dropout structure length should match with layer structure length.