Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parameter optimisation algorithm #11

Closed
jeffersonfparil opened this issue Feb 1, 2024 · 12 comments
Closed

Improve parameter optimisation algorithm #11

jeffersonfparil opened this issue Feb 1, 2024 · 12 comments

Comments

@jeffersonfparil
Copy link
Owner

It's starting to seem like the parameter spaces are not smooth and unimodal. See the sensitivity analysis currently running in perf.md.

@jeffersonfparil
Copy link
Owner Author

We can try the search from both ends, i.e. from one end then the other with the stopping criteria remaining the same, i.e. if stepping further increases MAE then we stop and use the lowest parameter value yielding the lowest MAE. We do this from smallest to larges parameter value, then again from the largest to the smallest value. Finally, among the two "optimal" parameter values we select the one which yields the smaller MAE.

@jeffersonfparil
Copy link
Owner Author

jeffersonfparil commented Feb 1, 2024

The previous proposal does not fully appreciate the complexity in the parameter spaces! The trends in MAE across one parameter space is unlikely to be the same across all the levels of all the other parameters! This means that the current simplistic coordinate-descent-like optimisation may prove to be suboptimal.

However, we can perform a similar coordinate-descent-like optimisation where we start from each corner of the 4-dimensional parameter space! Do we zigzag from each corner? Currently, what we do is start from one corner then move along a single dimension, then once the minimum MAE is found, we fix this dimension at the "optimum" value, and jump to the next dimension, and so on. Can we efficiently lick all 4 dimensions before each step from each corner?

Also, just a hopeful sidenote: Despite these, maybe the current output is too granular, and maybe as we get more info it all smooths out and the parameter spaces become more or less smooth and unimodal?

jeffersonfparil added a commit that referenced this issue Feb 2, 2024
tested optimisation modifcations and resolcing issue #11
@jeffersonfparil
Copy link
Owner Author

My optimisation revision is resulting to worse concordances and MAE! See commit 91b023b for comparison.

@jeffersonfparil
Copy link
Owner Author

jeffersonfparil commented Feb 5, 2024

Testing optimisation changes here. Main changes:

  • optimisations of first corr, then dist, then l, and finally k
  • performing forwards then backwards pass per parameter space

@jeffersonfparil
Copy link
Owner Author

  • adding parameter space exploration from the middle forwards and backwards up to the ends see commit aec01bc

@jeffersonfparil
Copy link
Owner Author

  • going across all 4 parameter spaces twice in order of corr, dist, l and k
  • also increasing the break threshold from f64::EPSILON to 0.0001
  • see commit 8c10911

@jeffersonfparil
Copy link
Owner Author

  • drafting optimisation per locus, i.e. optimising for the minimum loci correlation and maximum pool distances per missing data point

@jeffersonfparil
Copy link
Owner Author

The most recent merge from dev:

  • optimisation per locus across min_loci_corr and max_pool_dist only
  • outputting intermediate imputation files which will be helpful for large datasets
  • double parallelisation across chunks of the genome via std::thread::scope and per chunk across all alleles/loci via Zip

@jeffersonfparil
Copy link
Owner Author

To address the glacial pace at which the new optimisation algorithm proceeds, we added an additional inner-loop-break condition, i.e. when the last 3 consecutive MAEs are the same.

@jeffersonfparil
Copy link
Owner Author

jeffersonfparil commented Feb 12, 2024

Still too slow!

  • reducing the number of steps per parameter space to explore
  • using &[T] instead of &Vec<T>
  • using format! instead of + to concatenate strings (expensive and have to be owned to store in variables)

@jeffersonfparil
Copy link
Owner Author

jeffersonfparil commented Feb 13, 2024

  • improving computation time by skipping involved imputation (and optimisation) of the last allele per locus and just using the additive inverse of the sum of the imputed allele frequencies prior to fixing over-flows --> did not improve time noticeably and so removed

@jeffersonfparil
Copy link
Owner Author

  • looking like we have the first stable version I am happy with
  • Main things I have stuck with:
    • using default parameters: min_loci_corr=0.9, max_pool_dist=0.1, min_l_loci=20, and min_k_neighbours=5
    • using optimisation per locus where we have min_l_loci and min_k_neighbours fixed and we only explore the finitely upper-bounded min_loci_corr and max_pool_dist parameter spaces ranging from 0.0 to 1.0 and stepping 0.1 at a time.
    • performing independent imputation per allele per locus and then correcting for under- and over-flows is like an extra imputation replicate which is improving our accuracy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant