Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a VQ-VAE for DNA-sequences for stable diffusion #16

Closed
lucapinello opened this issue Oct 15, 2022 · 15 comments · Fixed by #101
Closed

Training a VQ-VAE for DNA-sequences for stable diffusion #16

lucapinello opened this issue Oct 15, 2022 · 15 comments · Fixed by #101
Assignees
Milestone

Comments

@lucapinello
Copy link
Collaborator

lucapinello commented Oct 15, 2022

Current notebook here: https://github.com/pinellolab/DNA-Diffusion/blob/latent-space-representation/vq_vae_diffusion.ipynb

@lucapinello lucapinello changed the title Training a VQ-VAE for DNA-sequences for stable diffusion NOTEBOOK-PROTOTYPING:Training a VQ-VAE for DNA-sequences for stable diffusion Oct 15, 2022
@lucapinello lucapinello changed the title NOTEBOOK-PROTOTYPING:Training a VQ-VAE for DNA-sequences for stable diffusion Training a VQ-VAE for DNA-sequences for stable diffusion Oct 15, 2022
@lucapinello
Copy link
Collaborator Author

@sg134 can you write here where you are with this and how people can contribute/help you?

@mihirneal
Copy link
Collaborator

@lucapinello Would love to work on this however I don't understand why do we need to work with VQ-VAE. Shouldn't we directly prototype with DDPMs?

@lucapinello
Copy link
Collaborator Author

The idea is to derive a good embedding for DNA-sequences so we can explore later stable diffusion. Right now we are diffusing directly on the one-hot-encoding of the DNA sequences.

@sg134
Copy link

sg134 commented Oct 17, 2022

Hi, just saw this (sorry). As Luca mentioned, we hope to represent the DNA sequences in a smaller latent space and pursue latent diffusion. To that end, if you have any other model suggestions to encode the sequences into a representation (another VAE variant for example), feel free to suggest to suggest and implement them -- we don't necessarily know if VQ-VAE would be the best model for this dataset. I started with this model because it was used in the DALL-E paper. Currently some of the next steps planned for the VQ-VAE:

  1. Modify the architecture to improve the reconstruction accuracy of nucleotides in the dataset
  2. This is the big one that I've been stuck on: "interpreting" the codebook embeddings. Does the information in the codebook confer any information regarding TF binding motifs or key features differentiating between binding patterns across cell-types??
  3. Down the line, we'd also probably need to clean up the code and modify it a bit so that it's easy to combine this code with the diffusion code into a unified pipeline.

@lucapinello @LucasSilvaFerreira Is there anything else to include or clarify?

@mihirneal
Copy link
Collaborator

gotcha. I'd like to work on this issue. Can you assign it to me?

@LucasSilvaFerreira
Copy link
Collaborator

@mihirneal and @sg134 I would recommend that you guys create a subgroup to explore it together. @sg134 already has some code, and it would be nice if he can guide you through it. I think it will be nice to have a (latent) stable diffusion model working on these sequences.

@mateibejan1
Copy link
Collaborator

@sg134 let me know if I can help with the VQVAE

@sg134
Copy link

sg134 commented Oct 19, 2022

Hi @mihirneal & @mateibejan1, is it possible that we can allocate a few minutes during the sprint meeting to discuss the VQ-VAE code and next steps for others who are interested as well?

@mihirneal
Copy link
Collaborator

Yeah, that’s what I had in mind as well.

@mateibejan1
Copy link
Collaborator

Sure, I'll devise a meeting planning. We'll start with a retrospective about what has been done in sprint 1, then talk current tasks and finally what we'll do. Sounds good for you schedule @sg134 ?

@noahweber1
Copy link
Collaborator

@sg134 please contact me when you see this.

[email protected]

thanks

@sg134
Copy link

sg134 commented Nov 30, 2022

@noahweber1 messaged you on Discord.

@noahweber1
Copy link
Collaborator

Summary of what we agreed upon and what are the next steps:

  1. I take over for couple of weeks until Sameer comes and we close the task off.
  2. I perform refactoring cleaning
  3. Any improvements in accuracy I can squeeze out
  4. Any adjustments in architecture
  5. Explainability of the inference, i.e. that the latent representations actually makes sense when inspecting manually.

@github-actions
Copy link

github-actions bot commented Mar 3, 2023

This issue is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Mar 3, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 11, 2023
@github-project-automation github-project-automation bot moved this from Notebook Prototyping to Done in DNA-diffusion Mar 11, 2023
@cameronraysmith cameronraysmith moved this from Done to Archive in DNA-diffusion Mar 11, 2023
@cameronraysmith cameronraysmith added this to the 0.0.1 milestone Mar 12, 2023
@github-actions github-actions bot removed the stale label Mar 13, 2023
@github-project-automation github-project-automation bot moved this from Archive to Done in DNA-diffusion Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants