Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained model for bulk low-quality data #236

Open
avantikalal opened this issue Apr 5, 2021 · 6 comments
Open

Pretrained model for bulk low-quality data #236

avantikalal opened this issue Apr 5, 2021 · 6 comments

Comments

@avantikalal
Copy link
Contributor

Hi @avantikalal, in cases where noisy samples have coverage >= that of the clean samples, should users always forego training and use your pretrained model, nvidia:atac_bulk_lowqual_20m_20m?

For example:
image

If the pretrained model is not successful in reducing noise, are there training parameters that should be considered when constructing a custom model.

Thanks for the help—atacworks looks like a gamechanger!

Originally posted by @umasstr in #221 (comment)

@avantikalal
Copy link
Contributor Author

avantikalal commented Apr 5, 2021

Hi @umasstr , for the best results we recommend training your own model, if you have matched low/high quality data available. The noisy data and clean data could have any coverage, what is important is that the noisy data used for training should have similar coverage to the noisy data to which you intend to apply the model.

@umasstr
Copy link

umasstr commented Apr 5, 2021

Thanks for following up. In agreement with your above post, I found the pretrained model to be incompatible with my dataset and some sample ENCODE datasets. The custom trained model worked well, but there's a constant, low-level background in the output (green below).

Purple: original (macs, as directed in #221 )
Green: custom model
Orange: atac_bulk_lowqual_20m_20m

image

For reasons unknown, it looks like your pipeline needs a non-zero integer at all coordinates. >25M peaks have a 1.0 signal, comprising the majority of the bedgraph.

$ cut -f4 2296_infer.track.bedGraph | sort | uniq -c | head -1

26356100 1.0

Unfortunately, simply subtracting 1 from all coordinates doesn't produce an ideal track, though it could be useful for analytical purposes. Any idea how I can get rid of this artifact produced by custom models? I can file a new issue if you'd like.

@avantikalal
Copy link
Contributor Author

avantikalal commented Apr 5, 2021

Great to see the custom model working well overall, though I agree the low-level background is a problem. The model shouldn't require a nonzero output at every coordinate. Could you share the config file generated by your model training run?
Also, what is the downstream task - are you interested in using primarily the denoised track, the peak calls, or both?

@umasstr
Copy link

umasstr commented Apr 5, 2021

Indeed, the output is pretty impressive, especially if I set the track bounds to hide the 1.0 background:

image

Here is a link to the training config

@avantikalal
Copy link
Contributor Author

One thing that might help is to train your model using the Poisson loss function for regression instead of the MSE/Pearson loss functions. This has given us better regression results and will become the default regression loss in the future.

To train a model using Poisson loss, you can set --mse_loss equal to 0, --pearson_loss equal to 0, and --poisson_loss equal to 1.

If you do try this, please let us know if it gives better results!

@umasstr
Copy link

umasstr commented May 12, 2021

Hi @avantikalal, I tried the parameters above, but the results did not look good.

image

I am a little concerned about the fact that, for each noisy BW, I have to retrain a model. None of the prebuilt models work with my data and none of my samples' models work on the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants