Add continous value (plDDT and PPL) support for Curriculum Learning #58

Leo-T-Zang · 2023-10-08T21:04:59Z

Enhanced the curriculum learning strategy with continuous values (like ppl and plddt) into categories. We notice it actually supports contnuouse values.
Testing script:
- Leveraged the existing testing script of length to validate the new PPL one.

Both plDDT and PPL values should be pre-computed and saved within dataset.

pascalnotin

Hi @Leo-T-Zang - nice work!
Had a couple questions/comments before approving:

What is the role of the new fields in the yaml files (eg., ppl_category, plddt_category)? They do not seem to be used in the rest of the code
We probably want to generate a sequence of amino acids here instead :)
I would add two small scripts in dataset.py to compute the perplexity metric based on ESM2, and the pLDDT based on ESMFold - and have a separate test to ensure they return correct values. In practice we will then run these scripts on the training datasets to pre-compute the values necessary to the CL scheme you implemented

pascalnotin · 2023-10-10T22:56:57Z

Another comment as Im re-reading the test routine: we should check that the variable of interest (eg., ppl) is ordered throughout training across mini batches (not within a mini batch).

talkhanz · 2023-10-11T09:11:08Z

@pascalnotin I have a question. my PR was computing length within the batch_set_curriculum_learning_column function but through this PR we are assuming the CL column is already precomputed and stored within the function prior to the function call for batch_set_curriculum_learning_column.

For example, my sequence length strategy is computing the negative of the length inside the batch_set function here:

protein-lm-scaling/protein_lm/modeling/getters/dataset.py

Line 64 in 24fbb41

    
           result[curriculum_learning_column_name] = [-len(x) for x in result[input_column_name]]

whereas Leo's PR is relying on the fact that perplexity/pldt is already precomputed (see below). The code line below is indexing from ppl/pldt column under the assumption it is already computed.

protein-lm-scaling/protein_lm/modeling/getters/dataset.py

Line 66 in 24fbb41

result[curriculum_learning_column_name] = [-x for x in result[strategy]]

To make my PR in line with leo's,we may need to make this assumption (CL column is precomputed ) applicable for all strategies .

How do you want to go about it?

assume column is precomputed (S1) VS
OR
Do the precomputation of CL column within the batch_set function as is the case with strategy ='sequence_length' (S2)

Personally, in my opinion, it makes sense for the time being to go with S2 because it makes the computation apparent whereas assuming it already exists could make for reproducibility issues although the latter is computational more feasible but im all ears what both @pascalnotin and @Leo-T-Zang have to say.

Currently with S2 we will be computing the metric everytime but it makes sense to compute it only once and store the precomputed dataset somewhere like HF/Dropbox/Google drive, it could help as we wont need to precompute these metrics everytime .

Also my apologies for creating random DNA sequences rather than the Amino acids!

talkhanz · 2023-10-11T09:23:24Z

Another comment as Im re-reading the test routine: we should check that the variable of interest (eg., ppl) is ordered throughout training across mini batches (not within a mini batch).

I can make this change in a subsequent PR

pascalnotin · 2023-10-11T12:41:23Z

That's a good point. I assumed things would be different for sequence length vs other CL strategies bc the length of input can be straightforwardly computed vs other strategies can be more involved (ie., calls for separate models such as Tranception to compute the perplexity or ESMfold to compute the plddt). But these more cpx strategies can also be handled via the batch_set function (as per your point S2) then it's probably easier to go with that

pascalnotin

Merging this PR based on pre-xmas discussion: we will pre-compute values to be used for the Curriculum Learning and store them together with the overarching cluster mapping file. This mapping file will thus contain cluster name, cluster representative sequence, pointer to file location on disk where all sequences in that cluster are stored, plDDT and PPL for the cluster representative sequence computed with a pretrained model (eg., ESM2 or Tranception).

pascalnotin · 2024-01-11T06:52:05Z

@Leo-T-Zang @talkhanz @jamaliki for reference ^^

Tianlai Chen and others added 4 commits October 8, 2023 16:55

ppl and plddt cl

df6a5ca

a

7cc67b0

fix bugs of continous testing

1361049

remove ipynb

24fbb41

Leo-T-Zang mentioned this pull request Oct 9, 2023

Add curriculum learning strategy #39

Open

pascalnotin reviewed Oct 10, 2023

View reviewed changes

Tianlai Chen and others added 3 commits October 17, 2023 17:35

Fix yaml

53c5bb8

Fix protein sequence Gen

c21664b

Add Bateches Order Comparison

43ab5fc

pascalnotin approved these changes Jan 11, 2024

View reviewed changes

pascalnotin merged commit 31dcd8f into OpenBioML:main Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add continous value (plDDT and PPL) support for Curriculum Learning #58

Add continous value (plDDT and PPL) support for Curriculum Learning #58

Leo-T-Zang commented Oct 8, 2023 •

edited

Loading

pascalnotin left a comment

pascalnotin commented Oct 10, 2023

talkhanz commented Oct 11, 2023 •

edited

Loading

talkhanz commented Oct 11, 2023

pascalnotin commented Oct 11, 2023

pascalnotin left a comment

pascalnotin commented Jan 11, 2024

Add continous value (plDDT and PPL) support for Curriculum Learning #58

Add continous value (plDDT and PPL) support for Curriculum Learning #58

Conversation

Leo-T-Zang commented Oct 8, 2023 • edited Loading

pascalnotin left a comment

Choose a reason for hiding this comment

pascalnotin commented Oct 10, 2023

talkhanz commented Oct 11, 2023 • edited Loading

talkhanz commented Oct 11, 2023

pascalnotin commented Oct 11, 2023

pascalnotin left a comment

Choose a reason for hiding this comment

pascalnotin commented Jan 11, 2024

Leo-T-Zang commented Oct 8, 2023 •

edited

Loading

talkhanz commented Oct 11, 2023 •

edited

Loading