Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CarveMe doesn’t satisfy the original protein complex definitions during the building of GPRs #182

Open
lazzarigioele opened this issue Jun 15, 2023 · 0 comments

Comments

@lazzarigioele
Copy link

Using carveme v1.5.2.

My opinion reading your code is that there is no real check whether a particular protein complex, as it is defined in a particular model for a particular reaction, is really satisfied or not. This in my opinion can lead to wrong GPRs.

I report here below your code contained in carveme/reconstruction/scoring.py , converting the table 'gene_scores' in 'protein_scores', and finally converting 'protein_scores' in 'reaction_scores'. Then I try to give a practical example.

From gene to protein scores:

def merge_subunits(genes):
    genes = genes.dropna()
    if len(genes) == 0:
        return None
    else:
        protein = ' and '.join(sorted(genes))
        if len(genes) > 1:
            return '(' + protein + ')'
        else:
            return protein

def merge_subunit_scores(scores):
    return scores.fillna(0).mean()

protein_scores = gene_scores.groupby(['protein', 'reaction', 'model'], as_index=False) \
        .agg({'query_gene': merge_subunits, 'score': merge_subunit_scores})
protein_scores.rename(columns={'query_gene': 'GPR'}, inplace=True)

From protein to reaction scores:

def merge_proteins(proteins):
    proteins = set(proteins.dropna())
    if not proteins:
        return None
    gpr_str = ' or '.join(sorted(proteins))
    if len(proteins) > 1:
        return '(' + gpr_str + ')'
    else:
        return gpr_str

def merge_protein_scores(scores):
    return scores.max(skipna=True)

reaction_scores = protein_scores.groupby(['reaction'], as_index=False) \
        .agg({'GPR': merge_proteins, 'score': merge_protein_scores}).dropna()
avg_score = reaction_scores.query('score > 0')['score'].median()
reaction_scores['normalized_score'] = (reaction_scores['score'] / avg_score).apply(lambda x: round(x, 1))

Suppose now this is my 'gene_scores' table:

query_gene BiGG_gene score gene protein reaction model
NaN STM_v1_0.STM0970 NaN G_STM0970 P_STM0970+STM0973 R_PFL STM_v1_0
gene_7500 STM_v1_0.STM0843 784.0 G_STM0843 P_STM0843+STM0844 R_PFL STM_v1_0
gene_18818 STM_v1_0.STM0973 864.0 G_STM0973 P_STM0970+STM0973 R_PFL STM_v1_0
NaN STM_v1_0.STM0970 NaN G_STM0970 P_STM0970+STM3241 R_PFL STM_v1_0
gene_18818 STM_v1_0.STM3241 846.0 G_STM3241 P_STM0970+STM3241 R_PFL STM_v1_0
NaN STM_v1_0.STM4114 NaN G_STM4114 P_STM4114+STM4115 R_PFL STM_v1_0
NaN STM_v1_0.STM4115 NaN G_STM4115 P_STM4114+STM4115 R_PFL STM_v1_0
gene_13306 STM_v1_0.STM0844 156.0 G_STM0844 P_STM0843+STM0844 R_PFL STM_v1_0

Applying the above code would result in this 'protein_score' table:

protein reaction model GPR score
P_STM0843+STM0844 R_PFL STM_v1_0 (gene_13306 and gene_7500) 470.0
P_STM0970+STM0973 R_PFL STM_v1_0 gene_18818 432.0
P_STM0970+STM3241 R_PFL STM_v1_0 gene_18818 423.0
P_STM4114+STM4115 R_PFL STM_v1_0 None 0.0

And finally in this 'reaction_score' table:

reaction GPR score normalized_score
R_PFL ((gene_13306 and gene_7500) or gene_18818) 470.0 1.0

As you can read , the final GPR is ((gene_13306 and gene_7500) or gene_18818).
The member “(gene_13306 and gene_7500)” is correct, as the both the components of the complex P_STM0843+STM0844 were well matched by Diamond.

Instead, the member “gene_18818” should not appear in the final GPR in my opinion, because neither the complex P_STM0970+STM0973 nor the complex P_STM0970+STM3241 were satisfied, since Diamond didn’t really catch the gene STM_v1_0.STM0970 (score = Nan).

This simplified example pretend to work with a 'gene_scores' table having just 1 model with 1 reaction, anyway I think it is enough to highlight the issue.
With a real-life 'gene_scores' table, there is the chance that other models containing the same reaction can make the final GPR actually correct, balancing the errors. But it is just a chance...

Maybe it could be useful to have a dedicated option, like carve --strictgpr, if the user wants the original protein complex definitions to be strictly satisfied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant