Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph response formatting fails when no pipeline data is in the database #367

Open
1 task done
alyssadai opened this issue Oct 25, 2024 · 1 comment
Open
1 task done
Labels
bug:functional Functional defects resulting from feature changes. Epic A collection of issues that are related by topic and can be addressed together.

Comments

@alyssadai
Copy link
Contributor

alyssadai commented Oct 25, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Expected Behavior

No response

Current Behavior

When submitting any cohort query to an n-API v0.4.0

  • in non-aggregated mode,
  • with a graph with old JSONLD data (no pipeline metadata),

the query results in an Internal server error.

The same issue does not occur if the n-API is run in aggregated mode, or for a graph database containing at least one subject with pipeline metadata.

Error message

Error in the n-API container logs:

TypeError: DataFrame.reset_index() got an unexpected keyword argument 'name'

This error is misleading, as it gives the impression that the relevant section of code:

api/app/api/crud.py

Lines 222 to 232 in 9913736

session_completed_pipeline_data = (
pipeline_grouped_data.groupby(
["sub_id", "session_id", "session_type"]
)
.apply(
lambda x: dict(
zip(x["pipeline_name"], x["pipeline_version"])
)
)
.reset_index(name="completed_pipelines")
)

is using a non-existent or deprecated argument / has some inherent syntax error.

However, what's actually happening is that the code assumes reset_index() is operating on a pd.Series (which DOES have the name argument). But something is going wrong in the logic for session_completed_pipeline_data such that it's producing a pd.DataFrame instead (which DOESN'T have the name argument for reset_index()).

Source of problem

Earlier in the code, when pipeline_grouped_data is constructed:

api/app/api/crud.py

Lines 202 to 220 in 9913736

pipeline_grouped_data = (
group.groupby(
[
"sub_id",
"session_id",
"session_type",
"pipeline_name",
],
dropna=True,
)
.agg(
{
"pipeline_version": lambda x: list(
x.dropna().unique()
)
}
)
.reset_index()
)

we are dropping NaNs during the groupby, meaning that when there are no pipeline names in the data, we get an empty dataframe like:

Empty DataFrame
Columns: [sub_id, session_id, session_type, pipeline_name, pipeline_version]
Index: []

as a result, when we then try to run groupby again on this object to construct session_completed_pipeline_data, that has no effect and still returns a pd.DataFrame, causing the unexpected keyword error when we then try to run reset_index() on it.

If we instead set dropna=False in the groupby when constructing pipeline_grouped_data, there is no longer an error, but the resulting completed_pipelines field for single subject-session looks like this in the response:

        "completed_pipelines": {
          "null": []
        }

Environment

  • OS:
  • Python/Node version:

How to reproduce

No response

Anything else?

Some considerations

Why this wasn't caught by our tests

  • This section of code is not currently covered by any tests - i.e., we don't assert over the new pipeline fields in an n-API response based on a non-aggregated (mocked or real) result from the graph (i.e., checking that the graph response is re-formatted correctly on the API side)
  • We do have a mocked response from the graph that we currently use in some tests of the API response formatting:

    api/tests/conftest.py

    Lines 111 to 168 in 9913736

    @pytest.fixture
    def mock_post_query_to_graph():
    """Mock post_query_to_graph function that returns toy aggregate data containing a dataset with no modalities for testing."""
    def mockreturn(query, timeout=5.0):
    return {
    "head": {
    "vars": [
    "dataset_uuid",
    "dataset_name",
    "dataset_portal_uri",
    "sub_id",
    "image_modal",
    "pipeline_name",
    "pipeline_version",
    ]
    },
    "results": {
    "bindings": [
    {
    "dataset_uuid": {
    "type": "uri",
    "value": "http://neurobagel.org/vocab/12345",
    },
    "dataset_portal_uri": {
    "type": "literal",
    "value": "https://rpq-qpn.ca/en/researchers-section/databases/",
    },
    "sub_id": {"type": "literal", "value": "sub-ON95534"},
    "dataset_name": {"type": "literal", "value": "QPN"},
    },
    {
    "dataset_uuid": {
    "type": "uri",
    "value": "http://neurobagel.org/vocab/12345",
    },
    "dataset_portal_uri": {
    "type": "literal",
    "value": "https://rpq-qpn.ca/en/researchers-section/databases/",
    },
    "sub_id": {"type": "literal", "value": "sub-ON95535"},
    "dataset_name": {"type": "literal", "value": "QPN"},
    "image_modal": {
    "type": "uri",
    "value": "http://purl.org/nidash/nidm#T1Weighted",
    },
    "pipeline_version": {
    "type": "literal",
    "value": "7.3.2",
    },
    "pipeline_name": {
    "type": "uri",
    "value": "https://github.com/nipoppy/pipeline-catalog/tree/main/processing/freesurfer",
    },
    },
    ]
    },
    }
    • However, even if we did use this, it would not have caught the error because:
      a. It assumes a dataset with info containing pipeline metadata
      b. It is an aggregated response

To avoid similar issues

  • We need tests that assert over the API response given all possible graph database types allowed by Neurobagel, i.e. datasets with:
    i. ONLY phenotypic data
    ii. ONLY phenotypic + bids data (no derivatives)
    iii. ONLY phenotypic + derivatives data (no BIDS)
    iv. Phenotypic + bids + derivatives
  • We may want to simplify our existing code for reformatting graph responses so that they are more explicit, rather than complex lambda funcs or chaining multiple methods together (harder to debug)
@alyssadai alyssadai added the type:bug Defects in shipped code and fixes for those defects label Oct 25, 2024
@alyssadai alyssadai added bug:functional Functional defects resulting from feature changes. and removed type:bug Defects in shipped code and fixes for those defects labels Oct 25, 2024
@alyssadai alyssadai moved this to Backlog in Neurobagel Oct 25, 2024
@alyssadai
Copy link
Contributor Author

Also possibly related to #303

@surchs surchs added the Epic A collection of issues that are related by topic and can be addressed together. label Oct 28, 2024
@surchs surchs removed the status in Neurobagel Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:functional Functional defects resulting from feature changes. Epic A collection of issues that are related by topic and can be addressed together.
Projects
Status: No status
Development

No branches or pull requests

2 participants