Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Combined query to fetch dimension values #1487

Merged
merged 6 commits into from
Jun 21, 2024

Conversation

bprusinowski
Copy link
Collaborator

@bprusinowski bprusinowski commented May 2, 2024

Closes #1470

This PR is an exploration of the feasibility to fetch values for multiple dimensions in a single SPARQL query.

Constrains

  • We need to be able to filter each dimension individually, due to cascading filters behavior. Could be achieved with FILTER(IF(?dimensionIri = <A>, ?dimensionIri = <a> && ?dimensionIri = <a>, ?dimensionIri))
  • We need to be able to unversion dimension values per individual dimension (can't unversion values outside of individual SELECT queries, as sometimes schema:sameAs is not used to indicate unversioned values, this is only the case when a dimension is versioned).

we need to combine individual queries into a big one (at least that's my current assumption).

Copy link

vercel bot commented May 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
visualization-tool ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 14, 2024 6:35am

@Rdataflow
Copy link
Contributor

@bprusinowski can you share an example of a less performing query? thank you

@bprusinowski
Copy link
Collaborator Author

bprusinowski commented May 2, 2024

@Rdataflow sure, this is a query to fetch dimension values for every dimension in the Bathing water quality cube (takes ~12s). Compare this to queries fired on PROD: https://visualize.admin.ch/en/create/lmbY5klvYJAm?dataSource=Prod&flag__debug=true&flag__server-side-cache.disable=true (takes 1.4s for 5 queries, one per each dimension, fired in parallel).

The above example is for a non-filtered query (fetches all dimension values), but we also need to be able to filter and unversion each dimension separately (see PR description). This is why I think we need to combine individual queries like this – let me know if it's enough information or if you see some other direction to try.

@Rdataflow
Copy link
Contributor

@bprusinowski it seems those OPTIONALs for schema:version have a negative impact. You may try to further parallelize using a query pattern of

{ 
    # versioned case
    ...
    ?dimension schema:version ?version 
    ... 
} UNION { 
    # nonversioned case
    ... 
    FILTER NOT EXISTS { ?dimension schema:version ?version } 
    ... 
}

i.e. like https://s.zazuko.com/HenZ6K

HTH for self guiding the next steps 👍

@bprusinowski
Copy link
Collaborator Author

bprusinowski commented May 6, 2024

Hey @Rdataflow, I optimized the query based on your suggestion and fixed some issues, it seems to fetch the correct values now.

However it looks like it's less performing than the current approach of parallel queries, both for smaller and bigger cubes. You can see some examples for Photovoltaikanlagen (TEST, PR) and NFI: Change (TEST, PR) cubes (both using INT data source to not rely on cached endpoint and with disabled server-side cache).

It looks like the new approach is ~2x slower than what we currently have on TEST. I will take a deeper look to see if there's something obvious to optimize, but it would be great if you could also take a look in case you have a bit of time.

Let me know what you about the whole direction of combining the queries 👀

@Rdataflow
Copy link
Contributor

@bprusinowski unfortunately those /create/ links won't endure... do you have some permalinks maybe?

@Rdataflow
Copy link
Contributor

Rdataflow commented May 6, 2024

i.e. those /create/new?from=<published> would work IIUC

@bprusinowski
Copy link
Collaborator Author

@Rdataflow of course – I updated the links, should be ok now (I blame it on Monday morning 🤦 😅)

@Rdataflow
Copy link
Contributor

@bprusinowski what happens if you drop #pragma evaluate on everywhere? with the query in the current form this might help. curious to see...

@bprusinowski
Copy link
Collaborator Author

@Rdataflow I did some tests to fire the query "for the first time", to avoid some apparent caching on LINDAS side (did this by modifying the query to e.g. remove retrieval of color, so the query looks like new); it still looks like #pragma improves the situation a bit (1.8s for pragma vs 2.2s without) for the Traffic noise pollution cube. The timings change if I "comment out" other properties, but the delta seems to always be in favor of #pragma...

@Rdataflow
Copy link
Contributor

Rdataflow commented May 6, 2024

@bprusinowski the query pattern still would profit of some minor optimization steps...
the 3rd and 4th UNION block of each dimension may be optimized using a inner SELECT to prevent ?obs ?dimensionIri ?versionedValue from evaluation in case of empty due to FILTER NOT EXISTS { ... } returning empty set

this might look like i.e.

# 3rd UNION block on a dimension
  {
    SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
    WHERE {
      {
        SELECT ?observation
        WHERE {
          VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
          ?dimension sh:path ?dimensionIri .
          ?dimension schema:version ?version .
          FILTER NOT EXISTS {
            ?dimension sh:in ?in .
          }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
        }
      }
      VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
      ?observation ?dimensionIri ?versionedValue .
      ?versionedValue schema:sameAs ?unversionedValue .
    }
  }
  UNION
  {
# 4th UNION block on a dimension
    SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
    WHERE {
      {
        SELECT ?observation
        WHERE {
          VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
          ?dimension sh:path ?dimensionIri .
          FILTER NOT EXISTS {
            ?dimension sh:in ?in .
          }
          FILTER NOT EXISTS {
            ?dimension schema:version ?version .
          }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
        }
      }
      ?observation ?dimensionIri ?versionedValue .
      BIND(?versionedValue AS ?unversionedValue)
    }
  }

then as the innermost SELECT ?observation becomes emtpy (true for many dimensions) the UNION blocks are faster now 😄

nb: or in case you prefer the #pragmas transfer those pragmas to the inner SELECT ?observation then

@bprusinowski
Copy link
Collaborator Author

bprusinowski commented May 7, 2024

Thanks again @Rdataflow for optimizing the query 💯

Unfortunately it looks that it's still significantly less performing that the ones we have on TEST / INT / PROD. See this combined query that takes 12-13s – the same cube on TEST takes ~7-8s to load values for every dimension when fired separately.

I think we might reach out to Zazuko, seeing that the approach we currently try doesn't seem to improve things – does it sound good? Maybe I miss some additional context, but knowing that we'll use a cached endpoint that will already offload a load of computing power from Stardog, I am not sure if it's worth it to sacrifice 50% of performance (assuming is scales linearly 😅 – but even if not, an overhead of 4s for NFI cubes if noticeable) just to send a smaller number of queries.

Let me know what you think @Rdataflow :)

cc @sosiology @adintegra

@Rdataflow
Copy link
Contributor

Rdataflow commented May 26, 2024

@bprusinowski the proposed query obviously misses to constrain the dimensionIri to the relevant dimension only - therefore it suffers heavily degraded performance

see comments inline
https://s.zazuko.com/23hyB45

nb: regarding perf on TEST see VSHN SBAR-1122 and comment inline

cc @sosiology @adintegra

@bprusinowski bprusinowski marked this pull request as ready for review June 14, 2024 07:05
@bprusinowski bprusinowski requested a review from ptbrowne as a code owner June 14, 2024 07:05
@bprusinowski bprusinowski merged commit a7d9a19 into main Jun 21, 2024
5 of 7 checks passed
@bprusinowski bprusinowski deleted the refactor/combined-components-values-query branch June 21, 2024 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework Components query to query multiple dimensions at once
2 participants