-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Combined query to fetch dimension values #1487
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@bprusinowski can you share an example of a less performing query? thank you |
@Rdataflow sure, this is a query to fetch dimension values for every dimension in the Bathing water quality cube (takes ~12s). Compare this to queries fired on PROD: https://visualize.admin.ch/en/create/lmbY5klvYJAm?dataSource=Prod&flag__debug=true&flag__server-side-cache.disable=true (takes 1.4s for 5 queries, one per each dimension, fired in parallel). The above example is for a non-filtered query (fetches all dimension values), but we also need to be able to filter and unversion each dimension separately (see PR description). This is why I think we need to combine individual queries like this – let me know if it's enough information or if you see some other direction to try. |
@bprusinowski it seems those {
# versioned case
...
?dimension schema:version ?version
...
} UNION {
# nonversioned case
...
FILTER NOT EXISTS { ?dimension schema:version ?version }
...
} i.e. like https://s.zazuko.com/HenZ6K HTH for self guiding the next steps 👍 |
Hey @Rdataflow, I optimized the query based on your suggestion and fixed some issues, it seems to fetch the correct values now. However it looks like it's less performing than the current approach of parallel queries, both for smaller and bigger cubes. You can see some examples for Photovoltaikanlagen (TEST, PR) and NFI: Change (TEST, PR) cubes (both using INT data source to not rely on cached endpoint and with disabled server-side cache). It looks like the new approach is ~2x slower than what we currently have on TEST. I will take a deeper look to see if there's something obvious to optimize, but it would be great if you could also take a look in case you have a bit of time. Let me know what you about the whole direction of combining the queries 👀 |
@bprusinowski unfortunately those /create/ links won't endure... do you have some permalinks maybe? |
i.e. those |
@Rdataflow of course – I updated the links, should be ok now (I blame it on Monday morning 🤦 😅) |
@bprusinowski what happens if you drop |
@Rdataflow I did some tests to fire the query "for the first time", to avoid some apparent caching on LINDAS side (did this by modifying the query to e.g. remove retrieval of color, so the query looks like new); it still looks like |
@bprusinowski the query pattern still would profit of some minor optimization steps... this might look like i.e. # 3rd UNION block on a dimension
{
SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
WHERE {
{
SELECT ?observation
WHERE {
VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
?dimension sh:path ?dimensionIri .
?dimension schema:version ?version .
FILTER NOT EXISTS {
?dimension sh:in ?in .
}
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
}
}
VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
?observation ?dimensionIri ?versionedValue .
?versionedValue schema:sameAs ?unversionedValue .
}
}
UNION
{
# 4th UNION block on a dimension
SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
WHERE {
{
SELECT ?observation
WHERE {
VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
?dimension sh:path ?dimensionIri .
FILTER NOT EXISTS {
?dimension sh:in ?in .
}
FILTER NOT EXISTS {
?dimension schema:version ?version .
}
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
}
}
?observation ?dimensionIri ?versionedValue .
BIND(?versionedValue AS ?unversionedValue)
}
} then as the innermost nb: or in case you prefer the |
Thanks again @Rdataflow for optimizing the query 💯 Unfortunately it looks that it's still significantly less performing that the ones we have on TEST / INT / PROD. See this combined query that takes 12-13s – the same cube on TEST takes ~7-8s to load values for every dimension when fired separately. I think we might reach out to Zazuko, seeing that the approach we currently try doesn't seem to improve things – does it sound good? Maybe I miss some additional context, but knowing that we'll use a cached endpoint that will already offload a load of computing power from Stardog, I am not sure if it's worth it to sacrifice 50% of performance (assuming is scales linearly 😅 – but even if not, an overhead of 4s for NFI cubes if noticeable) just to send a smaller number of queries. Let me know what you think @Rdataflow :) |
@bprusinowski the proposed query obviously misses to constrain the dimensionIri to the relevant dimension only - therefore it suffers heavily degraded performance see comments inline nb: regarding perf on TEST see VSHN SBAR-1122 and comment inline |
Closes #1470
This PR is an exploration of the feasibility to fetch values for multiple dimensions in a single SPARQL query.
Constrains
FILTER(IF(?dimensionIri = <A>, ?dimensionIri = <a> && ?dimensionIri = <a>, ?dimensionIri))
schema:sameAs
is not used to indicate unversioned values, this is only the case when a dimension is versioned).we need to combine individual queries into a big one (at least that's my current assumption).