Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept for SPARQL UPDATE #916

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

hannahbast
Copy link
Member

This is not yet meant for production but as a proof of concept. At the time of the creation of this PR, there are commands insert and delete for inserting or deleting a single triple via the API. The triple is located in each of the permutations, and the location is sorted in an internal data structure (of a new class DeltaTriples). There are already extensive unit tests for this functionality, and it seems to work correctly.

There is no code yet that actually takes the inserted or deleted triples into account when processing queries. But that is actually relatively little work (when reading a block from a permutation, just augment the result by the delta triples stored for that block). The hard part, it seems, is the location.

Hannah Bast added 6 commits March 15, 2023 21:11
1. Add support for URL parameters "insert=..." and "delete=..." for
   inserting or deleting a single triple.

2. Stub of new class `DeltaTriples` that maintains the set of inserted
   and deleted triples.

3. First working implementation of a method `findTripleInPermutation`
   that for a given triple and a given permutation, finds the matching
   block in that permutation and the right position in that block.
There is now a test that checks for all existing triples whether the
found location is correct (by checking `id2` and `id3` at the found
position in the found block, note that the block does not have an
explicity `id1` for a given position).

The `findTripleInPermutation` method is still (very) inefficient in that
it goes through the complete relation metadata in order to find the
sequence of `id1`s relevant for a block. This will be fixed in the next
commit.

Note: the previous commit lacked the new files `DeltaTriples.h`,
`DeltaTriples.cpp`, and `DeltaTriplesTest.cpp`.
1. Refactored `DeltaTriple::locateTripleInAllPermutations` (the central
   method of this class).

2. Wrote a test that checks all triple that are contained in the index
   as well as a slightly modified version of each triple that is not in
   the index. The test checks that the triple has been located at the
   exact right position in all permutations.

   (This is harder than it seems because a lot of things can go wrong +
   we do not have the relation `Id`s for the blocks explicitly, but only
   implicitly via the relation metadata.)

3. The method `locateTripleInAllPermutations` now inserts the results
   into proper data structures that can then be used conveniently in an
   index scan (writing that code is the next step, but it should be
   relatively straightforward now).
1. The internal data structure for each permutation now stores a
   `std::set` (which is ordered) for the triples located at a certain
   position of a certain block.

2. For each triple that is inserted or deleted, remember the iterator
   into the corresponding `std::set` for each permutation. This is
   important for undoing an insertion or deletion (for example, when
   re-inserting a previous deleted triple).

3. The unit tests now test insertion and deletion directly. The
   following cases are covered now: deletion of triples that exist in
   the index, insertion of triples that did not exist in the index,
   deletion of previously inserted triples, and re-insertion of
   previously deleted triples.

4. Move the visualization code (that show the contents of a particular
   block where a triple has been located) out of the implementation of
   the `DeltaTriples` class and into `DeltaTriplesTest`.
@codecov
Copy link

codecov bot commented Mar 19, 2023

Codecov Report

Patch coverage: 85.29% and project coverage change: -1.97 ⚠️

Comparison is base (df93d56) 74.07% compared to head (48578d3) 72.11%.

❗ Current head 48578d3 differs from pull request most recent head 8910f61. Consider uploading reports for the commit 8910f61 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #916      +/-   ##
==========================================
- Coverage   74.07%   72.11%   -1.97%     
==========================================
  Files         254      247       -7     
  Lines       23997    24389     +392     
  Branches     3021     3156     +135     
==========================================
- Hits        17776    17587     -189     
- Misses       5010     5449     +439     
- Partials     1211     1353     +142     
Impacted Files Coverage Δ
src/global/Id.h 84.84% <ø> (ø)
src/index/CompressedRelation.h 100.00% <ø> (ø)
src/index/ConstantsIndexBuilding.h 100.00% <ø> (ø)
src/index/Index.h 100.00% <ø> (+25.00%) ⬆️
src/index/IndexImpl.h 57.14% <0.00%> (-9.53%) ⬇️
src/index/IndexMetaData.h 67.56% <ø> (+1.85%) ⬆️
src/util/AllocatorWithLimit.h 95.16% <ø> (ø)
src/index/Index.cpp 46.39% <22.22%> (-3.61%) ⬇️
src/index/DeltaTriples.cpp 68.90% <68.90%> (ø)
src/index/DeltaTriples.h 83.33% <83.33%> (ø)
... and 9 more

... and 88 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

It complained about an assignment in the condition of an `if`.
hannahbast pushed a commit that referenced this pull request Mar 19, 2023
1. The block size used to be `1 << 23` (over 8M), which is too large,
   since we always need to decompress at least one whole block, even
   when reading only few triples. It's now 100'000, which still has a
   small relatively small overall space consumption.

2. Add member `_col2LastId` to block data because we need it for the
   delta triples (#916).
Hannah Bast added 8 commits March 21, 2023 13:25
1. Each permutation now has information about the delta triples per
   block and where exactly they are located in that permutation

2. As a proof of concept, the LOG already shows the number of delta
   triples found in the blocks relevant for the scan.
1. Improved the names. In particular, `TripleWithPosition` is now called
   `LocatedTriple` and analogously for the related classes.

2. Put `LocatedTriple` and associated classed in own file
   `LocatedTriple.h`. Also move a function there that used to be a
   helper lambda in one of `DeltaTriples` methods.

TODO: The `locatedTripleInPermutation` method should also be moved to
`LocatedTriple.h` and there should be a `test/LocatedTripleTest` with
all the associated tests.
1. The code for locating triples for an individual permutation
   is now no longer in the (already too big) class `DeltaTriples` but in
   separate classes in files `LocatedTriples.{h,cpp}`. The corresponding
   tests are also in a new file `test/LocatedTriplesTest.cpp` now. Used
   the opportunity to improve the code in several respects.

2. First attempt at writing a function that or a given block merges the
   relevant delta triples into it. Tried to do it in-place without using
   an extra array, until I realized that that leads to a very hard
   algorithmic problem, which most likely can't be solved practically
   efficiently. At least it's clear now that the best approach is to
   first decompress the block into a temporary array and the merge that
   temporary array and the relevant delta triples in the pre-allocated
   portion of the result `IdTable`. That's what I will implement next
   and it shouldn't be hard.

3. When testing the merging, it helps to output the columns of an
   `IdTable`. For that, values like `VocabIndex:15` are rather
   unpractical to read, so I abbreviated these to use a one-letter
   prefix, like in `V:15`.
There is now a method `LocatedTriplesPerBlock::mergeTriples` that merges
the delta triple into a given (possibly partial) block. There are unit
tests for the following cases: full block with unrestricted `id1` and
`id2`, full block with restricted `id1`, patial block with restricetd
`id1`, partial block with restricted `id1` and `id2`, and the latter
with only a single column.

Removed the previous complicated version that tried to do it in place.
Implemented it and wrote some unit tests.

TODO: If the first block of the scan is incomplete, delta triples are
currently ignored (with a warning). There is no principle problem to add
this, but this first needs some refactoring of the original code to
avoid code duplication.
joka921 pushed a commit that referenced this pull request Apr 4, 2023
1. The block size used to be 1 << 23 (over 8M), which is too large, since we always need to decompress at least one whole block, even when reading only few triples. It's now 500'000, which still has a relatively small overall space consumption.

2. Add member _col2LastId to block data because we will need it for #916 and it's nice if our live indexes already have this information so that we can play around with this PR without having to rebuild indexes.

3. Renamed the data members in `CompressedRelation.h` such that they have `trailingUnderscores_`.

4. Unrelated fix in IndexTestHelpers.h: The test TTL file was not deleted after the test, now it is.
Hannah Bast added 8 commits April 4, 2023 12:31
The variant of `CompressedRelation::scan` with two `Id`s fixed now also
considers delta triples. This was significantly more complicated than
for the variant with only one `Id` fixed and required quite a bit of
refactoring.

TODO: For the first incomplete block when only a single `Id` is fixed,
delta triples are still not considered. That should be easy to add
though.
There was still one case missing: the possibly incomplete bloc at the
beginning when only a single `Id` is fixed. Now delta tiples ae also
considered for these blocks.

TOOD: The unit test works for a permutation with a single relation, but
there is still a problem when there are multiple relations.
I am not convinced though that it was a bug.
@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug E 1 Bug
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 84 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Hannah Bast added 2 commits May 24, 2023 22:19
[incomplete, intermediate commit so that I can switch branches]
One test in `LocatedTriplesTest` still fails because writing the files
for a PSO permutation and then reading from it no longer works as it
did. I hope that Johannes or Julian can help me.
hannahbast pushed a commit that referenced this pull request Jun 9, 2023
This is the first part of a series of PRs split of from the large
proof-of-concept PR #916,
which realizes SPARQL 1.1 Update
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 9, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug E 1 Bug
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 79 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Qup42 added a commit to Qup42/qlever that referenced this pull request Apr 29, 2024
@arcangelo7
Copy link

arcangelo7 commented May 20, 2024

Hi Hannah,

Thank you for the updates on the proof of concept for SPARQL UPDATE. For the OpenCitations project, the update functionality has become a priority as we are looking to migrate our OpenCitations Meta data to Qlever. We would like to inquire about the current progress on this feature.

Specifically, we are interested in understanding if it is possible to add triples via SPARQL using the insert and delete commands mentioned. From the PR description, it is not clear if these operations can be performed directly via SPARQL or if they are limited to inserting or deleting one triple at a time through the API.

Could you please provide more details on this? Any updates or clarifications on the state of the SPARQL UPDATE feature would be greatly appreciated.

@Spothedog1
Copy link

Also following up here, I saw this issue which sort of indicates that this feature is implemented, however when I try to run an insert I get an "This SPARQL Endpoint is read-only" message.

Assuming since this PR is open its not implemented yet but some confirmation would be helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants