Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nquads silently drops lines containing quotes #72

Closed
kortschak opened this issue Jul 19, 2014 · 1 comment
Closed

nquads silently drops lines containing quotes #72

kortschak opened this issue Jul 19, 2014 · 1 comment

Comments

@kortschak
Copy link
Contributor

If you apply the patch below or equivalent, you see 137 lines output as having been dropped when you execute cayley http --dbpath=30kmoviedata.nt. All have at least one quote (single or double) mark.

    diff --git a/nquads/nquads.go b/nquads/nquads.go
    index 1f534b0..c555cc2 100644
    --- a/nquads/nquads.go
    +++ b/nquads/nquads.go
    @@ -16,6 +16,7 @@ package nquads

     import (
            "bufio"
    +       "fmt"
            "io"
            "strings"

    @@ -185,11 +186,13 @@ func ReadNQuadsFromReader(c chan *graph.Triple, reader io.Reader) {
                            continue
                    }
                    triple := Parse(line)
    -               line = ""
                    if triple != nil {
                            nTriples++
                            c <- triple
    +               } else {
    +                       fmt.Printf("dropped line: %q\n", line)
                    }
    +               line = ""
            }
            glog.Infoln("Read", nTriples, "triples")
            close(c)
@kortschak
Copy link
Contributor Author

This appears to be merely an issue with the 30kmoviedata.nt dataset (see below) in conjunction with swallowing error conditions. This should be fixed in an upcoming PR from me (this will add error return for cases like this and fix the incorrect 30kmoviedata.nt lines).

$ grep "Weird Al" 30kmoviedata.nt
":/en/the_weird_al_yankovic_video_library" "name" "The "Weird Al" Yankovic Video Library" .
":/en/weird_al_yankovic_live" "name" ""Weird Al" Yankovic Live!" .
":/en/weird_al_yankovic" "name" ""Weird Al" Yankovic" .
":/en/weird_al_yankovic_the_ultimate_collection" "name" ""Weird Al" Yankovic: The Ultimate Collection" .
":/en/weird_al_yankovic_the_ultimate_video_collection" "name" ""Weird Al" Yankovic: The Ultimate Video Collection" .
":/en/weird_al_yankovic_the_videos" "name" ""Weird Al" Yankovic: The Videos" .

kortschak added a commit to kortschak/cayley that referenced this issue Jul 22, 2014
Fixes issue cayleygraph#72

This change simplifies interactions with parsing N-Quads and makes
reading datasets more robust. Changes made while here also improve
performance:

benchmark           old ns/op     new ns/op     delta
BenchmarkParser     1058          667           -36.96%

We still use string concatenation which I'm not wildly happy about, but
I think this can be left for a later change.

Initial changes towards idiomatic error handling have been made. More
significant changes are needed, but these have subtle design implication
and need to be thought about more.

30kmoviesdata.nt.gz has been altered to properly escape double quotes.
This was done mechanically and with manual curation to pick up
straglers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant