-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roundtrip differences #6
Comments
Re-checked roundtrip as of 9eab7ec Comparing nodes file
Comparing edges file
|
Confirming Krusty::wd_to_neo4j.pyI run the module and uploaded the CVS (wikibase dump) into neo4j server. Data counts
commands used for nodes files
commands used for edges files
commands used for nodes in neo4j MATCH (n)
RETURN count(n) commands used for edges in neo4j MATCH ()-->() RETURN count(*) sparql queries for wikibase for nodes SELECT (COUNT(DISTINCT ?s) AS ?sc)
WHERE{ ?s wdt:P8 ?o } sparql queries for wikibase for edges Comparing neo4j networks in the brower
Comparing nodes file
MATCH (n)
RETURN DISTINCT labels(n),
count(*) AS NumberOfEntities, reduce(keys = [], keys_n in collect(keys(n)) | keys + filter(k in keys_n WHERE NOT k IN keys)) as EntityAttributes
ORDER BY NumberOfEntities DESC To change in nodes file @stuppie
Comparing edges file
MATCH ()-[r]-()
RETURN DISTINCT type(r),
count(*) AS NumberOfRelationships, reduce(keys = [], keys_r in collect(keys(r)) | keys + filter(k in keys_r WHERE NOT k IN keys)) as EntityAttributes
ORDER BY NumberOfRelationships DESC
To change in edges file @stuppie
|
I ran a roundtrip, from the files here, and then dumped them out (to nodes_out.csv and edges_out.csv).
Comparing nodes file
Looking only at the IDs
$ cut -f1 -d, nodes_out.csv | sort > nodes_out_id.csv
$ cut -f1 -d, ngly1_concepts.csv | sort > ngly1_concepts_sort.csv
$ diff nodes_out_id.csv ngly1_concepts_sort.csv
Result: everything is there except for the 4 items with huge IDs (#2)
Comparing edges file
$ cut -f1-3 -d, edges_out.csv | sort > edges_out_id.csv
$ cut -f1-3 -d, ngly1_statements.csv | grep -v ",None," | sort > ngly1_statements_id.csv
$ wc -l edges_out_id.csv ngly1_statements_id.csv
786913 edges_out_id.csv
791161 ngly1_statements_id.csv
We're missing 4248 lines...
Which subj IDs am I missing?
$ diff -U0 =(cut -f1 -d, edges_out_id.csv) =(cut -f1 -d, ngly1_statements_id.csv) | grep -E "^+" | uniq -c
Missing 2827 from HGNC:6914 and 1402 from HGNC:8031, which we know.
What are the 19 others?
$ diff -U0 =(grep -v HGNC:6914 edges_out_id.csv | grep -v HGNC:8031 | cut -f2 -d,) =(grep -v HGNC:6914 ngly1_statements_id.csv | grep -v HGNC:8031 | cut -f2 -d,) | grep -E "^+" | uniq -c
The rdf:type issue: #5
I know about colocalizes_with and contributes_to (NuriaQueralt/ngly1-graph#3)
For the other two, these look like weird edge cases. For example
There are two lines for the same edge in the input file. One has no ref, one does. So in wikidata, they become one. One output, we end up with one line instead of two. This isn't an issue as we aren't actually missing anything.
The text was updated successfully, but these errors were encountered: