-
Notifications
You must be signed in to change notification settings - Fork 13
lang-links option while compiling wiki #30
Comments
Interesting.
it doesn't have to be in one direction, no reason why you can't have both 'water' -> 'вода' and 'вода' -> 'water', and it's not necessary to go via xdxf format, unless you want to use the output with other xdxf programs. In any case, your issue description doesn't seem to describe an actual issue or request. What are you suggesting? |
Yes, of course. |
If there's some work to be done then the issue better stay open :) You mentioned you have some code written, care to share? |
I had some memory leak issues, after what I found this article:
import page_parser
import json
def yourCallback(page):
try:
j = json.loads(page.text)
for lang in j['links']:
if lang == 'trwiki':
print '<ar><k>'+j['links']['trwiki'].encode('utf-8')+'</k>'
print ' <def>'+j['entity']+'</def>'
for lang in j['links']:
if lang == 'ruwiki':
print ' <def>ruwiki: <kref>'+j['links']['ruwiki'].encode('utf-8')+'</kref></def>'
for lang in j['links']:
if lang == 'enwiki':
print ' <def>enwiki: <kref>'+j['links']['enwiki'].encode('utf-8')+'</kref></def>'
print '</ar>'
elt.text.close()
except:
pass
page_parser.parseWithCallback("wikidatawiki-20130505-pages-meta-current.xml", yourCallback) Final step: replace all ampersand symbols & with & |
Wikidata migrated is migrating interlanguage wiki links from individual articles into a central database to ease maintenance.
Here is detailed infomation: https://en.wikipedia.org/wiki/Wikipedia:Wikidata
Now all links are stored in wikidata database which is about 15Gb in size.
Article structure is something like this
Q12345
enwiki: 'water'
ruwiki: 'вода'
and so on. Each article now has its own number, starting with Q.
I have written small script to parse that dump, and create seperate xdxf dictionary, in "one direction". For example ENG -> RUS,FR. Sort of english to russian dictionary, based on wikipedia.
The text was updated successfully, but these errors were encountered: