-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTNI-267 ⁃ Import data from old extension to new DB and data model #19
Comments
Also important to note here, that this function should only run once, and it MUST be somehow checked if the transfer really happened. We should not lose content at this stage. |
It would make sense to allow the import of an external JSON file with the respective data. This would make it possible to do testings with larger data sets as well as create and upload backups. Opened a new issue: #37 for that to keep the work separate. However the "storing of JSON"-mechanism could possibly be a shared component. |
@arpitgogia There is no data dump functionality in the old extension, right? Or at least I cannot see one. Data Import/RestoreTo meet this issue's requirements, I think it would need some minor changes to the old extension. The two ways I can think of is:
2 may be the most simple as it's not user-facing, but would require the user to have both extensions installed to do the import + conversion. 1 would require a user to dump the file in the old extension, then interact with a file upload in the new extension (I could hook it in with the Restore process of #37 and detect the restore file type, or something). 2 may also be a problem for FF users of the old extension. Not sure if it's possible in Firefox without the web extension polyfill, which the old extension doesn't seem to be using? (may be my lack of understanding here and it is possible) Data conversionFor convenience, in the old extension (correct me if I'm getting things wrong) the data model looks like this:
Could easily convert this to minimal page docs in the new extension model:
Some remarks:
Let me know if I'm overlooking something that is in the old extension. |
@poltak no there is no data dump mechanism in the old extension as such. When we were shifting to pouch, I manually iterated over all the documents in local storage and inserted them into pouch. |
I suggest we iterate through local storage and use a wrapper over import history/bookmark module.
This doesn't address the problem of getting the local storage data from the old ext to the new ext though. AFAIK it's not possible to access local storage from other extensions, same as cross domains. It needs some sort of interface via the external messaging protocols, hence the #2 suggestion above.
Đã gửi từ iPhone của tôi
Ngày 27 thg 6, 2017, vào lúc 12:38, Arpit Gogia <[email protected]> viết:
… I suggest we iterate through local storage and use a wrapper over import history/bookmark module.
|
My bad, I forgot that this will be a totally fresh install, and not a major update to the old one. |
@poltak @arpitgogia Keep in mind that this data transfer is only used by users having the old extension already installed. Means on update, it temporarily will have 2 databases - local storage and pouch and we'd have to transfer the db elements 1 by 1. Also we'd have to do a duplicate check, as currently, each visit creates an own "page object" in the old extension. What we would need to do is create just 1, then do a duplicate check every time a new one is added. If it is a duplicate, we would just add a visit element, not a new page object. |
@poltak so my assumption was correct, it won't be a fresh install. Hence we'll have access to the localStorage. Therefore it is easier to iterate over it and use a wrapper over import history/bookmark module. This way redundant page objects and visit docs can also be handled. |
Apologies, guys; seems like I misunderstood that it's gonna be released as an update over the old extension. Looking back, it's even in the OP:
Well that simplifies things :) yes could go with @arpitgogia's idea to hook it into the imports process. That would mean the user has to do it manually? There's also a There is also the |
Alright got a little proof-of-concept of the conversion logic working on
1 gives an memory overhead of The mapping of conversion logic in 3.2 involves:
So for each page data in old extension, it will output:
One main assumption in this algorithm:
If it's not, then the only other algorithm I can think of is iterating over everything in local storage, meaning linear space complexity to number of stored page docs (not good). I think this is a safe assumption to make as anything not indexed wouldn't be searchable in the old extension, hence not really of a user's concern/important? @arpitgogia maybe you can shed some light on this? Plan to put some sort of deduping logic in there. @oliversauter seems happy to do simple URL deduping, which seems alright as the One other cool thing is I can set the Blacklist conversion is also done, although that's all very simple and doesn't need much of a discussion. Note that it's hard to properly test this with real data. At the moment I am resorting to manually copying over local storage data from the old extension into the new one. May spend some time to come up with a data set that I can test with better (even just getting a number of page data from old extension and duplicating them should suffice, before deduping logic is there). Main thing I want to measure is the algorithm's performance on larger sets, as the visit conversion could be slow (same as pre-imports). |
If it is the history API you are thinking of, you may not get all the visits from there. |
Yes, history API may or may not contain all visits, but it's another point of call for visits; if they're there, great, if not, no worries. But yes, actually could just create a barebones visit doc from the old page data as well (since it's almost like a combined visit + page doc, thinking in relation to the new model), especially for dupes. Good point!
From what I understand, quite different if we just care about the URLs. We can skip a lot of unneeded complexity that happens in the deduping framework. This is the next step though, so to be thought about more. So after some sort of deduping's implementing, the output for each page data should be more like:
|
Where do you see the difference? From what I thought about in #17 we would also just check the url in the simplest form of deduping, in the beginning, so we could combine both here. (happens only once anyhow for the transfer)
forgot to ask before: where does the bookmark element come from. Respectively where is the list/field 'bookmarkUrls' from? |
@poltak the index is maintained by the visit time in the old extension, if I remember correctly. In any case when you pull an object from the old ext.'s storage and insert it into the DB, the |
Because the deduping framework compares things like page text and title; URL is not a concern as it's not fixed. If we want to dedup via URL here, it can be done by a simple query.
In the old extension data model, |
Basic URL deduping on the batch-level is done now, along with a minimal visit doc that can be converted from the old ext page data. The algorithm is now:
Conversion algorithm (3.4) would now look like: 3.3 ensures that the DB query happens on the batch level, once per batch rather than for every page data in the doc. As the map happens asynchronously for every item in the batch, without any enforced order, Pouch can get grumpy when many things try to query it around the same time, as we've seen. 3.2 ensures URL uniqueness within any given batch, as 3.3 is more like a previous batch check. A lot of doc formatting logic is brought over from Another thing is local storage clean up, I suppose. At the end of the conversion process, it's probably wise to remove all the old data. Although it should only happen if everything went alright. Tested with some larger data sets today, but there is the problem of local storage being limited 10MB (or is it 5?), hence can only test with a few thousand sample page datas. @arpitgogia What did the old extension do in regards to that limit? AFAIK it cannot be exceeded per domain, but I've heard of people having up to 1GB of storage in the old ext @oliversauter do you have any opinions about when the data conversion should happen once the extension gets updated? There is the automatic route (listening on extension update event), which allows it to happen without user knowing and allow the user to search on their old data almost straight away. Or it can maybe happen when a user triggers imports (maybe preimports stage, if the old data is detected), but means when the extension updates then overview will be blank, etc. |
I wonder what the purpose of the bookmarkURLs array was. Does it have all data we would also get from the Bookmarks API? Otherwise it might be better to use this array simply for the reason to get the missing data from the Bookmarks API. This way we won't have to check each url, which could be a lot.
I would do it straight away on update. Since this all can take a while (including indexing), we could add a notification into the address bar, instead of the first element saying "We just made a major update to the WorldBrain extension. It will take some minutes for the update to complete. Learn more >>" The won't know about the overview anyhow, so for them the mode of entry is only the address bar.
Couldn't we have the deletion process in the batch? Means checking of the url has been transferred and then delete it right away from local storage? |
It seems to be to mark page data as being for a bookmark, as opposed to history. If I install the extension and import bookmarks, I can see that array being filled out.
Yes, all the same data you get from the bookmarks API apart from the in-browser bookmark ID. That is left out of the converted minimal visit doc as well (because it's not available). The bookmarks API doesn't seem to have a way of querying bookmarks on non-ID attributes anyway, so the only way to use that at this stage would mirror how it's being used in imports, hence kinda redundant. So we either continue to create bookmark docs here, or we ignore them and let the user sort that out in imports.
Great idea! I'll find out how to do that.
Yes much better to have it in there at the end of each batch. |
Alright seems to be working quite alright now. Added in some little cleanup bits, user notification with progress count (here is what that looks like) and all seems to be working great with my test data. Would be nice to actually test it in a real environment, (i.e., installing the old extension, importing data, then triggering an extension update to our current version), but not exactly sure how that could be done. I'll have a think, as without seeing it work automatically, it's a bit scary having such an important process unconfirmed that it will run without problems. If all our old users auto update and it doesn't run, well that won't be very fun. If we have a blog post or something explaining the update later, I should be able to add that into the notification so it opens when you click it. And of course we should revisit this just before release, to see if there are any obvious problems that have been missed. |
I have an idea how to test it in a real environment:
|
So I just tried that out, and it seems to trigger the load, even without running a command, just by updating and reloading the code. The notification pops up, but it throws a bunch of errors and does not really progress
|
oh and this conversion is triggered everytime i "reload" the extension, maybe we need a flag, so it is not triggered anymore, once the thing is complete. |
Bit hacky but seems to work for the purpose of testing with the exact data from the old extension :) Found two minor issues with assumptions about the local storage state (index was doubly nested, bookmark array slightly different shape, somehow), which have been fixed up and it seems to work great being invoked manually with the old extension's data. Now hopefully that Those errors you encountered aren't related to this, but still very important overall as we would have ignored them otherwise. They seem to be with the But there is also the I'll try to leave the old extension running for a bit today to import part of my history, and see how the conversion process runs with more data (and how it handles the storage limit; whether it removes the data or just leaves it and throws an error). |
AFAIK localstorage is a Web SQL DB, so this attribute is important in the old extension to make the DB in localstorage possible. |
No local storage isn't a Web SQL DB. However I found the answer. Apparently the restrictions in standard So yes, will have to add that. |
Oh I thought because the content data was stored in SQLite on chrome. Okidoki :) |
@oliversauter Yes, that reload button does trigger the Set the Testing by importing data on the old ext, manually replacing the files with new ext build, then triggering the |
Really slow after trying yesterday with more data I got throughout the day. Had to convert 6440 old ext page data and it took ~10mins to produce 35,079 docs in the new model (mostly visits). Didn't like this at all. Did a bit of a performance profile and narrowed it down to the URL deduping logic (db query on page docs with URL field). This was an unindexed query, so obviously it was going slower and slower as the conversion process advances (should be linear, and a lot of the time worst case - no match found). Put a temp index on page doc URLs for the duration of conversion process, which in theory allows these deduping queries be done in logarithmic time (log N where N is # page docs already converted). Also cannot do a single query per batch, as it needs an Testing with 2k old ext page data this morning, it was able to finish in 11 secs, which is great! Going to test later today with much more data again and hopefully see it not noticeably slowing down for later batches. |
Yep tested today with 6.5k and took ~33secs. Again with 2.5k tonight ~18s. Working great with a concurrency of 15 pages. Good stuff. The output looks fine. Leaving them not marked as stubs for now (so they won't be scheduled for imports), but will revisit this before release as it it might be nice to schedule them for more filling out (metadata, favicon, etc), but need to make sure imports won't overwrite the old page text and title. To be merged after a self-walkthrough of the code. |
Background Information
The old extension stores the data about websites (title, keyword etc) to local storage whereas the new extension has it stored in PouchDB also with a different data model.
When we update the code from the old extension to the new one, we have to make sure, that all the previous data is transferred to the new DB and data model and entries deduplicated (1 page object per URL)
Previous work to understand old data model:
WorldBrain/Legacy-Research-Engine#101
Caution: In some extensions there is about 1GB of data, whereas there will be many duplicates in the system. Means the import process may need to be executed in batches, as well as to check for each element before it is stored, if there is already one available. We should not add duplicates.
Also we need to make sure we have a visit items for each entry.
Inspiration for that process could be gotten from @poltak's PR: #25
Also in the WebMemex upstream there has work been happening towards a simpler data model, which should be taken into consideration: WebMemex/webmemex-extension#101
Want to back this issue? ?utm_campaign=plugin&utm_content=tracker%2F59103681&utm_medium=issues&utm_source=github Post a bounty on it! We accept bounties via ?utm_campaign=plugin&utm_content=tracker%2F59103681&utm_medium=issues&utm_source=github Bountysource.
The text was updated successfully, but these errors were encountered: