Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTNI-267 ⁃ Import data from old extension to new DB and data model #19

Closed
blackforestboi opened this issue May 5, 2017 · 29 comments
Closed
Assignees
Labels

Comments

@blackforestboi
Copy link
Member

blackforestboi commented May 5, 2017

Background Information

The old extension stores the data about websites (title, keyword etc) to local storage whereas the new extension has it stored in PouchDB also with a different data model.

When we update the code from the old extension to the new one, we have to make sure, that all the previous data is transferred to the new DB and data model and entries deduplicated (1 page object per URL)

Previous work to understand old data model:
WorldBrain/Legacy-Research-Engine#101

Caution: In some extensions there is about 1GB of data, whereas there will be many duplicates in the system. Means the import process may need to be executed in batches, as well as to check for each element before it is stored, if there is already one available. We should not add duplicates.
Also we need to make sure we have a visit items for each entry.

Inspiration for that process could be gotten from @poltak's PR: #25

Also in the WebMemex upstream there has work been happening towards a simpler data model, which should be taken into consideration: WebMemex/webmemex-extension#101


Want to back this issue? ?utm_campaign=plugin&utm_content=tracker%2F59103681&utm_medium=issues&utm_source=github Post a bounty on it! We accept bounties via ?utm_campaign=plugin&utm_content=tracker%2F59103681&utm_medium=issues&utm_source=github Bountysource.

@blackforestboi
Copy link
Member Author

Also important to note here, that this function should only run once, and it MUST be somehow checked if the transfer really happened. We should not lose content at this stage.

@blackforestboi
Copy link
Member Author

It would make sense to allow the import of an external JSON file with the respective data. This would make it possible to do testings with larger data sets as well as create and upload backups.

Opened a new issue: #37 for that to keep the work separate. However the "storing of JSON"-mechanism could possibly be a shared component.

@poltak
Copy link
Member

poltak commented Jun 27, 2017

@arpitgogia There is no data dump functionality in the old extension, right? Or at least I cannot see one.

Data Import/Restore

To meet this issue's requirements, I think it would need some minor changes to the old extension. The two ways I can think of is:

  1. add a data dump button to old extension that dumps all page data from local storage to a file (could be done in a similar way I proposed in Upload Test Data Set #37)
  2. develop a really simple API (maybe sends chunks of data from local storage on request) using cross-extension messaging

2 may be the most simple as it's not user-facing, but would require the user to have both extensions installed to do the import + conversion. 1 would require a user to dump the file in the old extension, then interact with a file upload in the new extension (I could hook it in with the Restore process of #37 and detect the restore file type, or something).

2 may also be a problem for FF users of the old extension. Not sure if it's possible in Firefox without the web extension polyfill, which the old extension doesn't seem to be using? (may be my lack of understanding here and it is possible)

Data conversion

For convenience, in the old extension (correct me if I'm getting things wrong) the data model looks like this:

{
  text: string, // the page text
  time: number, // timestamp on when it was visited, or bookmarked in bookmark case
  title: string, // page title, or bookmark title
  url: string,
}

Could easily convert this to minimal page docs in the new extension model:

  • text could be doc.content.fullText
  • title could be doc.title + doc.content.title
  • time could be set as part of doc._id (generated from timestamp + nonce values)
  • url could be doc.url

Some remarks:

  • doesn't seem to be able to differentiate between bookmarks and history (so bookmark docs in new model may not be restorable)
  • there's nothing corresponding to visit docs, which the new model relies on, so there may have to be a post-restore stage where we iterate through all the restore'd page data URLs and generate visit docs for them (shouldn't be too hard; still may take a while for big inputs - same as imports)

Let me know if I'm overlooking something that is in the old extension.

@bohrium272
Copy link
Member

bohrium272 commented Jun 27, 2017

@poltak no there is no data dump mechanism in the old extension as such. When we were shifting to pouch, I manually iterated over all the documents in local storage and inserted them into pouch.
I suggest we iterate through local storage and use a wrapper over import history/bookmark module to make documents according to the new data model. That would be pretty convenient.
Also we should keep this as a user-blocking activity, as in warn the user that this is going on and "You must absolutely let this happen or you'll lose your data". Kind of brute forcing but this is one way to ensure preservation of the data.

@poltak
Copy link
Member

poltak commented Jun 27, 2017 via email

@bohrium272
Copy link
Member

My bad, I forgot that this will be a totally fresh install, and not a major update to the old one.
In which case I think #2 would work well. It will of course involve installing having both the extensions together. Approach 1 is undesirable if the database is large (which it would be if someone has been using it since the last release), so approach 2 qualifies. I think making the user aware of this process can lead to a smooth transition.

@blackforestboi
Copy link
Member Author

@poltak @arpitgogia

Keep in mind that this data transfer is only used by users having the old extension already installed.
In case of an update, what changes for them is the code of their current installation, so they won't have 2 extensions installed. (update come directly from the chrome store)

Means on update, it temporarily will have 2 databases - local storage and pouch and we'd have to transfer the db elements 1 by 1.

Also we'd have to do a duplicate check, as currently, each visit creates an own "page object" in the old extension. What we would need to do is create just 1, then do a duplicate check every time a new one is added.

If it is a duplicate, we would just add a visit element, not a new page object.

@bohrium272
Copy link
Member

@poltak so my assumption was correct, it won't be a fresh install. Hence we'll have access to the localStorage. Therefore it is easier to iterate over it and use a wrapper over import history/bookmark module. This way redundant page objects and visit docs can also be handled.

@poltak
Copy link
Member

poltak commented Jun 27, 2017

Apologies, guys; seems like I misunderstood that it's gonna be released as an update over the old extension. Looking back, it's even in the OP:

When we update the code from the old extension to the new one

Well that simplifies things :) yes could go with @arpitgogia's idea to hook it into the imports process. That would mean the user has to do it manually?

There's also a browser.runtime.onInstalled event to detect ext updates that could trigger something in UI state. Maybe like an initial options page state that, if old extension data detected, blocks the options page until the user chooses to do the conversion or skip? Some extensions do the new tab load to setup things on install as well. Both seem intrusive and annoying compared to putting it into imports, but it is an important process for anyone that is migrating to the new extension and continue using it as they did (it will be weird for them on the day when it automatically updates and then they try to search and see nothing there).

There is also the blacklist key in local storage which conflicts with the new extension. I'll have to think about what to do with that as well. May be simple to convert over (the new data model for that is just an array of strings), but it will need to be handled somehow on update as it will throw a TypeError in the UI code, as it gets loaded into redux as options page loads currently.

@poltak
Copy link
Member

poltak commented Jun 29, 2017

Alright got a little proof-of-concept of the conversion logic working on feature/old-data-conversion. It addresses the "how" rather than the "when"; that's still yet to be decided. The algorithm follows. Note that the index in the original extension's storage contains the keys to other areas of storage holding the page data (so you can do something like: localStorage.get(index[0]) to do a constant-time lookup).

  1. read index (Array<string>) and bookmarkUrls (Array<string>) from local storage into memory
  2. split index into batches of constant size (arbitrarily set to 10 now; to be optimised later)
  3. for each batch of index keys:
    3.1. read all batch-associated page data from local storage into memory
    3.2. map conversion logic over page datas, returning all visit/page/bookmark docs associated with given batch
    3.3. bulk insert all docs from current batch into DB
    3.4. continue to next batch

1 gives an memory overhead of (indexLength * c1) + (numBookmarks * c2), hence linear to the index size (assuming that is larger than bookmark count; I'd say so in most cases). c1 and c2 are just constants for whatever the average size of an entry in those arrays is (say between 5-30 bytes; it's not really important). This is required.
2 ensures a constant memory overhead (on the page data) by doing it batched, rather than reading in everything from local storage and iterating over it.

The mapping of conversion logic in 3.2 involves:

  1. convert old page data to minimal page doc
  2. grab all visits from browser for page's URL and convert to visit docs
  3. create bookmark doc, if page's URL appears in bookmarkUrls

So for each page data in old extension, it will output:

  • 1 page doc
  • 0/1 bookmark docs
  • 0+ visit docs

One main assumption in this algorithm:

  • all needed page data is indexed

If it's not, then the only other algorithm I can think of is iterating over everything in local storage, meaning linear space complexity to number of stored page docs (not good). I think this is a safe assumption to make as anything not indexed wouldn't be searchable in the old extension, hence not really of a user's concern/important? @arpitgogia maybe you can shed some light on this?

Plan to put some sort of deduping logic in there. @oliversauter seems happy to do simple URL deduping, which seems alright as the index looks ordered (we'll convert and take the latest one in the batching process and ignore the rest at later stages).

One other cool thing is I can set the isStub flag on the converted page docs so they can be scheduled in a later imports process to fill them out more. This may or may not be wanted as the imports process gets what is at a URL now; the old extension data will quite possibly contain different data in text that will be overwritten. Maybe can add in a check to imports later to do something if there is already text?

Blacklist conversion is also done, although that's all very simple and doesn't need much of a discussion.

Note that it's hard to properly test this with real data. At the moment I am resorting to manually copying over local storage data from the old extension into the new one. May spend some time to come up with a data set that I can test with better (even just getting a number of page data from old extension and duplicating them should suffice, before deduping logic is there). Main thing I want to measure is the algorithm's performance on larger sets, as the visit conversion could be slow (same as pre-imports).

@blackforestboi
Copy link
Member Author

grab all visits from browser for page's URL and convert to visit docs

If it is the history API you are thinking of, you may not get all the visits from there.
I think it may goes hand in hand with the deduping, as whenever you find a url in the old extension data that is already imported, you could just add a visit doc.
Since it is sorted (by time?) it seems like the first occurence is the most recent version and all subsequent versions could then just be visits.
Not sure, if this deduping process is different from the one when visiting, as we would not update an existing doc, but only a new visit.
If we start from the oldest doc and then work in direction of the newest, we could use the same dedupe logic as when an actual visit is happening.

@poltak
Copy link
Member

poltak commented Jun 29, 2017

If it is the history API you are thinking of, you may not get all the visits from there.
I think it may goes hand in hand with the deduping, as whenever you find a url in the old extension data that is already imported, you could just add a visit doc.

Yes, history API may or may not contain all visits, but it's another point of call for visits; if they're there, great, if not, no worries. But yes, actually could just create a barebones visit doc from the old page data as well (since it's almost like a combined visit + page doc, thinking in relation to the new model), especially for dupes. Good point!

Not sure, if this deduping process is different from the one when visiting, as we would not update an existing doc, but only a new visit.

From what I understand, quite different if we just care about the URLs. We can skip a lot of unneeded complexity that happens in the deduping framework. This is the next step though, so to be thought about more.

So after some sort of deduping's implementing, the output for each page data should be more like:

  • 0/1 page docs
  • 0/1 bookmark docs
  • 1+ visit docs

@blackforestboi
Copy link
Member Author

From what I understand, quite different if we just care about the URLs. We can skip a lot of unneeded complexity that happens in the deduping framework.

Where do you see the difference? From what I thought about in #17 we would also just check the url in the simplest form of deduping, in the beginning, so we could combine both here. (happens only once anyhow for the transfer)

0/1 bookmark docs

forgot to ask before: where does the bookmark element come from. Respectively where is the list/field 'bookmarkUrls' from?

@bohrium272
Copy link
Member

@poltak the index is maintained by the visit time in the old extension, if I remember correctly. In any case when you pull an object from the old ext.'s storage and insert it into the DB, the insertion operation should assume the doc(page/visit?) getting indexed.

@poltak
Copy link
Member

poltak commented Jun 30, 2017

Where do you see the difference?

Because the deduping framework compares things like page text and title; URL is not a concern as it's not fixed. If we want to dedup via URL here, it can be done by a simple query.

where does the bookmark element come from. Respectively where is the list/field 'bookmarkUrls' from?

In the old extension data model, bookmarkUrls is just a list of URLs in local storage that represent bookmarks in the standard page data (all on different local storage keys, via timestamp). Hence for each page data, when they're being converted, if that page data's url field appears in that list, it's a bookmark and a minimal bookmark doc (new model) can be generated (will have title, url, time as part of _id, and a reference to the generated page doc).

@poltak
Copy link
Member

poltak commented Jun 30, 2017

Basic URL deduping on the batch-level is done now, along with a minimal visit doc that can be converted from the old ext page data. The algorithm is now:

  1. read index (Array) and bookmarkUrls (Array) from local storage into memory
  2. split index into batches of constant size (arbitrarily set to 10 now; to be optimised later)
  3. for each batch of index keys:
    3.1. read all batch-associated page data from local storage into memory
    3.2. unique them by URL
    3.3. grab page docs with the matching URLs from DB
    3.4. map conversion logic over page datas, returning all visit/page/bookmark docs associated with given batch
    3.5. bulk insert all docs from current batch into DB
    3.6. continue to next batch

Conversion algorithm (3.4) would now look like:
1. check for match in matching page docs against current page data
2. convert old page data to minimal page doc unless matching page doc exists (use that instead)
3. convert page data to minimal visit doc
4. grab all visits from browser for page's URL and convert to visit docs
5. create bookmark doc, if page's URL appears in bookmarkUrls

3.3 ensures that the DB query happens on the batch level, once per batch rather than for every page data in the doc. As the map happens asynchronously for every item in the batch, without any enforced order, Pouch can get grumpy when many things try to query it around the same time, as we've seen.
The main purpose of this check is to see, for a given old ext page data, if a page doc with the same URL was converted earlier in the index (in a previous batch). If so, don't make a new page doc, just reuse (for references from the new visit docs, and possible bookmark doc).

3.2 ensures URL uniqueness within any given batch, as 3.3 is more like a previous batch check.

A lot of doc formatting logic is brought over from feature/imports for now as it uses the same logic for creating docs, hence it's a bit messy, but that can all go once merged.

Another thing is local storage clean up, I suppose. At the end of the conversion process, it's probably wise to remove all the old data. Although it should only happen if everything went alright.

Tested with some larger data sets today, but there is the problem of local storage being limited 10MB (or is it 5?), hence can only test with a few thousand sample page datas. @arpitgogia What did the old extension do in regards to that limit? AFAIK it cannot be exceeded per domain, but I've heard of people having up to 1GB of storage in the old ext

@oliversauter do you have any opinions about when the data conversion should happen once the extension gets updated? There is the automatic route (listening on extension update event), which allows it to happen without user knowing and allow the user to search on their old data almost straight away. Or it can maybe happen when a user triggers imports (maybe preimports stage, if the old data is detected), but means when the extension updates then overview will be blank, etc.

@blackforestboi
Copy link
Member Author

read index (Array) and bookmarkUrls (Array) from local storage into memory

I wonder what the purpose of the bookmarkURLs array was. Does it have all data we would also get from the Bookmarks API? Otherwise it might be better to use this array simply for the reason to get the missing data from the Bookmarks API. This way we won't have to check each url, which could be a lot.

do you have any opinions about when the data conversion should happen once the extension gets updated?

I would do it straight away on update. Since this all can take a while (including indexing), we could add a notification into the address bar, instead of the first element saying "We just made a major update to the WorldBrain extension. It will take some minutes for the update to complete. Learn more >>"

The won't know about the overview anyhow, so for them the mode of entry is only the address bar.

Another thing is local storage clean up, I suppose.

Couldn't we have the deletion process in the batch? Means checking of the url has been transferred and then delete it right away from local storage?

@poltak
Copy link
Member

poltak commented Jun 30, 2017

I wonder what the purpose of the bookmarkURLs array was.

It seems to be to mark page data as being for a bookmark, as opposed to history. If I install the extension and import bookmarks, I can see that array being filled out.

Does it have all data we would also get from the Bookmarks API? Otherwise it might be better to use this array simply for the reason to get the missing data from the Bookmarks API

Yes, all the same data you get from the bookmarks API apart from the in-browser bookmark ID. That is left out of the converted minimal visit doc as well (because it's not available). The bookmarks API doesn't seem to have a way of querying bookmarks on non-ID attributes anyway, so the only way to use that at this stage would mirror how it's being used in imports, hence kinda redundant. So we either continue to create bookmark docs here, or we ignore them and let the user sort that out in imports.

we could add a notification into the address bar, instead of the first element

Great idea! I'll find out how to do that.

Couldn't we have the deletion process in the batch?

Yes much better to have it in there at the end of each batch.

@poltak
Copy link
Member

poltak commented Jul 2, 2017

Alright seems to be working quite alright now. Added in some little cleanup bits, user notification with progress count (here is what that looks like) and all seems to be working great with my test data. Would be nice to actually test it in a real environment, (i.e., installing the old extension, importing data, then triggering an extension update to our current version), but not exactly sure how that could be done. I'll have a think, as without seeing it work automatically, it's a bit scary having such an important process unconfirmed that it will run without problems. If all our old users auto update and it doesn't run, well that won't be very fun.

If we have a blog post or something explaining the update later, I should be able to add that into the notification so it opens when you click it.

And of course we should revisit this just before release, to see if there are any obvious problems that have been missed.

@blackforestboi
Copy link
Member Author

I have an idea how to test it in a real environment:

  • Load the old extension as a temporary extension
  • import data
  • Replace files with new extension files
  • run the command to start the import in command line (as a "newinstall" cannot be triggered by it)

@blackforestboi
Copy link
Member Author

blackforestboi commented Jul 2, 2017

So I just tried that out, and it seems to trigger the load, even without running a command, just by updating and reloading the code. The notification pops up, but it throws a bunch of errors and does not really progress

Uncaught (in promise) 
Object
message
:
"QUOTA_BYTES quota exceeded"
__proto__
:
Object
background.js:146386 Uncaught (in promise) SyntaxError: Unexpected token o in JSON at position 1
    at JSON.parse (<anonymous>)
    at _callee$ (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:146386:81)
    at tryCatch (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:124289:40)
    at Generator.invoke [as _invoke] (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:124527:22)
    at Generator.prototype.(anonymous function) [as next] (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:124341:21)
    at step (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:10145:30)
    at chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:10156:13
    at <anonymous>
_callee$	@	background.js:146386
tryCatch	@	background.js:124289
invoke	@	background.js:124527
prototype.(anonymous function)	@	background.js:124341
step	@	background.js:10145
(anonymous)	@	background.js:10156
Promise rejected (async)		
step	@	background.js:10148
(anonymous)	@	background.js:10158
Promise rejected (async)		
step	@	background.js:10148
(anonymous)	@	background.js:10156
Promise resolved (async)		
callbackArgs	@	browser-polyfill.js:612
safeCallbackApply	@	extensions::uncaught_exception_handler:27
handleResponse	@	extensions::sendRequest:67
index.js:58 Uncaught (in promise) SyntaxError: Unexpected token o in JSON at position 1
    at JSON.parse (<anonymous>)
    at _callee$ (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:146386:81)
    at tryCatch (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:124289:40)
    at Generator.invoke [as _invoke] (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:124527:22)
    at Generator.prototype.(anonymous function) [as next] (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:124341:21)
    at step (chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:10145:30)
    at chrome-extension://kiebiimobhokoejjlgnbblfaimfbhaai/background.js:10156:13
    at <anonymous>
_callee$	@	index.js:58
tryCatch	@	runtime.js:65
invoke	@	runtime.js:303
prototype.(anonymous function)	@	runtime.js:117
step	@	asyncToGenerator.js:17
(anonymous)	@	asyncToGenerator.js:28
Promise rejected (async)		
step	@	asyncToGenerator.js:20
(anonymous)	@	asyncToGenerator.js:30
Promise rejected (async)		
step	@	asyncToGenerator.js:20
(anonymous)	@	asyncToGenerator.js:28
Promise resolved (async)		
callbackArgs	@	browser-polyfill.js:612
safeCallbackApply	@	extensions::uncaught_exception_handler:27
handleResponse	@	extensions::sendRequest:67```

@blackforestboi
Copy link
Member Author

oh and this conversion is triggered everytime i "reload" the extension, maybe we need a flag, so it is not triggered anymore, once the thing is complete.

@poltak
Copy link
Member

poltak commented Jul 3, 2017

I have an idea how to test it in a real environment:

Bit hacky but seems to work for the purpose of testing with the exact data from the old extension :) Found two minor issues with assumptions about the local storage state (index was doubly nested, bookmark array slightly different shape, somehow), which have been fixed up and it seems to work great being invoked manually with the old extension's data. Now hopefully that onInstalled extension update event actually triggers in the background... Would still like to test that before release.

Those errors you encountered aren't related to this, but still very important overall as we would have ignored them otherwise. They seem to be with the blacklist in local storage: old ext and new ext share the same storage key, but they differ in shape. New ext will automatically try to parse that value from local storage on opening of options page + any page visit, and assumes the shape conforms to the new model. Good thing is, assuming that the conversion runs on update as designed, the blacklist should get converted almost instantly (it's a very simple operation compared to converting the page data).

But there is also the "QUOTA_BYTES quota exceeded" I can see in your error. This is thrown when local storage exceeds its 5/10MB limit. This goes back to the question of how the old extension can store so much data? I can see the old extension has the unlimitedStorage permission in the manifest, which the new ext doesn't have, however the docs for this permission state that it's only for "Web SQL Database" or "application cache". @arpitgogia Do you know anything about this, or if the old ext uses these features? It's really important to have the new extension follow the same data permissions so this error doesn't get thrown and result in possible data loss in the update.

I'll try to leave the old extension running for a bit today to import part of my history, and see how the conversion process runs with more data (and how it handles the storage limit; whether it removes the data or just leaves it and throws an error).

@blackforestboi
Copy link
Member Author

Do you know anything about this, or if the old ext uses these features? It's really important to have the new extension follow the same data permissions so this error doesn't get thrown and result in possible data loss in the update.

AFAIK localstorage is a Web SQL DB, so this attribute is important in the old extension to make the DB in localstorage possible.
Whenever we update the extension data (including the manifest), without adding this unlimitedStorage attribute, it will likely through that error. I would add the attribute to the manifest file just in case.

@poltak
Copy link
Member

poltak commented Jul 3, 2017

No local storage isn't a Web SQL DB. However I found the answer. Apparently the restrictions in standard Window.localStorage are different to the web extension API's browser.storage.local. Putting unlimitedStorage permission in the manifest will allow us to have unlimited local storage via that API https://developer.chrome.com/extensions/storage#property-local

So yes, will have to add that.

@blackforestboi
Copy link
Member Author

No local storage isn't a Web SQL DB

Oh I thought because the content data was stored in SQLite on chrome. Okidoki :)

@poltak
Copy link
Member

poltak commented Jul 6, 2017

@oliversauter Yes, that reload button does trigger the onInstalled event to simulate an update, which is nice for testing! :) Good find

Set the unlimitedStorage permission, as detailed in last comment, to match the old ext. Now that imports is merged in, I was able to clean up a bit of the shared logic too.

Testing by importing data on the old ext, manually replacing the files with new ext build, then triggering the onInstalled event via reload to automatically invoke the conversion seems to work great. Still only able to test a small amount of data, as the imports process on the old ext isn't working very well for me, so I think I'm going to leave this running in my browser all day and see how it fairs with more data later. Nothing expected to change, just more interested in seeing how the time it takes changes (should be linear) and to verify constant memory overload.

@poltak
Copy link
Member

poltak commented Jul 7, 2017

Really slow after trying yesterday with more data I got throughout the day. Had to convert 6440 old ext page data and it took ~10mins to produce 35,079 docs in the new model (mostly visits). Didn't like this at all. Did a bit of a performance profile and narrowed it down to the URL deduping logic (db query on page docs with URL field). This was an unindexed query, so obviously it was going slower and slower as the conversion process advances (should be linear, and a lot of the time worst case - no match found).

Put a temp index on page doc URLs for the duration of conversion process, which in theory allows these deduping queries be done in logarithmic time (log N where N is # page docs already converted). Also cannot do a single query per batch, as it needs an $in: [] operator which isn't indexable. Replaced this with log N time query per page.

Testing with 2k old ext page data this morning, it was able to finish in 11 secs, which is great! Going to test later today with much more data again and hopefully see it not noticeably slowing down for later batches.

@poltak
Copy link
Member

poltak commented Jul 7, 2017

Yep tested today with 6.5k and took ~33secs. Again with 2.5k tonight ~18s. Working great with a concurrency of 15 pages. Good stuff. The output looks fine. Leaving them not marked as stubs for now (so they won't be scheduled for imports), but will revisit this before release as it it might be nice to schedule them for more filling out (metadata, favicon, etc), but need to make sure imports won't overwrite the old page text and title. To be merged after a self-walkthrough of the code.

@poltak poltak closed this as completed Jul 8, 2017
@blackforestboi blackforestboi changed the title Import data from old extension to new DB and data model MTNI-267 ⁃ Import data from old extension to new DB and data model Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants