config.json and log.json can get out of sync, preventing startup of the credential manager #14

jchartrand · 2023-07-31T17:17:07Z

If I pass a badly formed VC to the allocateStatus method, it fails, but leaves the repos in a state such that when I try to restart, it gives me an error. I think the problem is that when allocating a status, the manager first updates the config, and in particular increments the number of credentials issued and saves this back to Github:

credential-status-manager-git/src/credential-status-manager-base.ts

Lines 215 to 218 in 076921b

    
           // update status config data 
        
           configData.credentialsIssued = credentialsIssued; 
        
           configData.latestList = latestList; 
        
           await this.updateConfigData(configData);

And then when it goes on to issue the credential, that fails, and so the credential never gets issued and the log doesn't get updated, which leaves the number of credentials issued in the config ('credentialsIssued') one greater than are in the log. Which is what later I think causes the error when restarting, specifically because of this check:

credential-status-manager-git/src/credential-status-manager-base.ts

Line 530 in 076921b

const hasProperLogEntries = credentialIdsUnique.length === credentialsIssued;

Fundamentally I think the problem is that the two operations (updateConfig and updateLog) aren't atomic and so the log and config can get out of whack. I could imagine this causing other problems too.

If the 'credentialsIssued` in the log is only used to know when the current list is full and so that a new list needs to be created, then maybe better to just calculate the # of credentials issued from the log?

kezike · 2023-08-04T18:34:55Z

At the moment, I am leaning toward adding a credential validation check early in allocateStatus. If that fails then neither the config nor the log file would be updated.

With this solution, there is still a valid argument to be made regarding consistency and fault tolerance in the face of network failure. Many of the operations in this library involve updating files both in the same repo and in separate repos (i.e., status credential in the main repo and config or log file in the metadata repo). Considering that this library will likely be used in large scale course distributions, such as Harvard's CS50, we may want to consider a mechanism that implements transaction rollback when one operation of a logical transaction is interrupted.

@dmitrizagidulin your thoughts are welcome here.

kezike · 2023-08-07T16:31:19Z

I have done a bit more thinking on this over the last couple of days and I think I have a solution that is an extension of the locking/serialization work that I have already done with this library.

The main update would be to check core invariants in the relationship between the three status documents (status credential, log file, config file) prior to releasing the lock in transactional functions. If any of the invariants fail, restore the value of these three documents, release the lock to allow other processes a chance to execute, and try again. The main code delta would involve storing the current values of these documents in memory at the beginning of all transactional functions and restoring/retrying as needed.

@jchartrand @dmitrizagidulin thoughts on this approach?

jchartrand · 2023-08-07T16:48:10Z

One potential problem might still be that if a write is made to one of the files (e.g., config) and then the network cuts out (e.g, Github goes down) before you can also write to the other files (e.g, log), and the network stays down for a while, then you might not be able to revert the write to the config. Worse might be that you write to the config file and then the machine on which the code is running crashes before you can write to the log, so again you can’t revert the write to the log. Or is your lock actually in the GitHub repo itself and so you can fix things up on restart? Would another option be to combine the config file and log file into a single file? So you’d have a single atomic write? If the revocation list itself (the bit vector) gets out of whack that seems a little less important because at least it won’t prevent the whole system from continuing to run. At worst it might mean that a position didn’t get revoked when it should have.

…

On Aug 7, 2023, at 12:31 PM, Kayode Ezike ***@***.***> wrote: I have done a bit more thinking on this over the last couple of days and I think I have a solution that is an extension of the locking/serialization work that I have already done with this library. The main update would be to check core invariants in the relationship between the three status documents (status credential, log file, config file) prior to releasing the lock. If any of the invariants fail, restore the value of these three documents, release the lock to allow other processes a chance to execute, and try again. The main code delta would involve storing the current values of these documents in memory prior to beginning any parallel operation and retrying as needed. @jchartrand <https://github.com/jchartrand> @dmitrizagidulin <https://github.com/dmitrizagidulin> thoughts on this approach? — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEFSXLOIZVVWGURUFGRMMLXUEJ6FANCNFSM6AAAAAA26R3AT4>. You are receiving this because you were mentioned.

kezike · 2023-08-10T20:19:53Z

Resolution: we will apply my proposal from above, plus a modification that addresses @jchartrand's concerns: instead of temporarily saving the repo data locally, we will temporarily save it in the metadata repo, such that a disrupted client service that finds this data on restart understands that it needs to restore the repo data to this previous state. Additionally, we will combine the config and log files to prevent them from getting out of sync, per @jchartrand's recommendation.

alexfigtree · 2024-02-21T15:21:03Z

Tested.

alexfigtree · 2024-02-21T20:23:32Z

Keeping open until after deployment

alexfigtree · 2024-06-21T21:04:20Z

Deployed to both Google Play and App Store (release 2.1.0-build80), closing ticket.

kayaelle added this to DCC Engineering Jul 31, 2023

kezike self-assigned this Aug 4, 2023

kayaelle moved this to Backlog in DCC Engineering Aug 9, 2023

kezike moved this from Backlog to In Progress in DCC Engineering Aug 16, 2023

This was referenced Aug 25, 2023

Add fault recovery mechanism #18

Merged

Merge config and log files #17

Merged

kayaelle moved this from In Progress to To Do (Current sprint) in DCC Engineering Nov 1, 2023

alexfigtree moved this from To Do (Current sprint) to Release Ready in DCC Engineering Feb 21, 2024

alexfigtree closed this as completed Feb 21, 2024

alexfigtree reopened this Feb 21, 2024

alexfigtree moved this from Release Ready to Done (Deployed) in DCC Engineering Jun 21, 2024

alexfigtree closed this as completed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config.json and log.json can get out of sync, preventing startup of the credential manager #14

config.json and log.json can get out of sync, preventing startup of the credential manager #14

jchartrand commented Jul 31, 2023

kezike commented Aug 4, 2023

kezike commented Aug 7, 2023 •

edited

Loading

jchartrand commented Aug 7, 2023 via email

kezike commented Aug 10, 2023 •

edited

Loading

alexfigtree commented Feb 21, 2024

alexfigtree commented Feb 21, 2024

alexfigtree commented Jun 21, 2024

config.json and log.json can get out of sync, preventing startup of the credential manager #14

config.json and log.json can get out of sync, preventing startup of the credential manager #14

Comments

jchartrand commented Jul 31, 2023

kezike commented Aug 4, 2023

kezike commented Aug 7, 2023 • edited Loading

jchartrand commented Aug 7, 2023 via email

kezike commented Aug 10, 2023 • edited Loading

alexfigtree commented Feb 21, 2024

alexfigtree commented Feb 21, 2024

alexfigtree commented Jun 21, 2024

kezike commented Aug 7, 2023 •

edited

Loading

kezike commented Aug 10, 2023 •

edited

Loading