Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

".registry.bad" files are created for actually valid registry snapshots #1581

Closed
mwallner opened this issue May 28, 2018 · 20 comments
Closed

Comments

@mwallner
Copy link
Member

mwallner commented May 28, 2018

I got the warning that this registry snapshot is broken:

<?xml version="1.0" encoding="utf-8"?>
<registrySnapshot xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <user>S-1-5-21-2087680783-807360899-2087665911-3127</user>
  <keys>
    <key installerType="Unknown" displayName="GPL Ghostscript" displayVersion="9.19">
      <RegistryView>Registry32</RegistryView>
      <KeyPath>HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\GPL Ghostscript 9.19</KeyPath>
      <DefaultValue />
      <InstallLocation><![CDATA[]]></InstallLocation>
      <UninstallString><![CDATA["C:\Program Files (x86)\gs\gs9.19\uninstgs.exe"]]></UninstallString>
      <HasQuietUninstall>false</HasQuietUninstall>
      <Publisher><![CDATA[Artifex Software Inc.]]></Publisher>
      <InstallDate />
      <InstallSource><![CDATA[]]></InstallSource>
      <Language />
      <Version />
      <VersionMajor />
      <VersionMinor />
      <SystemComponent>false</SystemComponent>
      <WindowsInstaller>false</WindowsInstaller>
      <NoRemove>false</NoRemove>
      <NoModify>true</NoModify>
      <NoRepair>true</NoRepair>
      <ReleaseType />
      <ParentKeyName />
    </key>
  </keys>
</registrySnapshot>

After renaming the file from .registry.bad to .registry and re-ran choco info the emmitted warning disappeared. (which would suggest the file is actually ok)

Related to #1505 and #1540

@mwallner
Copy link
Member Author

mwallner commented May 28, 2018

thinking of it this might actually be related to this: #1264 (edit: to be more exact this: GH-1258)

  • I've been having problems with corrupted chocolatey.config files in the past, because choco.exe is run quite often (and hence sometimes simultaneously) in my environment

@mwallner
Copy link
Member Author

extends GH-1580

@mwallner
Copy link
Member Author

until this is fixed, following snipped may help fixing those invalid created files:

gci C:\ProgramData\chocolatey\.chocolatey\ -Recurse ".registry.bad" | % { Move-Item $_.FullName $(Join-Path $_.Directory.FullName ".registry") }

choco info

@bcurran3
Copy link

bcurran3 commented May 28, 2018

`I've been having problems with corrupted chocolatey.config files in the past, because choco.exe is run quite often (and hence sometimes simultaneously) in my environment

This is one of the reasons why I created chocolatey-preinstaller-checks.extension which is currently on forced hold and being needlessly debated.

@bcurran3
Copy link

bcurran3 commented May 30, 2018

Just another comment, I brought up this issue and started experimenting/triaging it as well, unfortunately the conversation got no results.

https://groups.google.com/forum/#!topic/chocolatey/IaQ7_vLX-3s

The follow snippet was the weirdest of the behaviors:

I had this problem on 2 or 3 of my servers for the last month or two and finally decided to try to clean it up. Uninstalling the package would make the error go away, but reinstalling it fresh brought the error back. So I went and manually deleted the related directory and then re-installed the packages and all is well.

@mwallner
Copy link
Member Author

@bcurran3 - I'm really thinking that there could be a synchronization issue going on here, the thing is .. I can't really get it to reproduce by 'forcing it' - with the corrupt chocolatey.config it's been easy - spin up 1000x "choco list -o" simultaneously (via PowerShell Jobs) - BOOM broken config file.

  • that's been fixed by introducing the mutex on reading/writing the config file

But this (forcing it into a wrong behavior) doesn't seem to be possible with the .registry.bad files...

@bcurran3
Copy link

bcurran3 commented Jul 28, 2018

@mwallner

Luckily this problem has gone away for me. I'm not sure why. I'd like to say https://chocolatey.org/packages/chocolatey-preinstaller-checks.extension (BTW: there's a new configurable 0.0.2 up on my GitHub repo) fixed the problem for me, but I have no proof. I do know my Chocolatey Pro license expired 05/25/2018. So that's an obvious change in my environment. I could hypothesize that it's related, but have no way to prove/test/etc without buying a new Pro license. So it's just a theory.

Out of curiosity are you running a paid version of Chocolatey? I see the Customer tag on this issue, I'm assuming you are. We might be narrowing the problem down...

EDIT: v0.0.2.1 is current: https://chocolatey.org/packages/chocolatey-preinstaller-checks.extension

@mwallner
Copy link
Member Author

Hey @bcurran3 - I'm still experiencing this issue, but I'm currently "evading" by running the PowerShell 1-liner that I've suggested above.

Yes, I'm using Chocolatey 4 Business, so there may be a connection to that.
@gep13 - I think that's actually a good point we haven't been talking about yet..

@Mikaela
Copy link

Mikaela commented Aug 3, 2018

I am sometimes seeing this issue and getting it fixed with the oneliner (#1581 (comment)) and I am using the open source edition judging by the pricing page and not having paid anything.

@mwallner
Copy link
Member Author

mwallner commented Aug 3, 2018

uh, thanks @Mikaela - than this is clearly not caused by utilizing licensed choco features.

@Mikaela
Copy link

Mikaela commented Aug 3, 2018

You're welcome. I should probably add some additional information in case it turns out to be relevant.

I am using Windows 10 Insider Slow ring, but have seen this on family PC running stable branch.

I also have https://chocolatey.org/packages/choco-upgrade-all-at-startup installed as I don't trust my family to run Chocolatey or GUI and I guess it's possible that it is attempting to run simultaneously when I run Choco by hand as I have funny habit of checking updates with package managers as soon as I boot machines.

@bcurran3
Copy link

bcurran3 commented Aug 3, 2018

If you're running a choco command while a choco upgrade all is going on....you're bound to have problems. It's not supported and well documented that you'll have problems. It's on the Chocolatey 1.0 milestone list to be fixed.

Make sure you read my summary of known issues at:
https://chocolatey.org/packages/chocolatey-preinstaller-checks.extension/0.0.1

The first link in the thread above will default to the -pre release. The 0.0.1 release (still being ignored and in moderation purgatory) has a MUCH more detailed description of the problems and links to the GitHub choco issues.

I'd recommend you try the 0.0.2 pre-release (same but configurable) from my GitHub rep:
https://github.com/bcurran3/ChocolateyPackages/blob/master/chocolatey-preinstaller-checks.extension/chocolatey-preinstaller-checks.extension.0.0.2-pre01.nupkg
At the very least you can set it to WARN you when choco.exe is already running.

FYI: You'll have to manually install it. (Download .nupkg, cinst name of .nupkg)

EDIT: v0.0.2.1 is current: https://chocolatey.org/packages/chocolatey-preinstaller-checks.extension

@ferventcoder
Copy link
Member

So I think I know what may be going on here, but I'm not sure and want to add some logging for 0.10.12 release to see if it brings the issue to light.

@ferventcoder
Copy link
Member

ferventcoder commented Mar 13, 2019

So here is my theory,

There is a try that decides if the xml file should be renamed to .bad done at

packageInformation.RegistrySnapshot = _registryService.read_from_file(_fileSystem.combine_paths(pkgStorePath, REGISTRY_SNAPSHOT_FILE));

Once it fails attempting to read the file, it renames it to .bad. This is at

catch (Exception)
{
FaultTolerance.try_catch_with_logging_exception(
() =>
{
if (_config.RegularOutput) this.Log().Warn(deserializationErrorMessage);
// rename the bad registry file so that it isn't processed again
_fileSystem.move_file(_fileSystem.combine_paths(pkgStorePath, REGISTRY_SNAPSHOT_FILE), _fileSystem.combine_paths(pkgStorePath, REGISTRY_SNAPSHOT_BAD_FILE));
},
"Unable to read registry snapshot file for {0} (located at {1})".format_with(package.Id, _fileSystem.combine_paths(pkgStorePath, REGISTRY_SNAPSHOT_FILE)),
throwError: false,
logWarningInsteadOfError: true,
isSilent: true
);
}

But as you dig into the code, you realize there are several factors that could cause an error with a perfectly valid .registry file. For one, it only attempts to deserialize the file one time.

public Registry read_from_file(string filePath)
{
if (!_fileSystem.file_exists(filePath))
{
return null;
}
return _xmlService.deserialize<Registry>(filePath, 1);
}

public XmlType deserialize<XmlType>(string xmlFilePath, int retryCount)

What kinds of errors can we see from that code?

  • TimeoutException while waiting on the file to come out of the mutex.
  • InvalidOperationException in the XML deserialization
  • Anything surrounding Exception

So if there is a lock on the file that doesn't let go in time, boom, renamed to bad registry file.

To start with, you can search your logs for when you get the warning about the registry and see if right before that the mutex threw a timeout. Something like:
"Maximum tries of 1 reached. Throwing error"
"Error deserializing response of type Registry:
Timeout waiting for exclusive access to value."
"A corrupt .registry file exists at ..."

I'm going to add some more logging around exactly when we decide to move the file and log the error so it's more explicit into why this is done.

Probably the right answer here is to look at using GlobalMutex with a guid per file we are accessing, so things can move faster and synchronize across faster. This is trickier as we need to be able to uniquely set a GUID per file AND the same GUID per file each time.

ferventcoder added a commit to ferventcoder/choco that referenced this issue Mar 13, 2019
Add reference to documentation on Escape, remove variable, and fix
possible null reference exception with a conversion "to_string()" after
the Escape method.
ferventcoder added a commit to ferventcoder/choco that referenced this issue Mar 13, 2019
Previously, the attempt to deserialize a file was only run once. This
should be attempted at least twice to ensure that it can be a bit more
robust in the check.
ferventcoder added a commit to ferventcoder/choco that referenced this issue Mar 13, 2019
Previously, when a .registry file was moved to .registry.bad, there
wasn't an indication of what all might have caused it to be set that
way. Provide some logging based on exactly what causes it, and provide
the entire error output instead of just the message for now.
ferventcoder added a commit that referenced this issue Mar 13, 2019
* ticket/1581/registry_snapshots:
  (maint) formatting
  (GH-1581) Log error that caused .registry.bad file
  (maint)(log) update message about number of tries
  (GH-1581) try deserialization at least twice
  (GH-1581) Fix possible null exception
ferventcoder added a commit that referenced this issue Mar 13, 2019
* stable:
  (maint) formatting
  (GH-1581) Log error that caused .registry.bad file
  (maint)(log) update message about number of tries
  (GH-1581) try deserialization at least twice
  (GH-1581) Fix possible null exception
@ferventcoder
Copy link
Member

So we've added more logging and tolerance around this for 0.10.12.

@bcurran3
Copy link

If not already done, please check to see if the .registry file is already open during processing.

@ferventcoder ferventcoder modified the milestones: 0.10.12, 0.10.13, 0.10.14 Mar 14, 2019
@ferventcoder
Copy link
Member

@bcurran3 that's already done.

@ferventcoder
Copy link
Member

ferventcoder commented May 20, 2019

@mwallner has this happened since we put in the fix for 0.10.12?

@ferventcoder
Copy link
Member

I haven't heard any grumbles of this occurring anywhere else, so the fault tolerance change up for multiple tries probably fixed the issue. Marking as resolved for 0.10.14, but will note in release notes this actually went in at 0.10.12.

@mwallner
Copy link
Member Author

mwallner commented Dec 4, 2019

@mwallner has this happened since we put in the fix for 0.10.12?

not seeing this issue anymore (yet we're already on 0.10.15 ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants