Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(5xx): should return 4xx when file not found #1291

Merged

Conversation

kishore03109
Copy link
Contributor

@kishore03109 kishore03109 commented Apr 11, 2024

Problem

Described in linear:
missing media files in pages leads to a 5xx when the resource name itself cannot be processed

as an example, consider a link (some_link)[/%2fa/b] -> this resolves to //a/b, which is not processable by our backend as it results in an error like so: Error when getting Git blob hash: Error: HEAD:images/technews//tvws-part2.JPG and in turn causes a 5xx.

this 5xx is caused because GitFileService (iirc) will try to determine the blob hash (see if can find the file on disk) and in doing so, it will get an error (as the file doesn't exist), leading to the 500.

see here for an example where this occurs

in order to reproduce this:

  • go to any site with an image
  • create or click on an existing markdown page
  • in the page, add an image
  • open network console
  • update the link to be /images/%2f
  • there should be a 5xx in the network console.

Additionally, we also see a fatal: path '.git' does not exist in 'HEAD' in the logs every time the workspace is refreshed.
This is because we rely on fileStats to get a list of directories, but forget to remove special directories such as .git.

Solution

Reason for this is that git under the hood does a file path normalision for checking if the file exists in local disk, but does not when checking when it exists in head.

consider these three examples:

Screenshot 2024-04-11 at 12 58 41 PM

We use fileStats to see if a file exists before passing it to GetGitBlobHash. issue with this is that while the file could exist in disk and still return a correct folder stats, this should still return a file not found error as the path given by the user is still incorrect. (am opting for this rather than normalising the path in the function GetGitBlobHash to keep the implementation of GetGitBlobHash lean and not think about potential path traversal attacks.)

for the special files, we simply do a filter to ignore files like .git. We know that the class of errors with the strings exists on disk and Error when getting Git blob hash mostly comes from the .git folder by looking at dd logs here.

ie all instances of exists on disk and Error when getting Git blob hash errors in prod for the past two weeks came from trying to access the .git folders.

Breaking Changes

  • Yes - this PR contains breaking changes
    • Details ...
  • No - this PR is backwards compatible with ALL of the following feature flags in this doc

Before & After Screenshots

BEFORE:
Screenshot 2024-04-11 at 12 51 22 PM

AFTER:
Screenshot 2024-04-11 at 12 52 17 PM

Tests

in order to reproduce this:

  • go to any site with an image
  • create or click on an existing markdown page
  • in the page, add an image
  • open network console
  • update the link to be /images/%2f
  • there should be a 4xx in the network console.
  • go into datadog, there should not be any instance of fatal: path '.git' exists on disk, but not in 'HEAD'

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @kishore03109 and the rest of your teammates on Graphite Graphite

@kishore03109 kishore03109 marked this pull request as ready for review April 11, 2024 05:14
@kishore03109 kishore03109 requested a review from a team April 11, 2024 05:14
@kishore03109 kishore03109 force-pushed the 04-11-fix_5xx_should_return_4xx_when_file_not_found branch from 7679174 to 4a2610a Compare April 11, 2024 05:15
@@ -333,6 +333,12 @@ export default class GitFileSystemService {
)

if (error instanceof GitError) {
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you know why the request even contains things like //filename?

are these actually broken links that cms tries to resolve?

Copy link
Contributor Author

@kishore03109 kishore03109 Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you know why the request even contains things like //filename?

we have a markdown editor, so there are possibility for typo + the logs suggest this has occurred at least once.
we dont debounce too, to this could occur while the user is typing the correct string.

are these actually broken links that cms tries to resolve?

yes, intuitively should be able to resolve them gracefully by 4xx rather than 5xx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add comments for this check here so that we don't lose context of when this error statements occur?

@kishore03109 kishore03109 requested a review from harishv7 April 12, 2024 00:32
Copy link
Contributor

@seaerchin seaerchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also have tests out for this PR since it's fixing a bug and we want to avoid a regression?

Comment on lines +337 to +340
error.message.includes("fatal: path") &&
error.message.includes("exists on disk, but not in 'HEAD'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker for now, because we can raise a PR to improve this + bug is fixed here but i wonder if error.message is a robust enough property for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like best effort, since this is directly reliant on the error message that git returns us
do have an alternative in mind?

@@ -1121,7 +1127,11 @@ export default class GitFileSystemService {
)
)
.andThen((directoryContents) => {
const resultAsyncs = directoryContents.map((directoryItem) => {
const IGNORED_FILES = [".git"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the pattern we might wanna aim for here is to respect the filesystem's convention! in here, this means that hidden files/folders (prefixed by a `.) should be ignored rather than a specific instance.

wwdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i rather keep this as .git since this actually is not the file system convention, but rather the .git system's convention. since they dont have a direct mapping, opting for the stricter check here.

@kishore03109 kishore03109 force-pushed the 04-11-fix_5xx_should_return_4xx_when_file_not_found branch from 4a2610a to 6b1f857 Compare April 16, 2024 01:13
@kishore03109 kishore03109 merged commit 9f10ec1 into develop Apr 16, 2024
12 checks passed
@mergify mergify bot deleted the 04-11-fix_5xx_should_return_4xx_when_file_not_found branch April 16, 2024 01:19
@dcshzj dcshzj mentioned this pull request Apr 17, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants