Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repository indexer clogs with file with multi-byte character sets #7809

Closed
2 of 7 tasks
guillep2k opened this issue Aug 10, 2019 · 5 comments
Closed
2 of 7 tasks

Repository indexer clogs with file with multi-byte character sets #7809

guillep2k opened this issue Aug 10, 2019 · 5 comments
Labels

Comments

@guillep2k
Copy link
Member

guillep2k commented Aug 10, 2019

  • Gitea version (or commit ref): release/v1.9
  • Git version: 2.22.0
  • Operating system: LInux - CentOS 7
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as a single token; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable).

To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets:

sailorvenus
áéíóú
sailormoon

Note: to test properly the files must be commited through git, not Gitea's web interface.

Searching for sailorvenus brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled.
image

Searching for sailormoon doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled:
image

@guillep2k
Copy link
Member Author

I think some kind of encoding fallback could be used, perhaps pre-set in app.ini.

@silverwind
Copy link
Member

Sounds like a bug in Bleve to me.

@lafriks
Copy link
Member

lafriks commented Aug 10, 2019

I think just like for display we currently detect encoding and convert to utf-8 for display we need to do same before giving content to bleve

@guillep2k
Copy link
Member Author

Sounds like a bug in Bleve to me.

@silverwind, It's more like the way it's being used but yes, it's not much robust when invalid data is presented to it. This is the current set of filters instantiated in Gitea for the repositories:

const unicodeNormalizeName = "unicodeNormalize"

func addUnicodeNormalizeTokenFilter(m *mapping.IndexMappingImpl) error {
    return m.AddCustomTokenFilter(unicodeNormalizeName, map[string]interface{}{
        "type": unicodenorm.Name,
        "form": unicodenorm.NFC,
    })
}

[...]
    textFieldMapping := bleve.NewTextFieldMapping()
    textFieldMapping.IncludeInAll = false
    docMapping.AddFieldMappingsAt("Content", textFieldMapping)

    mapping := bleve.NewIndexMapping()
    if err = addUnicodeNormalizeTokenFilter(mapping); err != nil {
        return err
    } else if err = mapping.AddCustomAnalyzer(repoIndexerAnalyzer, map[string]interface{}{
        "type":          custom.Name,
        "char_filters":  []string{},
        "tokenizer":     unicode.Name,
        "token_filters": []string{unicodeNormalizeName, lowercase.Name, unique.Name},
    }); err != nil {
        return err
    }
    mapping.DefaultAnalyzer = repoIndexerAnalyzer
    mapping.AddDocumentMapping(repoIndexerDocType, docMapping)
    mapping.AddDocumentMapping("_all", bleve.NewDocumentDisabledMapping())
[...]

And then the queue is filled with:

	fileContents, err := git.NewCommand("cat-file", "blob", update.BlobSha).
		RunInDirBytes(repo.RepoPath())
	if err != nil {
		return err
	} else if !base.IsTextFile(fileContents) {
		return nil
	}
	indexerUpdate := indexer.RepoIndexerUpdate{
		Filepath: update.Filename,
		Op:       indexer.RepoIndexerOpUpdate,
		Data: &indexer.RepoIndexerData{
			RepoID:  repo.ID,
			Content: string(fileContents),
		},
	}
	return indexerUpdate.AddToFlushingBatch(batch)

The indexer is passed the original data nonchanantly, even if it's binary.

This code was probably copied from the issue indexer, and issue texts are always utf-8 encoded.

I agree with @lafriks, detect encoding is the way to go, but that only goes so far. I'd add a filter to deal with invalid cases, because if one invalid code point gets through, the index gets filled with weird data.

I'll try to look into this in a couple of days. I'm very glad I've finally found the reason my indexes were only partially useful.

@guillep2k
Copy link
Member Author

Fixed by #7814

@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants