Repository indexer clogs with file with multi-byte character sets #7809

guillep2k · 2019-08-10T01:55:47Z

Gitea version (or commit ref): release/v1.9
Git version: 2.22.0
Operating system: LInux - CentOS 7
Database (use [x]):
- PostgreSQL
- MySQL
- MSSQL
- SQLite
Can you reproduce the bug at https://try.gitea.io:
- Yes (provide example URL)
- No
- Not relevant
Log gist:

Description

When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as a single token; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable).

To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets:

sailorvenus
áéíóú
sailormoon

Note: to test properly the files must be commited through git, not Gitea's web interface.

Searching for sailorvenus brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled.

Searching for sailormoon doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled:

The text was updated successfully, but these errors were encountered:

guillep2k · 2019-08-10T03:21:19Z

I think some kind of encoding fallback could be used, perhaps pre-set in app.ini.

silverwind · 2019-08-10T06:05:34Z

Sounds like a bug in Bleve to me.

lafriks · 2019-08-10T06:38:51Z

I think just like for display we currently detect encoding and convert to utf-8 for display we need to do same before giving content to bleve

guillep2k · 2019-08-10T11:39:38Z

Sounds like a bug in Bleve to me.

@silverwind, It's more like the way it's being used but yes, it's not much robust when invalid data is presented to it. This is the current set of filters instantiated in Gitea for the repositories:

const unicodeNormalizeName = "unicodeNormalize"

func addUnicodeNormalizeTokenFilter(m *mapping.IndexMappingImpl) error {
    return m.AddCustomTokenFilter(unicodeNormalizeName, map[string]interface{}{
        "type": unicodenorm.Name,
        "form": unicodenorm.NFC,
    })
}

[...]
    textFieldMapping := bleve.NewTextFieldMapping()
    textFieldMapping.IncludeInAll = false
    docMapping.AddFieldMappingsAt("Content", textFieldMapping)

    mapping := bleve.NewIndexMapping()
    if err = addUnicodeNormalizeTokenFilter(mapping); err != nil {
        return err
    } else if err = mapping.AddCustomAnalyzer(repoIndexerAnalyzer, map[string]interface{}{
        "type":          custom.Name,
        "char_filters":  []string{},
        "tokenizer":     unicode.Name,
        "token_filters": []string{unicodeNormalizeName, lowercase.Name, unique.Name},
    }); err != nil {
        return err
    }
    mapping.DefaultAnalyzer = repoIndexerAnalyzer
    mapping.AddDocumentMapping(repoIndexerDocType, docMapping)
    mapping.AddDocumentMapping("_all", bleve.NewDocumentDisabledMapping())
[...]

And then the queue is filled with:

	fileContents, err := git.NewCommand("cat-file", "blob", update.BlobSha).
		RunInDirBytes(repo.RepoPath())
	if err != nil {
		return err
	} else if !base.IsTextFile(fileContents) {
		return nil
	}
	indexerUpdate := indexer.RepoIndexerUpdate{
		Filepath: update.Filename,
		Op:       indexer.RepoIndexerOpUpdate,
		Data: &indexer.RepoIndexerData{
			RepoID:  repo.ID,
			Content: string(fileContents),
		},
	}
	return indexerUpdate.AddToFlushingBatch(batch)

The indexer is passed the original data nonchanantly, even if it's binary.

This code was probably copied from the issue indexer, and issue texts are always utf-8 encoded.

I agree with @lafriks, detect encoding is the way to go, but that only goes so far. I'd add a filter to deal with invalid cases, because if one invalid code point gets through, the index gets filled with weird data.

I'll try to look into this in a couple of days. I'm very glad I've finally found the reason my indexes were only partially useful.

guillep2k · 2019-08-16T01:21:37Z

Fixed by #7814

lafriks added the type/bug label Aug 10, 2019

This was referenced Aug 10, 2019

Convert files to utf-8 for indexing #7814

Merged

[WIP] Implement Elastic Search Code Indexer #7592

Closed

guillep2k closed this as completed Aug 16, 2019

go-gitea locked and limited conversation to collaborators Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository indexer clogs with file with multi-byte character sets #7809

Repository indexer clogs with file with multi-byte character sets #7809

guillep2k commented Aug 10, 2019 •

edited

Loading

guillep2k commented Aug 10, 2019

silverwind commented Aug 10, 2019

lafriks commented Aug 10, 2019

guillep2k commented Aug 10, 2019

guillep2k commented Aug 16, 2019

Repository indexer clogs with file with multi-byte character sets #7809

Repository indexer clogs with file with multi-byte character sets #7809

Comments

guillep2k commented Aug 10, 2019 • edited Loading

Description

guillep2k commented Aug 10, 2019

silverwind commented Aug 10, 2019

lafriks commented Aug 10, 2019

guillep2k commented Aug 10, 2019

guillep2k commented Aug 16, 2019

guillep2k commented Aug 10, 2019 •

edited

Loading