Skip to content

Commit

Permalink
add docs about db migration
Browse files Browse the repository at this point in the history
  • Loading branch information
umputun committed Jan 2, 2025
1 parent 5a6baf8 commit 176e0a0
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 3 deletions.
35 changes: 34 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,9 +141,42 @@ Using words that mix characters from multiple languages is a common spam techniq

This option is disabled by default. If `--space.enabled` is set or `env:SPACE_ENABLED` is true, the bot will check if the message contains abnormal spacing. Such spacing is a common spam technique that tries to split the message into multiple shorter parts to avoid detection. The check calculates the ratio of the number of spaces to the total number of characters in the message, as well as the ratio of the short words. Thresholds for this check can be set with:
- `--space.short-word` (default:3) - the maximum length of a short word
- `--space.ratio` (default:0.3) - the ratio of spaces to all characters in the message
- `--space.ratio` (default:0.3) - the ratio of spaces to all characters in the message
- `--space.short-ratio` (default:0.7) - the ratio of short words to all words in the message

### Database Migration for samples (spam and ham), stop words and exclude tokens, after version (v1.16.0+)

Starting from version 1.16.0, the bot has transitioned from using multiple text files to a fully database-driven architecture. Previously separate files for spam/ham samples, stop words, and excluded tokens are now stored directly in the database alongside other bot data.

#### Migration Control

The migration process can be controlled using the `--convert` parameter, which accepts the following values:
- `enabled` (default): Performs migration during startup if needed, then continues normal operation
- `disabled`: Skips all migration, requires data to be already present in the database
- `only`: Performs migration and exits immediately after completion, useful for maintenance tasks

#### Migration Process

During the first startup after upgrading to v1.16.0, the bot automatically:
1. Migrates all existing data from text files to the database.
2. Renames the processed files to `*.loaded` to prevent duplicate loading.
3. Continues operation using only the database for all data access.

New installations come with all necessary samples and configuration preloaded in the database, eliminating the need for separate text files.

If a user renames any `*.loaded` files back to their original `.txt` extension, the bot will detect them during the next startup and perform a fresh migration. This process:
1. Clears the corresponding dataset in the database (e.g., spam samples, stop words).
2. Loads the content from the renamed files.
3. Renames the files to `*.loaded` again.

This behavior allows resetting and reloading specific datasets if needed while maintaining database consistency.

The database-driven architecture offers several benefits:
- Simplified data management through a single storage solution.
- Improved performance with optimized database access.
- Enhanced reliability by eliminating file I/O operations.
- Easier system migration by transferring a single `tg-spam.db` file.

### Admin chat/group

Optionally, user can specify the admin chat/group name/id. In this case, the bot will send a message to the admin chat as soon as a spammer is detected. Admin can see all the spam and all banned users and could also unban the user, confirm the ban or get results of spam checks by clicking a button directly on the message.
Expand Down
10 changes: 8 additions & 2 deletions app/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -754,6 +754,10 @@ func migrateSamples(ctx context.Context, opts options, samplesDB *storage.Sample
// migrateDicts runs migrations from legacy dictionary text files to db, if needed
func migrateDicts(ctx context.Context, opts options, dictDB *storage.Dictionary) error {
migrateDict := func(file string, dictType storage.DictionaryType) (*storage.DictionaryStats, error) {
if opts.Convert == "disabled" {
log.Print("[DEBUG] dictionary migration disabled")
return &storage.DictionaryStats{}, nil
}
if _, err := os.Stat(file); err != nil {
log.Printf("[DEBUG] dictionary file %s not found, skip", file)
return &storage.DictionaryStats{}, nil
Expand All @@ -780,7 +784,7 @@ func migrateDicts(ctx context.Context, opts options, dictDB *storage.Dictionary)
return errors.New("dictionary db is nil")
}

// migrate stop words if files exist
// migrate stop-words if files exist
stopWordsFile := filepath.Join(opts.Files.SamplesDataPath, stopWordsFile)
s, err := migrateDict(stopWordsFile, storage.DictionaryTypeStopPhrase)
if err != nil {
Expand All @@ -800,7 +804,9 @@ func migrateDicts(ctx context.Context, opts options, dictDB *storage.Dictionary)
log.Printf("[INFO] excluded tokens loaded: %s", s)
}

log.Printf("[DEBUG] dictionaries migration done: %s", s)
if s.TotalIgnoredWords > 0 || s.TotalStopPhrases > 0 {
log.Printf("[DEBUG] dictionaries migration done: %s", s)
}
return nil
}

Expand Down

0 comments on commit 176e0a0

Please sign in to comment.