HuggingFaceTokenizers.jl

Rudimentary Julia bindings for 🤗 Tokenizers, providing fast and easy-to-use tokenization through Python interop.

Installation

From the Julia REPL, enter Pkg mode with ] and add the package using the URL:

add https://github.com/MurrellGroup/HuggingFaceTokenizers.jl

Usage

Loading a Tokenizer

You can load a tokenizer either from a pre-trained model or from a saved file:

using HuggingFaceTokenizers

# Load a pre-trained tokenizer
tokenizer = from_pretrained(Tokenizer, "bert-base-uncased")

# Alternatively specify revision and auth token
tokenizer = from_pretrained(Tokenizer, "bert-base-uncased", "main", nothing)

# Or load from a file
tokenizer = from_file(Tokenizer, "path/to/tokenizer.json")

Basic Operations

Single Text Processing

# Encode a single text
text = "Hello, how are you?"
result = encode(tokenizer, text)
println("Tokens: ", result.tokens)
println("IDs: ", result.ids)

# Decode back to text
decoded_text = decode(tokenizer, result.ids)
println("Decoded: ", decoded_text)

Batch Processing

# Encode multiple texts at once
texts = ["Hello, how are you?", "I'm doing great!"]
batch_results = encode_batch(tokenizer, texts)

# Each result contains tokens and ids
for (i, result) in enumerate(batch_results)
    println("Text $i:")
    println("  Tokens: ", result.tokens)
    println("  IDs: ", result.ids)
end

# Decode multiple sequences at once
ids_batch = [result.ids for result in batch_results]
decoded_texts = decode_batch(tokenizer, ids_batch)

Saving a Tokenizer

# Save the tokenizer to a file
save(tokenizer, "path/to/save/tokenizer.json")

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
src		src
test		test
.gitignore		.gitignore
CondaPkg.toml		CondaPkg.toml
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HuggingFaceTokenizers.jl

Installation

Usage

Loading a Tokenizer

Basic Operations

Single Text Processing

Batch Processing

Saving a Tokenizer

About

Releases

Languages

License

MurrellGroup/HuggingFaceTokenizers.jl

Folders and files

Latest commit

History

Repository files navigation

HuggingFaceTokenizers.jl

Installation

Usage

Loading a Tokenizer

Basic Operations

Single Text Processing

Batch Processing

Saving a Tokenizer

About

Resources

License

Stars

Watchers

Forks

Releases

Languages