Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Absorb ssplit-cpp and clean headers towards packaging #13

Merged
merged 37 commits into from
Oct 4, 2023

Conversation

jerinphilip
Copy link
Owner

@jerinphilip jerinphilip commented Oct 2, 2023

Change simplifies submodule dependencies towards packaging for distribution. Path being taken is:

  1. google/sentencepiece instead of browsermt/sentencepiece, which should correspond to libsentencepiece-dev. There are some disparities in training code, but should be okay for inference. It appears it is not possible to rely on libsentencepiece-dev or equivalents because these do no export SentencePieceText (see relevant SentencePieceAPI docs).
  2. browsermt/ssplit-cpp is absorbed here into {Splitter,Regex}.{hh,cc}.
  3. 8-bit GEMM provider is enabled using mozilla/gemmology, which using CMake is pointed to for include-directory. Since this is header-only and used only in QMM.cc, gemmology headers end up private and is not needed to distribute. This adds an xsimd dependency, which appears to be available via repositories.

Links

@jerinphilip jerinphilip marked this pull request as ready for review October 2, 2023 18:03
@jerinphilip jerinphilip changed the title Switch to google/sentencepiece Prepare for packaging Oct 3, 2023
@jerinphilip jerinphilip changed the title Prepare for packaging Towards packaging Oct 3, 2023
@jerinphilip jerinphilip changed the title Towards packaging Absorb ssplit-cpp and clean headers towards packaging Oct 4, 2023
@jerinphilip jerinphilip merged commit 9920fbd into main Oct 4, 2023
4 checks passed
@jerinphilip jerinphilip deleted the google-spiece branch October 4, 2023 07:49
jerinphilip added a commit that referenced this pull request Oct 4, 2023
Change simplifies submodule dependencies towards packaging for
distribution. Path being taken is:

1. `google/sentencepiece` instead of `browsermt/sentencepiece`, which
should correspond to `libsentencepiece-dev`. There are some disparities
in training code, but should be okay for inference. It appears it is not
possible to rely on `libsentencepiece-dev` or equivalents because these
do no export `SentencePieceText` (see relevant SentencePieceAPI docs).
2. `browsermt/ssplit-cpp` is absorbed here into
`{Splitter,Regex}.{hh,cc}`.
3. 8-bit GEMM provider is enabled using `mozilla/gemmology`, which using
CMake is pointed to for include-directory. Since this is header-only and
used only in `QMM.cc`, gemmology headers end up private and is not
needed to distribute. This adds an `xsimd` dependency, which appears to
be available via repositories.

Links

- SentencePiece API
https://github.com/google/sentencepiece/blob/8cbdf13794284c30877936f91c6f31e2c1d5aef7/src/sentencepiece_processor.h#L387-L401
- google/sentencepiece https://github.com/google/sentencepiece
- browsermt/sentencepiece https://github.com/browsermt/sentencepiece
- browsermt/ssplit-cpp https://github.com/browsermt/ssplit-cpp
- mozilla/gemmology https://github.com/mozilla/gemmology
- xsimd https://github.com/xtensor-stack/xsimd
- libsentencepiece-dev
https://packages.debian.org/stable/libdevel/libsentencepiece-dev

Pull-Request: #13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant