Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with markdown translation where code blocks were split across chunks #21

Merged
merged 1 commit into from
Sep 29, 2024

Conversation

skytin1004
Copy link
Collaborator

@skytin1004 skytin1004 commented Sep 29, 2024

Purpose

Problem

During markdown translation, code blocks and inline code were sometimes split across multiple chunks, causing errors or incomplete translations. This often led to broken translations, especially for longer documents with numerous code snippets. Additionally, in some cases, inline code and code blocks contained variables or syntax that could cause errors in the Semantic Kernel translation process.

Solution

  • Implemented a process to replace code blocks (```) and inline code (``) with placeholders like @@CODE_BLOCK_x@@ and @@INLINE_CODE_x@@ before sending the document for translation.
  • After the translation is completed, the placeholders are replaced back with the original code blocks and inline code.
  • This ensures that code blocks and inline code are not split between chunks, leading to more accurate translations without breaking code formatting. (Solved Code block and chunk splitting issue in Markdown translation #18)
  • It also prevents potential errors in the Semantic Kernel that could be triggered by variables or specific syntax within code blocks during the translation process. (Solved Translation fails due to Semantic Kernel tokenization error #11)

Testing

The solution was tested on documents with extensive code snippets to ensure code blocks and inline code remain intact during translation, fixing the previous issues where chunks were improperly split. It was also confirmed that this method prevents Semantic Kernel errors caused by specific variables or syntax within the code blocks.

Additionally, in the Phi-3 CookBook, the previously problematic translation of the 06.E2ESamples/E2E_Phi-3-FineTuning_PromptFlow_Integration.md file, which is around 1200 lines long, was successfully completed without any chunk loss or formatting errors. (Solved #18)

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[x] No

Type of change

[x] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

…ss chunks

- Resolved issue causing code blocks ( ` ) and inline code ( ` ) to be split across chunks during markdown translation.
- Implemented a new logic to replace code blocks and inline code with placeholders before translation.
- Restored the placeholders back to their original form after translation, preventing translation issues related to broken code blocks.
@skytin1004 skytin1004 self-assigned this Sep 29, 2024
@skytin1004 skytin1004 added the bug Something isn't working label Sep 29, 2024
@skytin1004 skytin1004 changed the title Fix issue with markdown translation where code blocks were split acro… Fix issue with markdown translation where code blocks were split across chunks Sep 29, 2024
@skytin1004
Copy link
Collaborator Author

I have reviewed the changes and everything looks good.

@skytin1004 skytin1004 merged commit ba135e8 into Azure:main Sep 29, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant