Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_x000D_ kind of value in string cell should be unescaped #469

Open
yorkz1994 opened this issue Sep 24, 2024 · 4 comments
Open

_x000D_ kind of value in string cell should be unescaped #469

yorkz1994 opened this issue Sep 24, 2024 · 4 comments

Comments

@yorkz1994
Copy link

image
Take this excel value for example, the value is multi line.
After run below code to print the cell value:

fn main() {
    let mut wb: Xlsx<_> = calamine::open_workbook("Book1.xlsx").unwrap();
    let ws = wb.worksheet_range("Sheet1").unwrap();
    let data = ws.get_value((0, 0)).unwrap();
    dbg!(data);
}

Output:

[src/main.rs:7:5] data = String(
    "ABC_x000D_\r\nDEF",        
)

Expected output:

[src/main.rs:7:5] data = String(
    "ABC\r\nDEF",        
)

Golang excelize library handle it correctly.
Reference
Book1.xlsx

@jmcnamara
Copy link

If it helps here is how rust_xlsxwriter encodes these characters in the opposite direction:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/xmlwriter.rs#L204-L248

And here is a test file with each of the characters from 0..127:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings01.xlsx

However, as mentioned in the Reference link you need to also handle escaped literal strings which are prefixed by _x005F_. For example a string stored as _x005F_x0000_ in /xl/sharedStrings.xml would be displayed in Excel as _x0000_.

There is a test file for strings like that here:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings02.xlsx

@yorkz1994
Copy link
Author

@jmcnamara

Thanks. This information is very useful. I check the code, it seems only _x00HH_ literals are escaped.
If other valid _xHHHH_ literals are skipped, then when doing read, excel will not treat them as literal anymore.
For example this *_x597D_*, if you don't escape it, when read back into excel, we got *好*, but we expect *_x597D_* back.
image

@jmcnamara
Copy link

jmcnamara commented Sep 26, 2024

If other valid _xHHHH_ literals are skipped, then when doing read, excel will not treat them as literal anymore.

You are correct. That is a bug in rust_xlsxwriter. :-| Update: fixed.

@jmcnamara
Copy link

I had a look at submitting a patch for this but it looks like the escaping is handled in quick_xml. I then looked at maybe using quick_xml::escape::unescape_with() but that seems intended for entities rather than general unescaping (as far as I can see).

I could look into it a bit more but overall I don't know if it is worth it. The escape _x000D_ == \r is probably the only one that a general user would encounter and maybe they could just handle it themselves. @tafia if you think it is with fixing let me know and also how/where you think it should be fixed and I can look a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants