Add support for parsing files under `extracted/` #46

inquisitivecrystal · 2021-07-11T02:35:59Z

Rust needs the ability to parse extracted/DerivedNumericValues.txt as part of rust-lang/rust#84056. This adds parsing support for that file and all the other files under extracted/.

inquisitivecrystal · 2022-02-03T20:40:59Z

@BurntSushi Sorry for the nag, but can I ask for an update on this? If this is too large, I'd be happy to seperate out the portion that we actually need, extracted/DerivedNumericValues.txt, to make it easier to review.

BurntSushi

Thanks! LGTM.

Closes #46

inquisitivecrystal · 2022-07-14T08:10:59Z

Thanks so much for merging this. I really appreciate it, especially as I know things have been so busy for you. I'm glad rust-lang/rust#84056 is finally unblocked! 🎉

BurntSushi · 2022-07-14T12:46:20Z

No problem and sorry it took so long! Incidentally, I didn't realize this was blocking work for std (although I now see you did link it in your initial comment, whoops). Is ucd-generate used to generate the Unicode tables for std? I didn't know about that.

inquisitivecrystal · 2022-07-14T20:51:05Z

Yep, the unicode-table-generator tool used to make the standard library's unicode tables uses ucd-parse for parsing. It does its own table generation, because the space requirements of the standard library are a bit bespoke.

I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.

BurntSushi · 2022-07-15T12:10:23Z

@inquisitivecrystal Ah interesting. I bet some of those space saving tricks would be useful for regex-syntax too. See #30 and #39 for some ideas in this direction.

I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.

Ug sorry. Yeah, my brain has been a pile of mush for the past couple of years. I've only recently just started coming back up for air and getting more time to devote to projects.

BurntSushi · 2022-07-15T12:13:41Z

To elaborate a bit more on regex-syntax, basically, the tables are read when compiling a regex and not when searching. While regex compilation still needs to be reasonably fast, it would be acceptable to make using the Unicode tables slower if it allowed us to make the tables smaller. As it stands currently, I've basically invested no work or time in shrinking the tables at all. They are just sorted sequences of codepoint ranges. regex-syntax embeds a considerable amount of Unicode data (which can be disabled using Cargo features at least), but all of it is included by default. So space savings there would be a huge win.

BurntSushi · 2022-07-15T12:15:44Z

So what I'm trying to say is that if rustc has these super optimized/compressed formats for codepoint tables, it could be worth porting them to ucd-generate. With that said, it can be frustrating to rely on an external project for such a key thing inside of std. But, I wanted to throw it out there that there is almost certainly demand for the Herculean efforts being made elsewhere. :-)

thomcc · 2022-07-15T20:10:58Z

FWIW, the smallest tables I know of are in https://bellard.org/quickjs/'s libunicode, which manages to fit all boolean properties, general categories, scripts, and script extensions in around 40kb. Many of them require an unpacking step, but several can be modified to have an index (and the code for that is already in the repo). It's worth taking a look, the basic idea is just to use a chunked RLE on most of the tables.

It's considerably smaller than the tables that libstd uses, but also slower, even with the index.

Add support for parsing files under extracted/

7e79e21

inquisitivecrystal force-pushed the extracted branch from bccda70 to 7e79e21 Compare July 11, 2021 02:37

inquisitivecrystal changed the title ~~Add support for parsing values in extracted/~~ Add support for parsing files under extracted/ Jul 11, 2021

BurntSushi approved these changes Jul 5, 2022

View reviewed changes

BurntSushi pushed a commit that referenced this pull request Jul 5, 2022

ucd-parse: add support for parsing files under 'extracted/'

1381427

Closes #46

BurntSushi pushed a commit that referenced this pull request Jul 5, 2022

ucd-parse: add support for parsing files under 'extracted/'

c61ae95

Closes #46

BurntSushi closed this in 6341645 Jul 5, 2022

inquisitivecrystal mentioned this pull request Jul 14, 2022

Chinese numerals are not recognized by char::is_numeric rust-lang/rust#84056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parsing files under `extracted/` #46

Add support for parsing files under `extracted/` #46

inquisitivecrystal commented Jul 11, 2021 •

edited

Loading

inquisitivecrystal commented Feb 3, 2022

BurntSushi left a comment

inquisitivecrystal commented Jul 14, 2022

BurntSushi commented Jul 14, 2022

inquisitivecrystal commented Jul 14, 2022

BurntSushi commented Jul 15, 2022

BurntSushi commented Jul 15, 2022

BurntSushi commented Jul 15, 2022

thomcc commented Jul 15, 2022 •

edited

Loading

Add support for parsing files under extracted/ #46

Add support for parsing files under extracted/ #46

Conversation

inquisitivecrystal commented Jul 11, 2021 • edited Loading

inquisitivecrystal commented Feb 3, 2022

BurntSushi left a comment

Choose a reason for hiding this comment

inquisitivecrystal commented Jul 14, 2022

BurntSushi commented Jul 14, 2022

inquisitivecrystal commented Jul 14, 2022

BurntSushi commented Jul 15, 2022

BurntSushi commented Jul 15, 2022

BurntSushi commented Jul 15, 2022

thomcc commented Jul 15, 2022 • edited Loading

Add support for parsing files under `extracted/` #46

Add support for parsing files under `extracted/` #46

inquisitivecrystal commented Jul 11, 2021 •

edited

Loading

thomcc commented Jul 15, 2022 •

edited

Loading