Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing files under extracted/ #46

Closed

Conversation

inquisitivecrystal
Copy link
Contributor

@inquisitivecrystal inquisitivecrystal commented Jul 11, 2021

Rust needs the ability to parse extracted/DerivedNumericValues.txt as part of rust-lang/rust#84056. This adds parsing support for that file and all the other files under extracted/.

@inquisitivecrystal inquisitivecrystal changed the title Add support for parsing values in extracted/ Add support for parsing files under extracted/ Jul 11, 2021
@inquisitivecrystal
Copy link
Contributor Author

@BurntSushi Sorry for the nag, but can I ask for an update on this? If this is too large, I'd be happy to seperate out the portion that we actually need, extracted/DerivedNumericValues.txt, to make it easier to review.

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

@inquisitivecrystal
Copy link
Contributor Author

Thanks so much for merging this. I really appreciate it, especially as I know things have been so busy for you. I'm glad rust-lang/rust#84056 is finally unblocked! 🎉

@BurntSushi
Copy link
Owner

No problem and sorry it took so long! Incidentally, I didn't realize this was blocking work for std (although I now see you did link it in your initial comment, whoops). Is ucd-generate used to generate the Unicode tables for std? I didn't know about that.

@inquisitivecrystal
Copy link
Contributor Author

Yep, the unicode-table-generator tool used to make the standard library's unicode tables uses ucd-parse for parsing. It does its own table generation, because the space requirements of the standard library are a bit bespoke.

I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.

@BurntSushi
Copy link
Owner

@inquisitivecrystal Ah interesting. I bet some of those space saving tricks would be useful for regex-syntax too. See #30 and #39 for some ideas in this direction.

I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.

Ug sorry. Yeah, my brain has been a pile of mush for the past couple of years. I've only recently just started coming back up for air and getting more time to devote to projects.

@BurntSushi
Copy link
Owner

To elaborate a bit more on regex-syntax, basically, the tables are read when compiling a regex and not when searching. While regex compilation still needs to be reasonably fast, it would be acceptable to make using the Unicode tables slower if it allowed us to make the tables smaller. As it stands currently, I've basically invested no work or time in shrinking the tables at all. They are just sorted sequences of codepoint ranges. regex-syntax embeds a considerable amount of Unicode data (which can be disabled using Cargo features at least), but all of it is included by default. So space savings there would be a huge win.

@BurntSushi
Copy link
Owner

So what I'm trying to say is that if rustc has these super optimized/compressed formats for codepoint tables, it could be worth porting them to ucd-generate. With that said, it can be frustrating to rely on an external project for such a key thing inside of std. But, I wanted to throw it out there that there is almost certainly demand for the Herculean efforts being made elsewhere. :-)

@thomcc
Copy link
Contributor

thomcc commented Jul 15, 2022

FWIW, the smallest tables I know of are in https://bellard.org/quickjs/'s libunicode, which manages to fit all boolean properties, general categories, scripts, and script extensions in around 40kb. Many of them require an unpacking step, but several can be modified to have an index (and the code for that is already in the repo). It's worth taking a look, the basic idea is just to use a chunked RLE on most of the tables.

It's considerably smaller than the tables that libstd uses, but also slower, even with the index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants