Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to TinyStr #7

Merged
merged 1 commit into from
Aug 10, 2019
Merged

Switch to TinyStr #7

merged 1 commit into from
Aug 10, 2019

Conversation

zbraniecki
Copy link
Owner

This is a port of https://github.com/projectfluent/fluent-locale-rs/pull/8/files to unic-langid.

It's a WIP patch, but it works and passes all tests. I think there are couple more optimizations possible and I'd also like to get feedback from the author on my attempt to separate try_new from new_unchecked.

Performance looks good, except of the unchecked:

     Running /Users/zbraniecki/projects/unic-locale/target/release/deps/canonicalize-de07a657eb9bbd6d
Gnuplot not found, disabling plotting
langid_canonicalize     time:   [3.1782 us 3.1955 us 3.2135 us]                                 
                        change: [-38.116% -37.401% -36.763%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

Gnuplot not found, disabling plotting
     Running /Users/zbraniecki/projects/unic-locale/target/release/deps/langid-c1dada57b49294d8
Gnuplot not found, disabling plotting
language_identifier_from_str                                                                             
                        time:   [512.60 ns 513.77 ns 515.01 ns]
                        change: [-84.009% -83.919% -83.827%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) low mild
  4 (4.00%) high mild

language_identifier_from_parts                                                                            
                        time:   [293.03 ns 294.23 ns 295.60 ns]
                        change: [-89.997% -89.930% -89.862%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

language_identifier_from_parts_unchecked                                                                            
                        time:   [220.00 ns 221.03 ns 222.26 ns]
                        change: [+49.424% +50.201% +50.996%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

Gnuplot not found, disabling plotting
     Running /Users/zbraniecki/projects/unic-locale/target/release/deps/parser-9a3dedd695250831
Gnuplot not found, disabling plotting
language_identifier_parser                                                                            
                        time:   [432.51 ns 433.59 ns 434.83 ns]
                        change: [-85.793% -85.707% -85.617%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

language_identifier_parser_casing                                                                            
                        time:   [432.55 ns 433.62 ns 434.78 ns]
                        change: [-85.559% -85.467% -85.376%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild

Gnuplot not found, disabling plotting

This patch so far is backwards compatible which is super nice. But I'm now wondering if it should return TinyStr from get_* methods (which, I assume, would make the unchecked constructor cheaper/faster), but it would be a breaking change and would let people operate on TinyStr rather than str. Not sure if it's worth it...

@zbraniecki zbraniecki force-pushed the tinystr branch 2 times, most recently from 53a46e5 to 60c3bc0 Compare August 7, 2019 18:27
@zbraniecki
Copy link
Owner Author

@raphlinus - hi! would you have time to take a look at this and verify if I'm doing the right thing?

I'm concerned about the unchecked perf and also about the Ord on TinyStr8.

Copy link

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments inline.

Unchecked performance is an interesting question - the reason why it was good before is that you allowed Cow<'static, str> so there's basically no work to be done. Since these are static strings, there is a potential strategy, which is to use the integers directly. One thing to explore is whether aggressive inlining of the TinyStr constructor would let the compiler do this. If const fn gets more powerful, then it might be possible to write the code carefully to be const fn.

Another possibility to explore, if identifiers from static sources are important, is a macro.

pub fn new_unchecked(text: &str) -> TinyStr8 {
unsafe {
let mut word: u64 = 0;
copy_nonoverlapping(text.as_ptr(), &mut word as *mut u64 as *mut u8, text.len());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be possible to do this without unsafe, using a [u8; 8] buffer and then u64::from_ne_bytes, which is present as of 1.32.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm trying to do that, but I failed to find a way to get [u8; 8] out of a &str. What API should I use?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, I'll actually go with dumping numbers and new_unchecked will take it back.

fn cmp(&self, other: &TinyStr8) -> Ordering {
//XXX: Is that correct?
// I see that `nedis` < `macos` otherwise - see tests :(
other.0.cmp(&self.0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect what you want to do here is self.0.to_be().cmp(other.0.to_be()).

assert!(TinyStr8::try_new("abcXYZ").unwrap().is_all_ascii_alpha());
}

#[test]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs more cases: strings of different lengths, etc.

pub struct TinyStr4(NonZeroU32);

impl TinyStr8 {
pub fn new_unchecked(text: &str) -> TinyStr8 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsafe as written, so should be pub unsafe fn.

@zbraniecki
Copy link
Owner Author

Another possibility to explore, if identifiers from static sources are important, is a macro.

This is exactly for a macro - see https://github.com/zbraniecki/unic-locale/blob/master/unic-langid-macros-impl/src/lib.rs#L12 and https://github.com/zbraniecki/unic-locale/blob/master/unic-langid-macros/examples/langid-macro.rs

I'd like to not make it slower if possible. I'm wondering if I could just have a method that returns the structure as TinyStr that can be passed to from_parts_unchecked just for the macro use.

@raphlinus
Copy link

raphlinus commented Aug 8, 2019

You should be able to do this by having a TinyStr constructor that takes an integer, and having your macro compute the integer for the string. Then performance would be spectacular.

One thing to watch out for is endianness: I recommend the TinyStr constructor always take the string in little-endian, so do .from_le() to swap on a BE system. It's very difficult to test on big-endian systems, as they're rare (but do exist).

Also keep in mind, that constructor needs to be unsafe, because otherwise you can deref the resulting TinyStr and get an invalid string.

@zbraniecki
Copy link
Owner Author

I updated this patch to be based on #8. The unchecked performance is now on par with the previous solution! :)

@zbraniecki zbraniecki force-pushed the tinystr branch 4 times, most recently from 53b1a32 to 47f6efc Compare August 9, 2019 17:03
@zbraniecki zbraniecki marked this pull request as ready for review August 9, 2019 17:04
@zbraniecki
Copy link
Owner Author

Per emilio recommendation I also switched variants from Vec<_> to Box<[_]>. I was hoping it'll unblock production of LanguageIdentifier in const fn, but unfortunately that's still blocked on, this time on Box::new.

I could try to bring ToTokens into TinyStr, but I'd like to avoid setting this as a dependency on tinystr crate so I think for now I'll give up on langid! working in static.

@Manishearth
Copy link
Collaborator

Wait, why do you need ToTokens on TinyStr? In the macro you can just expand to an invocation of new_unchecked, no?

@zbraniecki
Copy link
Owner Author

Wait, why do you need ToTokens on TinyStr? In the macro you can just expand to an invocation of new_unchecked, no?

Yeah, but then I need to produce:

    let variants: Vec<_> = variants
        .into_iter()
        .map(|v| quote!($crate::TinyStr8::new_unchecked(#v)))
        .collect();
    let variants = quote!(Box::new([#(#variants,)*]));

Am I wrong?

@Manishearth
Copy link
Collaborator

You'll need to expose an as_int() -> Option<u64> function on it and use it in the macro

@Manishearth
Copy link
Collaborator

Never mind there's already an Into impl

@Manishearth
Copy link
Collaborator

What's wrong with that approach?

@zbraniecki
Copy link
Owner Author

What's wrong with that approach?

When I try to do:

static LANGID: LanguageIdentifier = langid!("en-US")

I am getting an error:

error[E0015]: calls in statics are limited to constant functions, tuple structs and tuple variants
 --> unic-langid/examples/simple-langid.rs:6:37
  |
6 | static LANGID: LanguageIdentifier = langid!("en-US");

@Manishearth
Copy link
Collaborator

Ah. That could be fixed if we used SmallVec as well I guess. I guess you still can't allocate in const?

@Manishearth
Copy link
Collaborator

Anyway you can always put these in lazy_static

@zbraniecki
Copy link
Owner Author

Ah. That could be fixed if we used SmallVec as well I guess. I guess you still can't allocate in const?

Why would it be fixed? SmallVec constructor also is not const.

@zbraniecki
Copy link
Owner Author

@Manishearth - I think this is big enough to warrant a review on its own before I pile up more. The TinyStr crate is in review in a separate issue so this is only about the second commit.

@Manishearth
Copy link
Collaborator

We can add one for constructing from arrays :)

@zbraniecki
Copy link
Owner Author

yes pls :)

@Manishearth
Copy link
Collaborator

Is there a reason we're vendoring TinyStr? Is this the canonical location? May be worth maintaining in a separate repo

@zbraniecki
Copy link
Owner Author

Is there a reason we're vendoring TinyStr? Is this the canonical location? May be worth maintaining in a separate repo

No preference. If you think it's worth it, I can separate out.

@Manishearth
Copy link
Collaborator

That would be ideal, thanks 😄

Copy link
Collaborator

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the tinystr code, but this looks good!

script: Option<Cow<'static, str>>,
region: Option<Cow<'static, str>>,
variants: Vec<Cow<'static, str>>,
language: Option<TinyStr8>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment saying that these are guaranteed to be <=4

@zbraniecki zbraniecki force-pushed the tinystr branch 3 times, most recently from 81c1bef to 80462ce Compare August 9, 2019 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants