Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force double quotes on integers and null in string #112

Merged
merged 1 commit into from
Jul 4, 2022

Conversation

jnystad
Copy link
Contributor

@jnystad jnystad commented Jun 30, 2022

Fixes #106

Also fixes quoting for integer values in strings.

I guess the double.TryParse was an optimization to avoid parsing long string values if unnecessary? If so, this might impact performance when serializing a lot of longer string values. Might add similar parsing attempts for bools and integers (null check is trivial), but not sure if C# parsers match the YAML spec exactly.

@xoofx
Copy link
Owner

xoofx commented Jun 30, 2022

Hm, I think that it was to avoid Schema.TryParse which is a lot more costly than the early exit double.TryParse

@jnystad
Copy link
Contributor Author

jnystad commented Jul 1, 2022

Yes, that's what I thought. Any suggestion to fix? Perhaps parsing a substring would be sufficient.

@nickcampau
Copy link

I just ran into an issue that was caused by this issue yesterday. To my utter amazement this PR had just been opened a few hours before. I want to shoutout @jnystad 🥇 your timing couldn't have been more perfect. I was literally cloning down the repo to look into fixing this issue when I saw this PR.

I'd like to suggest that this PR not wait for any further performance enhancement modifications since it addresses the main issue of YAML non-compliance. Instead a new PR be created to improve the performance (if possible) to investigate those concerns.

@xoofx
Copy link
Owner

xoofx commented Jul 1, 2022

I'd like to suggest that this PR not wait for any further performance enhancement modifications since it addresses the main issue of YAML non-compliance. Instead a new PR be created to improve the performance (if possible) to investigate those concerns.

This behavior has been around for the past 5 years, so there is no urgency to land this PR.

Yes, that's what I thought. Any suggestion to fix? Perhaps parsing a substring would be sufficient.

Could you run a benchmark before/after using BenchmarkDotNet on a significant yml file to see the impact?

@xoofx xoofx added the bug label Jul 3, 2022
@jnystad
Copy link
Contributor Author

jnystad commented Jul 4, 2022

Given the customizable nature of SharpYaml, and specifically since Schema in SerializerContext can be overridden I think any optimizations added here may come into conflict with custom Schema handling and therefore do not belong here.


Also, I'm not sure I have the overview of the library to be certain I'm creating useful performance tests.

A preliminary investigation using Stopwatch and loops indicate that the actual Serialize step has a negligible performance cost compared to initializing a new Serializer (i.e. creating a new Serializer and running Serialize takes ~500ms on a large object while the next thousand with the same Serializer takes ~5ms in total).

Also, I don't think the regexes used by the real parser are significantly slower than any other check one can do, at least not in a scale that matters. They probably return fast enough regardless of string length considering the limited matches they produce.

But as I said, I don't know the internal mechanics enough to determine if I'm actually testing this properly (as in, is there some caching where feeding the same deserialized object to the same Serializer somehow skips the actual serialization?).


Regarding the urgency, I wonder how existing users of SharpYaml and YamlDotNet (which has the same flaw with at least integer values) live with this. Perhaps they override the default behaviour (like I have done where I'm currently using SharpYaml until this fix is merged).

@xoofx
Copy link
Owner

xoofx commented Jul 4, 2022

A preliminary investigation using Stopwatch and loops indicate that the actual Serialize step has a negligible performance cost compared to initializing a new Serializer (i.e. creating a new Serializer and running Serialize takes ~500ms on a large object while the next thousand with the same Serializer takes ~5ms in total).

Afair, It is recommended to use a Serializer per thread and keep it around, as it is storing metadata information that you really don't want to recompute.

Also, I don't think the regexes used by the real parser are significantly slower than any other check one can do, at least not in a scale that matters. They probably return fast enough regardless of string length considering the limited matches they produce.

Right, this could be also further optimized with .NET 7+ regex optimizations in the future.


Anyway, thanks for the fix, I'm gonna merge it. The regression in perf might be negligeable and if someone notice it, we will have a candidate to fix it 😅

@xoofx xoofx merged commit a8fcaa5 into xoofx:master Jul 4, 2022
@jnystad
Copy link
Contributor Author

jnystad commented Jul 5, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The string "null" should be deserialized as a string, not as null
3 participants