Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support unit tests for Arrays and remove a ToString overload, #1114 #1121

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

paulirwin
Copy link
Contributor

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Adds unit tests for the Arrays class from Apache Harmony, removes Arrays.ToString<T>(T[], IFormatProvider)

Fixes #1114

Description

This adds the tests from Apache Harmony for many of the Arrays methods, and creates some Lucene.NET-specific tests for the remaining methods.

During writing these tests, I went to write a test for Arrays.ToString<T>(T[], IFormatProvider) using fr-FR as the culture so that it would produce something different than the invariant culture, and it produced problematic output. i.e. passing new float[] { 1.1f, 2.2f } results in the string "[1,1, 2,2]" due to that culture's decimal comma. This is clearly invalid output, as you cannot tell the difference between the list delimiter and the decimal comma. I looked into what Java does, and it does not have such an overload. It always uses the invariant format with a decimal point, even if your current locale is set to one with a decimal comma. Given that this method was only used in one place, to build a debugging/diagnostic string in FacetResult, I figured that it would be better to match Java's behavior here and remove this overload, using the invariant culture instead. This way we don't have to create tests for what is certainly buggy behavior that is only used in one place in our codebase and only for diagnostic purposes. Additionally, this invariant output more closely matches what you would write in code as well, better catering the output for that purpose.

@paulirwin paulirwin added the is:enhancement New feature or request label Jan 23, 2025
@paulirwin paulirwin requested a review from NightOwl888 January 23, 2025 04:40
/// <param name="array">The array to convert.</param>
/// <param name="provider">A <see cref="IFormatProvider"/> instance that supplies the culture formatting information.</param>
/// <returns>The converted array string.</returns>
public static string ToString<T>(T[] array, IFormatProvider provider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bug, it is a feature. In .NET, the default behavior is to use culture-sensitive formatting, but in Java the default is culture invariant. In most cases (ToString() methods), the output should be displayed in the culture of the current thread, as that is what .NET users expect to happen. So, technically, the default overload is the one that is broken by hard coding J2N.Text.StringFormatter.InvariantCulture instead of using J2N.Text.StringFormatter.CurrentCulture. But changing it is going to require a review and should probably be left to be done in another issue that considers culture-sensitivity as a whole after we have completed #924.

There are cases where we need to override this behavior, such as when there are tests that check the output of ToString(). Since the test framework randomly chooses the culture of the current thread, we can specify to use the invariant culture in those tests. I would say that this is one of those cases. We can leave the proper formatting of specific cultures up to the J2N tests to catch and only check the basic formatting that these methods do in the Lucene.NET codebase.

FYI - J2N.Text.StringFormatter is aware of collections and will do structural formatting on those if passed as the IFormatProvider. It also lowercases boolean true and false and ensures there is at least 1 zero in the decimal place position on floating point types. So, it is a closer match to Java behavior than using CultureInfo.InvariantCulture. But there is nothing at all in .NET or in Java like J2N.Text.StringFormatter.CurrentCulture. It is an attempt to make a Java-like format that is localized to the current thread. But we should leave testing of that for J2N to work out.

Copy link
Contributor Author

@paulirwin paulirwin Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, in FacetResult (the only usage), the formatted array is an array of strings, so culture has no effect either way, making this more of a theoretical discussion. But assuming that it was i.e. floating point numbers in other usages in the future, or that this code might become part of J2N for other users...

I disagree that this behavior is correct or not a bug. There is a difference between formatting just a number as a string in a way that might be useful to end users (in which case a culture is necessary, i.e. displaying in a UI or for subsequent round-trip parsing) and formatting an array. Apart from perhaps some math-inclined nerds like myself, array syntax is only useful for debugging and diagnostic purposes, not for end users. (Also, it is not a proper serialization, in which case you might want to use something like JSON instead, which also does not support decimal commas.) And if you do get that array printed in i.e. a log message, it is not useful for it to have a decimal comma for floating-point numbers if that is how your culture is configured. The primary benefit of formatting an array as a string is for programmers (or those that can understand data structures) to diagnose the values in an array, or perhaps even to copy/paste into i.e. a unit test. If you did that with values formatted with a decimal comma, that is not correct in C# or most programming languages. Java is correct here in using the invariant culture, and I do not see a good reason why our diagnostic messages should differ from Java in this regard (again, assuming someone adds a use for this method on numbers in the future).

And that's one problem even if there's just one number in the array. The second problem is the list separator, which is hardcoded to ", " (although correctly, IMO, per the above, but it's a problem if you're formatting the values with a culture). When the culture uses a decimal comma, this breaks the formatting of the list due to ambiguity. This theoretically could be fixed if we had a CultureInfo parameter (instead of IFormatProvider) by using TextInfo.ListSeparator, but AFAICT there's no way to get that from IFormatProvider. Regardless, even if we could, and say it was ';' for that culture, the formatted array string of "[1,1; 2,2]" is not close to being valid in most programming languages, which is the main benefit to the reader of the output of Arrays.ToString, either from a cognitive understanding of the diagnostic or to actually put into code like a unit test. Say you had 1000 items in the array. That would require extensive manual editing to put into a unit test to repeat behavior from that diagnostic string, whereas the string "[1.1, 2.2]" would at least be valid syntax as-is for double[] or object[]. (Edit: using C# collection literals, anyways. It would just require a trivial change of brackets to braces for array initializer syntax in older C# or Java.) Of course there are easy examples where this would not be valid C#/Java, but it at least can get you close for many primitive types.

There is a good reason why Java's Arrays.toString is culture-invariant: because Java code is. I think we should match the same behavior now and going forward for formatting Arrays as a string, which again is only useful for diagnostic purposes as if it were code, and not serialization or end-user use.

ChatGPT perhaps said it best when asked:

Java's Arrays.toString(...) methods are not locale-aware because they are designed to provide a consistent, straightforward string representation of arrays for debugging and logging purposes, rather than being formatted for user-facing output. Locale-sensitive formatting is typically handled by dedicated classes like NumberFormat or MessageFormat, keeping Arrays.toString(...) lightweight and predictable across different environments.

@@ -117,7 +117,7 @@ public string ToString(IFormatProvider? provider)
sb.Append("dim=");
sb.Append(Dim);
sb.Append(" path=");
sb.Append(Arrays.ToString(Path, provider));
sb.Append(Arrays.ToString(Path));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this. See my comment in Arrays.ToString(T[], IFormatProvider).

for (int i = 0; i < d.Length; i++)
{
// ReSharper disable once CompareOfFloatsByEqualityOperator - we're looking for exactly this value
assertTrue("Failed to fill float array correctly", d[i] == float.MaxValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to how JIT screws this up on x86 in .NET Framework, we should compare raw bits.

NumericUtils.SingleToSortableInt32(d[i]) == NumericUtils.SingleToSortableInt32(float.MaxValue)

for (int i = 0; i < d.Length; i++)
{
// ReSharper disable once CompareOfFloatsByEqualityOperator - we're looking for exactly this value
assertTrue("Failed to fill double array correctly", d[i] == double.MaxValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to how JIT screws this up on x86 in .NET Framework, we should compare raw bits.

NumericUtils.DoubleToSortableInt64(d[i]) == NumericUtils.DoubleToSortableInt64(double.MaxValue)

for (int i = 0; i < 400; i++)
{
// ReSharper disable once CompareOfFloatsByEqualityOperator - we're looking for exactly not this value
assertTrue("Filled elements not in range", d[i] != val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to how JIT screws this up on x86 in .NET Framework, we should compare raw bits.

NumericUtils.DoubleToSortableInt64(d[i]) != NumericUtils.DoubleToSortableInt64(val)

for (int i = 400; i < d.Length; i++)
{
// ReSharper disable once CompareOfFloatsByEqualityOperator - we're looking for exactly this value
assertTrue("Failed to fill double array correctly", d[i] == val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to how JIT screws this up on x86 in .NET Framework, we should compare raw bits.

NumericUtils.DoubleToSortableInt64(d[i]) == NumericUtils.DoubleToSortableInt64(val)

for (int i = 0; i < 400; i++)
{
// ReSharper disable once CompareOfFloatsByEqualityOperator - we're looking for exactly not this value
assertTrue("Filled elements not in range", d[i] != val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to how JIT screws this up on x86 in .NET Framework, we should compare raw bits.

NumericUtils.SingleToSortableInt32(d[i]) != NumericUtils.SingleToSortableInt32(val)

for (int i = 400; i < d.Length; i++)
{
// ReSharper disable once CompareOfFloatsByEqualityOperator - we're looking for exactly this value
assertTrue("Failed to fill float array correctly", d[i] == val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to how JIT screws this up on x86 in .NET Framework, we should compare raw bits.

NumericUtils.SingleToSortableInt32(d[i]) == NumericUtils.SingleToSortableInt32(val)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is:enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support unit tests for Arrays
2 participants