Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantizer test fails locally #597

Open
elshize opened this issue Dec 22, 2024 · 1 comment
Open

Quantizer test fails locally #597

elshize opened this issue Dec 22, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@elshize
Copy link
Member

elshize commented Dec 22, 2024

This is strange: runs ok in CI but recently started failing on my local. Fails for 32 bits, max of 1.0. Looks like issues with floating point error but idk what is different now, needs investigating.

@elshize elshize added the bug Something isn't working label Dec 22, 2024
@elshize elshize self-assigned this Dec 22, 2024
@elshize
Copy link
Member Author

elshize commented Dec 22, 2024

Not sure why the issue only appears sometimes but I think the current solution is just not correct; the correct types and order of operation to get the expected values at 0 and MAX are:

LinearQuantizer::LinearQuantizer(float max, std::uint8_t bits)
    : m_range((1UL << bits) - 1U), m_max(max) {
    if (max <= 0.0) {
        throw std::runtime_error(
            fmt::format("Max score for linear quantizer must be positive but {} passed", max)
        );
    }
    if (bits > 32 or bits < 2) {
        throw std::runtime_error(fmt::format(
            "Linear quantizer must take a number of bits between 2 and 32 but {} passed", bits
        ));
    }
}

auto LinearQuantizer::operator()(float value) const -> std::uint32_t {
    if (value < 0 || value > m_max) {
        throw std::invalid_argument(
            fmt::format("quantized value must be between 0 and {} but {} given", m_max, value)
        );
    }
    // This is always in (0, 1] range.
    auto normalized_value = static_cast<double>(value / m_max);
    return static_cast<std::uint32_t>(normalized_value * (m_range - 1)) + 1;
}

This removes scale which may introduce some errors imo, and casts the normalized value to double, so that once it's multiplied by range, it doesn't lose precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant