Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement time-domain convolution and use it for integers #545

Merged
merged 11 commits into from
Nov 5, 2024

Conversation

martinholters
Copy link
Member

This does the very minimal thing to get discussion started. I think we should do the time-domain convolution in more cases, but before going in that direction, I'd first like to get feedback.

In particular, for "small" u and v, time-domain convolution would be more efficient and I see no reason not to do it. It just takes a bit of benchmarking to figure out what "small" should be exactly, here. And it would also fix this minor issue:

julia> conv(zeros(0), zeros(0))
ERROR: ArgumentError: invalid Array dimensions

julia> conv(zeros(0), zeros(1))
ERROR: InexactError: trunc(Int64, -Inf)

julia> conv(zeros(1), zeros(0))
ERROR: InexactError: trunc(Int64, -Inf)

The time-domain convolution form this PR does the right thing:

julia> conv(zeros(Int, 0), zeros(Int, 0))
Int64[]

julia> conv(zeros(Int, 0), zeros(Int, 1))
Int64[]

julia> conv(zeros(Int, 1), zeros(Int, 0))
Int64[]

Another point worth considering is whether to use time-domain convolution for more input types. Consider

julia> conv(zeros(BigFloat, 1), zeros(BigFloat, 1))
ERROR: type BigFloat not supported

julia> conv(zeros(Rational{Int}, 1), zeros(Rational{Int}, 1))
1-element Vector{Float64}:
 0.0

For BigFloat, this would change from throwing an error to working, but potentially slow, which I'd consider an improvement. For Rational, I wonder whether doing the computation on Rationals (and also returning Rationals) wouldn't be better. The extreme measure would be to do time-domain convolution by default and only use FFT-based approaches for Float32 and Float64 input (above a certain problem size), which would still cover most cases in practice, I guess. Opinions?

Note that if we decide to do time-domain convolution in more cases, we can always do that in later PRs; this one could stand on its own.

Closes #411, fixes #410.

Copy link

codecov bot commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 98.50746% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.83%. Comparing base (1dae6a3) to head (34f42b2).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/dspbase.jl 98.48% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
+ Coverage   97.81%   97.83%   +0.01%     
==========================================
  Files          18       19       +1     
  Lines        3204     3226      +22     
==========================================
+ Hits         3134     3156      +22     
  Misses         70       70              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wheeheee
Copy link
Member

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

Copy link
Member

@wheeheee wheeheee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Val(N) would help with type stability

src/dspbase.jl Outdated Show resolved Hide resolved
src/dspbase.jl Outdated Show resolved Hide resolved
@martinholters
Copy link
Member Author

martinholters commented Feb 23, 2024

maybe Val(N) would help with type stability

It does for N>10, which may not be too relevant, but then again ... why not?

@martinholters
Copy link
Member Author

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

Hm. Export tdconv and fdconv (names subject to bike-shedding) and have conv do some best-effort choosing among them? Or add kwarg to conv to force a choice instead of the additional exports?

@wheeheee
Copy link
Member

I kind of like the names tdconv and fdconv, but maybe adding a kwarg is better, since people might also want to choose the overlap-save algorithm and that's one more function.

src/dspbase.jl Outdated Show resolved Hide resolved
@martinholters
Copy link
Member Author

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

The algorithm set to :auto means "select based on problem size", which is the default for types known to be reasonably FFT-able (Float32, Float64, and their Complex versions). For other types the default is :direct. This lets user opt into :auto for other types at their own risk, but provides a sane default.

@wheeheee
Copy link
Member

wheeheee commented Feb 26, 2024

I took reference to FastConv.jl (licensed under the "MIT License (Expat)") and wrote a version with an explicit loop over CartesianIndices for 1-based arrays (unsure about offset indices). Here it is:

function _tdconv1!(out, E::AbstractArray{<:Number,N}, k::AbstractArray{<:Number,N}) where {N}
    output_indices = CartesianIndices(ntuple(i -> 1:(size(E, i)+size(k, i)-1), Val(N)))
    checkbounds(out, output_indices)
    fill!(out, 0)
    offset = CartesianIndex(ntuple(_ -> 1, Val(N)))
    for j in CartesianIndices(E), i in CartesianIndices(k)
        out[i+j-offset] = muladd(E[j], k[i], out[i+j-offset])
    end
    return out
end

Benchmarks:

julia> x = rand(100); y = rand(100); out = rand(199);

julia> @benchmark _conv_td!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  9.000 μs   38.000 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     9.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.216 μs ± 602.871 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █    █                                            ▃     ▂
  ▅▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁█ █
  9 μs         Histogram: log(frequency) by time      10.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _tdconv1!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.510 μs    5.720 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.520 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.583 μs ± 101.580 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █                                                   ▂
  ▄▁▁█▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▄▁▁█▁▁▂ ▂
  1.51 μs         Histogram: frequency by time        1.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@martinholters
Copy link
Member Author

Yep. I used the generator-based version because it allowed to derive the output type based on the actual values. With the latest change from _conv_td to _conv_td!, a simple loop like you have provided is probably better. Once we're sure about the API, I'll try to optimize, probably by switching to such a loop.

@wheeheee
Copy link
Member

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

@martinholters
Copy link
Member Author

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

I was primarily referring to the keyword-based API. Once that's settled, the organization of the conv methods can be streamlined. And the heuristic when to choose time vs frequency domain is mostly a placeholder for now, that needs adjustment either way, but that's somewhat orthogonal.

So let's discuss API. Some random thoughts I have gathered while working on this:

  • We should allow the user to explicitly select time or frequency domain. I'm not entirely convinced explicit distinction between "simple FFT" and "overlap-save" is needed. Would there be a use case where a caller needs to decide this?
  • We should have conv(u,v) do something reasonable by default. To me, that means at least to do small convolutions for non-FFT-able datatypes in time domain and large convolutions of FFT-able datatype in frequency domain. For large integer convolutions, it is not clear whether the long runtime of time domain or the rounding issues of frequency domain are the lesser evil, so an automatic mode that only considers problem size but ignores input types might also be desirable.
  • For the poly-algorithm conv, the result type should be independent of the algorithm, and I guess the only thing take really makes sense is eltype(conv(u,v)) == promote_type(eltype(u), eltype(v)).
  • If we decide to implement conv as first allocating the output and then calling a conv! function, we should consider making the latter public.
  • The implementation in the first commit here uses the collect machinery to derive the output eltype based on the actual values. That may have applications in rare situations, but I wouldn't want to make the API more complicated just to allow this.
  • We might want to be a bit careful with swapping u and v in the light of non-commutative number types.

I think the current kwargs approach addresses the first three items, although choosing between :auto and :direct based on the eltype might come a bit surprising. I'm not sure I like that. The alternative would be two different auto-like keywords, I guess.

@wheeheee
Copy link
Member

wheeheee commented Feb 27, 2024

My opinion, for what it's worth:

  • I initially thought it possible in rare cases that the slower algorithm is chosen, or that it could be platform-dependent, so that it could be nice to give users a choice by exposing os vs fft. As this doesn't seem to be the case, confusing users with these options seems pointless and without other benefits, I take back that suggestion.
  • I think many would find conv! welcome, totally in favour of that.
  • WRT swapping arguments, since AFAIK there isn't a trait or a general way to determine commutativity, we should probably define another union of types that are commutative and where swapping is legal, for correctness? Along with documentation, @warn for non-explicitly-allowed types could be warranted since that can really slow things down, though I'm not sure if @warn does bad things even on the slow path...
  • On integer convolutions: I'm on the side again of :direct by default, so path 2 in Integer convolutions have rounding errors #410, with another warning if the inputs aren't reasonably small. This would obviate the need for another keyword argument, if I understand you correctly, and another benefit, as was mentioned in that issue, is conforming with SciPy's behaviour.
  • Additionally, not sure if when conv! is in use, and the output array is of a wider type (although again, not too sure how to determine that), we should promote the elements from u and v first before multiplying and adding.

@martinholters
Copy link
Member Author

I'm no fan of @warnings, unless there is also an option to say "I know what I'm doing, no need to flood me with warnings". And including that option would make the API somewhat clunky, in my opinion.

Otherwise, this is slowly getting shape -- partially in code, partially in my head for now. One thing I haven't made up mind about is how conv! should deal with offset axes. A short recap for conv:

  • If all input axes are OneTos, so will be the output axes, and if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $1,\ldots,M+N-1$ as a OneTo (for every axis).
  • If there is at least one non-OneTo axis, and if the input indices are $0,\ldots,M-1$ and $0,\ldots,N-1$, the output indices will be $0,\ldots,M+N-2$ (for every axis). Shifts in the input are reflected in the output, so if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $2,\ldots,M+N$. This directly matches the $y[n]=\sum v[m]\cdot u[n-m]$ definition.

The two cases are inconsistent with each other for the input-indices-start-at-1 case, but each one individually is very much the only reasonable choice, so that's where we are. Now if the output provided to conv! matches what conv returns, all is well, but what if not. If the output has a OneTo axis, just storing the results starting at 1 seems ok, I'd say. If the output and inputs have offsets, but they don't "match", if all required indices are present in the output, storing the result there and zeroing the rest sounds reasonable. If the result does not fit into the output, a BoundsError seems warranted. A silent truncation might also be considered. But what if all inputs are OneTos but the output is not? Put the results starting at 2, starting at 1, or error?

Of course, erroring for unclear cases would open the possibility to decide later on, so that might be the safest route for now.

@wheeheee
Copy link
Member

wheeheee commented Feb 28, 2024

Sounds good. Another thought I had regarding commutativity and axes, while reading this after prematurely micro-optimizing the loop: even if we consider * potentially non-commutative, I suspect we could still retain the performance benefits of swapping u and v by simply changing the iteration order of the CartesianIndices.

However, as it is, there is the potential for type instability if the axes of each CartesianIndices are not of the same type, although union splitting may automatically take care of that. If this is implemented, I think extracting the kernel out and using a function barrier would at least look nicer, if it isn't strictly necessary. Scratch that, I see that they are all UnitRanges now...

we should promote the elements from u and v first before multiplying and adding.

I also just realized that muladd automatically promotes first.

@galenlynch
Copy link
Member

These all seem like great changes!

@martinholters
Copy link
Member Author

This turns out to be a but more of a rabbit hole than I had anticipated, but it's slowly getting into shape. Apart from some more tests and documentation, the main to-do appears to be a better heuristic to choose the fastest algorithm. But the working of the algorithm kwarg also may need some more consideration. Current status:

  • :direct chooses time-domain direct convolution.
  • :fft chooses an FFT-based algorithm; whether simple or overlap-save is chosen automatically. Should we allow the caller to choose?
  • :auto picks the faster (well, presently modulo the shortcomings of the super-simple heuristic) of those.

The default for algorithm is chosen based on the (output) eltype: :auto if it is Float32, Float64 or a Complex of those, :direct otherwise. Another option would be to have another keyword, say :fastest, to do what :auto does now, and have :auto be what the default does now (i.e. being equal to :direct or :fastest depending on eltype). Would the latter be better?

Another alternative would be to replace the algorithm keyword with something like allow_fft_based::Boolean where false corresponds to current :direct and true to current :auto, with the default again decided by eltype. Would that be preferable?

@galenlynch
Copy link
Member

I think allowing the user to choose the algorithm would be better. In my previous attempts to add direct convolution to this and figure out which alg to choose, I found it very hard to come up with heuristics since dimensionality, size of the problem, element type, number of threads available, and architecture (eg avx instructions available) all come into play. Doing a best effort selection and letting users who really care override seems better imo.

@martinholters
Copy link
Member Author

That would then include the option the choose between simple FFT and overlap-save, right? (I had this in an earlier iteration, can revive.)

@mbaz
Copy link
Contributor

mbaz commented Feb 29, 2024

I agree with giving users the ability to choose specific algorithms. Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

@wheeheee
Copy link
Member

wheeheee commented Mar 1, 2024

I think the second option with :auto replacing the current default behaviour seems safer.
Perhaps a simplified general-user-facing conv(u, v; mode=:auto(/:unsafe_fast)) would be friendlier, with :unsafe_fast enabling conversion to floats and FFT for numbers that aren't IEEE floats / complex.
And, for those who need it, put the additional, more involved options in conv!(out, u, v; alg), also defining an allocate_output for convenience, where one can choose a specific algorithm, :fft, or the options in conv. The different keyword arguments would be confusing though.

@martinholters
Copy link
Member Author

Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

Do you mean fft behave like the present auto, i.e. also allowing time-domain? That would be a bit surprising, so probably not. So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

@wheeheee I don't think allowing different kwargs in conv and conv! for similar purposes would help simplify the interface for users in any way.

@mbaz
Copy link
Contributor

mbaz commented Mar 1, 2024

So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

Yes, that's what I meant.

test/dsp.jl Outdated Show resolved Hide resolved
@martinholters
Copy link
Member Author

Another pair of eyes proof-reading the added warning to the docstring would be welcome. Then, this should be ready to (squash!)-merge.

src/dspbase.jl Outdated Show resolved Hide resolved
@wheeheee
Copy link
Member

Just a side note that from 1.10 -> 1.11 the tests take something like 0.8x more time, which imo is a pretty serious performance regression. Can't be certain but probably not this PR's fault, since it's present in all sections. Locally, it's not that bad, but still slightly worse.

@martinholters
Copy link
Member Author

Just a side note that from 1.10 -> 1.11 the tests take something like 0.8x more time, which imo is a pretty serious performance regression. Can't be certain but probably not this PR's fault, since it's present in all sections. Locally, it's not that bad, but still slightly worse.

Julia 1.11.0 had some compile time (and other latency) regressions, which apparently have only been partially fixed in 1.11.1. I'm inclined to blame those, given that the tests are typically pretty compile-heavy. So with some luck, things will improve with upcoming Julia versions. If not, we should inspect this more closely.

src/dspbase.jl Outdated
Comment on lines 703 to 708
calc_index_offset(ao::Base.OneTo, au::Base.OneTo, av::Base.OneTo) = 1
calc_index_offset(ao::Base.OneTo, au::AbstractUnitRange, av::AbstractUnitRange) = # first(au) + first(av) - 1
throw(ArgumentError("output must have offset axes if the input has"))
calc_index_offset(ao::AbstractUnitRange, au::Base.OneTo, av::Base.OneTo) = # 2
throw(ArgumentError("output must not have offset axes if none of the inputs has"))
calc_index_offset(ao::AbstractUnitRange, au::AbstractUnitRange, av::AbstractUnitRange) = 0
Copy link
Member

@wheeheee wheeheee Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran into some trouble while testing whether the PR would partially resolve #497. Briefly, this part throws an ArgumentError for SOneTo axes, although they are not offset axes. Should we consider using Base.has_offset_axes for this part instead of testing whether axes are OneTo?
edit: ok, I guess this would mean a 1-based OffsetArray would be treated like an Array. Maybe there could be more documentation on what is meant by "offset axes"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is unfortunate. I'll think about it while I'm away from the computer for the rest of the week.

Copy link
Member Author

@martinholters martinholters Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In brief, the problem manifests itself as:

conv(@SVector([1,1]), @SVector([1,2,3]))
ERROR: MethodError: no method matching similar(::SVector{2, Int64}, ::Type{Int64}, ::Tuple{UnitRange{Int64}})

or, after using OffsetArrays, as

julia> conv(@SVector([1,1]), @SVector([1,2,3]))
4-element OffsetArray(::Vector{Int64}, 2:5) with eltype Int64 with indices 2:5:
 1
 3
 5
 3

which may also come unexpected (note the offset axis). The result one would actually desire here is

4-element SVector{4, Int64} with indices SOneTo(4):
 1
 3
 5
 4

This makes me really skeptical of the approach of treating anything unknown as having an offset, as that may lock us into unwanted behavior to avoid breakage later on.

So my current thought is: Bail on anything not OneTo. But do so via a function to which additional methods can be added to specify the desired behavior. Then we have the option of providing package extensions e.g. for StaticArrays making SOneTo behave in the no-offset way and for OffsetArrays making IdOffsetRange do the offset thing. Yes, that means any unsupported axis type will yield an error. But I'd prefer that to doing something arbitrary to later find out that something else would have been better. I think I will try that approach with StaticArrays and OffsetArrays as trial balloons to see how it feels. Opinions so far?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first draft of the extension-based implementation looks promising to me. Especially for StaticArrays, that is probably the only way to get it type-stable. Downside is that this won't work for Julia older than 1.9. But maybe we want to stop supporting anything before 1.10 anyways?

@martinholters
Copy link
Member Author

Okay, pushed a version utilizing package extensions. The StaticArrays part is just a proof of concept for now and needs tests, but might better be added in a separate PR later.

@martinholters
Copy link
Member Author

@wheeheee are you ok with this in the present shape? Then, once #575 has landed, I'll remove the last commit with the StaticArrays stuff (to reappear in a separate PR) and merge. Ok with you?

@wheeheee
Copy link
Member

Sounds good! Not familiar with package extensions, so maybe it would be better if there's another reviewer...

@martinholters martinholters merged commit 7652e2d into master Nov 5, 2024
11 checks passed
@martinholters martinholters deleted the mh/tdconv branch November 5, 2024 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants