Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial notebook DAMP #920

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

NimaSarajpoor
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor commented Oct 3, 2023

This PR is a replacement for PR #872.

This PR is to address #606.


I will start with implementing the original algorithm, as proposed in the paper, as close as possible. Then, I will add notes on how to optimize the code. I will consider the MATLAB code too.


Notes

  • In contrast to the original paper, the MATLAB code currently shows the following logic to compute the initial chunksize for backward processing: 2 ^ next_pow2(8 * m), pay attention to 8.

  • As a follow up to previous point: what happens if m is already a "power of two" number?

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
For now, I have only implemented the backward processing algorithm as proposed in the DAMP paper. I have commented on the parts that can be changed to enhance the code.

@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (857063c) 98.93% compared to head (5c110a1) 98.93%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #920   +/-   ##
=======================================
  Coverage   98.93%   98.93%           
=======================================
  Files          84       84           
  Lines       14292    14292           
=======================================
  Hits        14140    14140           
  Misses        152      152           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
@@ -0,0 +1,378 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #24.        For the given index i, segmentize the array np.arange(i) into 

Is there a better way to describe the task of this function? How about:

"Given the index i, divide the array np.arange(i) into chunks. Ensure the last chunk's size is chunksize_init, with the preceding chunks doubling in size. The output consists of (start, stop) indices of each chunk."

Also, we may add:

The left-most chunk always start at 0.


Reply via ReviewNB

Copy link
Contributor

@seanlaw seanlaw Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems weird. It's not clear why you need to end with the last element being chunksize_init

Something feels backwards here. The intent is hard to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we actually need to call the function multiple times? Or is it possible to call the function once and then shift the indices somehow?

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels backwards here

Well, it is backward (I mean... the process itself is backward). but, I trust your lens! Please allow me to think more and see if I can come up with something that is more elegant.

Also, do we actually need to call the function multiple times? Or is it possible to call the function once and then shift the indices somehow?

I wanted to do that but I decided to start simple. What we can do is to keep shifting the start, stop indices of all chunks by one in each iteration. So, for instance, if I get the chunks for the subsequences in T[:idx]. Then, I can find the chunks for T[:(idx+1)] by just shifting start/stop indices of chunks by one to the right. We just need to keep track of the left-most chunk though as it can become the power two of a number at some point. In such case, the number of chunks will be increased by one.

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[FYR...before I forget]
Another approach is to ALWAYS OFFSET FROM THE QUERY INDEX idx by s, 2s, 4s, 8s, and so on, where s is power of two. The good thing is that the difference between any two consecutive offset is STILL a power of two. This may result in cleaner code. IIRC, I think I saw something similar in the MATLAB code before. Going to check that as well.

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MATLAB code uses the following lines for updating the start / stop of each chunk:

# Query is at index `i`, i.e. `query = T(i:i+SubsequenceLength-1)`
# X is the chunk size, initially set to `2 ^ nextpow2(8 * SubsequenceLength)`

X_start = i - X + 1 + (expansion_num * SubsequenceLength)
X_end = i - (X / 2) + (expansion_num * SubsequenceLength)

# and then
approximate_distance = min( real(MASS_V2(T(X_start:X_end), query))); 
X = X * 2
expansion_num = expansion_num  + 1

The term (expansion_num * SubsequenceLength) is to take into account the elements of the last subsequence in the new chunk. To keep the length of chunk untouched, the X_start has the same shift.


Note 1:
According to the paper / MATLAB code, the length of chunk, T in core.mass(Q, T), (NOT the number of subsequences) should be a power-of-two.

Note 2:
The reason behind considering the length power-of-two for chunk is to speed up the MASS algorithm (according to the authors)

So, I did a quick check to see how much having the length power-of-two affects the performance.

seed = 0
np.random.seed(seed)

T = np.random.rand(10_000_000)
m = 50

T, M_T, Σ_T, T_subseq_isconstant = stumpy.core.preprocess(T, m)

query_idx = 600000
t_lst = []
for stop in range(2 ** 20 - 1000 - 1, 2 ** 20 + 1000 + 1):
    time_start = time.time()

    QT = core.sliding_dot_product(
        T[query_idx : query_idx + m], 
        T[start : stop],
    )
    
    D = core._mass(
        T[query_idx : query_idx + m],
        T[start : stop],
        QT=QT,
        μ_Q=M_T[query_idx],
        σ_Q=Σ_T[query_idx],
        M_T=M_T[start : stop - m + 1],
        Σ_T=Σ_T[start : stop - m + 1],
        Q_subseq_isconstant=T_subseq_isconstant[query_idx],
        T_subseq_isconstant=T_subseq_isconstant[start : stop - m + 1],
    )

    time_end = time.time()
    t_lst.append(time_end - time_start)

And, the I plot it:

indices = np.arange(2 ** 20 - 1000 - 1, 2 ** 20 + 1000 + 1)
indices = indices[2:]
t_lst = t_lst[2:]

idx = np.flatnonzero(indices == 2 ** 20)[0]

plt.figure(figsize=(20, 5))
plt.scatter(indices[idx], t_lst[idx], color='r', label='chunksize = 2 ** 20')
plt.plot(indices[idx-200 : idx+200], t_lst[idx-200:idx+200]) 
plt.ylabel('running time')
plt.legend()
plt.show()
image

Well, it seems the running time for the chunk size 2 ** 20 is low (not necessarily the lowest) but it should be okay. To remain faithful to the paper, I am going to consider the length power-of-two for each chunk size.

docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
I have provided a few comments. Please let me know what you think.

@@ -0,0 +1,569 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #28.        while start < chunk_stop:

Is this code readable? I tried to vectorize it but couldn't figure out "how". If the number of subsequences in each chunk were supposed to be a power of two, I think I would be able to vectorize it. However, according to the paper, the size of the chunk itself should be a power of two. In other words, the number of subsequences is like... 2 ** num - m + 1

Since I couldn't find out a cleaner solution, I am going with naive_get_range_damp for now.


Reply via ReviewNB

Copy link
Contributor

@seanlaw seanlaw Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to see more examples of what the inputs/outputs are suppose to be in order to understand what is expected. Right now, I'm very confused. Can you give me some concrete examples and any gotchas (i.e., when things aren't perfectly square)?

docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
@@ -0,0 +1,569 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #67.            PL[start : stop - m + 1] = np.minimum(PL[start : stop - m + 1], D)

Previously, we had the variable pruned , which was a boolean vector (of length `len(T)-m+1`), where pruned[i] is True if the i-th subsequence is pruned. And, instead of line above (i.e. line #67), we had:

mask = np.argwhere(D < bsf) + start

pruned[mask] = True

Recall that the paper's original algorithm does not compute PL[i] if pruned[i] is True. It just skips it. In such case, the original algorithm set PL[i] as follows:

PL[i] = PL[i-1]

which makes it difficult to find the correct index of discord. The MATALB code does a hack instead as follows:

PL[i] = PL[i-1] - 0.000001

and this does not seem to be a clean solution.

So, instead, I am updating the (approximate) matrix profile PL by using the computed distance profile D . This helps us to avoid the hack, and I think It should not be computationally more expensive compared to np.argwhere(D < bfs)



Reply via ReviewNB

@@ -0,0 +1,569 @@
{
Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 60-68 updates the chunks_range by shifting it by one to the right in each iteration. I feel the code is not that clean! Still trying to figure out if there is a better way for updating the chunks range. Any thoughts?


Reply via ReviewNB

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 19, 2023

@seanlaw

Would you please check out the comments and let me know what you think? In particular, I need your support to figure out:
(1) If there is a better / cleaner way to get the chunks range (see function naive_get_range_damp)
(2) If there is a better way to update the chunks range in each iteration in the main function DAMP

Note [Regarding exclusion zone]:
Also, as discussed, I am considering excl_zone (instead of the hardcoded value m) to make the result align with what one obtains using the stumpy.stump

docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
docs/Tutorial_DAMP.ipynb Show resolved Hide resolved
@seanlaw
Copy link
Contributor

seanlaw commented Nov 20, 2023

@NimaSarajpoor Let's start with these comments first before we dive into DAMP.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 26, 2023

A few updates:

(1) The proposed get_damp_range function is added to the notebook.

(2) The running time of DAMP algorithm increases drastically when m becomes larger. See the new example 3 in the notebook. The running time of naive DAMP (stumpy-based DAMP) is ~5 sec, and the running time of DAMP algorithm is ~55 sec. I think it is mostly related to the computation of sliding dot product using core.sliding_dot_product. If we use the NUMBA JIT-compiled function core._ sliding_dot_product instead, the running time becomes ~41 sec. And, if we parallelize core._sliding_dot_product (by passing parallel=True to the njit decorator and using prange), the running time becomes ~12 sec. Still, this is longer than the stumpy-based naive DAMP.

  • Since I do not want to do any premature optimization, I am avoiding making any changes.
  • It might be worth it to check the paper and see if the authors compare DAMP with parallelized STOMP algorithm or not. Is DAMP really better than stumpy.stump in a batch case?

(3) An experimental top-k DAMP code is provided on the supporting webpage. I tried to make some changes on our current DAMP to reflect top-k code. I did a few tests and I think it works (i.e. the results are the same as the top-k discords from naive DAMP). Btw, I did not push these changes to this PR.

How it works:
In short, we just set bsf (best-so-far score) to the k-th discord score, and not the top-1. In contrast to the stumpy-based naive DAMP, we need to set k in the beginning of the algorithm.

(4) Is only ONE discord interesting?
I am not sure... but I FEEL that it should be useful in an online case (i.e. streaming data) when a user wants to see if the current subsequence (containing the latest data) is an anomaly or not when it is compared with a set of usual patterns (e.g. see GOLDEN DAMP in the paper)

@seanlaw
Copy link
Contributor

seanlaw commented Nov 27, 2023

I think it is mostly related to the computation of sliding dot product using core.sliding_dot_product

I think it is worth profiling the code to see which function most of the time is being spent. Does the paper offer any clues on what timings should be? How does the Matlab code perform?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 29, 2023

I think it is worth profiling the code to see which function most of the time is being spent.

# example 

import numpy as np

seed = 100
np.random.seed(seed)

T = np.random.rand(100000)
m = 50
split_index = 5000

I profiled the computing time for:

  • backward proces, forward process, and chunks update block.
  • Also, in backward process and forward process each, I profiled the computing time of QT (the dot product part) and core._mass
# computing time [in seconds]

# in DAMP
backward_time:  49.28337216377258
forward_time:  8.625611543655396
chunkupdate_time:  0.16403627395629883

-----

# considering total calls of `_backward_process`
# QT is regarding the dot product part

backward_time_QT:  47.81996726989746
backward_time_mass:  0.7662274837493896

-----

# considering total calls of `_foreward_process`
# QT is regarding the dot product part

forward_time_QT:  8.326124429702759
forward_time_mass:  0.1750011444091797

How does the Matlab code perform?

I ran DAMP version 2.0 provided on the supporting website using the same input of Example 3 (of our notebook in this PR). The output matches (see below) and the running time is ~56 sec (which is close to our current implementation of DAMP)

# DAMP version 2.0 MATLAB code 

Results:
Pruning Rate: 0.011047
Predicted discord score/position: 29.4233/6111

* If you want to suppress the outputs, please call DAMP using the following format:
>> [discord_score,position] = DAMP_2_0(T,SubsequenceLength,location_to_start_processing,'enable_output',false)

Elapsed time is 56.641462 seconds.

Note:
In contrast to the original algorithm proposed in the paper (and our implementation), the MATLAB code sets the initial chunk size to next_pow2(8 * m)


Does the paper offer any clues on what timings should be?

Working on it...

@seanlaw
Copy link
Contributor

seanlaw commented Nov 29, 2023

The running time of naive DAMP (stumpy-based DAMP) is ~5 sec

I didn't realize that STUMPY was that performant even for n = 100_000! Out of curiosity, i tried it myself on my M2 Macbook Air and also saw ~4.46s. According to our old performance chart (in the README), it use to take ~19s on a much older 16-core Intel machine. This is exciting to see!

Update: According to the site for the paper, the first example shown says:

Real-time Time Series Anomaly Search
This video shows time series anomaly search in real time, on a cheap commodity computer: Intel® Core™ i5-8500 CPU @ 3.00 GHz

The dataset is:
Eight million datapoints long
17.36 hours of wall clock time
85,056 heartbeats, which have been annotated by a combination of algorithms and cardiologists.

The search took 22.3 seconds, which is 2,802 times faster than real-time.

For a CPU that was released in Q2 of 2018 has only 6 cores/threads, 22.3 seconds seems really fast for 8 million data points. I'd be curious if we could reproduce these results using their Matlab code or our DAMP code.

I ran DAMP version 2.0 provided on the supporting website using the same input of Example 3 (of our notebook in this PR). The output matches (see below) and the running time is ~56 sec (which is close to our current implementation of DAMP)

Thank you. At least this confirms that we're not doing anything horribly wrong compared with the MATLAB code? But is this really scalable? After you've investigated, it might be good to check with Eamonn and his team.

In contrast to the original algorithm proposed in the paper (and our implementation), the MATLAB code sets the initial chunk size to next_pow2(8 * m)

Presumably, this shouldn't change the performance much?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 1, 2023

didn't realize that STUMPY was that performant even for n = 100_000!
This is exciting to see!

Indeed exciting!

For a CPU that was released in Q2 of 2018 has only 6 cores/threads, 22.3 seconds seems really fast for 8 million data points. I'd be curious if we could reproduce these results using their Matlab code or our DAMP code.

According to this video, the split_index seems to be 2500, and the window size is 80. However, according to this, the window size seems to be set to 94. The data itself is not a 1D array. It is a 2D array (a matrix), with two rows, and each row has ~10M data points (not 8 million)

But is this really scalable? After you've investigated, it might be good to check with Eamonn and his team.

I will get to this after the investigation.


Before I forget, I am going to share a couple of observations. We may discuss/ignore them later after trying out the case you suggested above.

(1) Regarding the trillion-experiment data:
The data and the code is provided at the bottom of this page:
https://sites.google.com/view/discord-aware-matrix-profile/dataset#h.dwws21dk9mej

The data is a txt file, containing values, each in one line. The value returned by wc -l command is 118-540-512, which is around 100M (so not trillion?)

The data is about ~3GB. The maximum size that can be uploaded to MATLAB online server is 256 MB. So, I think I need to split it to several files, each with maximum size of 256MB, and then reconstruct the full time series on MATLAB after uploading the files there.

(2) I think the performance can be affected by the location of discord. So, in addition to the running time, it is worth it to report the location of discord as well.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 1, 2023

I am going to share a couple of observations

Excellent. Please let me know if there is anything I can do to help support you!

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 2, 2023

According to this video, the split_index seems to be 2500, and the window size is 80. However, according to this, the window size seems to be set to 94. The data itself is not a 1D array. It is a 2D array (a matrix), with two rows, and each row has ~10M data points (not 8 million)

I couldn't find the time series with 8M data points. The mit_long_term_ecg14046.mat data provided for Fig. 12 on the supporting webpage has two rows, each with 10M data points. In the following cases, I considered the first row of this data, and I just considered the first 8M data points. The split_idx is set to 2500. Both Python and MATLAB codes were run on MATLAB online server. Thank you @seanlaw for setting this up!.

Case 1-1:

m=80 Python DAMP MATLAB DAMP 2.0
discord loc 973932 973933
discord dist 10.433 10.4333
running time [sec] 282.23 1044.47

Case 1-2:

m=94 Python DAMP MATLAB DAMP 2.0
discord loc 3711 3712
discord dist 11.254 11.254
running time [sec] 428 1162

Note 1: Indexing in MATLAB starts at 1.
Note 2: The MATLAB code uses DAMP 2.0, which is slightly different:

  • In the beginning of the algorithm, the program computes the full distance profile for the first ~ 16 * m subsequences located AFTER split_idx (with a mix of forward_process). And then, start using the backward-process / forward-process for the remaining test points (i.e. the points that are located AFTER split_idx + 16 * m)
  • Also, the initial chunksize in the backward process of DAMP 2.0 is the minimum power of two that is $\ge 8m$

So, if I have a data that has some repetitive patterns every, say, 2 * m period, then approx. 6 * m subsequences are being processed in the first chunk (almost) unnecessarily(?). However, it is possible to just early abandon the neighbours of a subsequences after considering just the first 2 * m. So, I am trying to understand the logic behind 8m vs m. I think it all comes to this point:

Is the running time of MASS(Q, T1 U T2) (roughly) equal to MASS(Q, T1) + MASS(Q, T2)?

If yes... then we may even avoid doubling the chunksize in each iteration of backward process, and just go backward using the same chunksize!

Next step
I can remove that first part of DAMP 2.0 that considers full distance profile for ~ 16 * m, and then change 8m to m for the initial chunksize, or change m to 8m in our python DAMP, and compare again...

[Update]
Okay... I modified the MATLAB code to avoid computing full distance profile for the first 16*m subsequences located AFTER split_idx. I also changed the initial chunksize from next_pow2(8m) to next_pow2(m).

Now, the computing time of MATLAB in Case 1-1 becomes 152 seconds.
And, the computing time of MATLAB in Caes 2-2 becomes 292 seconds.

[Update 2]
Adding njit decorator with fastmath results in reducing the running time of Python DAMP to 44 seconds for m=80 and to 72 seconds for m=94 case!
(Note: I had to use core._sliding_dot_product instead of core.sliding_dot_product to allow the functions backward / forward process work with njit decorator)


I think it is important to test the performance when m is large. So, I tested it for m=500.

m=500 MATLAB DAMP 2.0 MATLAB DAMP Python DAMP with njit
discord loc 3691 3691 3690
discord dist 28.216 28.216 28.216
running time [sec] TBD 1196 427

@seanlaw
Copy link
Contributor

seanlaw commented Dec 3, 2023

@NimaSarajpoor Thanks for running the comparison. From what I can tell, the Python implementation seems to be around the same performance as Matlab (maybe a little faster)? What should be the next steps? Do we have questions that we should pose to Eamonn (especially with regards to reproducibility)?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 8, 2023

@seanlaw
I tried to use stumpi-like concept (i.e. shifting QT and updating it as we consider new data points in T)

QT_new[1:] = self._QT[:l] - T_new[:l] * t_drop + T_new[self._m :] * t

However, I think:
(1) it can take more time to implement (as one needs to carefully track the slice of T that is processed and bypass QT computation without re-computing chunks range) ,
and
(2) I have been doing pre-mature optimization!

So, for now, I just added @nji decorators.


From what I can tell, the Python implementation seems to be around the same performance as Matlab (maybe a little faster)?

Correct. I just added @njit decorator with fastmath flag (but no parallelism) and its performance becomes faster than MATLAB DAMP and MATLAB DAMP 2.0.

What should be the next steps?

(1) I think we can start reproducing a couple of figures of the paper.
(2) We can think about when it makes sense to use DAMP instead of stump (because stump uses parallel computing).
(3) Add param k (for top-k), and enhance the code accordingly.

Do we have questions that we should pose to Eamonn (especially with regards to reproducibility)?

I can think of three questions:

  • Where is that 8M data? Because, the provided dataset has two time series each with 10M data points.
  • Has DAMP been compared with other options like stumpi, or stump (which uses parallel computing) ?
  • Most examples provided on the supporting webpage except two have small m. I was wondering if the performance is tested for large m particularly when there is less periodic behaviour in the data.

Regarding the last item (i.e. performance of DAMP when window size is large)
To do a quick comparison, I considered the first 100k datapoints provided here.

Then:

# python
data = np.array(loadmat('./DAMP_data/mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)

l = 100000 # length of data
split_idx = 2500
m in range(50, 1001, 50)
image

@seanlaw
Copy link
Contributor

seanlaw commented Dec 8, 2023

(1) I think we can start reproducing a couple of figures of the paper.
(2) We can think about when it makes sense to use DAMP instead of stump (because stump uses parallel computing).
(3) Add param k (for top-k), and enhance the code accordingly.

Cool! I think reproducing the figures/performance found in the paper is key.

Where is that 8M data? Because, the provided dataset has two time series each with 10M data points.

We should definitely get some clarification for this.

Has DAMP been compared with other options like stumpi, or stump (which uses parallel computing) ?

Note that Eamonn likely has no idea what stumpi or stump is since we've renamed the original functions. Instead, refer to stompi and stomp/mpx

Most examples provided on the supporting webpage except two have small m. I was wondering if the performance is tested for large m particularly when there is less periodic behaviour in the data.

Based on your test of this, DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code? If "yes", then we should definitely ask Eamonn and his students.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 14, 2023

For sliding dot product between a query, Q, and a time series T, I tried MATLAB (fft) and Python (fft / njit). The computing time for these three cases are shown below.

image

Notice that Python with njit seems to be dependent on the window size. However, that is not the case for Python fft (with SciPy). However, to decorate the backward / forward functions with njit decorator (to speed up computation), the sliding dot product cannot take advantage of scipy fft as Numba does not support that!

Most examples provided on the supporting webpage except two have small m. I was wondering if the performance is tested for large m particularly when there is less periodic behaviour in the data.

Based on your test of this, DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code?

Now that we have checked the performance of sliding dot product, I will check the MATLAB DAMP performance considering different window size as you suggested. Will provide an update.

@seanlaw
What do you think? Please let me know if you have any suggestion.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 14, 2023

Notice that Python with njit seems to be dependent on the window size. However, that is not the case for Python fft (with SciPy). However, to decorate the backward / forward functions with njit decorator (to speed up computation), the sliding dot product cannot take advantage of scipy fft as Numba does not support that!

Yikes, our njit version is pretty bad. When you say "scipy fft", are you referring to scipy.signal.convolve? Note that convolve requires executing an FFT 3 times so please make sure that you are doing the same thing everywhere and not simply a single "FFT". Additionally, I vaguely remember reading scipy.signal.convolve will attempt to use heuristics to choose the fastest algorithm (based on the length of the array). I believe that a different algo is used when the length is over 500 in length and there may be an even better algorithm for longer time series. I think it would be worthwhile testing longer time series.

Can you please share the code that you used?

I think it might be worthwhile posting any comparisons to both the numba discourse channel to see if numba can get closer (either by dot or by fft/convolve) and also post to the scipy Github issues.

Update

It appears that numba now supports np.convolve according to a quick search? This might be relevant to compare

Note, at some point, we may want to implement our own parallel convolve (FFT) function from scratch without relying on numpy or scipy, which is likely using a much slower np.dot underneath the hood.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 14, 2023

are you referring to scipy.signal.convolve? Note that convolve requires executing an FFT 3 times so please make sure that you are doing the same thing everywhere and not simply a single "FFT"

Yes... Below, I am showing what I meant by each case :

  • MATLAB fft --> The part of MATLAB MASS function that computes the sliding dot product with help of fft
  • Python scipy fft --> core.sliding_dot_product, which uses scipy.signal.convolve
  • Python njit --> core._sliding_dot_product, which uses njit on top of np.dot

Btw, I think it makes sense that the computing time in the last case depends on the window size m as it simply uses np.dot. So, this computation has $\mathcal{O}({nm})$ time complexity. However, computing the sliding dot product via fft trick is said to have $\mathcal{O}({n\log{n}})$ time complexity.

Can you please share the code that you used?

All the following three cases were run on the MATLAB online server.

Case: [MATLAB] sliding dot product from MASS algorithm

# MATLAB 

t_lst = [];

idx = 1_000_000;
T = load("mit_long_term_ecg14046.mat");
T = T.mit_long_term_ecg_14046(1, 1:idx);
n = length(T);

m_values = 99:1:1000;
for m = m_values
    Q = T(1:m);
   
    t = tic; 
    Q = Q(end:-1:1); %Reverse the query
    Q(m+1:n) = 0;  %aappend zeros

    X = fft(T);
    Y = fft(Q);
    Z = X.*Y;
    z = ifft(Z);
    
    t_lst(end+1) = toc(t);
end

# later,  the slice `z(m:n)` is considered to compute the distance profile

Case: [Python] core.sliding_dot_product(Q, T), which uses scipy.signal.convolve

from stumpy import core

idx = 1_000_000
T = np.array(loadmat('./mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)[:idx]

t_lst = []
for m in range(99, 1000):
    Q = T[:m]
    t1 = time.time()
    core.sliding_dot_product(Q, T)
    t2 = time.time()
    t_lst.append(t2 - t1)

Case: [Python] core._sliding_dot_product(Q, T), which uses njit decorator on top of np.dot

from stumpy import core

idx = 1_000_000
T = np.array(loadmat('./mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)[:idx]

t_lst = []
for m in range(99, 1000):
    Q = T[:m]
    t1 = time.time()
    core._sliding_dot_product(Q, T)
    t2 = time.time()
    t_lst.append(t2 - t1)

I vaguely remember reading scipy.signal.convolve will attempt to use heuristics to choose the fastest algorithm (based on the length of the array).

Right. I read the doc. It has a parameter called method that is set to auto by default, which automatically chooses between fft and another method.

I think it would be worthwhile testing longer time series.

Will try it out. Note that this will affect the case that uses scipy.signal.convolve. Still, we cannot decorate it with njit (unless we use np.convolve with Numba as you suggested), which seems to have a considerable impact on reducing the running time of DAMP. I will try to check the performance of DAMP without njit decorator and using core.sliding_dot_product. I GUESS that its running time should not depend on m but it cannot outperform the case where DAMP's private functions are decorated with njit

It appears that numba now supports np.convolve according to a quick search? This might be relevant to compare

Interesting! I will see if I can easily/safely replace scipy convolve with numpy convolve, and then compare again.

Note, at some point, we may want to implement our own parallel convolve (FFT) function from scratch without relying on numpy or scipy, which is likely using a much slower np.dot underneath the hood.

I understand. In fact, I already tried rocket-fft, which supports Numba for fft, and then I tried to do what MATLAB code does in python using np.fft. However, I did not notice any improvement in the performance. I did not put much time into it though.

I think it might be worthwhile posting any comparisons to both the numba discourse channel to see if numba can get closer (either by dot or by fft/convolve) and also post to the scipy Github issues.

@seanlaw
... get closer to the MATLAB's performance? Will do so 👍

@seanlaw
Copy link
Contributor

seanlaw commented Dec 14, 2023

I was curious as to why Matlab's FFT might be faster and came across this post.

@NimaSarajpoor
Copy link
Collaborator Author

I was curious as to why Matlab's FFT might be faster and came across this post.

Also: https://www.reddit.com/r/Julia/comments/5wel0j/fft_speed_vs_matlab/

@seanlaw
Copy link
Contributor

seanlaw commented Dec 14, 2023

Also: https://www.reddit.com/r/Julia/comments/5wel0j/fft_speed_vs_matlab/

In 2017, I experimented with FFTW and PyFFTW but hadn't noticed any significant performance difference while noting that they were much harder to use (due to its need to make "plans"). I understand that FFTW uses a ton of tricks and optimizations underneath the hood that could help make DAMP faster but, after a quick search, it looks like FFTW's license (GPL) is incompatible with STUMPY's (BSD3) license. At some point, we may just need to accept our performance-dependency trade off.

Maybe there's a nice way to detect whether pyfftw+FFTW3 are installed and, if so, hand off the calculations. Otherwise, default to scipy.signal.convolve. Or simply set a config.FFTW = True flag to force STUMPY to monkey patch itself or something like that. Regardless, we should create a new Github issue.

@NimaSarajpoor
Copy link
Collaborator Author

At some point, we may just need to accept our performance-dependency trade off.

I understand. The best option is to see if we can improve the performance with the current dependencies. If not, then...

Maybe there's a nice way to detect whether pyfftw+FFTW3 are installed and, if so, hand off the calculations. Otherwise, default to scipy.signal.convolve. Or simply set a config.FFTW = True flag to force STUMPY to monkey patch itself or something like that.


Regardless, we should create a new Github issue.

It makes sense. I created the issue #938 to continue our conversation regarding the performance of sliding dot product.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 18, 2023

We recently noticed that Python DAMP's running time increases as window size, m, gets larger and larger. To investigate MATLAB DAMP regarding same matter, we first started with comparing MATLAB MASS (with fft trick) and PYTHON _MASS. We noticed that the performance of stumpy.core.sliding_dot_product, which uses scipy.signal.convolve (fft trick), seems to be independent of window size. This is what we want as MATLAB MASS function shows similar behaviour. However, this function cannot be njit-decorated. An alternative function, stumpy.core.sliding_dot_product is njit-decorated; however, it uses np.dot for the sliding dot product. This can results in having high computing time when m is large (see the following experiments).

(1) Comparing the "sliding dot product" functions
#920 (comment)

(2) Comparing Python Naive (i.e. with STUMP using multiple threads) vs Python DAMP (using one thread but nit-decorated with fast math. We used core._sliding_dot_product)
#920 (comment)

Based on your test of this, DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code? If "yes", then we should definitely ask Eamonn and his students.

Recall that the current version of MATLAB DAMP provided on the supporting webpage has some differences compared to the original algo proposed in the paper. To this end, I am going to show the result for both versions. To reproduce the following result, here is the code:

# MATLAB

running_time_lst_comp = [];
discord_lst_comp = [];

m_values = 50:50:1000;

idx = 100000;
T = load("mit_long_term_ecg14046.mat");
T = T.mit_long_term_ecg_14046(1, 1:idx);
location_to_start_processing = 2500;
for m = m_values
    t = tic;
    
    [discord_score,position] = DAMP(T, m, location_to_start_processing,'lookahead', m);
    
    running_time_lst_comp(end+1) = toc(t);
    discord_lst_comp(end+1) = position;
end

(1) MATLAB DAMP Original
where, the initial chunksize is next_pow2(m).

The following figure shows the running time of DAMP original for different window sizes. Group of adjacent datapoints with the same color has the same next_pow2(m) value, which is the initial chunksize (in moving backward) and the lookahead size (for forward process).

image

One reason for having ascending trend for group of window sizes with similar next_pow2(m) can be the fact that the initial chunk will have less and less subsequences as m gets larger and larger. For instance, for both m=300 and m=500, the value of next_pow2(...) is 512. In the former case (i.e. m=300), we have 213 subsequences in our first initial chunk size. However, in the latter case (i.e. m=500), we have 13 subsequences. So, maybe it is more efficient to consider more subsequences per chunk (?)... let's see the next case!

(2) MATLAB DAMP v2
where, the initial chunksize is next_pow2(16*m).

image

Compared to the previous case, we can see a slightly larger SLOPE in the running time for a group of adjacent data points. One reason can be: "For the same window size between the two cases, i.e. DAMP original and DAMP v2, DAMP v2 needs to compare the query subsequence agains more subsequences as each chunk is larger in this version of DAMP.

Now, let's get back to our question.

DAMP seems pretty bad as window size increases. Can you confirm that this is also true when using the MATLAB code?

According to my experiments, I think it is true.


Let's compare MATLAB DAMP with Python DAMP (njit version), for different window sizes, by running the latter on the MATLAB online services.

image

Now, let's compare again but with larger window size, [3500, 4000, 4500]

image

We can now see the impact of higher time complexity of sliding dot product in Python.


Finally, let's just focus on MATLAB DAMP and consider a wide range of values np.arange(250, 4500+250, 250) for window size m to better see the impact of window size on the performance MATLAB DAMP. The split index is set to 20_000

Group of adjacent data points with similar color are associated with window sizes that have the same next_pow2(m) value.

image

As an aside, I should mention that the MATLAB code has the following block that gives warning and suggests a window size when the provided value by user is either too small or too large

# MATLAB

if SubsequenceLength <= 10 || SubsequenceLength > 1000
        [autocor,lags] = xcorr(T,3000,'coeff');
        [~,ReferenceSubsequenceLength] = findpeaks(autocor(3010:4000),lags(3010:4000),'SortStr','descend','NPeaks',1);
        ReferenceSubsequenceLength(isempty(ReferenceSubsequenceLength))=1000;
        ReferenceSubsequenceLength = floor(ReferenceSubsequenceLength);

        disp("WARNING: ");
        disp("The subsequence length you set may be too large or too small");
        disp(strcat("For the current input time series, we recommend setting the subsequence length to ",num2str(ReferenceSubsequenceLength)));
        fprintf("------------------------------------------\n\n");
    end
  • Why 1000 as the upper bound for SubsequenceLength?
  • Why 3000 as the maximum number of lags?

What does the above results mean for us?

(1) The performance depends on size of m, even in MATLAB DAMP

(2) The sliding dot product without fft trick has the time complexity O(nm). We can see the high running time of Python DAMP when m is large

@seanlaw
Copy link
Contributor

seanlaw commented Dec 18, 2023

Thanks @NimaSarajpoor

@Kotrix
Copy link

Kotrix commented Jan 18, 2024

Thanks for the good work @NimaSarajpoor

Do you plan to add anything else or the DAMP is ready to roll?

One comment:

The data is a txt file, containing values, each in one line. The value returned by wc -l command is 118-540-512, which is around 100M (so not trillion?)

From slides here: https://docs.google.com/presentation/d/1tQQfKuKrOa3j5-WJ9peoYfZGOnJbBcl9/edit#slide=id.p31

we count a trial successful if the top-1 discord is found in the first 100,000 datapoint (created by [e]), rather than from the appended ninety-nine million nine hundred thousand datapoints. We conducted ten thousand such trials, and found no false positives.
100 million times 10 thousand is 1 trillion

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 19, 2024

@Kotrix

Do you plan to add anything else or the DAMP is ready to roll?

Not ready yet. We are currently focusing on implementing numba-friendly Fast Fourier Transform (FFT), which is an important component of DAMP (see #938 and #939 to track the progress for FFT).

One comment:

The data is a txt file, containing values, each in one line. The value returned by wc -l command is 118-540-512, which is around 100M (so not trillion?)

From slides here: https://docs.google.com/presentation/d/1tQQfKuKrOa3j5-WJ9peoYfZGOnJbBcl9/edit#slide=id.p31

we count a trial successful if the top-1 discord is found in the first 100,000 datapoint (created by [e]), rather than from the appended ninety-nine million nine hundred thousand datapoints. We conducted ten thousand such trials, and found no false positives.
100 million times 10 thousand is 1 trillion

Ahhh.... good catch! Thanks! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants