[WIP] Discovering Discords of arbitrary length using MERLIN #417 #505

NimaSarajpoor · 2021-12-28T03:42:23Z

In this PR, we would like to implement MERLIN algorithm that discovers discords of arbitrary length. Although the MATLAB implementation of the faster version of MERLIN is available on MERLIN: SUPPORT, we, for now, implement the original version as proposed in the MERLIN.

What I have done so far:

Add Introduction
Explain the advantage / disadvantage of MatrixProfile for discovering discords
Explain two core ideas of MERLIN algorithm
Implement the first phase of DRAG (DRAG is the first part of MERLIN)

NOTE:
(1) I already implemented MERLIN algorithm and enhanced it a little bit. Both MERLIN and STUMPY: MatrixProfile gives EXACTLY the same output regarding the discords indices and their distances to their NN. The discord NN indices are the same in almost all cases.

(2) In terms of performance, MERLIN outperforms STUMPY for long time series.

review-notebook-app · 2021-12-28T03:42:27Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

NimaSarajpoor · 2021-12-28T03:44:29Z

@seanlaw
please let me know if the current notebook is better compared to the previous ones. I just want to know how you feel about it. If you have any general suggestion for further improvement of this notebook, please feel free and let me know. Thanks!

codecov-commenter · 2021-12-28T04:00:12Z

Codecov Report

Merging #505 (169a07e) into main (60bb08d) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #505   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11300    11300           
=======================================
  Hits        11288    11288           
  Misses         12       12

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60bb08d...169a07e. Read the comment docs.

seanlaw · 2021-12-28T15:22:04Z

@ninimama Will do. Thanks!

docs/Tutorial_DiscordMERLIN.ipynb

NimaSarajpoor · 2021-12-30T22:36:55Z

@seanlaw
I think we have discussed/resolved all the issues that need to be addressed in the next commit / future. Please let me know what you think.

seanlaw · 2021-12-30T22:54:05Z

@seanlaw I think we have discussed/resolved all the issues that need to be addressed in the next commit / future. Please let me know what you think.

@ninimama Alright, looks good to me!

…t excl_zone

…euclidean distance

…d euclidean distance

…malized euclidean distances

… are resolved

NimaSarajpoor · 2022-01-02T19:50:34Z

What I have done:

revise the notebook according to the previous review
implement second phase of DRAG (Table 2 of paper). I named this function as _prune_candidates

Also, I tried to make sure the implementation of _prune_candidates and _select_candidates to be close to each other as they basically do similar things (_select_candidates is to check out right neighbors and _prune_candidates considers the left ones). It can make it easier for review.

NOTE (1):
I am providing an alternative implementation below. I sorted the cand_indices (output of previous phase) in descending order (i.e. I just flipped it!) It makes it easier to update it in each iteration. Then, I flipped it back at the end just to make sure the output (indices of final candidates) is in ascending order (please see below for more details).

def alternative_prune_candidates(subseqs, min_dist, cand_index):
   
    n, m = subseqs.shape # n: number of subsequences, #m: length of each subsequence
    
    cand_index = np.flip(cand_index) #sorting indices in descending order
    isDiscord = np.ones(len(cand_index), dtype=bool) 
    
    excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
    
    r = m * (1.0 - ((min_dist ** 2.0) / (2.0 * m))) 

    for i in range(0, n):
        if not np.any(isDiscord):
            return None #means all candidates are pruned! There is nothing to return!
        
        non_trivial_cand = cand_index[cand_index > i + excl_zone]  
        #since indices are sorted in descending order, the slice non_trivial_cand starts from index 0 of cand_index.
        
        if not len(non_trivial_cand):
            break #means we already reached a subsequence that has no candidates on its right neighbors!
        
        R = np.matmul(subseqs[non_trivial_cand], subseqs[i]) 
        isDiscord[np.flatnonzero(R>r)] = False
        
    return np.flip(cand_index[isDiscord])

NOTE (2):
I also tried non-normalized min_dist in the first phase. It seems there should be no problem. Further investigation is needed (such as trying it in other phases and monitoring its running time). However, I wanted to keep things organized. So, if that is okay with you, we can cover it after implementing the original MERLIN.

… the name used in _select_candidates

seanlaw · 2022-01-02T20:58:04Z

I tried to make sure the implementation of _prune_candidates and _select_candidates to be close to each other as they basically do similar things (_select_candidates is to check out right neighbors and _prune_candidates considers the left ones). It can make it easier for review.

Great and thank you. I will take a look.

Note 1: I am providing an alternative implementation below. I sorted the cand_indices (output of previous phase) in descending order (i.e. I just flipped it!) It makes it easier to update it in each iteration. Then, I flipped it back at the end just to make sure the output (indices of final candidates) is in ascending order (please see below for more details).

Maybe it's just me but I find this implementation more confusing. The version in the notebook is much more readable and consistent with the first phase.

Note 2: I also tried non-normalized min_dist in the first phase. It seems there should be no problem. Further investigation is needed (such as trying it in other phases and monitoring its running time). However, I wanted to keep things organized. So, if that is okay with you, we can cover it after implementing the original MERLIN.

Yes, let's think about it later. I think that this notebook will end up being split up into "Tutorial_Merlin.ipynb" (i.e., how to use the merlin function but without getting into the technical phases) and "Merlin.ipynb" (i.e., how everything in Merlin is derived including the non-normalized min_dist). Again, we can worry about it later once we are satisfied and convinced that everything works correctly 😄

NimaSarajpoor · 2022-01-02T21:07:51Z

Maybe it's just me but I find this implementation more confusing. The version in the notebook is much more readable and consistent with the first phase

You are correct. In fact, that was one of the reason I didn't include it in the notebook because I remember the importance of readability. And, I think they have similar running time.

Again, we can worry about it later once we are satisfied and convinced that everything works correctly

Exactly. So, for now, if we discuss something about the implementation of non-normalized case, I will keep that in mind and might make a short note about it in the notebook. But, I avoid going to details. Btw, I already provided the math for both normalized and non-normalized cases in the notebook. We can keep it as is for now. I can move the calculation related to non-normalized part if you think that is better.

docs/Tutorial_DiscordMERLIN.ipynb

seanlaw · 2022-01-02T21:50:21Z

@ninimama Overall, I think things are looking great (good work)! Phase 2 was pretty straightforward to understand given how simple we've kept the code. I've made some suggestions for you to consider

docs/Tutorial_DiscordMERLIN.ipynb

NimaSarajpoor · 2022-01-02T23:21:10Z

I think this round of review was straightforward. please let me know if would like to discuss something further. Otherwise, I can go ahead to apply changes and implement the third phase (i.e. finding NN of each of final candidates).

seanlaw · 2022-01-03T02:31:19Z

I can go ahead to apply changes and implement the third phase

@ninimama Sounds good. Please feel free to move forward

…ctions

NimaSarajpoor · 2022-04-03T23:44:17Z

I think the notebook is ready.

Important changes:

function _find_discords: return numpy ndarray instead of three lists
function _discords
top-level function discords: return top-k discords of length m rather than range of m
change r_init to r, and change r to r_updated
Add bonus section to guide reader on how to take advantage of input r in discovering discords for a range of m

Btw, I ran the whole notebook and it is all good. (In fact, I ran the whole notebook every time I made a change, and I now noticed that my pushed commits show changes of id of cells as well. Sorry about that! I should have run the whole thing only once.)

seanlaw · 2022-04-05T03:10:04Z

Thank you. I will take a look

seanlaw · 2022-04-05T13:24:55Z

@NimaSarajpoor Have you seen this paper by Keogh? It appears to be an extension of VALMOD to handle discords as well as motifs. It may be relevant to MERLIN. Btw, I have not read either papers yet.

docs/Tutorial_DiscordMERLIN.ipynb

NimaSarajpoor · 2022-04-05T15:52:17Z

Have you seen this paper by Keogh? It appears to be an extension of VALMOD to handle discords as well as motifs.

No, this is new to me! I will go through them if I get some time...

I noticed:

They cited the paper Disk_Aware_Discord_Discovery, which is the foundation of paper MERLIN (and I think it is the core of method MERLIN.) They referred to this method as DAD and compared their proposed method with it.
Their statement sounds reasonable: "DAD has different execution times. We observe that the computational time of DAD depends on the subsequence length, since it computes Euclidean distances in their entirety (only applying early abandoning based on the best so far distance). How effective this early abandoning mechanism is, depends on the characteristics of the data. On the other hand, our algorithm computes all distances for the first subsequence length in constant time, and then prunes entire distance computations for the larger lengths"

So, the paper seems interesting... And, I just noticed you opened an issue. So, All good!

NimaSarajpoor · 2022-04-06T05:49:51Z

I think we discuss the comments. I can go ahead and apply changes.

seanlaw · 2022-04-06T16:55:52Z

I think we discuss the comments. I can go ahead and apply changes.

Sounds good

- change type of parameter `r` from scalar to float - remove functions from section BONUS section

NimaSarajpoor · 2022-04-07T05:34:37Z

@seanlaw
I checked the notebook and it seems the comments have been addressed. please feel free to review.

seanlaw · 2022-04-07T12:48:50Z

Will do!

docs/Tutorial_DiscordMERLIN.ipynb

seanlaw · 2022-04-08T15:30:09Z

@NimaSarajpoor Only found some very minor things. I have not spent time comparing the stumpy.stump vs merlin discords outputs though as I think that will come later. Given VALMOD, what do you think we should do next given the overlap? In other words, how should we decide whether to go with the VALMOD route or the MERLIN path for discords? I don't think that the user should need to "choose" if possible. Of course, "MAD" handles both discords AND motifs.

Really, what I'm asking is how should the user "know" when to use what? If the user only cares about discords (and doesn't care about motifs), should they EVER use VALMOD/MAD??

NimaSarajpoor · 2022-04-08T18:50:02Z

Given VALMOD, what do you think we should do next given the overlap? In other words, how should we decide whether to go with the VALMOD route or the MERLIN path for discords?

Really, what I'm asking is how should the user "know" when to use what? If the user only cares about discords (and doesn't care about motifs), should they EVER use VALMOD/MAD??

I am studying VALMOD now, and it might be too soon for me to say, with high certainty, whether we should go with MERLIN or VALMOD.

Based on my current understanding of VALMOD, I think it should perform well in finding motifs/discords but probably not for large-size data. I did a quick scan and I noticed the results are mostly for data sets with size 1M at the most. They considered 2M data points in some cases to show scalability. So, the VALMOD method will probably be useful for many users who want to deal with small/medium size data sets. (Is 1M-2M data points considered as medium-size data?)

I am pointing out the advantage / disadvantage of VALMOD method below:

Advantage: In contrast to MERLIN, or a simple matrix profile that only considers 1-NN of each subsequence, it can discover top-k discords while considering not just 1NN but also i-th nearest neighbors. (see Fig. 2 in the extended version VALMOD_2020 and notice the point "top-1 2nd Discord".)
Disadvantage: For a given subsequence-length range [min_m, max_m], VALMOD needs to obtain not just the full matrix profile for the min_m, but, in fact, the "top-p nearest neighbors" for each single subsequence of length min_m in time series data T (see Algorithm3) In other words, extracting top-p from the "distance profile" that corresponds to each subsequence . The parameter p should be set by user. For instance, the authors considered p=5 for 0.1M data points, and, in another case, p=50 for 1M data points (see Table 2). As the size T increases, authors considered larger p.

The idea behind MERLIN is to find discords without computing the full matrix profile as it might be computationally heavy for medium/large size data. According to the results illustrated in VALMOD, their method should perform well; however, I THINK that depends on the size of data as VALMOD needs to obtain top-p nearest neighbors of each single subsequence of length min_m from its distance profile and store them in the beginning of the process, and then use them to accelerate the computation of matrix profile for window size in range [min_m+1, max_m].

What do you think? Should we put this on hold, and work on VALMOD, and then compare them?

seanlaw · 2022-04-09T00:58:20Z

So, after addressing the latest comments, maybe we can merge this notebook so that we don't lose anything and then pause since we are at a very good place with it. Then we switch our attention to VALMOD. How does that sound?

NimaSarajpoor · 2022-04-09T02:13:54Z

Yes...I think that is a great idea... This way we can also realize whether we need both or not.

So, I will change the commits and push them. I will also read it again to see if I can correct any grammatical errors.

Then, if you are okay, I can continue to work on VALMOD. If you prefer to do it yourself or have some other plans, please feel free and let me know :)

seanlaw · 2022-04-09T02:29:22Z

I'd prefer for you to work on it if you have the time and I get to continue to learn from you 😊

NimaSarajpoor · 2022-04-09T02:34:33Z

I would love to work on it... I have learned a lot from you... and I am enjoying it 😃.

I will push commits soon in a day or two...

- change name of function _find_discords to _refine_candidates - update docstring of function _refine_candidates - add comment next to 1e-6 about setting config

- change function name _find_discords to _refine_candidates throughout the notebook - improve docstrings

NimaSarajpoor · 2022-04-09T07:17:16Z

What I have done:

Revised notebook according to last set of comments
Read the whole notebook and improve docstrings
Ran the whole notebook and it is all good...

I think the notebook is ready.

seanlaw · 2022-04-09T15:56:11Z

@NimaSarajpoor Thank you for this contribution! We appreciate all of your hard work

NimaSarajpoor added 3 commits December 27, 2021 18:25

add introduction part to better understand the idea behind MERLIN

4bee517

Implement first phase of DRAG (DRAG is first part of MERLIN)

7d2b794

fix some typos and small enhancement in explanation/comments

a0e8d97

seanlaw reviewed Dec 29, 2021

View reviewed changes

NimaSarajpoor added 10 commits December 31, 2021 01:49

fix exclusion zone to make it consistent with STUMPY

efffc23

change MatrixProfile to matrix profile

0a67162

use a more understandable function for select_candidates.Also, correc…

402983a

…t excl_zone

fix a typo and delete some incorrect discussion about min_dist

1ff88f0

add calculation behind the min_dist in normalized and non-normalized …

c859bf6

…euclidean distance

do slight improvement on the explanation of calculating non-normalize…

0e5b716

…d euclidean distance

explain the 'how' of updating min_dist in both normalized and non-nor…

e81b122

…malized euclidean distances

fix small things and improve explanation to make sure the main issues…

ecc253e

… are resolved

implement second phase of DRAG (DRAG is first part of MERLIN).

603e4e8

review and fix small errors

eef6bdd

change the name cand_indices to cand_index to make it consistent with…

39d4a80

… the name used in _select_candidates

seanlaw reviewed Jan 2, 2022

View reviewed changes

docs/Tutorial_DiscordMERLIN.ipynb Show resolved Hide resolved

small enhancement in the _select_candidates and _prune_candidates fun…

eb7bc7c

…ctions

NimaSarajpoor added 2 commits April 3, 2022 18:33

retrieve if-check and remove indexer

ba97dff

remove indexer and use append instead

ba081a2

seanlaw reviewed Apr 5, 2022

View reviewed changes

docs/Tutorial_DiscordMERLIN.ipynb Show resolved Hide resolved

docs/Tutorial_DiscordMERLIN.ipynb Show resolved Hide resolved

docs/Tutorial_DiscordMERLIN.ipynb Show resolved Hide resolved

docs/Tutorial_DiscordMERLIN.ipynb Show resolved Hide resolved

Merge branch 'main' into Discord_MERLIN

52c972e

NimaSarajpoor added 4 commits April 6, 2022 18:27

minor Changes

831f64b

- change type of parameter `r` from scalar to float - remove functions from section BONUS section

Revise while-loop structure to increase readability

59fa1db

minor changes

16b28bd

minor changes throughout notebook

053f38f

seanlaw reviewed Apr 8, 2022

View reviewed changes

docs/Tutorial_DiscordMERLIN.ipynb Show resolved Hide resolved

docs/Tutorial_DiscordMERLIN.ipynb Show resolved Hide resolved

NimaSarajpoor added 3 commits April 9, 2022 00:10

minor changes

20593b8

- change name of function _find_discords to _refine_candidates - update docstring of function _refine_candidates - add comment next to 1e-6 about setting config

improve the whole notebook

a25e307

- change function name _find_discords to _refine_candidates throughout the notebook - improve docstrings

minor change in a comment

169a07e

seanlaw merged commit baa3d99 into stumpy-dev:main Apr 9, 2022

NimaSarajpoor deleted the Discord_MERLIN branch April 9, 2022 17:44

[WIP] Discovering Discords of arbitrary length using MERLIN #417 #505

[WIP] Discovering Discords of arbitrary length using MERLIN #417 #505

Uh oh!

Conversation

NimaSarajpoor commented Dec 28, 2021

Uh oh!

review-notebook-app bot commented Dec 28, 2021

Uh oh!

NimaSarajpoor commented Dec 28, 2021

Uh oh!

codecov-commenter commented Dec 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seanlaw commented Dec 28, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NimaSarajpoor commented Dec 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Dec 30, 2021

Uh oh!

NimaSarajpoor commented Jan 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Jan 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Jan 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanlaw commented Jan 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

NimaSarajpoor commented Jan 2, 2022

Uh oh!

seanlaw commented Jan 3, 2022

Uh oh!

NimaSarajpoor commented Apr 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Apr 5, 2022

Uh oh!

seanlaw commented Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NimaSarajpoor commented Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Apr 6, 2022

Uh oh!

seanlaw commented Apr 6, 2022

Uh oh!

NimaSarajpoor commented Apr 7, 2022

Uh oh!

seanlaw commented Apr 7, 2022

Uh oh!

Uh oh!

Uh oh!

seanlaw commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

codecov-commenter commented Dec 28, 2021 •

edited

Loading

NimaSarajpoor commented Dec 30, 2021 •

edited

Loading

NimaSarajpoor commented Jan 2, 2022 •

edited

Loading

seanlaw commented Jan 2, 2022 •

edited

Loading

NimaSarajpoor commented Jan 2, 2022 •

edited

Loading

seanlaw commented Jan 2, 2022 •

edited

Loading

NimaSarajpoor commented Apr 3, 2022 •

edited

Loading

seanlaw commented Apr 5, 2022 •

edited

Loading

NimaSarajpoor commented Apr 5, 2022 •

edited

Loading

seanlaw commented Apr 8, 2022 •

edited

Loading

NimaSarajpoor commented Apr 8, 2022 •

edited

Loading

NimaSarajpoor commented Apr 9, 2022 •

edited

Loading

NimaSarajpoor commented Apr 9, 2022 •

edited

Loading