Skip to content

[WIP] Discovering Discords of arbitrary length using MERLIN #417 #505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 213 commits into from
Apr 9, 2022

Conversation

NimaSarajpoor
Copy link
Collaborator

In this PR, we would like to implement MERLIN algorithm that discovers discords of arbitrary length. Although the MATLAB implementation of the faster version of MERLIN is available on MERLIN: SUPPORT, we, for now, implement the original version as proposed in the MERLIN.

What I have done so far:

  • Add Introduction
  • Explain the advantage / disadvantage of MatrixProfile for discovering discords
  • Explain two core ideas of MERLIN algorithm
  • Implement the first phase of DRAG (DRAG is the first part of MERLIN)

NOTE:
(1) I already implemented MERLIN algorithm and enhanced it a little bit. Both MERLIN and STUMPY: MatrixProfile gives EXACTLY the same output regarding the discords indices and their distances to their NN. The discord NN indices are the same in almost all cases.

(2) In terms of performance, MERLIN outperforms STUMPY for long time series.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
please let me know if the current notebook is better compared to the previous ones. I just want to know how you feel about it. If you have any general suggestion for further improvement of this notebook, please feel free and let me know. Thanks!

@codecov-commenter
Copy link

codecov-commenter commented Dec 28, 2021

Codecov Report

Merging #505 (169a07e) into main (60bb08d) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #505   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11300    11300           
=======================================
  Hits        11288    11288           
  Misses         12       12           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60bb08d...169a07e. Read the comment docs.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 28, 2021

@ninimama Will do. Thanks!

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 30, 2021

@seanlaw
I think we have discussed/resolved all the issues that need to be addressed in the next commit / future. Please let me know what you think.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 30, 2021

@seanlaw I think we have discussed/resolved all the issues that need to be addressed in the next commit / future. Please let me know what you think.

@ninimama Alright, looks good to me!

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 2, 2022

What I have done:

  • revise the notebook according to the previous review
  • implement second phase of DRAG (Table 2 of paper). I named this function as _prune_candidates

Also, I tried to make sure the implementation of _prune_candidates and _select_candidates to be close to each other as they basically do similar things (_select_candidates is to check out right neighbors and _prune_candidates considers the left ones). It can make it easier for review.

NOTE (1):
I am providing an alternative implementation below. I sorted the cand_indices (output of previous phase) in descending order (i.e. I just flipped it!) It makes it easier to update it in each iteration. Then, I flipped it back at the end just to make sure the output (indices of final candidates) is in ascending order (please see below for more details).

def alternative_prune_candidates(subseqs, min_dist, cand_index):
   
    n, m = subseqs.shape # n: number of subsequences, #m: length of each subsequence
    
    cand_index = np.flip(cand_index) #sorting indices in descending order
    isDiscord = np.ones(len(cand_index), dtype=bool) 
    
    excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
    
    r = m * (1.0 - ((min_dist ** 2.0) / (2.0 * m))) 

    for i in range(0, n):
        if not np.any(isDiscord):
            return None #means all candidates are pruned! There is nothing to return!
        
        non_trivial_cand = cand_index[cand_index > i + excl_zone]  
        #since indices are sorted in descending order, the slice non_trivial_cand starts from index 0 of cand_index.
        
        if not len(non_trivial_cand):
            break #means we already reached a subsequence that has no candidates on its right neighbors!
        
        R = np.matmul(subseqs[non_trivial_cand], subseqs[i]) 
        isDiscord[np.flatnonzero(R>r)] = False
        
    return np.flip(cand_index[isDiscord])

NOTE (2):
I also tried non-normalized min_dist in the first phase. It seems there should be no problem. Further investigation is needed (such as trying it in other phases and monitoring its running time). However, I wanted to keep things organized. So, if that is okay with you, we can cover it after implementing the original MERLIN.

@seanlaw
Copy link
Contributor

seanlaw commented Jan 2, 2022

I tried to make sure the implementation of _prune_candidates and _select_candidates to be close to each other as they basically do similar things (_select_candidates is to check out right neighbors and _prune_candidates considers the left ones). It can make it easier for review.

Great and thank you. I will take a look.

Note 1: I am providing an alternative implementation below. I sorted the cand_indices (output of previous phase) in descending order (i.e. I just flipped it!) It makes it easier to update it in each iteration. Then, I flipped it back at the end just to make sure the output (indices of final candidates) is in ascending order (please see below for more details).

Maybe it's just me but I find this implementation more confusing. The version in the notebook is much more readable and consistent with the first phase.

Note 2: I also tried non-normalized min_dist in the first phase. It seems there should be no problem. Further investigation is needed (such as trying it in other phases and monitoring its running time). However, I wanted to keep things organized. So, if that is okay with you, we can cover it after implementing the original MERLIN.

Yes, let's think about it later. I think that this notebook will end up being split up into "Tutorial_Merlin.ipynb" (i.e., how to use the merlin function but without getting into the technical phases) and "Merlin.ipynb" (i.e., how everything in Merlin is derived including the non-normalized min_dist). Again, we can worry about it later once we are satisfied and convinced that everything works correctly 😄

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 2, 2022

Maybe it's just me but I find this implementation more confusing. The version in the notebook is much more readable and consistent with the first phase

You are correct. In fact, that was one of the reason I didn't include it in the notebook because I remember the importance of readability. And, I think they have similar running time.

Again, we can worry about it later once we are satisfied and convinced that everything works correctly

Exactly. So, for now, if we discuss something about the implementation of non-normalized case, I will keep that in mind and might make a short note about it in the notebook. But, I avoid going to details. Btw, I already provided the math for both normalized and non-normalized cases in the notebook. We can keep it as is for now. I can move the calculation related to non-normalized part if you think that is better.

@seanlaw
Copy link
Contributor

seanlaw commented Jan 2, 2022

@ninimama Overall, I think things are looking great (good work)! Phase 2 was pretty straightforward to understand given how simple we've kept the code. I've made some suggestions for you to consider

@NimaSarajpoor
Copy link
Collaborator Author

I think this round of review was straightforward. please let me know if would like to discuss something further. Otherwise, I can go ahead to apply changes and implement the third phase (i.e. finding NN of each of final candidates).

@seanlaw
Copy link
Contributor

seanlaw commented Jan 3, 2022

I can go ahead to apply changes and implement the third phase

@ninimama Sounds good. Please feel free to move forward

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 3, 2022

I think the notebook is ready.

Important changes:

  • function _find_discords: return numpy ndarray instead of three lists
  • function _discords
  • top-level function discords: return top-k discords of length m rather than range of m
  • change r_init to r, and change r to r_updated
  • Add bonus section to guide reader on how to take advantage of input r in discovering discords for a range of m

Btw, I ran the whole notebook and it is all good. (In fact, I ran the whole notebook every time I made a change, and I now noticed that my pushed commits show changes of id of cells as well. Sorry about that! I should have run the whole thing only once.)

@seanlaw
Copy link
Contributor

seanlaw commented Apr 5, 2022

Thank you. I will take a look

@seanlaw
Copy link
Contributor

seanlaw commented Apr 5, 2022

@NimaSarajpoor Have you seen this paper by Keogh? It appears to be an extension of VALMOD to handle discords as well as motifs. It may be relevant to MERLIN. Btw, I have not read either papers yet.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 5, 2022

Have you seen this paper by Keogh? It appears to be an extension of VALMOD to handle discords as well as motifs.

No, this is new to me! I will go through them if I get some time...

I noticed:

  • They cited the paper Disk_Aware_Discord_Discovery, which is the foundation of paper MERLIN (and I think it is the core of method MERLIN.) They referred to this method as DAD and compared their proposed method with it.

  • Their statement sounds reasonable: "DAD has different execution times. We observe that the computational time of DAD depends on the subsequence length, since it computes Euclidean distances in their entirety (only applying early abandoning based on the best so far distance). How effective this early abandoning mechanism is, depends on the characteristics of the data. On the other hand, our algorithm computes all distances for the first subsequence length in constant time, and then prunes entire distance computations for the larger lengths"


So, the paper seems interesting... And, I just noticed you opened an issue. So, All good!

@NimaSarajpoor
Copy link
Collaborator Author

I think we discuss the comments. I can go ahead and apply changes.

@seanlaw
Copy link
Contributor

seanlaw commented Apr 6, 2022

I think we discuss the comments. I can go ahead and apply changes.

Sounds good

- change type of parameter `r` from scalar to float
- remove functions from section BONUS section
@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
I checked the notebook and it seems the comments have been addressed. please feel free to review.

@seanlaw
Copy link
Contributor

seanlaw commented Apr 7, 2022

Will do!

@seanlaw
Copy link
Contributor

seanlaw commented Apr 8, 2022

@NimaSarajpoor Only found some very minor things. I have not spent time comparing the stumpy.stump vs merlin discords outputs though as I think that will come later. Given VALMOD, what do you think we should do next given the overlap? In other words, how should we decide whether to go with the VALMOD route or the MERLIN path for discords? I don't think that the user should need to "choose" if possible. Of course, "MAD" handles both discords AND motifs.

Really, what I'm asking is how should the user "know" when to use what? If the user only cares about discords (and doesn't care about motifs), should they EVER use VALMOD/MAD??

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 8, 2022

Given VALMOD, what do you think we should do next given the overlap? In other words, how should we decide whether to go with the VALMOD route or the MERLIN path for discords?

Really, what I'm asking is how should the user "know" when to use what? If the user only cares about discords (and doesn't care about motifs), should they EVER use VALMOD/MAD??

I am studying VALMOD now, and it might be too soon for me to say, with high certainty, whether we should go with MERLIN or VALMOD.

Based on my current understanding of VALMOD, I think it should perform well in finding motifs/discords but probably not for large-size data. I did a quick scan and I noticed the results are mostly for data sets with size 1M at the most. They considered 2M data points in some cases to show scalability. So, the VALMOD method will probably be useful for many users who want to deal with small/medium size data sets. (Is 1M-2M data points considered as medium-size data?)


I am pointing out the advantage / disadvantage of VALMOD method below:

  • Advantage: In contrast to MERLIN, or a simple matrix profile that only considers 1-NN of each subsequence, it can discover top-k discords while considering not just 1NN but also i-th nearest neighbors. (see Fig. 2 in the extended version VALMOD_2020 and notice the point "top-1 2nd Discord".)

  • Disadvantage: For a given subsequence-length range [min_m, max_m], VALMOD needs to obtain not just the full matrix profile for the min_m, but, in fact, the "top-p nearest neighbors" for each single subsequence of length min_m in time series data T (see Algorithm3) In other words, extracting top-p from the "distance profile" that corresponds to each subsequence . The parameter p should be set by user. For instance, the authors considered p=5 for 0.1M data points, and, in another case, p=50 for 1M data points (see Table 2). As the size T increases, authors considered larger p.

The idea behind MERLIN is to find discords without computing the full matrix profile as it might be computationally heavy for medium/large size data. According to the results illustrated in VALMOD, their method should perform well; however, I THINK that depends on the size of data as VALMOD needs to obtain top-p nearest neighbors of each single subsequence of length min_m from its distance profile and store them in the beginning of the process, and then use them to accelerate the computation of matrix profile for window size in range [min_m+1, max_m].


What do you think? Should we put this on hold, and work on VALMOD, and then compare them?

@seanlaw
Copy link
Contributor

seanlaw commented Apr 9, 2022

So, after addressing the latest comments, maybe we can merge this notebook so that we don't lose anything and then pause since we are at a very good place with it. Then we switch our attention to VALMOD. How does that sound?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 9, 2022

Yes...I think that is a great idea... This way we can also realize whether we need both or not.

So, I will change the commits and push them. I will also read it again to see if I can correct any grammatical errors.

Then, if you are okay, I can continue to work on VALMOD. If you prefer to do it yourself or have some other plans, please feel free and let me know :)

@seanlaw
Copy link
Contributor

seanlaw commented Apr 9, 2022

I'd prefer for you to work on it if you have the time and I get to continue to learn from you 😊

@NimaSarajpoor
Copy link
Collaborator Author

I would love to work on it... I have learned a lot from you... and I am enjoying it 😃.

I will push commits soon in a day or two...

- change name of function _find_discords to _refine_candidates
- update docstring of function _refine_candidates
- add comment next to 1e-6 about setting config
- change function name _find_discords to _refine_candidates throughout
the notebook
- improve docstrings
@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 9, 2022

What I have done:

  • Revised notebook according to last set of comments
  • Read the whole notebook and improve docstrings
  • Ran the whole notebook and it is all good...

I think the notebook is ready.

@seanlaw seanlaw merged commit baa3d99 into stumpy-dev:main Apr 9, 2022
@seanlaw
Copy link
Contributor

seanlaw commented Apr 9, 2022

@NimaSarajpoor Thank you for this contribution! We appreciate all of your hard work

@NimaSarajpoor NimaSarajpoor deleted the Discord_MERLIN branch April 9, 2022 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants