-
Notifications
You must be signed in to change notification settings - Fork 337
[WIP] Discovering Discords of arbitrary length using MERLIN #417 #505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@seanlaw |
Codecov Report
@@ Coverage Diff @@
## main #505 +/- ##
=======================================
Coverage 99.89% 99.89%
=======================================
Files 80 80
Lines 11300 11300
=======================================
Hits 11288 11288
Misses 12 12 Continue to review full report at Codecov.
|
@ninimama Will do. Thanks! |
@seanlaw |
…euclidean distance
…d euclidean distance
…malized euclidean distances
What I have done:
Also, I tried to make sure the implementation of NOTE (1):
NOTE (2): |
… the name used in _select_candidates
Great and thank you. I will take a look.
Maybe it's just me but I find this implementation more confusing. The version in the notebook is much more readable and consistent with the first phase.
Yes, let's think about it later. I think that this notebook will end up being split up into "Tutorial_Merlin.ipynb" (i.e., how to use the |
You are correct. In fact, that was one of the reason I didn't include it in the notebook because I remember the importance of readability. And, I think they have similar running time.
Exactly. So, for now, if we discuss something about the implementation of non-normalized case, I will keep that in mind and might make a short note about it in the notebook. But, I avoid going to details. Btw, I already provided the math for both normalized and non-normalized cases in the notebook. We can keep it as is for now. I can move the calculation related to non-normalized part if you think that is better. |
@ninimama Overall, I think things are looking great (good work)! Phase 2 was pretty straightforward to understand given how simple we've kept the code. I've made some suggestions for you to consider |
I think this round of review was straightforward. please let me know if would like to discuss something further. Otherwise, I can go ahead to apply changes and implement the third phase (i.e. finding NN of each of final candidates). |
@ninimama Sounds good. Please feel free to move forward |
I think the notebook is ready. Important changes:
Btw, I ran the whole notebook and it is all good. (In fact, I ran the whole notebook every time I made a change, and I now noticed that my pushed commits show changes of id of cells as well. Sorry about that! I should have run the whole thing only once.) |
Thank you. I will take a look |
@NimaSarajpoor Have you seen this paper by Keogh? It appears to be an extension of VALMOD to handle discords as well as motifs. It may be relevant to MERLIN. Btw, I have not read either papers yet. |
No, this is new to me! I will go through them if I get some time... I noticed:
So, the paper seems interesting... And, I just noticed you opened an issue. So, All good! |
I think we discuss the comments. I can go ahead and apply changes. |
Sounds good |
- change type of parameter `r` from scalar to float - remove functions from section BONUS section
@seanlaw |
Will do! |
@NimaSarajpoor Only found some very minor things. I have not spent time comparing the Really, what I'm asking is how should the user "know" when to use what? If the user only cares about discords (and doesn't care about motifs), should they EVER use VALMOD/MAD?? |
I am studying VALMOD now, and it might be too soon for me to say, with high certainty, whether we should go with MERLIN or VALMOD. Based on my current understanding of VALMOD, I think it should perform well in finding motifs/discords but probably not for large-size data. I did a quick scan and I noticed the results are mostly for data sets with size 1M at the most. They considered 2M data points in some cases to show scalability. So, the VALMOD method will probably be useful for many users who want to deal with small/medium size data sets. (Is 1M-2M data points considered as medium-size data?) I am pointing out the advantage / disadvantage of VALMOD method below:
The idea behind MERLIN is to find discords without computing the full matrix profile as it might be computationally heavy for medium/large size data. According to the results illustrated in VALMOD, their method should perform well; however, I THINK that depends on the size of data as VALMOD needs to obtain What do you think? Should we put this on hold, and work on VALMOD, and then compare them? |
So, after addressing the latest comments, maybe we can merge this notebook so that we don't lose anything and then pause since we are at a very good place with it. Then we switch our attention to VALMOD. How does that sound? |
Yes...I think that is a great idea... This way we can also realize whether we need both or not. So, I will change the commits and push them. I will also read it again to see if I can correct any grammatical errors. Then, if you are okay, I can continue to work on VALMOD. If you prefer to do it yourself or have some other plans, please feel free and let me know :) |
I'd prefer for you to work on it if you have the time and I get to continue to learn from you 😊 |
I would love to work on it... I have learned a lot from you... and I am enjoying it 😃. I will push commits soon in a day or two... |
- change name of function _find_discords to _refine_candidates - update docstring of function _refine_candidates - add comment next to 1e-6 about setting config
- change function name _find_discords to _refine_candidates throughout the notebook - improve docstrings
What I have done:
I think the notebook is ready. |
@NimaSarajpoor Thank you for this contribution! We appreciate all of your hard work |
In this PR, we would like to implement MERLIN algorithm that discovers discords of arbitrary length. Although the MATLAB implementation of the faster version of MERLIN is available on MERLIN: SUPPORT, we, for now, implement the original version as proposed in the MERLIN.
What I have done so far:
NOTE:
(1) I already implemented MERLIN algorithm and enhanced it a little bit. Both
MERLIN
andSTUMPY: MatrixProfile
gives EXACTLY the same output regarding thediscords indices
and theirdistances to their NN
. Thediscord NN indices
are the same in almost all cases.(2) In terms of performance, MERLIN outperforms STUMPY for long time series.