-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distance martix calculation may be wrong #62
Comments
The substitution matrix (SM) is computed in a very classical way as done originally by Dayhoff (1978) or more recently for the Blosum, PAM, Gonnet, etc. SM for amino acids. The computation cannot be more simple and was presented in our first paper in Proteins 2006, 2008 or more recently the new SM in Biochimie 2011. |
On 02/05/15 11:17, Alexandre G. de Brevern wrote:
|
Sorry, @jbarnoud , i've read it a little bit fast. |
@jbarnoud the clustering part of PBxplore has never been used in any paper so far. No need to keep a legacy (wrong?) function. @alexdb27 could you give us a clear answer on the two points below?
|
I would like to mention that for the hierarchical clustering in R, the value of the diagonal doesn't matter (either min or 0) and in scipy/scikit-learn, it has to be set at 0. |
I re-open this thread because I'm trying to optimize the computation of the distance matrix. So far so good, my method is between 6-8 times faster to the previous one (thanks to pdist method). I compute only the condensed matrix (i.e upper triangle of the matrix) and with squareform method, I get the whole matrix. Unfortunately, I have trouble with the normalization. Currently, the diagonal was set to its minimum value and then the minimum value of the whole matrix was searched. Since the diagonal value is pointless, I think normalize by this minimum value is not correct. The minimum value should be search in the upper triangle. What do you think? |
So far, the matrix is normalized using the minimum distance that must be You cannot use a value from outside the diagonal as minimum for the How long would it take to compute the non normalized distances on the On 07/07/15 11:18, Hub wrote:
|
Okay, I see your points. Indeed, the normalization should be done through the minimum value of the diagonal, it makes sense. |
Guys, I am puzzled about something. From the substitution matrix, we have: Say we have three sequences: ZZdddZZ, ZZjjjZZZ and ZZZjjkZZZ. Scores between sequences are: score(ZZjjjZZ/ZZjjkZZ) = 1721 The corresponding matrix is:
If we set the diagonal to the minimum value of the diagonal we get:
We then will have an issue while computing the distance. We should instead set the diagonal to the maximum value of the diagonal :
Don't you think? |
Yes, it seems we need to set to the maximum, hence |
I guess the maximum of the score matrix is necessary in the diagonal. @alexdb27 could you comment on this? I am aware the formula to compute the distance is wrong in the code. I would like to correct this today. |
On 07/07/15 13:01, Hub wrote:
|
@jbarnoud +1 |
I don't see either but could it worth a test case/check in the code? |
I'll add a very simple test. |
A distance matrix is a matrix with distance. So, the diagonal must be at zero. I propose a little game: 663 -519 -931 second step: third step: fourth step: fifth and last: Sincerely |
@alexdb27 I did not really understand the little game ;-( |
On 07/07/15 13:28, Hub wrote:
@alexdb27 @pierrpo Careful not to confuse the similarity matrix we get |
Thhe little game is the principle to normalize it in a simple way: 1- the data : 663 -519 -931 second step, simply do the matrix third step, in regards to the diagonal, you "positive" the matrix (it is a question , must it be the matrix or the column. I've some personal question to myself) fourth step, diagonal is at zero by row (or column), with a simple difference with the original value of the diagonal: 0 1182 1594 fifth and last : back to the half-matrix by adding (j,i) to (i,j) |
@alexdb27 Sorry I did not understand steps 4 and 5... |
4th, we take the first line: 1594 412 0 => (a) minus the max give 0 -1182 -1594 5th step, you have a non-symetric matrix so you add the symetrical cell to the one upper. we have 0 1182 1594 so we want a half matrix (0,1) is the sum of (0,1) and (1,0) 1182+2823=4005 It is my point of view others can be done. line 1 is 0 1182 1594 |
This issue emerges from comments on the #56 pull request.
The pull request moves the code that build the distance matrix into a
PBlib.distance_matrix
function. I commented about this function:@pierrepo replied and highlighted an other issue n the same function:
The same normalization in used in the code that convert a similarity-score-expressed substitution matrix to a single digit distance-expressed substitution matrix in
PBlib.matrix_to_single_digit
.So, this issue exposes 3 problems:
PBlib.distance_matrix
PBlib.matrix_to_single_digit
Ultimately, the way we build the distance matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.
Anyway, this issue should be fixed separately from the pull request #56 as it may change the output completely. The current way of building the matrix should be kept as
PBlib.matrix_distance_legacy
if it was used in already published papers.cc @pierrepo @alexdb27
The text was updated successfully, but these errors were encountered: