Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distance martix calculation may be wrong #62

Closed
3 tasks
jbarnoud opened this issue May 1, 2015 · 26 comments
Closed
3 tasks

Distance martix calculation may be wrong #62

jbarnoud opened this issue May 1, 2015 · 26 comments

Comments

@jbarnoud
Copy link
Collaborator

jbarnoud commented May 1, 2015

This issue emerges from comments on the #56 pull request.

The pull request moves the code that build the distance matrix into a PBlib.distance_matrix function. I commented about this function:

At this point of the function, we are building a similarity matrix, we only convert it to a distance matrix in the next lines. Therefore, should not the diagonal be the maximum value, so the diagonal will correspond to a distance of 0 once the matrix normalized and converted? In most cases, it should not matter as the minimum value in the diagonal is most likely greater than all values out of the diagonal. Yet, it is not impossible to have cases were it is not true. For instance, ZZdddddZZ as a similarity score of 1105 whith itself, while the similarity score between ZZeeeeeZZ and ZZeefeeZZ is 2265.

@pierrepo replied and highlighted an other issue n the same function:

agree with you @jbarnoud.
However, I believe I made an error in the normalization formula.
It should be
distance_mat = 1 - (distance_mat - mini)/(maxi - mini)
instead of
distance_mat = 1 - (distance_mat + abs(mini))/(maxi - mini)

The same normalization in used in the code that convert a similarity-score-expressed substitution matrix to a single digit distance-expressed substitution matrix in PBlib.matrix_to_single_digit.

So, this issue exposes 3 problems:

  • the diagonal is set to an inappropriate value
  • the normalization is wrong in PBlib.distance_matrix
  • the normalization is wrong in PBlib.matrix_to_single_digit

Ultimately, the way we build the distance matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.

Anyway, this issue should be fixed separately from the pull request #56 as it may change the output completely. The current way of building the matrix should be kept as PBlib.matrix_distance_legacy if it was used in already published papers.

cc @pierrepo @alexdb27

@jbarnoud
Copy link
Collaborator Author

jbarnoud commented May 1, 2015

@alexdb27
Copy link
Contributor

alexdb27 commented May 2, 2015

Ultimately, the way we build the substitution matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.

The substitution matrix (SM) is computed in a very classical way as done originally by Dayhoff (1978) or more recently for the Blosum, PAM, Gonnet, etc. SM for amino acids. The computation cannot be more simple and was presented in our first paper in Proteins 2006, 2008 or more recently the new SM in Biochimie 2011.
The fact that ZZdddddZZ aligned with itself as not the same values that ZZeeeeeZZ and ZZeefeeZZ is totally logical as the letter d (PB d) is seen quite often and e and f less.
You have the same results with the alignement of the amino acid string WWWWWWWW with itself and AAAAAA with itslef, the first one will be also largely higher. It is totally logical.

@alexdb27
Copy link
Contributor

alexdb27 commented May 2, 2015

Nonetheless, the point is very intersting cc @pierrepo @alexdb27.

@jbarnoud
Copy link
Collaborator Author

jbarnoud commented May 2, 2015

On 02/05/15 11:17, Alexandre G. de Brevern wrote:

Ultimately, the way we build the substitution matrix may not be
appropriate as it scramble the information given by the diagonal
of the substitution matrix. I would be interested by how such
distance matrix are built in the literature.

The substitution matrix (SM) is computed in a very classical way as
done originally by Dayhoff (1978) or more recently for the Blosum,
PAM, Gonnet, etc. SM for amino acids. The computation cannot be more
simple and was presented in our first paper in Proteins 2006, 2008 or
more recently the new SM in Biochimie 2011.
The fact that ZZdddddZZ aligned with itself as not the same values
that ZZeeeeeZZ and ZZeefeeZZ is totally logical as the letter d (PB d)
is seen quite often and e and f less.
You have the same results with the alignement of the amino acid string
WWWWWWWW with itself and AAAAAA with itslef, the first one will be
also largely higher. It is totally logical.


Reply to this email directly or view it on GitHub
#62 (comment).

Sorry. I meant that the distance matrix is not computed in a
appropriate way.

@alexdb27
Copy link
Contributor

alexdb27 commented May 2, 2015

Sorry, @jbarnoud , i've read it a little bit fast.
In fact if i remember well, most of the time, diagonal is not computed at all, it is set directly at ... zero ...

@pierrepo
Copy link
Owner

pierrepo commented May 2, 2015

@jbarnoud the clustering part of PBxplore has never been used in any paper so far. No need to keep a legacy (wrong?) function.

@alexdb27 could you give us a clear answer on the two points below?

  • Normalization function must be

    distance_mat = 1 - (distance_mat - mini)/(maxi - mini)
    instead of
    distance_mat = 1 - (distance_mat + abs(mini))/(maxi - mini)

  • Which value to we put in the diagonal ? max() ? min() ? 0 ?

@HubLot
Copy link
Collaborator

HubLot commented May 7, 2015

I would like to mention that for the hierarchical clustering in R, the value of the diagonal doesn't matter (either min or 0) and in scipy/scikit-learn, it has to be set at 0.

@HubLot
Copy link
Collaborator

HubLot commented Jul 7, 2015

I re-open this thread because I'm trying to optimize the computation of the distance matrix.

So far so good, my method is between 6-8 times faster to the previous one (thanks to pdist method). I compute only the condensed matrix (i.e upper triangle of the matrix) and with squareform method, I get the whole matrix.

Unfortunately, I have trouble with the normalization. Currently, the diagonal was set to its minimum value and then the minimum value of the whole matrix was searched.
With my calculation my diagonal is set to 0, so I search the min value in my condensed matrix but this min value could be the diagonal value... I cannot have the same matrix distance as before.

Since the diagonal value is pointless, I think normalize by this minimum value is not correct. The minimum value should be search in the upper triangle.

What do you think?

@jbarnoud
Copy link
Collaborator Author

jbarnoud commented Jul 7, 2015

So far, the matrix is normalized using the minimum distance that must be
on the diagonal. This, of course, assume that the distances are computed
for the diagonal. Doing so, the values on the diagonal should be close
to 0 on the diagonal, after normalization. Some cells from the diagonal
(at least 1) should be 0 after the normalization, the other ones are set
to 0 afterward.

You cannot use a value from outside the diagonal as minimum for the
normalization. By doing so, you will have nul distances for sequences
that are different from each other.

How long would it take to compute the non normalized distances on the
diagonal? If you have them, you can get the minimum to use for the
normalization. Since, all the values from the diagonal will be set to 0
anyway, you do not have to change your use of pdist, you just have to
get the non-normalized values for the diagonal on the side of what you
currently do.

On 07/07/15 11:18, Hub wrote:

I re-open this thread because I'm trying to optimize the computation
of the distance matrix.

So far so good, my method is between 6-8 times faster to the previous
one (thanks to pdist method
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.pdist.html).
I compute only the condensed matrix (i.e upper triangle of the matrix)
and with squareform method
http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.squareform.html,
I get the whole matrix.

Unfortunately, I have trouble with the normalization. Currently, the
diagonal was set to its minimum value and then the minimum value of
the whole matrix was searched.
With my calculation my diagonal is set to 0, so I search the min value
in my condensed matrix but this min value could be the diagonal
value... I cannot have the same matrix distance as before.

Since the diagonal value is pointless, I think normalize by this
minimum value is not correct. The minimum value should be search in
the upper triangle.

What do you think?


Reply to this email directly or view it on GitHub
#62 (comment).

@HubLot
Copy link
Collaborator

HubLot commented Jul 7, 2015

Okay, I see your points. Indeed, the normalization should be done through the minimum value of the diagonal, it makes sense.
I'll see how i can compute the diagonal.

@pierrepo
Copy link
Owner

pierrepo commented Jul 7, 2015

Guys, I am puzzled about something.

From the substitution matrix, we have:
score(d/d) = 221
score(j/j) = 768
score(k/k) = 568
score(j/k) = 196
score(j/d) = -173
score(d/k) = -585

Say we have three sequences: ZZdddZZ, ZZjjjZZZ and ZZZjjkZZZ.

Scores between sequences are:
score(ZZdddZZ/ZZdddZZ) = 663
score(ZZjjjZZ/ZZjjjZZ) = 2304
score(ZZjjkZZ/ZZjjkZZ) = 2104

score(ZZjjjZZ/ZZjjkZZ) = 1721
score(ZZjjjZZ/ZZdddZZ) = -519
score(ZZjjkZZ/ZZdddZZ) = -931

The corresponding matrix is:

           ZZdddZZ  ZZjjjZZ  ZZjjkZZ
ZZdddZZ      663     -519     -931
ZZjjjZZ              2304     1721
ZZjjkZZ                       2104

If we set the diagonal to the minimum value of the diagonal we get:

           ZZdddZZ  ZZjjjZZ  ZZjjkZZ
ZZdddZZ      663     -519     -931
ZZjjjZZ              663      1721
ZZjjkZZ                        663

We then will have an issue while computing the distance.

We should instead set the diagonal to the maximum value of the diagonal :

           ZZdddZZ  ZZjjjZZ  ZZjjkZZ
ZZdddZZ     2304     -519     -931
ZZjjjZZ              2304     1721
ZZjjkZZ                       2304

Don't you think?

@HubLot
Copy link
Collaborator

HubLot commented Jul 7, 2015

Yes, it seems we need to set to the maximum, hence distance = 1 - (score -mini)/(maxi - mini) would equal 0.
Plus in the code, the formula is still: distance = 1 - (score + abs(min))/(maxi - min)
But, what happens if the maximum is not on the diagonal? The distance on the diagonal will not be set to 0...

@pierrepo
Copy link
Owner

pierrepo commented Jul 7, 2015

I guess the maximum of the score matrix is necessary in the diagonal. @alexdb27 could you comment on this?

I am aware the formula to compute the distance is wrong in the code. I would like to correct this today.

@jbarnoud
Copy link
Collaborator Author

jbarnoud commented Jul 7, 2015

On 07/07/15 13:01, Hub wrote:

Yes, it seems we need to set to the maximum, hence |distance = 1 -
(score -mini)/(maxi - mini)| would equal 0.
Plus in the code, the formula is still: |distance = 1 - (score +
abs(min))/(maxi - min)|
But, what happens if the maximum is not on the diagonal? The distance
on the diagonal will not be set to 0...


Reply to this email directly or view it on GitHub
#62 (comment).

Before the normalization, we compute a similarity matrix. I cannot see a
case where the maximum value is not on the diagonal. Do you ?

@pierrepo
Copy link
Owner

pierrepo commented Jul 7, 2015

@jbarnoud +1

@HubLot
Copy link
Collaborator

HubLot commented Jul 7, 2015

I don't see either but could it worth a test case/check in the code?

@pierrepo
Copy link
Owner

pierrepo commented Jul 7, 2015

I'll add a very simple test.
Maybe @alexdb27 can confirme to us that the maximum is always on the diagonal?

@alexdb27
Copy link
Contributor

alexdb27 commented Jul 7, 2015

A distance matrix is a matrix with distance. So, the diagonal must be at zero.

I propose a little game:

663 -519 -931
----- 2304 1721
----- ------- 2104

second step:
663 -519 -931
-519 2304 1721
-931 1721 2104

third step:
1594 412 0
412 3235 2652
0 2652 3035

fourth step:
0 1182 1594
2823 0 583
3035 383 0

fifth and last:
0 4005 4629
---- 0 966
---- ---- 0

Sincerely

@alexdb27
Copy link
Contributor

alexdb27 commented Jul 7, 2015

PS: you write too fast @pierrepo @jbarnoud @HubLot

@pierrepo
Copy link
Owner

pierrepo commented Jul 7, 2015

@alexdb27 I did not really understand the little game ;-(
I agree the diagonal of a distance matrix must be 0 but we are not sure how to compute the distance of other parts of the matrix.

@jbarnoud
Copy link
Collaborator Author

jbarnoud commented Jul 7, 2015

On 07/07/15 13:28, Hub wrote:

I don't see either but could it worth a test case/check in the code?


Reply to this email directly or view it on GitHub
#62 (comment).

@HubLot I have a test version were I check with assertions that the
diagonal is 0 (because I do not set it explicitly), that there is no
negative values, and that that the maximum value is 1.

@alexdb27 @pierrpo Careful not to confuse the similarity matrix we get
from the substitution matrix (where the maximum value should be on the
diagonal) with the distance matrix that we get after normalization
(where the diagonal must be 0).

@alexdb27
Copy link
Contributor

alexdb27 commented Jul 7, 2015

Thhe little game is the principle to normalize it in a simple way:

1- the data :

663 -519 -931
----- 2304 1721
----- ------- 2104

second step, simply do the matrix
663 -519 -931
-519 2304 1721
-931 1721 2104

third step, in regards to the diagonal, you "positive" the matrix (it is a question , must it be the matrix or the column. I've some personal question to myself)
1594 412 0
412 3235 2652
0 2652 3035

fourth step, diagonal is at zero by row (or column), with a simple difference with the original value of the diagonal:

0 1182 1594
2823 0 583
3035 383 0

fifth and last : back to the half-matrix by adding (j,i) to (i,j)
0 4005 4629
---- 0 966
---- ---- 0

@alexdb27
Copy link
Contributor

alexdb27 commented Jul 7, 2015

@pierrepo I hope it explain you how i see the question

@jbarnoud @HubLot it is too funny !!

@pierrepo
Copy link
Owner

pierrepo commented Jul 7, 2015

@alexdb27 Sorry I did not understand steps 4 and 5...

@alexdb27
Copy link
Contributor

alexdb27 commented Jul 7, 2015

@alexdb27
Copy link
Contributor

alexdb27 commented Jul 7, 2015

4th, we take the first line:

1594 412 0 => (a) minus the max give 0 -1182 -1594
(b) then abs => 0 1182 1594

5th step, you have a non-symetric matrix so you add the symetrical cell to the one upper.

we have 0 1182 1594
2823 0 583
3035 383 0

so we want a half matrix
(0,0) (0,1) (0,2)
----- (1,1) (1,2)
----- ------ (2,2)

(0,1) is the sum of (0,1) and (1,0) 1182+2823=4005
(0,2) is the sum of (0,2) and (2,0) 1594+3035=4629
(1,2) is the sum of (1,2) and (2,1) 583+383 = 966

It is my point of view others can be done.

line 1 is 0 1182 1594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants