Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] skip empty chats -- SOLUTION explained #112

Open
code-consensus opened this issue Sep 3, 2024 · 11 comments
Open

[FEATURE] skip empty chats -- SOLUTION explained #112

code-consensus opened this issue Sep 3, 2024 · 11 comments
Assignees
Labels
all platforms Issue related to all platforms enhancement New feature or request

Comments

@code-consensus
Copy link

code-consensus commented Sep 3, 2024

Is your feature request related to a problem? Please describe.
The chat export creates, depending on the database, thousands of empty chats, or chats with only system messages. e.g. files with phone numbers that have never exchanged messages user-to-user.
This is also the case with 14- and 15-digit system numbers that have e.g. "123456789012345@lid" as a chat name (which creates a file like 123456789012345.html)
In my particular case, for example with a database I have with about 1000 actual chats, the program will exports an additional 9000 files without any information inside those files.

One can test this out by opening these files, but the only information are messages like: "Not supported WhatsApp internal message", "The security code in this chat changed" and "xxx has changed their number to yyy"

Describe the solution you'd like
Filter out these messages by default.
Optionally, a command-line argument option could be provided if the user would instead like to output them, e.g. "--include-empty-chats".

Solution
The information to filter out the irrelevant chats is actually already readily available inside the msgstore.db database.

My problem in proposing a revision for @KnugiHK is that I am not an expert with SQL, and so I don't know how suggest how to modify the current SQL queries in the messages() or media() functions. However, I know the what needs to be done, and have tested it.

Here is how to find (and filter) the empty chats:

In msgstore.db, there is a table "chat_view". It contains a number of columns, including raw_string_jid and sort_timestamp (also "subject" for the group subject, as well as other data). The sort_timestamp is the time of the most recently received message from that contact/group represented by the raw_string_jid. It seems to me that this table is what the WhatsApp app uses to create the chat list when you launch the app.

The key to filtering out the irrelevant chats is that sort_timestamp is always NULL if you've never exchanged messages with the contact represented by the raw_string_jid. Furthermore, all the "@lid" "contacts" are also always NULL too. So the solution is to ignore processing any messages where the chat ID has a NULL sort_timestamp in the chat_view table.

N.B. the only exception to the above seems to be status@broadcast -- that ALSO has a NULL sort_timestamp, even though it may have legitimate items in that.

I have tested the above with a bit of inelegant code, whereby I created a dictionary of all the raw_string_jids from chat_view with the sort_timestamp, and then I removed chats from the "data" dictionary before create_html() function runs in main.py. The HTML function then didn't export the empty chats.

Unfortunately, I don't really understand the SQL queries in the messages() and media() functions to do the sorting in advance -- but the query would be something like "SELECT raw_string_jid, sort_timestamp FROM chat_view" to get all the timestamps and then cross-check of the raw_string_jid associated with the chat doesn't have a NULL sort_timestamp.

I hope I explained this enough, but if not, happy to clarify!

@code-consensus
Copy link
Author

Hi,

I just noticed that pull request #106 exists, which I missed because the title discussed contacts, but inside there there is code to filter out the empty chats (Refactoring empty filtering).

I would just like to comment that while that code would solve the problem, the program would still first need to run through ALL the entire message database in messages(), creating chatstore entries that are not needed from the get-go.

A less intensive solution would be to do the filtering at the messages() level considering the SQL table I mentioned in the post above (SELECT raw_string_jid, sort_timestamp FROM chat_view). If the SQL query could be altered, then no other changes are required elsewhere to the code (e.g. no changes to json output, no changes to create_html, etc.

There is already the code in the messages() function to deal a bit with filtering for phonenumber with get_chat_condition lines in the SQL query. Unfortunately, I don't know how to write SQL code to suggest the filtering taking into account the chat_view

@KnugiHK KnugiHK self-assigned this Sep 7, 2024
@KnugiHK KnugiHK added the enhancement New feature or request label Sep 7, 2024
@KnugiHK
Copy link
Owner

KnugiHK commented Sep 7, 2024

I agree that it would be better to do the filtering in SQL level.

@code-consensus
Copy link
Author

I agree that it would be better to do the filtering in SQL level.

If I improve on my SQL skills and can think of a solution, I will post it!

In any case, @KnugiHK thanks again for such an amazing project! I had almost a decade of my messages that I thought had been lost forever after some bug in WhatsApp didn't let me import my backup -- finding your tool saved me!!

@KnugiHK
Copy link
Owner

KnugiHK commented Sep 9, 2024

I agree that it would be better to do the filtering in SQL level.

If I improve on my SQL skills and can think of a solution, I will post it!

In any case, @KnugiHK thanks again for such an amazing project! I had almost a decade of my messages that I thought had been lost forever after some bug in WhatsApp didn't let me import my backup -- finding your tool saved me!!

I'm glad this project has been helpful to you, and you've made significant contributions as well!

@KnugiHK KnugiHK added the all platforms Issue related to all platforms label Sep 9, 2024
@KnugiHK
Copy link
Owner

KnugiHK commented Sep 14, 2024

Here comes the problem: I noticed that one of the chats contains only a message with an image I sent (along with its metadata, of course), but in the chat_view table, the chat has a null value for sort_timestamp.

@code-consensus
Copy link
Author

Here comes the problem: I noticed that one of the chats contains only a message with an image I sent (along with its metadata, of course), but in the chat_view table, the chat has a null value for sort_timestamp.

Hi @KnugiHK, I just made a test myself to see what happens on my phone, and the behavior works for me as expected (i.e. not NULL).

Specifically what I did was as follows:

  1. I found a number with which I had NO prior chat history, not even the system message of "messages and calls are end-to-end encrypted."
  2. I sent an image. Note that I only sent an image, i.e. no caption.
  3. The message showed as sent and delivered (two checkmarks), and I immediately shut off the internet to stop any other potential WhatsApp communications. I can also confirm that the contact did not view the message before I shut down the internet on my phone.
  4. I made a chat backup while offline and copied over the database to my computer.

Upon opening msgstore.db, I found the following:

  • "message" table -- I can confirm that it did not contain any messages with that number in the past. The final two rows of the table included two messages to that number back-to-back (milliseconds apart): the image message (message_type 1) and a system message (message_type 7)
  • chat_view: the jid's sort_timestamp was NOT null, but rather the exact same timestamp as the the image message from the "message table"

For reference, my phone is an Android (v14), and I am running WhatsApp version 2.24.17.75.

I wonder if anyone else can carry out such an experiment on a different platform?

@KnugiHK
Copy link
Owner

KnugiHK commented Sep 14, 2024

Hmm... Maybe I should provide more context about the issue I encountered. The chat in question is with a contact who has already de-registered from WhatsApp, and the image was sent via the WhatsApp Broadcast feature.

@code-consensus
Copy link
Author

Hmm... Maybe I should provide more context about the issue I encountered. The chat in question is with a contact who has already de-registered from WhatsApp, and the image was sent via the WhatsApp Broadcast feature.

Ah! Well, as I had mentioned above, I did notice that my status@broadcast also had a null sort_timestamp.
I think the filtering at the SQL level could still work well, but just also to take into account the "special" jids of @broadcast (and perhaps also @newsletter)

@code-consensus
Copy link
Author

I agree that it would be better to do the filtering in SQL level.

@KnugiHK In the end, I actually did figure out how to do this via SQL with two lines in the query, and tested it and achieves the result of filtering out the empty chats. I posted the solution in pull request 106.

With those 2 SQL lines, one doesn't need any new functions and zero changes to all the other functions (e.g. create_html, etc all can stay the same).

@code-consensus
Copy link
Author

code-consensus commented Nov 4, 2024

Hmm... Maybe I should provide more context about the issue I encountered. The chat in question is with a contact who has already de-registered from WhatsApp, and the image was sent via the WhatsApp Broadcast feature.

Hi @KnugiHK , related to this broadcast case of yours, I realize I misunderstood the situation. Also, I didn't really know much about SQL database manipulation before, but now I have quite a bit over the last months. Apologies for the confusion!

In my suggested code, I thought I had dealt with all broadcast cases using "WHERE ... key_remote_jid LIKE '%broadcast%'. But I realize that isn't going to do it, because key_remote_jid is the jid of the person of the chat, not the jid represented in message's sender_jid_row_id. And it is in sender_jid_row_id where the jid row of the broadcast is listed if the message is a broadcast message. The only thing that "LIKE '%broadcast%'" in my suggested code serves is to allow status@broadcast to be processed (which seems to always have a NULL sort_timestamp).

So, if one has received a broadcast from someone, the not-NULL sort_timestamp is sufficient to deal with such cases (because then that person's name shows up in your chat list inside the app). But what I realize is that from what I understand now in your case is that you sent the broadcast, right? (and that person is also no longer on whatsapp)

Now, after a further evaluation, I have figured out how to do process the broadcast cases, even if NULL sort_timestamp, because there is actually another feature of the message table which I hadn't noticed -- there is actually a column called broadcast. If the message is a broadcast, the broadcast column gets a 1. At least I have seen this in my own database.

edit: The other edge case is if you have only had phone calls with an individual, but no actual messages exchanged, then the sort_timestamp also would be NULL, the chat would be hidden from your list, but if you were to find and open the conversation you would have the phone calls exchanged. If that is to be shown, then adding media_wa_type=90 will allow for those chats also to be exported (included below).

With this information, it is possible to tweak what I had proposed as follows:

  1. Add in to the SELECT section of the query in messages() a line of "message.broadcast" (as well as chat.sort_timestamp, as I had suggested prior)
  2. Then, the where section would be altered as follows:
    WHERE key_remote_jid <> '-1' AND (chat.sort_timestamp IS NOT NULL OR key_remote_jid='status@broadcast' OR message.broadcast=1 OR media_wa_type=90)

Again, the benefit is much cleaner code, no need for new functions, no alteration to existing functions, etc., just needing 2 extra lines of SQL query (plus minor alteration of the WHERE line).

Hopefully this is helpful and useful for you!

@KnugiHK
Copy link
Owner

KnugiHK commented Nov 13, 2024

Hmm... Maybe I should provide more context about the issue I encountered. The chat in question is with a contact who has already de-registered from WhatsApp, and the image was sent via the WhatsApp Broadcast feature.

Hi @KnugiHK , related to this broadcast case of yours, I realize I misunderstood the situation. Also, I didn't really know much about SQL database manipulation before, but now I have quite a bit over the last months. Apologies for the confusion!

In my suggested code, I thought I had dealt with all broadcast cases using "WHERE ... key_remote_jid LIKE '%broadcast%'. But I realize that isn't going to do it, because key_remote_jid is the jid of the person of the chat, not the jid represented in message's sender_jid_row_id. And it is in sender_jid_row_id where the jid row of the broadcast is listed if the message is a broadcast message. The only thing that "LIKE '%broadcast%'" in my suggested code serves is to allow status@broadcast to be processed (which seems to always have a NULL sort_timestamp).

So, if one has received a broadcast from someone, the not-NULL sort_timestamp is sufficient to deal with such cases (because then that person's name shows up in your chat list inside the app). But what I realize is that from what I understand now in your case is that you sent the broadcast, right? (and that person is also no longer on whatsapp)

Now, after a further evaluation, I have figured out how to do process the broadcast cases, even if NULL sort_timestamp, because there is actually another feature of the message table which I hadn't noticed -- there is actually a column called broadcast. If the message is a broadcast, the broadcast column gets a 1. At least I have seen this in my own database.

edit: The other edge case is if you have only had phone calls with an individual, but no actual messages exchanged, then the sort_timestamp also would be NULL, the chat would be hidden from your list, but if you were to find and open the conversation you would have the phone calls exchanged. If that is to be shown, then adding media_wa_type=90 will allow for those chats also to be exported (included below).

With this information, it is possible to tweak what I had proposed as follows:

1. Add in to the SELECT section of the query in messages() a line of "message.broadcast" (as well as chat.sort_timestamp, as I had suggested prior)

2. Then, the where section would be altered as follows:
   `WHERE key_remote_jid <> '-1' AND (chat.sort_timestamp IS NOT NULL OR key_remote_jid='status@broadcast' OR message.broadcast=1 OR media_wa_type=90)`

Again, the benefit is much cleaner code, no need for new functions, no alteration to existing functions, etc., just needing 2 extra lines of SQL query (plus minor alteration of the WHERE line).

Hopefully this is helpful and useful for you!

Well considered! I will try to implement it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
all platforms Issue related to all platforms enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants