Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: classify emails by importance based on subjects #10277

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

st3iny
Copy link
Member

@st3iny st3iny commented Oct 17, 2024

Partly addresses #3968

Closes #8257

Part 5: The great rebasing ...

Improvements

  • Classify emails by importance using a new and improved classification pipeline
  • Persist classifier models (and transformer pipelines) in memory cache only
  • Drop the oc_mail_classifiers table

How to tests?

  1. Have some important mails in your inbox.
  2. Make sure to have the classification enabled: Account settings modal -> "Data collection consent" ✅ and "Mark as important" ✅
  3. Train a model: occ mail:account:train -vvv <account-id> (extract account id with occ mail:account:export <user-id>)
  4. Receive some mails and observe that some are marked as important and some aren't.

@st3iny

This comment was marked as resolved.

@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-V branch from ebbed72 to 21d45eb Compare October 22, 2024 11:53
@st3iny st3iny added 3. to review feature:priority inbox Features and bugs related to the "priority inbox" feature and removed 2. developing labels Oct 22, 2024
@st3iny st3iny marked this pull request as ready for review October 22, 2024 17:47
Copy link
Member

@ChristophWurst ChristophWurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First impression: looks good 😎

lib/Command/PreprocessAccount.php Outdated Show resolved Hide resolved
lib/Db/StatisticsDao.php Outdated Show resolved Hide resolved
lib/Db/StatisticsDao.php Outdated Show resolved Hide resolved
@ChristophWurst
Copy link
Member

ChristophWurst commented Nov 13, 2024

I'll observe my account for a while. So far the stats are (suspiciously) excellent:

[debug] found 36 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 300 messages of which 42 are important
WCF vocab size: 500
[debug] data set split into 240 (i: 33) training and 60 (i: 9) validation sets with 504 dimensions
[debug] classification report: {"recall":1,"precision":1,"f1Score":1}
[debug] classifier validated: recall(important)=1, precision(important)=1 f1(important)=1

@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-V branch from a7ea9c0 to f4fc5bc Compare November 15, 2024 06:39
@ChristophWurst
Copy link
Member

image

I'm still not sure how I can judge the quality of the new classification. Is it overfitting?

@kesselb
Copy link
Contributor

kesselb commented Nov 15, 2024

[debug] found 7 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 300 messages of which 241 are important
[debug] data set split into 240 (i: 195) training and 60 (i: 46) validation sets with 504 dimensions
[debug] classification report: {"recall":1,"precision":0.8518518518518519,"f1Score":0.92}
[debug] classifier validated: recall(important)=1, precision(important)=0.85185185185185 f1(important)=0.92
[debug] Classifier for account 1 persisted

I will keep an eye on the classification ;)

However, I haven't cleaned my inbox in a while and fear the countless undeleted "update xyz announcements" or "calendar notifications" might influence the model 🙈

@kesselb
Copy link
Contributor

kesselb commented Nov 15, 2024

  1. Would adding an option to exclude or include specific mailboxes from the training improve the results? For instance, my Archive folder holds about 160 important messages, while my "Test Data" folder has around 5,000 emails that I keep for testing purposes. These test emails aren’t relevant at all, but I move them there instead of deleting them to maintain a large mailbox for tests.

  2. While reviewing the code, I noticed that starring a message in Thunderbird sets the flag_flagged attribute. Should we consider this attribute when identifying important messages?

$importantMessages = array_filter($messages, static function (Message $message) {
return ($message->getFlagImportant() === true);
});

} catch (DoesNotExistException $e) {
public function loadLatest(Account $account): ?ClassifierPipeline {
$cached = $this->getCached((string)$account->getId());
if ($cached == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ($cached == null) {
if ($cached === null) {

'exception' => $e,
]);
}

if ($estimator === null) {
if ($pipeline === null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this condition was only met for new accounts where no training has happened yet. Now servers with no memory cache will always run into the rule-based classifier. I don't think this is desirable, because results will be bad. The rules based classifier is only meant to cold start the classification.
How about we skip classification all together if there is no distributed cache available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, instead of skipping, we generate a pipeline on the fly? It's slow but gives good results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3. to review enhancement feature:priority inbox Features and bugs related to the "priority inbox" feature
Projects
Status: 🏗️ In progress
Development

Successfully merging this pull request may close these issues.

3 participants