Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multilabel classification #440

Merged
merged 35 commits into from
May 11, 2024
Merged

Conversation

x-tabdeveloping
Copy link
Collaborator

Working on #434.
I will still have to add a good test task, if anyone has one don't hesitate to comment.

@x-tabdeveloping x-tabdeveloping added the enhancement New feature or request label Apr 19, 2024
@isaac-chung
Copy link
Collaborator

@x-tabdeveloping
Copy link
Collaborator Author

I'll look into it, thanks @isaac-chung !!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good @x-tabdeveloping. A few points of discussion, but testing it on a task seems like the best next step.

mteb/abstasks/AbsTaskMultilabelClassification.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskMultilabelClassification.py Outdated Show resolved Hide resolved
@x-tabdeveloping
Copy link
Collaborator Author

I'm currently in the process of adding EURLEX.

@isaac-chung isaac-chung mentioned this pull request Apr 24, 2024
10 tasks
…step outside the evaluator and encoding every possible training sentence before running the evaluation.
@x-tabdeveloping
Copy link
Collaborator Author

Currently this PR assumes that all labels in the classification are independent from each other.
This is due to MultiOutputClassifier from sklearn, which trains multiple independent classifiers for each label.

Some options we could consider that would fix this:

  1. ClassifierChain which would be an optimal choice for hierarchical tasks or where the ordering of the labels is trivial. We have to be cautious to order the labels properly though, which might be a pain in a half to do, and I'm not sure whether this should be the specific tasks' or the AbsTask's responsibility.
  2. Using a neural network like MLPClassifier with multiple outputs. This would be a good option, because it does not need any ordering and does not assume the independence of labels, but it's waaaaaay slower than just using kNN and we would also lose a great deal of conceptual transparency.

What do you guys think @KennethEnevoldsen @imenelydiaker @isaac-chung ?

@x-tabdeveloping
Copy link
Collaborator Author

I'm currently in the process of running MultiEURLEX on my machine, this might take a fair bit :D

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 24, 2024

My immediate assumption is just to go for simplicity and then we can always expand to other cases in the future.

@x-tabdeveloping
Copy link
Collaborator Author

Regarding the points: Do we count MultilabelClassification as a new task for each language contained in EURLEX or should I only add bonus points for those languages that had no classification task prior to this?

@x-tabdeveloping
Copy link
Collaborator Author

I have been running the task basically all day on UCloud on the two models, it takes a ridiculous amount of time.

@x-tabdeveloping
Copy link
Collaborator Author

Running on UCloud again, should be able to submit results within a day.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x-tabdeveloping feel free to merge it in once it is done running!

@x-tabdeveloping
Copy link
Collaborator Author

  1. It runs very slow, I couldn't complete the runs, maybe we should subsample and limit to the test set instead of the validation set and the test set.
  2. Performance is still crap, I have no idea what to do about that, and am at a bit of a loss as to what is happening.

@x-tabdeveloping
Copy link
Collaborator Author

I made the neural network smaller and introduced stratified subsampling for the test set so that it runs faster, I will try to do a rerun.

@isaac-chung
Copy link
Collaborator

For what it's worth, maybe it might help to debug to use a small dataset.

@KennethEnevoldsen
Copy link
Contributor

Yea using a smaller dataset for test seems like the right approach.

It runs very slow, I couldn't complete the runs, maybe we should subsample and limit to the test set instead of the validation set and the test set.

Hmm any idea about what part is slow? Is it simply running the trained model on the test set? (in which case reducing the test set might be an option)

Performance is still crap, I have no idea what to do about that, and am at a bit of a loss as to what is happening.

Doing a baseline using a logistic regression on each label is probably a good idea

@x-tabdeveloping
Copy link
Collaborator Author

Something's not right with these scores, I will make a deep dive

@x-tabdeveloping
Copy link
Collaborator Author

I ran EURLEX in English with all-MiniLM-L6 with multiple classifiers (MLPClassifier, KNN, DummyClassifier).
It would seem that the task is simply incredibly hard, and that accuracy is not exactly a good metric to reflect performance, maybe we should make lrap the main score.
Also note that kNN outperforms MLP by quite a bit, I think this is mainly because of the very small training set and the model is quite parameter-rich.

My suggestion is that we roll back to kNN and make LRAP the main score, what do you think @KennethEnevoldsen ?

{
    "en": {
      "dummy": {
        "accuracy": 0.0,
        "f1": 0.0,
        "lrap": 0.17113333333332317
      },
      "knn": {
        "accuracy": 0.0396,
        "f1": 0.29540945816583636,
        "lrap": 0.4267690714285629
      },
      "mlp": {
        "accuracy": 0.0082,
        "f1": 0.08189335124049107,
        "lrap": 0.2942032142856986
      }
    },
    "evaluation_time": 270.71
  }

@x-tabdeveloping
Copy link
Collaborator Author

Also including Dummy classifier scores gives us a relatively good idea of chance level in this multilabel case.

@KennethEnevoldsen
Copy link
Contributor

I would not include it in the task, but it might be interesting to just have a "random" model as a baseline.

  • A couple of thoughts. It might be worth increasing the training set size for the MLP
    • It might be fine with just KNN, alternatively we can do KNN + MLP and take the best (similar to clf)
  • It might be worth getting performance scores for subcategories (though in this case, it is 100+ right?)
  • I would also like an experiment using the base e5 just to see that larger models actually perform better

@x-tabdeveloping
Copy link
Collaborator Author

E5 definitely performs better on the task than paraphrase-multilingual. I'm not sure about the subcategories, might be a bit too much for some tasks. Though we could include it if need be.
In my experiments kNN uniformly performs better even with larger training set sizes. I suppose if it grows even larger it would surpass kNN, but we're already fighting performance issues with the benchmark, I think the less we have to embed the better.

@x-tabdeveloping
Copy link
Collaborator Author

Also specific tasks are free to use whatever they want, like if you see an MLP more fit you can specify it in the task.
What are your thoughts on the PR right now @KennethEnevoldsen ? Should we merge or is there something that still should be addressed

@KennethEnevoldsen
Copy link
Contributor

I believe it is fine to merge

@x-tabdeveloping x-tabdeveloping enabled auto-merge (squash) May 11, 2024 12:06
@x-tabdeveloping x-tabdeveloping merged commit 2aa0c67 into main May 11, 2024
7 checks passed
@x-tabdeveloping x-tabdeveloping deleted the multilabel-classification branch May 11, 2024 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request WIP Work In Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants