Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] A “raw” version of the tiny_dataset.zip #43

Closed
l3kn opened this issue Jan 28, 2024 · 3 comments
Closed

[Question] A “raw” version of the tiny_dataset.zip #43

l3kn opened this issue Jan 28, 2024 · 3 comments

Comments

@l3kn
Copy link

l3kn commented Jan 28, 2024

In #28 a link to a pre-processed small dataset was shared.

While testing different ways of converting review logs of different spacing algorithms
to FSRS, my evaluation on ~7000 reviews generated using an EmacsLisp implementation of py-fsrs
suggests that updating the difficulty and stability for reviews with an interval greater than 1 day
is slightly better than using the (re)learning/review states of the py-fsrs implementation.

To make sure I didn't make any mistake in my evaluation code and test on larger datasets,
I'd like to retry this experiment using the code and datasets of this benchmark
but I can't do so with the “tiny_dataset.zip” because the delta_t have been rounded to days.

Would it be possible to get access to a similar dataset either in an unprocessed format
or with floating-point delta_t values?

This seems to be related to a difference in how the benchmark and the optimizer
implement the FSRS algorithm (using the first review of each day, as I understand it) and how it's
implemented in e.g. py-fsrs (using states to decide when to update the parameters).
I'm not sure how to compare the two approaches other than using review logs from FSRS
and testing if the recall prediction would have been more accurate if we had included
reviews that occurred in the (re)learning state but after a sufficiently large interval or on a different day.

@Expertium
Copy link
Contributor

Expertium commented Jan 28, 2024

open-spaced-repetition/fsrs4anki#437

Keeping delta_t as floats:

  1. Wouldn't improve scheduling in practice since Anki doesn't schedule cards (in the "review" phase) at a specific hour/minute of the day.
  2. Wouldn't matter for long intervals. Rounding 1.5 to 2 introduces a large rounding error, rounding 365.5 to 366 introduces a very small rounding error.
  3. Doesn't improve accuracy anyway, according to LMSherlock.

@l3kn
Copy link
Author

l3kn commented Jan 28, 2024

Ah, thank for the link! I didn't see that there was already a discussion on the topic.

I understand that the potential gains would be very small and there's a chance I'm overthinking this.
The main problem I want to solve is migrating a large collection of flashcards I created
in a spaced repetition system that's not Anki and where repetitions are scheduled using float intervals.
There I wonder if slightly better parameters computed earlier during the history of a card
would yield more accurate results during the lifetime of a card.

@L-M-Sherlock
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants