[Question] A “raw” version of the tiny_dataset.zip #43

l3kn · 2024-01-28T17:05:48Z

In #28 a link to a pre-processed small dataset was shared.

While testing different ways of converting review logs of different spacing algorithms
to FSRS, my evaluation on ~7000 reviews generated using an EmacsLisp implementation of py-fsrs
suggests that updating the difficulty and stability for reviews with an interval greater than 1 day
is slightly better than using the (re)learning/review states of the py-fsrs implementation.

To make sure I didn't make any mistake in my evaluation code and test on larger datasets,
I'd like to retry this experiment using the code and datasets of this benchmark
but I can't do so with the “tiny_dataset.zip” because the delta_t have been rounded to days.

Would it be possible to get access to a similar dataset either in an unprocessed format
or with floating-point delta_t values?

This seems to be related to a difference in how the benchmark and the optimizer
implement the FSRS algorithm (using the first review of each day, as I understand it) and how it's
implemented in e.g. py-fsrs (using states to decide when to update the parameters).
I'm not sure how to compare the two approaches other than using review logs from FSRS
and testing if the recall prediction would have been more accurate if we had included
reviews that occurred in the (re)learning state but after a sufficiently large interval or on a different day.

The text was updated successfully, but these errors were encountered:

Expertium · 2024-01-28T18:46:47Z

open-spaced-repetition/fsrs4anki#437

Keeping delta_t as floats:

Wouldn't improve scheduling in practice since Anki doesn't schedule cards (in the "review" phase) at a specific hour/minute of the day.
Wouldn't matter for long intervals. Rounding 1.5 to 2 introduces a large rounding error, rounding 365.5 to 366 introduces a very small rounding error.
Doesn't improve accuracy anyway, according to LMSherlock.

l3kn · 2024-01-28T21:54:30Z

Ah, thank for the link! I didn't see that there was already a discussion on the topic.

I understand that the potential gains would be very small and there's a chance I'm overthinking this.
The main problem I want to solve is migrating a large collection of flashcards I created
in a spaced repetition system that's not Anki and where repetitions are scheduled using float intervals.
There I wonder if slightly better parameters computed earlier during the history of a card
would yield more accurate results during the lifetime of a card.

L-M-Sherlock · 2024-01-29T08:40:07Z

https://huggingface.co/datasets/open-spaced-repetition/fsrs-dataset/tree/main/float-delta-t

This dataset contain float delta_t.

L-M-Sherlock closed this as completed Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] A “raw” version of the tiny_dataset.zip #43

[Question] A “raw” version of the tiny_dataset.zip #43

l3kn commented Jan 28, 2024

Expertium commented Jan 28, 2024 •

edited

Loading

l3kn commented Jan 28, 2024

L-M-Sherlock commented Jan 29, 2024

[Question] A “raw” version of the tiny_dataset.zip #43

[Question] A “raw” version of the tiny_dataset.zip #43

Comments

l3kn commented Jan 28, 2024

Expertium commented Jan 28, 2024 • edited Loading

l3kn commented Jan 28, 2024

L-M-Sherlock commented Jan 29, 2024

Expertium commented Jan 28, 2024 •

edited

Loading