Skip to content

Google Summer of Code 2020

gillux edited this page Feb 3, 2020 · 8 revisions

This page lists project ideas as well as general information for students who would like to take part in Google Summer of Code 2020 and be mentored by Tatoeba.

Mentors

Tatoeba profile GitHub profile
Andreas rumpelstilzchen AndiPersti
gillux gillux jiru
Trang Trang trang
Yorwba Yorwba Yorwba

Every project will be mentored by all the mentors. We will indicate as "main mentor" which mentor is likely to have more involvement in the project, but overall, you should expect all mentors to be involved.

Ideas

Remember that the ideas listed on this page are only ideas. They are here to give you inspiration on what projects you could do with us but you are in no way limited to these ideas.

Translations wanted

Main mentor: gillux

Skills desired: PHP, SQL, Javascript, CSS

When members contribute a sentence, they sometimes wish somebody could translate it into a language they are learning. At the moment, the only thing they can do is to just wait for somebody to translate it. It may take a long time or never happen.

Similarly, when members want to contribute translations, they don’t know if that’s going to actually help anybody. The only thing they can do is to hope that what they do help. If we had a way to say "I would like somebody to translate my sentences in X into language Y", or "Can anybody provide a translation into X for that specific sentence?", it would connect learners and translators in a more helpful way and build more bonds among members.

Sentences wanted

Main mentor: Trang

Skills desired: PHP, SQL, Javascript, CSS

Imagine that you are learning a language, and you are reading some article in this foreign language. You come across a new word, and would like to have more example sentences that illustrate the usage of this word. You could go to Tatoeba and search for this word. But what if you don't find any sentence?

To address this, we made it possible for users to create vocabulary lists. When they add a vocabulary item for which no sentence exists, this item is listed on a page for "Sentences wanted" (login required). From this page, contributors can browse vocabulary items with less than 10 sentences, and create sentences for these vocabulary items.

This feature still needs a lot of improvement. For instance:

  • There is no way to filter out or remove "spam" vocabulary items.
  • There is no system to bump up more demanded vocabulary items.
  • The sentences linked to the vocabulary items contain only an exact match of the vocabulary.

Management of permissions

Main mentor: Trang

Skills desired: PHP, SQL, Javascript, CSS

The permissions of a user are based mostly on the user's status: depending on whether you are a contributor, advanced contributor, corpus maintainer or admin, you will have access to more or less features. For instance advanced contributors can add tags to a sentence, while regular contributors cannot. Corpus maintainers can delete others’ sentences while other contributors cannot.

The goal of this project is to design and implement a more refined permission system, with an interface to manage these permissions.

Here are examples of things that we cannot do at the moment, and that could be part of the project:

  • Disallow a user to add new sentences, but still allow them to translate sentences.
  • Restrict the languages in which a user can contribute.
  • Disallow a user from posting comments only on the Wall, but not on sentences.

Audio

Main mentor: gillux

Skills desired: PHP, SQL, Javascript, CSS

Tatoeba provides audio for some sentences. The audio is recorded by volunteers, but due to the fact that audio was initially not at the core of the project, the process of contributing audio is a bit complicated.

Audio was still a great addition and Tatoeba has received more and more audio contributions over the years. However the audio content lacks many features.

For instance:

  • It is not possible to attach several audio recordings to a single sentence (to illustrate different accents of the same language for instance).
  • Contributors cannot record audio directly through the web page (see this proof of concept)

The goal of this project would be to implement the necessary features for a better management of the audio content in Tatoeba.

Better export

Main mentor: gillux

Skills desired: PHP, SQL, Javascript, CSS

Tatoeba shares its data via CSV files that can be downloaded from the Downloads page of the website. CSVs are generated on a weekly basis. Third parties can reuse this data in their projects. However, it's not easy to do so because this approach has many limits:

  • Third parties must download the whole corpus. There is no way to download a part of it, for instance only sentences in a given set of languages.
  • We don’t provide diff between versions. Even if a relatively small part of the corpus changed, third parties must download the whole corpus at each new version.
  • The format of the data is documented, yet subject to change at any time. There is no way to notify third parties about this.
  • Third parties must wait a week to get new data.
  • Third parties must do some preliminary work to restructure the data the way they need it.
  • Probably other things.

We would love to see more projects reusing our data, but all this is definitely an entry barrier for many of them. So what can we do to make our export files easier to use?

App using Tatoeba's data

Main mentor: gillux

Skills desired: any

As mentioned in the "Better exports" idea above, Tatoeba shares its data and we are always happy to see projects reusing our data. Do you have a nice idea of an app that you could build from it? This can be a GSoC project as well.

Just one thing: make sure you check this list of projects that uses our corpus. Maybe someone else already had the same idea before you. So try to find the gaps. Make something innovative!

Note that this project idea is very tied to the "Better exports" idea, except it tackles the problem from a more concrete angle. Since you will be reusing our data, you will experience real situations where you can see how we can improve the way we share our data. You will be in a better position to find out, or help us find out, what we could do to make it easier for you (and other people like you) to get started with their projects.

Quality

Main mentor: Trang

Skills desired: PHP, SQL, Javascript, CSS

As a collaborative project that is open for anyone to join, one of the challenges that Tatoeba faces constantly is to provide data of good quality. Not all Tatoeba contributors are highly skilled in the language(s) they contribute in, and therefore contributions are not always good: they may contain spelling mistakes or grammatical mistakes, they may not sound natural, the translations may be inaccurate or just plain wrong.

Although Tatoeba has some mechanisms to manage quality, these mechanisms are not optimal. Users still need to make extra efforts to figure out when they can really rely on a sentence or translation.

What can we improve in our current system, to provide sentences and translations of higher quality? How can we assess the quality of a sentence or of a translation, so that language learners or third party tools can easily filter out sentences of bad quality, or of uncertain quality?

Management of languages

Main mentor: Trang

Skills desired: PHP, SQL, Javascript, CSS

Currently, Tatoeba supports around 350 languages. Our goal, however, is to support every language and to reach this goal, we still have thousands of languages to add.

We have a process for adding new languages but this process can be very much improved and could be fully integrated into the website.

  • Tatoeba is a linguistic project and yet, there is not really a space dedicated to languages on the website itself.
  • If a user notices that the language they speak or learn is not yet supported, they should be able to find out easily how to request it. Currently, some user may not even know that it is possible to request new languages.
  • Things could be automated so that the addition or modification of a language does not require the intervention of a developer anymore.

UX improvements

Main mentor: gillux

Skills desired: UX, UI, design

Tatoeba has room for improvements in terms of user experience. We have started to conduct a few UX tests that revealed major flaws in our website. The goal of this project is to identify what kind of people use Tatoeba, how they use it, what problems they face, and to make Tatoeba easier, more useful and more attractive to them. Note that programming is not a requirement for this project, but it would help if you are familiar with CSS and HTML.

Contact

If you have questions or need support, you can reach out to us through the channels listed below.

Email

Our email is [email protected]. This is a good starting point to get in touch with us if you are completely new to Tatoeba.

Public forums

  • The Wall. This is the main place where our members discuss things, ask questions, and exchange ideas.
  • The Google group. An alternative to the Wall, but it's not used much. You can still post in there if it feels more convenient, we are always reading.

Public chat

Just like the Google group, our chatrooms are not very active, but you are still free to drop by.