- Martin Užák
- 12 April 2020
I find the meaning of words and the origin thereof fascinating. etymolog
is a tool to store, retrieve and work with etymological facts. For example this could be looking up the Sanskrit word tr
and derivation thereof (./details.py sa tr
):
=> sa:tr = en:stars, to cross over
+ sa:rAtrI {rA+tr} = en:night, sa:naktA [that which gives (rA) the stars (tr)]
-> pt:trazer
-> sa:str = en:star, strewn, scattered, spread
-> sa:tara = en:crossing, sa:Kali
-> en:Tartary
-> en:astral
-> en:star
-> en:transit
Or producing a dictionary for a language showing the derivations of the words therefrom (./dict.py sk
):
* báť sa
-> sa:bhaya
* beda
-> sa:bheda
* biely
-> slavic:bel -> sa:bhalu
* brat
-> sa:bhrAtr
...
Etymological data is stored in .pill
files. They are processed line by line. One line contains one statement that can consist of several expressions. It is not allowed for one statement to span across multiple lines.
Whitespaces on the beginning and end of a line are stripped. Whitespaces between words are normalized to one space.
A word is the basic unit within a pill. It consists of any number of given alphabetical characters. Example:
ship
Each word gets assigned to a language. The language is an abbreviation of the language name. The default language is English (en
) and this can be changed in config. It is used, when no explicit language is set for a word. So the previous example is the same as:
en:ship
You can have several words following each other and they will be fused into one:
big ship (out of wood)
creates one word big ship (out of wood)
in the default language. Only the first word can have a language explicitly set:
en:big ship
To have two or more different language definitions for a composed word is invalid though:
en:big de:Schiff // invalid
The words are stored case-sensitive just as you define them, but they are always looked up case-insensitive. So looking up Ship
and ship
will yield the same result.
You can have several logically different words in the same statements comprising a group separated by ,
:
boat, ship, steam trawler
Will create three words. Group make sense to be used in relationships
The value of etymolog
lies in the ability to define relationships among words. Especially the unidirectional Derive
is useful as it helps to define the origin of words.
Derivation (->
) indicates that one word has given birth to another one:
en:pyre -> en:fire
You can have several relationships on one line:
sa:Pas -> sa:Pasa -> lat:pax -> pacify
Here we say that the word pacify
is derived from Latin pax
from Sanskrit word pasa
which in turn stems from pas
.
Relations are processed from the left and when one relationship (sa:Pas -> sa:Pasa
) is used to create another one, the right-most part (sa:Pasa
) from the one on left is used to create the new one (sa:Pasa -> lat:pax
).
You can have any number of relationships on one line. You can also combine any relationships you like.
Equality defines what would be the translation of a word from one language to another:
en:ship = de:Schiff
Here is where the groups make sense. If a group is used on the right side of relationship, the relationship will be created for all of its members:
sa:Pas -> sa:Pasa = rope, cord, tie net, chain, trap, noose, snare
This first defines the derivation of sa:Pas
into sa:Pasa
. Then it creates seven equals
relationships for sa:Pasa
with translation into the default language.
If a group is used on the left side of a relationship, only the right-most part of it will be used. E.g.:
sa:dvipa -> hindi:Doab, ir:Dobar, cornish:Dofer, celtic:Dubron -> Dubrovnik
Will translate into:
sa:dvipa -> hindi:Doab
sa:dvipa -> ir:Dobar
sa:dvipa -> cornish:Dofer
sa:dvipa -> celtic:Dubron
celtic:Dubron -> en:Dubrovnik
If there is a link between words yet it is not direct translation (Equals
relationship) nor derivation (Derive
) use generic relation ~
:
en:pyre ~ sk:pýriť sa, de:Feuer
Which expresses that the two words on the right are somehow related with pyre
yet it doesn't say why nor how.
If you have an idea on "how" or "why" of the relationship or meaning of word, you can use a comment. You put the comment immediately after a word or the sign of a relationship:
boat [means of transport on water]
en:pyre ~[person turning red in the face] sk:pýriť sa
If you want to have multiple comments, use multiple statements or put the comments behind each other:
boat [similar to a ship]
boat [means of transport on water]
a [b] [c] // is valid as well
To express that one word is a composite of two other words, use an union ({ first_word + ... + last_word }
):
sa:vi+sa {vi+sa}
This will result in creating a word sa:visa
and creating a corresponding union object for it.
You can have any number of components, e.g.
sa:svetAsvatara {sveta+asva+tara}
The components are by default in the language of the union. But you can also explicitly set the language for any of them:
slavic:Dažbog {sa:dadati+bog}
And finally you can combine comments with union statements:
slavic:Dažbog {sa:dadati+bog} [solar deity]
Comments that are for your eyes only and will be ignored by the parser start with //
and end with the the end of the line:
sa:Rudra -> cs:rudý // červený. Rudra je nahnevaný Šiva
Words in parentheses will be ignored, so they can be used as comments:
pie:k = curvilinear (motion)
Will be the same as:
pie:k = curvilinear
Finally there is a special category of comments that are directives for the parser. They are of the syntax // DIRECTIVE and there must not be any statement before them, i.e. they occupy the wohle line. E.g.
// SRC http://some.link
Will set the parser's SRC directive to http://some.link
. Here is a list of supported directives:
- SRC sets the
source
attribute for all following Relationships and Words. - NOSRC unsets the
source
attribute for all following entities. This can be also achieved by empty SRC directive. - LANG sets the default language.
- IGNORE: if a line starts with this processing of the file will be skipped from this line on.
These directives are valid from the point where they are encountered until the end-of-the file.
Finally you can tag a word. On a line there can be any number of tags, all of which apply to the word preceding it:
Thames #river #UK
sa:danu -> danube #river
All tags are case-insensitive and stored in lower case.
Above was the description of the grammar. It is implemented in lexer.py and parser.py, both of which can be used with a file as argument for testing.
use parser.py::load_db()
to load the DB. The parser will instantiate the objects and create relationships in memory according to model.py. Once this is done you can work on the object model, either with your tools or using the attached ones:
To produce some stats about your DB, use dump.py
.
details.py
shows the details of a word
in a language
.
dict.py
as seen above lists all the words for a given language
along with their derivations.