This repository has been archived by the owner on Feb 14, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
tag-clusterer. It clusters tags. Generates .tsx.
License
jimregan/tag-clusterer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
tag-clusterer. It clusters tags. Actually, it doesn't. It's just a pair of filters to delexicalise tagged text, which is then clustered by a word clustering tool (currently, mkcls only), and then generates a tagset specification (.tsx) file for use with apertium-tagger. The input must be tagged -- not analysed -- text in the format used for supervised training of the tagger. Feed its output to mkcls (play around with the number of classes it generates), then feed that into mkcls-to-tsx.pl semi-lexicalise.pl can take a set of tags to treat as stopwords, in the apertium-transfer-tools .atx format, and semi-lexicalise the input. In this case, the generated .tsx file will have closed classes, and may be usable without extra intervention. .atx looks like this: <?xml version="1.0" encoding="iso-8859-1"?> <transfer-at source="Portuguese" target="Spanish"> <source> <lexicalized-words> <lexicalized-word tags="cnjsub"/> <lexicalized-word tags="det.*"/> <lexicalized-word tags="pr"/> <lexicalized-word tags="prn.tn.*"/> <lexicalized-word tags="prn.enc.*"/> <lexicalized-word tags="prn.pro.*"/> <lexicalized-word tags="rel.*"/> <lexicalized-word tags="vbser.*"/> <lexicalized-word tags="vbhaver.*"/> <lexicalized-word tags="vbmod.*"/> <lexicalized-word tags="vblex.*" lemma="há"/> </lexicalized-words> </source> <target> <lexicalized-words> <lexicalized-word tags="cnjsub"/> <lexicalized-word tags="det.*"/> <lexicalized-word tags="pr"/> <lexicalized-word tags="prn.tn.*"/> <lexicalized-word tags="prn.enc.*"/> <lexicalized-word tags="prn.pro.*"/> <lexicalized-word tags="rel.*"/> <lexicalized-word tags="vbser.*"/> <lexicalized-word tags="vbhaver.*"/> <lexicalized-word tags="vbmod.*"/> <lexicalized-word tags="vblex.*" lemma="hacer"/> </lexicalized-words> </target> </transfer-at> semi-lexicalise.pl only uses the <source> part, so you only need that.
About
tag-clusterer. It clusters tags. Generates .tsx.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published