- Only add relevant NLP software
- LaMachine is only intended for linux and, with some limitations, unix-like systems (BSD, macOS). Other systems such as Windows are only supported as a host system for a VM.
- All software must be in public version control (we recommend github), be public, and be fully open source with an explicitly stated licence.
- LaMachine distinguishes between latest development versions and stable
releases. If your software is mature enough to be released, please do so
using the releases mechanism on github. Version numbers should use semantic versioning and start with a v (like
v0.1.2; the
major.minor.revision
format is recommended) - The latest state of the
master
branch of your repository will be considered the development version. - If there is a suitable software repository for your software (such as the Python Package Index for Python, CPAN for Perl, CRAN for R, Maven for Java); use it to publish stable releases. LaMachine can in turn obtain it from these repositories.
- Your software should have some form of documentation, at least a decent README
- Any Python software should support Python 3.4 or above. Python 2.7 is not supported.
- The software should be maintained and should work on modern linux distributions. More precisely, it should always work on the operating systems that we have designated with gold support status, and preferably also on those designated with silver support status. Support for the bronze category is much appreciated but not required. Note that these categories get updates as time progresses, see the general README.
- LaMachine is not intended for legacy or archiving purposes. (that is not to say that the resulting images themselves can not be used for archiving).
Why should you want to participate in LaMachine to distribute your software?
- LaMachine comes in many flavours and is not just a containerisation solution (e.g. Docker). We use Ansible to offer a good and clear level of abstraction, allowing it to work with multiple technologies (Vagrant, Docker, local installation, global installation). It also removes the need to mess with Dockerfile or Vagrantfile yourself, as this is something LaMachine handles for you.
- LaMachine already offers a lot of NLP software; this may include software your tool depends on or software people like to use independently in combination with your tool.
- LaMachine supports multiple modern Linux distributions, offering flexibility rather than forcing a single one.
- LaMachine aims to provide a Virtual Research Environment
- From a user perspective, LaMachine is easy to install (one command bootstrap).
- LaMachine comes with an extensive test framework.
- LaMachine is extensively documented.
- LaMachine does not replace or reinvent existing technologies, but builds on them: Linux Distributions, Ansible, Vagrant, Docker, pip, virtualenv
- This means LaMachine remains an optional solution to make things easier, but build on established installation methods that remain usable outside LaMachine.
- LaMachine handles software metadata
Contributors are expected to be familiar with git and github:
- Fork the LaMachine github repository
- Add a role for your software in
roles/
(read the rest of this documentation to learn how) - When all done, create a pull request on Github
LaMachine consists primarily of installation and configuration recipes for Ansible. These recipes are called playbooks by Ansible, or more specifically they are called roles when used in a modular fashion as we do. Roles are defined in a simple YAML syntax in combination with a powerful templating language.
A role or playbook defines tasks, each task is an installation or configuration
step. A task has a name and references a certain Ansible module based on the
type of work to do, there is for example a copy
module to copy files onto
the destination system, a file
module to create files/directories and set
proper owners ship, an include_role
module that calls another role with
specific parameters (this is heavily used in LaMachine), and hundreds more.
A task may also define a condition
for its execution through a when
statement.
LaMachine provides a full framework with various predefined roles and preset variables. The framework allows for
installation in various forms (docker, VM, local, global). A single bootstrap.sh
script is used build any desired
form; it does the necessary preprocessing and finally always invokes Ansible. We will look into that first.
The LaMachine bootstrap consists of a single bash script that is capable of downloading almost all necessary dependencies and kicking of the installation process in whatever form the user choses. The script can be downloaded and piped directly to the shell, allowing a single command to be sufficient to get everything downloaded and started.
By default, the script interactively queries the user for his installation preferences and creates at least the following files:
- Installation Manifest - (
install.yml
) - This is the main ansible playbook that contains the packages to install or roles to perform. - Host Configuration File - (
host_vars/$HOSTNAME.yml
) - This contains the configuration variables for your LaMachine installation. - If a Virtual Machine is chosen, a
Vagrantfile
will be generated. - If a Docker container is to be constructed, then the provided
Dockerfile
will be used. - A
hosts.ini
file may be generated, this is the Ansible inventory.
The bootstrap initially creates a directory in lamachine-controller/
(which in turn is created in the directory
where you run the bootstrap). This controller directory will hold a git clone of the LaMachine
repository with all the aforemention generated files. The installation manifest and host configuration file will initially be staged
in the current working directory and presented in a text editor a text editor so end-users can adapt these files to
their liking (both are heavily commented with instructions). Afterwards they are copied into the LaMachine controller.
In most cases, the LaMachine controller performs a temporary role during the bootstrap process, the controller will be
copied into the LaMachine environment, allowing the environment to be updatable; we call this the internal controller.
In some advanced situations however, you may prefer an external controller. This means the lamachine-controller
directory is kept and used to updated a LaMachine installation from the outside. This is used for managing remote
systems directly using Ansible, but may also be useful in development settings.
Note that the bootstrap procedure can also be run in a non-interactive fashion and parameters can be passed on the command line,
see bootstrap.sh --help
for details.
First we take a look at the variables available to LaMachine and defined in the host configuration file, afterwards we look at the installation manifest and learn how to add software to LaMachine.
The variables are generally set by the end-user in the LaMachine host configuration file when building or updating LaMachine. These determine the type of environment to build, as LaMachine offers quite some flexibility through its different flavours, versions, and is meant to work on multiple Linux distributions.
We distinguish the following variables, all of which you can read and use in your Ansible roles for LaMachine:
- Generic:
version
-- The version of LaMachine, is eitherstable
,development
, orcustom
.locality
- Set to eitherlocal
orglobal
and determines whether LaMachine is installed in a local user environment (seelocalenv_type
) or globally on the system. For Virtual Machines and Docker Containers, this value will be set to global.localenv_type
- The type of local environment (only makes sense iflocality == "local"
), can only be set tovirtualenv
for now, meaning we use a Python VirtualEnv.
- Permissions:
root
- A boolean indicating whether LaMachine has root permission on the system (necessarily true iflocality == "global"
)unix_user
- The unix user that owns and runs LaMachine.unix_group
- The unix group that owns and runs LaMachine and has write access.controller
- Set to eitherinternal
orexternal
. Indicates whether the LaMachine installation manages itself (i.e. upgrades are done from within the LaMachine environment), or whether it is externally managed. Externally managed installation are useful in development environments or for provisioning of remote machines.
- Paths:
lm_prefix
- The path where LaMachine installs its software. This equals to either:local_prefix
- The local installation path (i.e., the path of the virtual enviroment)global_prefix
- The global installation path (by default/usr/local
)
homedir
- The path to the home directory for the user that owns and runs LaMachinesource_path
- The path to where sources for packages will be downloadedlm_path
- The path to the LaMachine source repository (where are the ansible roles, templates and configurations are). This equals to either:lamachine_path
- The path upon first installation or remote location where the installation is managed (if controller ==external
)source_path
/LaMachine - The path inside LaMachine (if controller == 'internal')
data_path
- The path where the end-user can store data (this is typically shared with the host system, if applicable)
- Network:
hostname
- The hostname of the systemwebserver
- (boolean) Include a webserver or notwebservertype
- The type of webserver, defaults tonginx
. LaMAchine does not install any other webserver, so if you change this nginx won't be installed but you have to set up any alternative webserver (e.g. Apache) yourself.
http_port
- port the webserver will listen onweb_user
- The unix user that runs the webserver and webservicesweb_group
- The unix group that runs the webserver and webservices;unix_user
should always be a member of this groupservices
- This is a list of services to provide, by default it is set to[ all ]
, meaning all services provided by the software categories you install will be enabled, you can removeall
and provide only specific services.
- Other:
private
- (boolean) Send basic analytics back to usminimal
- (boolean) A minimal installation is requested (might break some stuff)
It is important to understand the directory layout used by LaMachine and to adhere to it when adding software yourself.
The variable lm_prefix
is most important here, as it holds the base directory under which all software is installed.
For a global installation (in the Docker and Virtual Machine flavours for instance), this by default corresponds to
/usr/local
(don't let the word local confuse you here, we are still talking about a global installation). In a
local installation lm_prefix
corresponds to the directory containing the virtual environment, which can be anywhere the user desires.
We try to follow the Filesystem Hierarchy Standard as much as possible:
{{lm_prefix}}/bin
- Holds executable binaries other executable scripts; this will be added to your environment's$PATH
upon activation{{lm_prefix}}/bin/activate.d
- Extra shell scripts for activating the environment for specific software, will be sourced automatically by the main LaMachine activation script
{{lm_prefix}}/lib
- Holds shared libraries; this will be added to your environments library path.{{lm_prefix}}/lib/python3.*/site-packages
- Contains installed Python modules.
{{lm_prefix}}/include
- Holds development header files{{lm_prefix}}/src
- Holds sources of the software, symlinks to{{source_path}}
{{lm_prefix}}/opt
- Holds optional application software, for which each application is stored in a single directory under this path. This is common for software that is not typically distributed in a more unix-like fashion. (e.g. Alpino, PICCL, Nextflow), but we also use it to symlink to certain software.{{lm_prefix}}/share
- Shared data files{{lm_prefix}}/var
- Holds variable files.{{lm_prefix}}/var/log/nginx
- Holds log files by the webserver.{{lm_prefix}}/var/log/uwsgi
- Holds log files for the individual uwsgi-powered (e.g. CLAM, Django) webservices/webapplications.
{{lm_prefix}}/etc
- Holds configuration files{{lm_prefix}}/etc/nginx/nginx.conf
- Generic webserver configuration (in global installations this will symlink to the system-wide/etc/nginx
){{lm_prefix}}/etc/nginx/conf.d/*.conf
- Configurations per webservice/webapplication.{{lm_prefix}}/etc/uwsgi-emperor/vassals/
- Holds individual uwsgi configuration files for uwsgi-powered webservices/webapplications{{lm_prefix}}/etc/*.clam.yml
- External configuration files for specific CLAM webservices
We supply roles that are built for reuse (they all start with
lamachine-*
) and are meant to to install software from one or more external
repositories:
lamachine-python-install
- Install Python software- Expects a
package
variable that is a dictionary/map with the following fields:github_user
- The user/group that holds the github repository (used by the development version of LaMachine)github_repo
- The name of the github repository (used by the development version of LaMachine)pip
- The name of the package in the Python Package Index (used by the stable version of LaMachine)
- Expects a
lamachine-package-install
- Install Distribution packages- Expects a
package
variable that is a dictionary/map with one or more of the following fields:debian
- The package name for APT on debian/ubuntu/mint systems.redhat
- The package name for YUM on fedora/redhat/rhel/centos systems.arch
- The package name for Arch Linuxhomebrew
- The package name for Homebrew on Mac OS X
- Expects a
lamachine-git-autoconf
- Install C++ software hosted in git and which makes use of the GNU autotools (autoconf/automake), i.e. software that follows the classic./configure && make && make install
paradigm.- Expects a
package
variable that is a dictionary/map with the following fields:user
- The user/group that holds the github repository (used by the development version of LaMachine)repo
- The name of the github repository (used by the development version of LaMachine)- You can also pass any of the variables used by
lamachine-register
, as this will be called automatically.
- Expects a
Some lower-level roles:
lamachine-git
- Clone a particular git repositorylamachine-run
- Run a particular command in the LaMachine environment.lamachine-register
- Registers software metadata manually- Holds a
metadata
dictionary/map with the relevant metadata according to the Codemeta model (Read the metadata section below). - Set
codemeta
to point to a specificcodemeta.json
file to load, any metadata supplied inmetadata
will be appended. - When using
lamachine-python-install
, metadata registration is entirely automatic, as PyPI contains all relevant information already. So you never needlamachine-register
, you can again set ametadata
map and/orcodemeta
field. - When using
lamachine-git-autoconf
,lamachine-register
is automatically called and a few variables (name,identifier,version) can be pre-filled. Others you will need to provide explicitly if wanted (preferably in the upstream project in acodemeta.json
file). - When using
lamachine-git
,lamachine-register
is only called if you setdo_registration: true
. Certain variables (name, version) can be pre-filled. Others you will need to provide explicitly if wanted.
- Holds a
The use of specific LaMachine roles is always preferred over the use of comparable generic ansible modules as the
LaMachine roles take care of a lot of specific things for you so it works in all environments. So use
lamachine-python-install
rather than Ansible's pip
module, and lamachine-git
rather than ansible's git
or github
module.
To add your own software, you add a role yourself which includes one (or more) of the above, with specific parameters, to do the actual work. Your role, in turn, is referenced by the end-user who has final control over the installation playbook. Rather than just installing a single piece of software, a role in LaMachine usually installs multiple inter-connected software components and all their dependencies.
This may sound a bit cryptic still, so let's go through some examples:
Learning by example is usually the most efficient. We provide somes examples below, but also want to encourage you to
just browse around in the roles/
directory and see how some of the existing software packages are installed:
- For this example, we use a Python software package named babelente
- The source code is on github as https://github.com/proycon/babelente
- The software is released on the Python Package Index as babelente, meaning a simple
pip install babelente
is enough to install it and all dependencies. - Fork the LaMachine github repository
- Git clone your fork
- Create a file roles/babelente/tasks/main.yml (create the necessary directories) with the following contents:
- name: Install Foobar
include_role:
name: lamachine-python-install
vars:
package:
github_user: proycon
github_repo: babelente
pip: babelente
That's it, the lamachine-python-install
role works in such a way that the stable version of LaMachine will use
pip
with PyPI, whilst the development version of LaMachine will draw from github directly and run python3 setup.py install
. Note that this automatically covers any Python dependencies the package has declared.
If a software package already commonly included in our supported Linux distributions, then we can pull straight from the
distribution's repository. To this end, we use the lamachine-package-install
role. The following example
demonstrates an installation of the Tesseract OCR system, supported on different distributions:
- name: Install Tesseract
include_role:
name: lamachine-package-install
with_items:
- { debian: tesseract-ocr, redhat: tesseract, arch: tesseract, homebrew: tesseract }
- { debian: tesseract-ocr-eng, redhat: tesseract-langpack-eng, arch: tesseract-data-eng }
loop_control:
loop_var: package
Note that with_items
and loop_control
is a standard Ansible looping construct. The role is called two times, and the
variable package
is assigned the value provided in with_items
.
It is not always feasible to include all distributions and this it not obligatory, if a platform isn't mentioned then nothing will be installed. This however may break the process, so you if you decide not to support a certain platform, we encourage you to set up a task that produces an error if the platform is unsupported. This can be done as follows:
- name: Check for unsupported OS
fail:
msg: "This software is not supported on Mac OS X or CentOS"
when: ansible_distribution|lower == "macosx" or ansible_distribution|lower == "centos"
LaMachine attempts to register software metadata of all software it explicitly installs, the lamachine-register
role
that is invoked by various installation procedures takes care of this. All software metadata is collected in a
single lamachine-registry.json
file.
LaMachine uses the Codemeta standard for software metadata,
ensuring a unified metadata format. Metadata from certain sources such as the Python Package Index or CRAN can be
automatically converted quite well. Our philosophy is that software metadata should be in a simple format and either live right alongside the source code in a
version controlled repository (e.g. on github, bitbucket, etc), or be obtained from a software repository such as as the
Python Package Index, CRAN, CPAN, Maven Central and automatically converted to a unified format. This automatic
conversion is performed by CodeMetaPy or CodeMetar), included in LaMachine.
You can, and are in fact encouraged to, place a codemeta.json
file in the root of your source code repository, following the proper Codemeta specifications. If you do so, LaMachine
will automatically use this. Alternatively, you can augment metadata from within the ansible roles themselves if updating the upstream project
is not an option.
Within LaMachine, the lamachine-list
allows users to access the available metadata from the command line (it will be serialised as YAML).
If you enabled and started the webserver in LaMachine, then you have access to a rich portal page giving an overview of all installed software and providing access to any software with a web-based interface. This portal is powered by Labirinto.
If you are doing development on LaMachine, you will be working from a cloned LaMachine repository on a particular git
branch. We recommend to fork it on github and then clone your fork. Make sure you navigate to the root of the git
repository, and then simply invoke ./bootstrap.sh
to build a LaMachine build. LaMachine will simply reuse your git
repository for the controller environment rather than download and create a new one, allowing you to test your
additions.
Once you are ready, issue a pull request on https://github.com/proycon/LaMachine/ to merge your changed into the
develop
branch.
Various predefined tests are also available for multiple linux distributions, through vagrant and docker. The test
script is builds/test.py
and the build are defined in builds/build.py
. Note that the former script references a particular
LaMachine git repository and branch, so may need to be adapted to fit your situation.