forked from os2datascanner/os2datascanner
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.rst
394 lines (288 loc) · 14.8 KB
/
README.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
Installation
============
**TL;DR:**
To get a development environment to run, follow these steps:
#. Clone the repo and start the containers:
.. code-block:: bash
git clone [email protected]:os2datascanner/os2datascanner.git
cd os2datascanner
docker-compose up -d
You can now reach the following services on their respective ports:
- Administration module: http://localhost:8020
- Web interface for message queues: http://localhost:8030
- Report module: http://localhost:8040
(see `Services`_ for further information)
#. Create logins for the django modules
Logins for the django modules (Administration and Report) must be created when
the development environment is first started (and any time the data volume has
been wiped).
Having started the environment as described above, simply run
.. code-block:: bash
docker-compose exec admin_application python manage.py createsuperuser
docker-compose exec report_application python manage.py createsuperuser
You can pass username and email as arguments to the command by adding
``--username <your username>`` and/or ``--email <your email>`` at the
end of the snippets above, otherwise you will be prompted for them along with a
password.
As of `Django 3.0<https://docs.djangoproject.com/en/3.2/ref/django-admin/#django-admin-createsuperuser/>`_, users can be created "script-like" as
.. code-block:: bash
docker-compose exec -e DJANGO_SUPERUSER_PASSWORD=test admin_application python manage.py createsuperuser --noinput --username test --email [email protected]
docker-compose exec -e DJANGO_SUPERUSER_PASSWORD=test report_application python manage.py createsuperuser --noinput --username test --email [email protected]
Credentials for the message queue web interface can be found in here:
- ``dev-environment/rabbitmq.env``
#. Start a scan:
#. Log into the administration module with the newly created superuser at
http://localhost:8020
#. Go to ``Administration`` and add an ``Organization``.
#. Return to the main page, go to ``Regler`` (Rules) and add one.
#. Go to ``Scannerjob`` and add a webscan using the organization and rule
just created for a website - e.g. ``https://www.magenta.dk``
**NB!** Please note that OS2datascanner has been built to scan an
organization's *own* data sources, and to do so as efficiently as
possible. Thus, OS2datascanner does not check for or adhere to e.g.
``robots.txt`` files, and may as a consequence overload a system or
trigger automated safety measures; **always ensure that the site
administrator is okay with scanning the site!**
#. Start the scan by clicking the play button and confirming your choice.
#. Follow the engine activity in RabbitMQ (optional):
#. Log into the web interface for RabbitMQ - using the credentials
mentioned above - at
http://localhost:8030
#. Queue activity is available on the ``Queues`` tab.
#. See the results:
#. Log into the report module with the newly created superuser at
http://localhost:8040
#. Go to the django admin site at
http://localhost:8040/admin
#. Create a new ``Remediator`` pointing to the superuser just created.
#. Return to the main page and check the results - refresh page for updates.
Docker
------
The repository contains a ``Dockerfile`` for each of the OS2datascanner
modules:
- **Administration**: ``docker/admin/Dockerfile``
- **Engine**: ``docker/engine/Dockerfile``
- **Report**: ``docker/report/Dockerfile``
Using these is the recommended way to
install OS2datascanner as a developer.
.. TODO: adjust section when the set-up has matured.
`as a developer` -> `both as a developer and in production.
All releases are pushed to Docker Hub at <link to our registry> under the
``latest`` tag.
To run OS2datascanner in Docker, you need a running Docker daemon. See
`the official Docker documentation <https://docs.docker.com/install/>` for
installation instructions.
The containers for the Admin and Report modules require a connection to a
postgres database server. It is configured with the ``DATABASE_*`` settings.
The database server must have a user and a database object. It can be created
with the help of the scripts in the ``/docker/postgres-initdb.d/`` folder:
- ``docker/postgres-initdb.d/20-create-admin-db-and-user.sh``
- ``docker/postgres-initdb.d/40-create-report-db-and-user.sh``
The folder can easily be mounted into ``/docker-entrypoint-initdb.d/`` in
`the official postgres docker image <https://hub.docker.com/_/postgres>`_, and
further contains a script to ensure that all relevant environment variables
have been passed to the container:
- ``docker/postgres-initdb.d/10-test-for-valid-env-variables.sh``
To run a fully functional OS2datascanner system, you will need to start a number
of services. The recommended way to set up an appropriate development environment
is to use `Docker-compose`_.
.. TODO: fill in section on starting each service when the set-up has matured.
..
Static files
^^^^^^^^^^^^
..
Logs
^^^^
User permissions
^^^^^^^^^^^^^^^^
Each ``Dockerfile`` creates a dedicated user, and any services started are run
as the user created by the related ``Dockerfile``. All files generated by such
a service will be owned by the respective user. For each user, the ``UID`` and
``GID`` are identical:
- **Administration**: 73020
- **Engine**: 73030
- **Report**: 73040
If you want to use another ``UID/GID``, you can specify it as the
``--user=uid:gid``
`overwrite flag <https://docs.docker.com/engine/reference/run/#user>`_. for the
``docker run`` command or
`in docker-compose <https://docs.docker.com/compose/compose-file/#domainname-hostname-ipc-mac_address-privileged-read_only-shm_size-stdin_open-tty-user-working_dir>`_.
If you change the ``UID/GID``, the ``/log`` and ``/static`` volumes may not
have the right permissions. It is recommended to only use
`bind <https://docs.docker.com/storage/bind-mounts/>`_ if you overwrite the
user and set the same user as owner of the directory you bind.
Missing permissions in development environment
**********************************************
During development, we mount our local editable files into the docker containers
which means they are owned by the local user, and **not** the user running
inside the container. Thus any processes running inside the container,
like management commands, will not be allowed to create or update files in the
mounted locations.
In order to fix this, we need to allow "others" to write to the relevant
locations. This can be done with ``chmod -R o+w <path>``
(``o`` is for "other users", ``+w`` is to add write-permissions and ``-R`` is
used to add the permissions recursively down through the file structure from
the location ``<path>`` points to).
The above is necessary whenever a process needs write permissions, but should
always be done for the following locations:
* ``code/src/os2datascanner/projects/<module>/locale/``
* ``code/src/os2datascanner/projects/<module>/<module>app/migrations/``
``<module>`` being either ``admin`` or ``report``.
**NB!** Git will only save executable permissions, which means that granting
other users write permissions on your local setup, will not compromise
production security.
..
Test
^^^^
Docker-compose
--------------
You can use ``docker-compose`` to start the OS2datascanner system and its runtime
dependencies (PostgreSQL and RabbitMQ).
A ``docker-compose.yml`` for development is included in the repository. It
specifies the settings to start and connect all required services.
Services
^^^^^^^^
The main services for OS2datascanner are:
- ``admin_frontend``:
Only needed in development.
Watches the frontend files and provides support for rebuilding the frontend
easily during the development process.
- ``admin_application``:
Reachable on: http://localhost:8020
Runs the django application that provides the administration interface for
defining and managing organisations, rules, scans etc.
- ``engine_explorer``:
Runs the **explorer** stage of the engine.
- ``engine_processor``:
Runs the **processor** stage of the engine.
- ``engine_matcher``:
Runs the **matcher** stage of the engine.
- ``engine_tagger``:
Runs the **tagger** stage of the engine.
- ``engine_exporter``:
Runs the **exporter** stage of the engine.
- ``report_frontend``:
Only needed in development.
Watches the frontend files and provides support for rebuilding the frontend
easily during the development process.
- ``report_application``:
Reachable on: http://localhost:8040
Runs the django application that provides the interface for accessing and
handling reported matches.
- ``report_collector``:
Runs the **collector** service that saves match results to the database of
the report module.
These depend on some auxillary services:
- ``db``:
Runs a postgres database server based on
`the official postgres docker image`_.
- ``queue``:
Runs a RabbitMQ message queue server based on
`the official RabbitMQ docker image`_, including a plugin providing a web
interface for monitoring (and managing) queues and users.
The web interface can be reached on: http://localhost:8030
.. _`the official postgres docker image`: https://hub.docker.com/_/postgres
.. _`the official RabbitMQ docker image`: https://hub.docker.com/_/rabbitmq/
Postgres initialisation
^^^^^^^^^^^^^^^^^^^^^^^
The postgres database is initialized using the scripts included in
``docker/postgres-initdb.d/`` folder, which checks that the configuration is
valid, and adds **postgres users** for the modules that need them.
They do not populate the database with users for the django modules or any
other data.
Gunicorn worker configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The two Django apps and the API use ``Gunicorn`` to serve web requests. By default Gunicorn
starts up ``CPU_COUNT*2+1`` workers. To override this default use the ``GUNICORN_WORKERS``
environment variable. Eg. ``GUNICORN_WORKERS=2``.
Django application users
^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned above, the system is not initialised with any default users for
the django applications. Instead, these will need to be created by running
.. code-block:: bash
docker-compose {exec|run} {admin|report}-application python manage.py createsuperuser [--username <your username>] [--email <your email>]
where ``exec`` is used when the development environment is already running, and
``run`` when it is not.
If you find yourself having to wipe the database often, you may find it helpful
to write a small script to aid with this, e.g.:
.. code-block:: bash
# Go to correct directory
cd <path to repository root>
# create admin user:
echo "Creating superuser for admin module..."
docker-compose <command> admin_application python manage.py createsuperuser --username <your username> --email <your email>
# create report user:
echo "Creating superuser for report module..."
docker-compose <command> report_application python manage.py createsuperuser --username <your username> --email <your email>
**NB!** Make sure your script is **not** added to the repo: add the file (or a
separate folder it lives in) to the global list for git to ignore (usually
``~/.config/git/ignore``, of which you may have to create the ``git`` folder
and the ``ignore`` file yourself).
Tests
=====
Each module has its own test-suite. These are run automatically as part of the
CI pipeline, which also produces a code coverage report for each test-suite.
During development, the test can be run using the relevant Docker image for
each module. As some of the tests are integration tests that require auxiliary
services - such as access to a database and/or message queue - we recommend
using the development docker-compose set-up to run the tests, as this takes care
of the required settings and bindings.
To run the test-suites using docker-compose:
.. code-block:: bash
docker-compose run admin_application python -m django test os2datascanner.projects.admin.tests
docker-compose run engine_explorer python -m unittest discover -s /code/src/os2datascanner/engine2/tests
docker-compose run report_application python -m django test os2datascanner.projects.report.tests
Please note that the engine tests can be run using any of the five pipeline
services as the basis, but a specific one is provided above for easy reference.
.. TODO: Add section on running the test suite when scripts for the
proper permissions in postgres has been added. Possibly add a script for
running the tests, compiling the report, and exposing/binding it to the
host.
Shell access
============
To access a shell on any container based on the OS2datascanner module images,
run
.. code-block:: bash
docker-compose {exec|run} <container name> bash
Debugging
=========
Stacktrace
^^^^^^^^^^
A stacktrace is printed to `stderr` if pipeline components receive `SIGUSR1`. The
scan continues without interuption.
The components must be startet using `run_stage`
Running the engine locally,
.. code-block:: bash
python -m os2datascanner.engine2.pipeline.run_stage worker
ps aux | grep os2datascanner
kill -USR1 <pid>
Running the engine in Docker, using the namespace sharing between localhost and docker
.. code-block:: bash
docker top os2datascanner_engine_worker_1 # get the <pid> of the python process
kill -USR1 <pid>
docker logs os2datascanner_engine_worker_1
Documentation
=============
The documentation can be found at the `OS2datascanner pages on Read the Docs`_
.. _`OS2datascanner pages on Read the Docs`: https://os2datascanner.readthedocs.io/en/latest
.. TODO: add section on how to build locally and how to access the artifact
generated by the pipeline, when this has been setup in the new GitLab CI
Code standards
==============
The coding standards below should be followed by all new and edited code for
the project. Linting checks are applied, but currently allowed to fail;
introducing a hard requirement would mean having to fill the version control
history with commits only related to style, which is considered undesirable.
.. TODO: add section on shellcheck and Hadolint, when the new CI pipeline is up
Licensing
=========
The OS2datascanner was programmed by Magenta ApS (https://magenta.dk)
for OS2 - Offentligt digitaliseringsfællesskab, https://os2.eu.
Copyright (c) 2014-2020, OS2 - Offentligt digitaliseringsfællesskab.
The OS2datascanner is free software; you may use, study, modify and
distribute it under the terms of version 2.0 of the Mozilla Public
License. See the LICENSE file for details. If a copy of the MPL was not
distributed with this file, You can obtain one at
http://mozilla.org/MPL/2.0/.
All source code in this and the underlying directories is subject to
the terms of the Mozilla Public License, v. 2.0.