Like freedb, MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
The musicbrainz-elasticsearch project is a java batch that indexes release groups of the MusicBrainz database into an Elasticsearch index.
From release groups, only "real" Album are indexed. Single, EP and Broadcast are not indexed. And from Album release group primary type, neither Compilation, Live, Remix or Soundtrack secondary types are indexed.
- SQL request selecting music album from the MusicBrainz PostgreSQL datanase
- Elasticsearch index settings and mapping of the musicalbum index in JSON format
- Tasklet deleting previous index
- Tasklets creating settings and mappings for the musicalbum index
- Parallel ES indexation using multi-threads on a single process
- A java main class to launch the batch (through command line, IDE or maven)
- End-to-end unit tests with U2 discography
This project depends on several other open source projects:
- Spring Batch : the most popular Java batch framework.
- Spring Data Elasticsearch : Spring factories for Elasticsearch used to inject ES client into batch tasklets.
- MusicBrainz data : useful Gender, ReleaseGroupPrimaryType and ArtistType enumerations. JPA data bindings is not used by the batch.
- DbSetup : the database unit test framework you must try.
A MusicBrainz database and an Elasticsearch cluster are the 2 pre-requisites in order to execute the batch. You have the choice by setting by yourself a MusicBrainz database and an Elasticsearch cluster or to use Docker.
Use Docker Compose to set up both a PostgreSQL database and an Elasticsearch cluster and import the musicbrainz database.
If you are on MacOS or Windows, you have to install Boot2docker in order to user Docker and Docker Compose. You will have to increase the DiskSize up to 100 Gb.
Command lines to start PostgreSQL and Elasticsearch:
git clone https://github.com/arey/musicbrainz-elasticsearch.git
cd musicbrainz-elasticsearch/docker
docker-compose up -d
docker-compose run postgresql /create-database.sh
- If you are using Boot2docker:
**
boot2docker ip
** edit thees-musicbrainz-batch.properties
file and replace localhost with the IP in the es.host and db.musicbrainz.url properties.
The last command line creates the database, downloads the latest dumps then populates the database. Depending your bandwidth, downloading of the mbdump.tar.bz2 could be take more than hour.
To index MusicBrainz data, the batch requires a connection to the MusicBrainz PostgreSQL relational database.
Musicbrainz.org does not provide a public access to its database. Thus you have to install your own database.
There are a two different methods to get a local database up and running, you can either:
- Download a pre-configured virtual image of the MusicBrainz Server, or
- Download the last data dumps and follow the relevant section of the INSTALL.md
For my part, before using Docker, I have chosen to download the MusicBrainz Server virtual machine. Available in Open Virtualization Archive (OVA), I have deployed it into Oracle VirtualBox but you may prefer VMWare.
Once finished the MusicBrainz Server setup guide, you have to follow the below two final steps in order the PostgreSQL database be accessible to your host:
Configuring port forwarding with NAT
Port forwarding enables VirtualBox to listen to certain ports on the host and resends all packets which arrive there to the guest, on the same or a different port. You may used same port on host and guest. Configure two rules (the second is optional):
- PostgreSQL database - TCP - host : 5432 / guest : 5432
- MusicBrainz web server : TCP - host : 5000 / guest : 5000
Configuring PostgreSQL
To enable remote access to the PostgreSQL database server, you may follow those instructions. Log into the VM (credentials: vm / musicbrainz) and edit the two configuration files pg_hba.conf and postgresql.conf.
Once steps done, you may connect to the database with any JDBC clients (ie. SQuireL):
- URL: jdbc:postgresql://localhost:5432/musicbrainz
- Credentials: musicbrainz / musicbrainz
Before launching the batch, you have to download Elasticsearch v1.7.1 and unarchived it. You may want to change the default elasticsearch cluster name from the config/elaticsearch.yml configuration file and change the name in the es-musicbrainz-batch.properties configuration file.
git clone https://github.com/arey/musicbrainz-elasticsearch.git
mvn install
mvn exec:java
(execute the IndexBatchMain main class)
On a Macbook Pro, the batch takes less than 3 minutes to build the Elasticsearch.
MusicBrainz database searching with Elasticsearch : http://musicsearch.javaetmoi.com/
For command line testing, you could execute the two following curl scripts: musicbrainz_autocomplete_u2.sh and musicbrainz_fulltext_u2_war.sh
- Github is for social coding platform: if you want to write code, we encourage contributions through pull requests from forks of this repository. If you want to contribute code this way, please reference a GitHub ticket as well covering the specific issue you are addressing.
Download the code with git: git clone git://github.com/arey/musicbrainz-elasticsearch.git
Compile the code with maven:
mvn clean install
If you're using an IDE that supports Maven-based projects (InteliJ Idea, Netbeans or m2Eclipse), you can import the project directly from its POM. Otherwise, generate IDE metadata with the related IDE maven plugin:
mvn eclipse:clean eclipse:eclipse
French articles on the javaetmoi.com blog:
Version | Release date | Features date |
---|---|---|
1.1-SNAPSHOT | 23/08/2015 | Elasticsearch 1.7.1 update Docker compose files Spring Data Elasticsearch use |
1.0 | 26/10/2013 | Initial version developed for a workshop about Elasticsearch (0.90.5) |