-
Notifications
You must be signed in to change notification settings - Fork 1
2.3 Configuring Your Project
- Include the MapReduce library in your project
- Create a module for running MapReduce jobs
- Define your own task queue
- Assigning a Google Cloud Storage bucket
There are three ways to include the MapReduce library in your app. The library is available via the Maven Central repositories, so you can link to it with Maven or Ant/Ivy dependency declarations. You can also download the library source code, compile it, and copy the jars directly into your project.
The MapReduce library is available in the Maven Central repositories. Include the following dependency in your project's pom.xml file:
<dependency>
<groupId>com.google.appengine.tools</groupId>
<artifactId>appengine-mapreduce</artifactId>
<version>RELEASE</version>
</dependency>
Add the following dependency to your project's ivy.xml file:
<dependency org="com.google.appengine.tools" name="appengine-mapreduce" rev="latest.integration" />
You might lock the library version down, by specifying rev=[1.0,1.1)
for example, but if you do, consider the compatibility issues.
Normally, you'll link to the MapReduce library with maven or ant. To build the library using Apache Ant, run:
cd java
ant dist
This creates the java/dist directory, which contains all the jars in the MapReduce library. Copy these jars into your application's WEB-INF/lib directory.
Alternatively, to build the library using Apache Maven, run:
cd java
mvn package
MapReduce jobs can run for a long time. We strongly recommend that you run MapReduce in a separate module, preferably one that does not handle user requests. Remember to pass that module's name to MapReduceSettings.setModule()
when you create your MapReduceJob.
To create a module, make a WAR directory and include it in your project's EAR directory. The process is described in detail in the App Engine modules documentation.
The WAR directory contains a web.xml
file that declares a module's servlets. All MapReduce jobs use two servlets from the MapReduce library. Copy the code below into your module's web.xml
file.
<servlet>
<servlet-name>mapreduce</servlet-name>
<servlet-class>
com.google.appengine.tools.mapreduce.MapReduceServlet
</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>mapreduce</servlet-name>
<url-pattern>/mapreduce/*</url-pattern>
</servlet-mapping>
<servlet>
<servlet-name>pipeline</servlet-name>
<servlet-class>
com.google.appengine.tools.pipeline.impl.servlets.PipelineServlet
</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>pipeline</servlet-name>
<url-pattern>/_ah/pipeline/*</url-pattern>
</servlet-mapping>
You should also consider adding a security constraint, so only admins can initiate MapReduce jobs:
<security-constraint>
<web-resource-collection>
<url-pattern>/*</url-pattern>
</web-resource-collection>
<auth-constraint>
<role-name>admin</role-name>
</auth-constraint>
</security-constraint>
The WAR directory also contains an appengine-web.xml
file. In this file, you can specify the module's instance type, which lets you separate the cost and performance of MapReduce from the rest of your application. For very large jobs, the shuffle stage can become memory-bound. With more memory, the shuffle stage uses fewer and larger temporary files for sorting. Using the instance types B4, B4_1G, B8, F4, or F4_1G can significantly improve the performance of the shuffle stage.
You must include the name of the MapReduce module in the EAR's application-xml
file. Assuming the WAR directory is named "mapreduce," include this code:
<module>
<web>
<web-uri>mapreduce</web-uri>
<context-root>mapreduce</context-root>
</web>
</module>
A MapReduce job should use a dedicated task queue, rather than relying on sharing your application's default task queue. When you define the <queue>
elements in the WEB-INF/queue.xml file, follow these guidelines:
- The
<name>
of the queue should match the string you pass toMapReduceSettings.setWorkerQueueName()
. - Use
<rate>
,<bucket-size>
, and<max-concurrent-requests>
as desired to control the throughput. - MapReduce only supports the "push"
<mode>
, which is the default. - Don't specify a
<target>
, useMapReduceSettings.setModule()
instead. - Don't use any of the
<retry-parameters>
options, useMapReduceSettings.setMaxSliceRetries()
andMapReduceSettings.setMaxShardRetries()
instead.
Every MapReduce job uses a Google Cloud Storage bucket, which can be specified in the MapReduceSettings. You may use either the default GCS bucket or one that you create. If you don't specify a bucket, the default bucket is used.
New App Engine apps are automatically assigned a default bucket with a free quota. The name of the bucket has the form: $<app-id>.appspot.com
, where <app-id>
is your app's ID. If you are adding MapReduce to an existing app, you can check whether a default bucket already exists (and create one if it doesn't) by going to the admin console:
- Select the Application Settings page from the Administration section in the left-side menu.
- If a default bucket exists, it will appear in your settings under the title "Google Cloud Storage Bucket."
- If no default bucket is shown, scroll down the page to the section titled "Cloud Integration." In that section, push the Create button to create a Google Cloud project that will contain a default bucket.
Before you deploy your app to the cloud, you should consider whether or not to enable billing. With billing enabled, you can still use the default bucket, or you can create a special bucket for your app. In either case, you will be billed for storage use above the free quota, up to the amount you specify in your budget. If billing is not enabled, you can only use the default bucket, and in this case, if your bucket use exceeds the free quota, your MapReduce job will fail.