Skip to content

MongoDB Queries in a Nutshell

Emidio Piermarini edited this page Jan 25, 2018 · 27 revisions

Introduction

This wikipage will attempt to get you the 'low hanging fruit' in with MongoDB and after reading it, you will by no means be a mongo master. However, with the following commands, you can do 80% of what a typical BH engineer will have to do with their data and help you get past a decent amount of the learning curve to getting great results from MongoDB. If you want further information, please see MongoDB's online documentation: MongoDB Reference Manual. Also, googling "mongodb aggregate XXXXX" will also be useful to you as well when you want to search how to do XXXX, where XXXX is something like "greater than" or "less than" or "division", etc.

As you will see, MongoDB's computer language is fairly easy to learn. Whilst it isn't extremely human readable, it is certainly more straightforward than things like nested loops. All of MongoDB's language is based on JSON syntax. It is worth looking a bit at the sytanx before jumping into Mongo Queries. Simply speaking, JSON script will usually look like this:

{"Key": Value} or {"Mongo Query Name": Stuff to Do}

One important thing to understand is some terminology related to the hierarchy of MongoDB. Don't be afraid because of the word hierarchy, you already do this in everything you do:

So a database houses collections which house documents and documents have various properties that they all share. Same thing as how an excel file houses tabs which have rows of data where each row has various properties associated to it in each column.

Ok, now let's do some examples. You'll need to do the following:

  1. Open an instance of MongoShell (Details Here)
  2. Open Robo3T (Details Here)

Important Queries

You can pick and choose any query to use, but one that will be pretty much necessary for the average engineering task will be aggregate:

aggregate

The $aggregate function that executes a list of queries one after the other. So let's break that down with what we already know:

Queries = JSON objects Queries = Programmatic Sieves Programmatic Sieves = Filters for us to get to the #DataWeWant

Therefore, by the transitive propery, the aggregate function takes in a list of JSONs and spits out the the subset of data from the database that we care about a.k.a the #DataWeWant.

Aggregate(database,List of JSONs) = #DataWeWant from database

When you use the grasshopper mongo tools, this is what is going on behind the scenes. Recalling that all queries are JSON objects, we should expect to see that the aggregate command takes in a set of JSON objects. To create this list, we will separate each query by a comma:

Aggregate(database,JSON1,JSON2,JSON3,...,JSON9999999) = #DataWeWant from database

Ok, so how do we actually execute this on our Mongo Database? Well, you could use the MongoShell. Or you could use the GrasshopperToolkit tools. But one of the easiest mediums to debug you code is Robo3T. This is because Robot3T has a skosh of intellesense that can help you understand where you have syntax errors in your aggregate command. Recall that if you execute an aggregate command in Robo3T and it works, you can feed the exact same list of queries into Grasshopper using the Mongo Components and it will yield the same result. Therefore, Robo3T is your best friend when you are just getting started with Mongo.

Let's set up our aggregate function. First, we open a collection:

As you can see, when I open my collection, the default command Robot3T throws in is the find() command. If we want to play with $aggregate commands, we need to remove this:

db.getCollection('Model A Forces Database').find({}) --------> db.getCollection('Model A Forces Database')

And in its place add the aggregate command:

db.getCollection('Model A Forces Database').aggregate([JSON1,JSON2])

As you can see, the syntax has a set of [] that enlose the JSON queries. In Robot3T, it will be most convenient to to set up the aggregate command with adding some returns to make the list more human readable:

In what follows, you will learn some common queries that will replace the JSON#'s in the picture above.

THING THAT WILL ANNOY YOU ALERT: To execute any command in Robot3T you must use the F5 key not the ENTER key. This goes against everything you've ever known.....so be prepared to a bit of cussing when instead of Robot3T executing your code, it just returns to the next line of the commandline instead.

So here are some of the best queries to throw inside an aggregate function:

$addFields

If you wish to AUGMENT your data, i.e. add parameters to it, this is your query.

In the example above, you can see that we've been able to append a new property to all our documents in the collection. But most likely, you'll want the arguement of $addFields to be another function which will add data to each document on a case by case basis:

So as you can see here, we're using a new function inside of the $addFields function called $in. The $in function can help use test if a certain property is part of a provided list (here we're asking "For this specific document, does the property Bar_Number equal either 5353 or 5454). As you can see, this allows us to write specific criteria for each document and in this case, we can use that information as the input to the next query in the aggregate (as we'll show in $match)

$match

If you wish to GRAB A SUBSET OF YOUR DATABASE based on certain criteria, i.e. filter all data items based a conditional criteria, this is your query.

Here's a simple example of this:

But you really cannot see what actually happened here as our database has millions of documents and we can only see a few at a time. What would be good is to see how many database documents match this criteria, which is something that if we're familiar with the project, we could probably calculate what this should be. We can find this out by adding a little command at the end of the aggregate pipeline ( { $group: { _id: null, count: { $sum: 1 } } } ----- don't worry, you'll get this soon):

50,658 documents out of the 28,873,446 total documents in the database (and in under 12 seconds to boot). Recall, thats enough data to max out 28 excel tabs in excel. I can back-track this number from first principles - I have a database full of information for 10,000ish bars. Each bar has 5 points where data is reported. Each point has 612 loadcases.

(10,000ish bars)x(5points/bar)x(612 loadcases/point) = 30,000ish results in database

so its no surprise my match returns roughly 50,000 documents because:

(10,000ish bars)x(5points/bar)x(61 loadcases/point) = 50,000 documents that match criteria of having loadcase 511.

We can always add to this as well:

Obviously, the more layers we add, the less we get back (and it takes a bit longer too, but still, pretty fast).

We can even use the output of our $addFeilds as part of this:

$project

If you wish to GET ONLY CERTAIN ASPECTS OF YOUR DATA but for all of your database, i.e. I have a million items that all have properties X, Y, and Z, but I only care out the X and Y values, this is your query.

In general, the syntax of this command will look like this:

{$project: {xxx:1,yyy:1}}

If the database documents held a property called zzz, this aspect of the document would be dropped for any subsequent aggregate commands. For instance, in our forces database, look what happens if we apply this command:

{$project: {F_X: 1}}

Orignal DataStructure:

DataStructure After $project:

If you only care about part of you database for an aggregation pipeline, using $project can not only help you keep heavy computation at bay by letting mongo only carry along data that you care about, but also (and more importantly), it can make it such that the calculations and operations you take care of in mongo are more human readable and verifiable:

Here, I want to find Stress = F_X/Area. As you can see, by using project, I can make my results not only show the result 'Stress', but also, the numbers that made that result. This is a fundamental habit we must promote in BH - clear, visual, back-tracing of results built into all of our workflows.

$group

If you wish to TRANSFORM THE STRUCTURE OF YOUR DATA, i.e. instead of having 30,000,000 documents with one value per document from 10,000 bars, having 10,000 documents with each document having 3,000 pieces of data per item, this is your query.

So as can be seen here, the anatomy of a $group function is two main parts - what do you want you new data structure to be governed by, and what data do you want to be in that data structure. In this example, we do the former by setting the _id of each document to its Bar_Number. We do the latter by utilizing the $push operator to push the vairable called F_X. Have you noticed yet that when you call a document's property as a variable it needs to be in quotes and preceded by the dollar sign?

Common Robot Mongo Errors

If you see this, don't despair - this is an easy one. There is a set time limit that robo mongo has to execute a query. If it takes too long, it gets cut off. All you have to do is ovveride the default time limit by clicking "Change Shell Timeout" and you'll be good.

Again, this is another easy one. Mongo can only access a small amount of your RAM by default. You can ovveride this by adding {allowDiskUse:true,cursor:{}} to the end of your aggregate command as such:

A Note On Optimization

Note that much like in all other aspects of computational design, its best to get working on the basics then worry about making it better from there. Once you've gotten used to mongo queries, then you may want to start focusing on the speed at which your queries return useful data to you. Note than Mongo does try to help you with optimization of your aggregation pipeline, however, there's nothing that will substitute some good old fashioned logic and testing in robomongo.

For instance, in the exmample we just went through, what if I wanted to extract any document that had a Bar_Number of 4539 or 5459 and a Loadcase of 2 or 4? There's plenty of ways to skin a cat, but some take longer than others:

Method 1) Completes in 70 seconds:

db.getCollection('Model A FULL').aggregate([ {$addFields: { Filter:{$and:[ {$in: ["$Bar_Number", [4539,5459]]},{$in: ["$Loadcase", [2,4]]}]}}}, {$match: {Filter: true}} ])

Method 2) Completes in 105 seconds:

db.getCollection('Model A FULL').aggregate([ {$project: { Bar_Number: 1,Loadcase: 1,Force_Position: 1, F_X: 1,F_Y: 1,F_Z: 1,M_X: 1,M_Y: 1, M_Z: 1, __Key__:1, Filter:{$and:[ {$in: ["$Bar_Number", [5353,5459]]},{$in: ["$Loadcase", [1,4]]}]}}}, {$match: {Filter: true}} ])

Method 3) Completes in 130 seconds:

db.getCollection('Model A FULL').aggregate([ {$project: { Bar_Number: 1,Loadcase: 1,Force_Position: 1, F_X: 1,F_Y: 1,F_Z: 1,M_X: 1,M_Y: 1, M_Z: 1, __Key__:1, isInBarList: {$in: ["$Bar_Number", [5353,5459]]}, isInLoadCaseList: {$in: ["$Loadcase", [1,4]]}}}, {$match: {isInBarList: true}}, {$match: {isInLoadCaseList: true}} ])

As you can see, by using different methods to query the database, we can effectively change our query response time by almost a factor of 2. This should come as no surprise, if you had to explain all these in English here's what it'd look like:

Method 1 is by far the most direct method: 'do you have both of these conditions?' and if so 'ok then come over here'.

Method 2 is a bit more indirect: 'push forward all the properties of the document, along with this new one, which is based on these 2 conditions' and if that new property is true then 'ok come over here'.

Method 3 is even more indirect: 'push forward all the properties of the document, along with two new one, which is based on these 2 conditions' and if the first new proporty is true the 'ok come over here' and then if you've come over here and your second new property is true the 'ok come over here again'.

Even the English descriptions gets complicated! In conclusion, Mongo queries are just like all other computer languages when it comes to efficiency - the more direct you are the faster it will go!

Clone this wiki locally