-
Notifications
You must be signed in to change notification settings - Fork 4
MongoDB Queries in a Nutshell
This wikipage will attempt to get you the 'low hanging fruit' in with MongoDB. With the following commands, you can do 80% of what a typical BH engineer will have to do with their data. If you want further information, please see MongoDB's online documentation: MongoDB Reference Manual.
As you will see, MongoDB's computer language is fairly easy to learn. Whilst it isn't extremely human readable, it is certainly more straightforward than things like nested loops. All of MongoDB's language is based on JSON syntax. It is worth looking a bit at the sytanx before jumping into Mongo Queries. Simply speaking, JSON script will usually look like this:
{"Key": Value} or {"Mongo Query Name": Stuff to Do}
Ok, now let's do some examples. You'll need to do the following:
- Open an instance of MongoShell (Details Here)
- Open Robo3T (Details Here)
You can pick and choose any query to use, but one that will be pretty much necessary for the average engineering task will be aggregate:
The $aggregate function that executes a list of queries one after the other. So let's break that down with what we already know:
Queries = JSON objects Queries = Programmatic Sieves Programmatic Sieves = Filters for us to get to the #DataWeWant
Therefore, by the transitive propery, the aggregate function takes in a list of JSONs and spits out the the subset of data from the database that we care about a.k.a the #DataWeWant.
Aggregate(database,List of JSONs) = #DataWeWant from database
When you use the grasshopper mongo tools, this is what is going on behind the scenes. Recalling that all queries are JSON objects, we should expect to see that the aggregate command takes in a set of JSON objects. To create this list, we will separate each query by a comma:
Aggregate(database,JSON1,JSON2,JSON3,...,JSON9999999) = #DataWeWant from database
Ok, so how do we actually execute this on our Mongo Database? Well, you could use the MongoShell. Or you could use the GrasshopperToolkit tools. But one of the easiest mediums to debug you code is Robo3T. This is because Robot3T has a skosh of intellesense that can help you understand where you have syntax errors in your aggregate command. Recall that if you execute an aggregate command in Robo3T and it works, you can feed the exact same list of queries into Grasshopper using the Mongo Components and it will yield the same result. Therefore, Robo3T is your best friend when you are just getting started with Mongo.
Let's set up our aggregate function. First, we open a collection:
As you can see, when I open my collection, the default command Robot3T throws in is the find() command. If we want to play with $aggregate commands, we need to remove this:
db.getCollection('Model A Forces Database').find({}) --------> db.getCollection('Model A Forces Database')
And in its place add the aggregate command:
db.getCollection('Model A Forces Database').aggregate([JSON1,JSON2])
As you can see, the syntax has a set of [] that enlose the JSON queries. In Robot3T, it will be most convenient to to set up the aggregate command with adding some returns to make the list more human readable:
In what follows, you will learn some common queries that will replace the JSON#'s in the picture above.
THING THAT WILL ANNOY YOU ALERT: To execute any command in Robot3T you must use the F5 key not the ENTER key. This goes against everything you've ever known.....so be prepared to a bit of cussing when instead of Robot3T executing your code, it just returns to the next line of the commandline instead.
So here are some of the best queries to throw inside an aggregate function:
If you wish to AUGMENT your data, i.e. add parameters to it, this is your query.
In the example above, you can see that we've been able to append a new property to all our documents in the collection. But most likely, you'll want the arguement of $addFields to be another function which will add data to each document on a case by case basis:
So as you can see here, we're using a new function inside of the $addFields function called $in. The $in function can help use test if a certain property is part of a provided list (here we're asking "For this specific document, does the property Bar_Number equal either 5353 or 5454). As you can see, this allows us to write specific criteria for each document and in this case, we can use that information as the input to the next query in the aggregate (as we'll show in $match)
If you wish to GRAB A SUBSET OF YOUR DATABASE based on certain criteria, i.e. filter all data items based a conditional criteria, this is your query.
Here's a simple example of this:
But you really cannot see what actually happened here as our database has millions of documents and we can only see a few at a time. What would be good is to see how many database documents match this criteria, which is something that if we're familiar with the project, we could probably calculate what this should be. We can find this out by adding a little command at the end of the aggregate pipeline ( { $group: { _id: null, count: { $sum: 1 } } } ----- don't worry, you'll get this soon):
50,658 documents out of the 28,873,446 total documents in the database (and in under 12 seconds to boot). Recall, thats enough data to max out 28 excel tabs in excel. I can back-track this number from first principles - I have a database full of information for 10,000ish bars. Each bar has 5 points where data is reported. Each point has 612 loadcases.
(10,000ish bars)x(5points/bar)x(612 loadcases/point) = 30,000ish results in database
so its no surprise my match returns roughly 50,000 documents because:
(10,000ish bars)x(5points/bar)x(61 loadcases/point) = 50,000 documents that match criteria of having loadcase 511.
We can always add to this as well:
Obviously, the more layers we add, the less we get back (and it takes a bit longer too, but still, pretty fast).
We can even use the output of our $addFeilds as part of this:
If you wish to GET ONLY CERTAIN ASPECTS OF YOUR DATA but for all of your database, i.e. I have a million items that all have properties X, Y, and Z, but I only care out the X and Y values, this is your query.
In general, the syntax of this command will look like this:
{$project: {xxx:1,yyy:1}}
If the database documents held a property called zzz, this aspect of the document would be dropped for any subsequent aggregate commands. For instance, in our forces database, look what happens if we apply this command:
{$project: {F_X: 1}}
Orignal DataStructure:
DataStructure After $project:
If you only care about part of you database for an aggregation pipeline, using $project can not only help you keep heavy computation at bay by letting mongo only carry along data that you care about, but also (and more importantly), it can make it such that the calculations and operations you take care of in mongo are more human readable and verifiable:
Here, I want to find Stress = F_X/Area. As you can see, by using project, I can make my results not only show the result 'Stress', but also, the numbers that made that result. This is a fundamental habit we must promote in BH - clear, visual, back-tracing of results built into all of our workflows.
If you wish to TRANSFORM THE STRUCTURE OF YOUR DATA, i.e. instead of having 30,000,000 documents with one value per document from 10,000 bars, having 10,000 documents with each document having 3,000 pieces of data per item, this is your query.
If you see this, don't despair - this is an easy one. There is a set time limit that robo mongo has to execute a query. If it takes too long, it gets cut off. All you have to do is ovveride the default time limit by clicking "Change Shell Timeout" and you'll be good.
Again, this is another easy one. Mongo can only access a small amount of your RAM by default. You can ovveride this by adding {allowDiskUse:true,cursor:{}} to the end of your aggregate command as such: