Skip to content

10. The select:where: queries

Oleksandr Zaytsev edited this page Jan 10, 2018 · 1 revision

SELECT is the most commonly used SQL statement that allows you to subset your data by applying filters to it using WHERE clause. The query language of DataFrame is designed to resemble SQL, so if you have some experience with relational databases, you should "feel like home".

The examples in this section will be using Iris dataset

df := DataFrame loadIris.

There are two things you need to specify in order to subset your data with select:where: message:

  1. What features (columns) do you want to get
  2. What conditions should the observations (rows) satisfy in order to be selected

First argument of the select:where: message should be an array of column names. They will not affect the selection of rows, but the resulting data frame will contain only these columns. Second argument should be a block with boolean conditions that will be applied to each row of data frame. Only those rows that make a block return true will be selected. In your conditions you will be referencing the features of your observations. For example, in Iris dataset you might want to select those flowers that belong to #setosa species and have the width of sepal equal to 3. To make queries more readable, DataFrame provides a querying language that allows you to specify the columns which you are using in your conditions as arguments of the where-block, and use these arguments in your conditions. So, for example, a block [ :species | species = #setosa ] passed to select:where: message will be translated to [ :row | (row atKey: #species) = #setosa ] and applied to every row of data frame. This means that all the arguments of the block you pass must correspond to the column names of your data frame.

Here is a query that selects petal_width and petal_length columns, and all the rows that satisfy the condition described in the previous paragraph

df select: #(petal_width petal_length)
   where: [ :species :sepal_width |
      species = #setosa and: sepal_width = 3 ].

If you rather want to select all the columns of a data frame, use the selectAllWhere: message. It works in a same way as SELECT * WHERE in SQL

df selectAllWhere: [ :species :sepal_width |
   species = #setosa and: sepal_width = 3 ].

This query will return a data frame will all 5 columns of Iris dataset and 6 rows that satisfy the given condition.

    |  sepal_length  sepal_width  petal_length  petal_width  species  
----+---------------------------------------------------------------
 2  |           4.9            3           1.4          0.2  setosa   
13  |           4.8            3           1.4          0.1  setosa   
14  |           4.3            3           1.1          0.1  setosa   
26  |             5            3           1.6          0.2  setosa   
39  |           4.4            3           1.3          0.2  setosa   
46  |           4.8            3           1.4          0.3  setosa

The previous query will return you only the petal_width and petal_length columns of this data frame. Try it yourself!

Tutorial

  1. Installation
  2. Creating DataSeries
Clone this wiki locally