Name | Description |
---|
RegisterTempTable | Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SqlContext that was used to create this DataFrame. |
Count | Number of rows in the DataFrame |
Show | Displays rows of the DataFrame in tabular form |
ShowSchema | Prints the schema information of the DataFrame |
Collect | Returns all of Rows in this DataFrame |
ToRDD | Converts the DataFrame to RDD of Row |
ToJSON | Returns the content of the DataFrame as RDD of JSON strings |
Explain | Prints the plans (logical and physical) to the console for debugging purposes |
Select | Selects a set of columns specified by column name or Column. df.Select("colA", df["colB"]) df.Select("*", df["colB"] + 10) |
Select | Selects a set of columns. This is a variant of `select` that can only select existing columns using column names (i.e. cannot construct expressions). df.Select("colA", "colB") |
SelectExpr | Selects a set of SQL expressions. This is a variant of `select` that accepts SQL expressions. df.SelectExpr("colA", "colB as newName", "abs(colC)") |
Where | Filters rows using the given condition |
Filter | Filters rows using the given condition |
GroupBy | Groups the DataFrame using the specified columns, so we can run aggregation on them. |
Rollup | Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. |
Cube | Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. |
Agg | Aggregates on the DataFrame for the given column-aggregate function mapping |
Join | Join with another DataFrame - Cartesian join |
Join | Join with another DataFrame - Inner equi-join using given column name |
Join | Join with another DataFrame - Inner equi-join using given column name |
Join | Join with another DataFrame, using the specified JoinType |
Intersect | Intersect with another DataFrame. This is equivalent to `INTERSECT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, intersect(self, other) |
UnionAll | Union with another DataFrame WITHOUT removing duplicated rows. This is equivalent to `UNION ALL` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, unionAll(self, other) |
Subtract | Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to `EXCEPT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, subtract(self, other) |
Drop | Returns a new DataFrame with a column dropped. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, drop(self, col) |
DropNa | Returns a new DataFrame omitting rows with null values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropna(self, how='any', thresh=None, subset=None) |
Na | Returns a DataFrameNaFunctions for working with missing data. |
FillNa | Replace null values, alias for ``na.fill()` |
DropDuplicates | Returns a new DataFrame with duplicate rows removed, considering only the subset of columns. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropDuplicates(self, subset=None) |
Replace``1 | Returns a new DataFrame replacing a value with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
ReplaceAll``1 | Returns a new DataFrame replacing values with other values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
ReplaceAll``1 | Returns a new DataFrame replacing values with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
RandomSplit | Randomly splits this DataFrame with the provided weights. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, randomSplit(self, weights, seed=None) |
Columns | Returns all column names as a list. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, columns(self) |
DTypes | Returns all column names and their data types. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dtypes(self) |
Sort | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs) |
Sort | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs) |
SortWithinPartitions | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs) |
SortWithinPartition | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs) |
Alias | Returns a new DataFrame with an alias set. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, alias(self, alias) |
WithColumn | Returns a new DataFrame by adding a column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumn(self, colName, col) |
WithColumnRenamed | Returns a new DataFrame by renaming an existing column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumnRenamed(self, existing, new) |
Corr | Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, corr(self, col1, col2, method=None) |
Cov | Calculate the sample covariance of two columns as a double value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, cov(self, col1, col2) |
FreqItems | Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in "http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou". Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, freqItems(self, cols, support=None) Note: This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. |
Crosstab | Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, crosstab(self, col1, col2) |
Describe | Computes statistics for numeric columns. This include count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. |
Limit | Returns a new DataFrame by taking the first `n` rows. The difference between this function and `head` is that `head` returns an array while `limit` returns a new DataFrame. |
Head | Returns the first `n` rows. |
First | Returns the first row. |
Take | Returns the first `n` rows in the DataFrame. |
Distinct | Returns a new DataFrame that contains only the unique rows from this DataFrame. |
Coalesce | Returns a new DataFrame that has exactly `numPartitions` partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. |
Persist | Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`) |
Unpersist | Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
Cache | Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`) |
Repartition | Returns a new DataFrame that has exactly `numPartitions` partitions. |
Repartition | Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions. |
Repartition | Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions. |
Sample | Returns a new DataFrame by sampling a fraction of rows. |
FlatMap``1 | Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results. |
Map``1 | Returns a new RDD by applying a function to all rows of this DataFrame. |
MapPartitions``1 | Returns a new RDD by applying a function to each partition of this DataFrame. |
ForeachPartition | Applies a function f to each partition of this DataFrame. |
Foreach | Applies a function f to all rows. |
Write | Interface for saving the content of the DataFrame out into external storage. |
SaveAsParquetFile | Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the `parquetFile` function in SQLContext. |
InsertInto | Adds the rows from this RDD to the specified table, optionally overwriting the existing data. |
SaveAsTable | Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options. Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an `insertInto`. Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved. |
Save | Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options. |
| Returns a new DataFrame that drops rows containing any null values. |
| Returns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row. |
| Returns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row. |
| Returns a new DataFrame that drops rows containing any null values in the specified columns. |
| Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values. |
| Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns. |
| Returns a new DataFrame that replaces null values in numeric columns with `value`. |
| Returns a new DataFrame that replaces null values in string columns with `value`. |
| Returns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored. |
| Returns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored. |
| Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0)); |
| Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("*", ImmutableMap.of("UNKNOWN", "unnamed")); |
| Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed")); |
| Specifies the input data source format. |
| Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading. |
| Adds an input option for the underlying data source. |
| Adds input options for the underlying data source. |
| Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system). |
| Loads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores). |
| Construct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties. |
| Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
| Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
| Loads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan. |
| Loads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in. |
| Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. |
| Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. |
| Specifies the underlying output data source. Built-in options include "parquet", "json", etc. |
| Adds an output option for the underlying data source. |
| Adds output options for the underlying data source. |
| Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment. |
| Saves the content of the DataFrame at the specified path. |
| Saves the content of the DataFrame as the specified table. |
| Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored. |
| Saves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored. |
| Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
| Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path) |
| Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path) |
+Name | Description |
---|
RegisterTempTable | Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SqlContext that was used to create this DataFrame. |
Count | Number of rows in the DataFrame |
Show | Displays rows of the DataFrame in tabular form |
ShowSchema | Prints the schema information of the DataFrame |
Collect | Returns all of Rows in this DataFrame |
ToRDD | Converts the DataFrame to RDD of Row |
ToJSON | Returns the content of the DataFrame as RDD of JSON strings |
Explain | Prints the plans (logical and physical) to the console for debugging purposes |
Select | Selects a set of columns specified by column name or Column. df.Select("colA", df["colB"]) df.Select("*", df["colB"] + 10) |
Select | Selects a set of columns. This is a variant of `select` that can only select existing columns using column names (i.e. cannot construct expressions). df.Select("colA", "colB") |
SelectExpr | Selects a set of SQL expressions. This is a variant of `select` that accepts SQL expressions. df.SelectExpr("colA", "colB as newName", "abs(colC)") |
Where | Filters rows using the given condition |
Filter | Filters rows using the given condition |
GroupBy | Groups the DataFrame using the specified columns, so we can run aggregation on them. |
Rollup | Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. |
Cube | Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. |
Agg | Aggregates on the DataFrame for the given column-aggregate function mapping |
Join | Join with another DataFrame - Cartesian join |
Join | Join with another DataFrame - Inner equi-join using given column name |
Join | Join with another DataFrame - Inner equi-join using given column name |
Join | Join with another DataFrame, using the specified JoinType |
Intersect | Intersect with another DataFrame. This is equivalent to `INTERSECT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, intersect(self, other) |
UnionAll | Union with another DataFrame WITHOUT removing duplicated rows. This is equivalent to `UNION ALL` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, unionAll(self, other) |
Subtract | Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to `EXCEPT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, subtract(self, other) |
Drop | Returns a new DataFrame with a column dropped. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, drop(self, col) |
DropNa | Returns a new DataFrame omitting rows with null values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropna(self, how='any', thresh=None, subset=None) |
Na | Returns a DataFrameNaFunctions for working with missing data. |
FillNa | Replace null values, alias for ``na.fill()` |
DropDuplicates | Returns a new DataFrame with duplicate rows removed, considering only the subset of columns. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropDuplicates(self, subset=None) |
Replace``1 | Returns a new DataFrame replacing a value with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
ReplaceAll``1 | Returns a new DataFrame replacing values with other values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
ReplaceAll``1 | Returns a new DataFrame replacing values with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
RandomSplit | Randomly splits this DataFrame with the provided weights. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, randomSplit(self, weights, seed=None) |
Columns | Returns all column names as a list. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, columns(self) |
DTypes | Returns all column names and their data types. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dtypes(self) |
Sort | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs) |
Sort | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs) |
SortWithinPartitions | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs) |
SortWithinPartition | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs) |
Alias | Returns a new DataFrame with an alias set. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, alias(self, alias) |
WithColumn | Returns a new DataFrame by adding a column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumn(self, colName, col) |
WithColumnRenamed | Returns a new DataFrame by renaming an existing column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumnRenamed(self, existing, new) |
Corr | Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, corr(self, col1, col2, method=None) |
Cov | Calculate the sample covariance of two columns as a double value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, cov(self, col1, col2) |
FreqItems | Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in "http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou". Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, freqItems(self, cols, support=None) Note: This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. |
Crosstab | Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, crosstab(self, col1, col2) |
Describe | Computes statistics for numeric columns. This include count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. |
Limit | Returns a new DataFrame by taking the first `n` rows. The difference between this function and `head` is that `head` returns an array while `limit` returns a new DataFrame. |
Head | Returns the first `n` rows. |
First | Returns the first row. |
Take | Returns the first `n` rows in the DataFrame. |
Distinct | Returns a new DataFrame that contains only the unique rows from this DataFrame. |
Coalesce | Returns a new DataFrame that has exactly `numPartitions` partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. |
Persist | Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`) |
Unpersist | Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
Cache | Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`) |
Repartition | Returns a new DataFrame that has exactly `numPartitions` partitions. |
Repartition | Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions. |
Repartition | Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions. |
Sample | Returns a new DataFrame by sampling a fraction of rows. |
FlatMap``1 | Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results. |
Map``1 | Returns a new RDD by applying a function to all rows of this DataFrame. |
MapPartitions``1 | Returns a new RDD by applying a function to each partition of this DataFrame. |
ForeachPartition | Applies a function f to each partition of this DataFrame. |
Foreach | Applies a function f to all rows. |
Write | Interface for saving the content of the DataFrame out into external storage. |
SaveAsParquetFile | Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the `parquetFile` function in SQLContext. |
InsertInto | Adds the rows from this RDD to the specified table, optionally overwriting the existing data. |
SaveAsTable | Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options. Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an `insertInto`. Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved. |
Save | Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options. |
| Returns a new DataFrame that drops rows containing any null values. |
| Returns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row. |
| Returns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row. |
| Returns a new DataFrame that drops rows containing any null values in the specified columns. |
| Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values. |
| Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns. |
| Returns a new DataFrame that replaces null values in numeric columns with `value`. |
| Returns a new DataFrame that replaces null values in string columns with `value`. |
| Returns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored. |
| Returns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored. |
| Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0)); |
| Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("*", ImmutableMap.of("UNKNOWN", "unnamed")); |
| Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed")); |
| Specifies the input data source format. |
| Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading. |
| Adds an input option for the underlying data source. |
| Adds input options for the underlying data source. |
| Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system). |
| Loads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores). |
| Construct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties. |
| Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
| Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
| Loads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan. |
| Loads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in. |
| Loads a AVRO file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan. |
| Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. |
| Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. |
| Specifies the underlying output data source. Built-in options include "parquet", "json", etc. |
| Adds an output option for the underlying data source. |
| Adds output options for the underlying data source. |
| Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment. |
| Saves the content of the DataFrame at the specified path. |
| Saves the content of the DataFrame as the specified table. |
| Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored. |
| Saves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored. |
| Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
| Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path) |
| Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path) |
| Saves the content of the DataFrame in AVRO format at the specified path. This is equivalent to: Format("com.databricks.spark.avro").Save(path) |
---
@@ -688,7 +688,7 @@
####Methods
-