##

Mobius API Documentation

###Microsoft.Spark.CSharp.Core.Accumulator ####Summary

        A shared variable that can be accumulated, i.e., has a commutative and associative "add"
        operation. Worker tasks on a Spark cluster can add values to an Accumulator with the +=
        operator, but only the driver program is allowed to access its value, using Value.
        Updates from the workers get propagated automatically to the driver program.
        
        While  supports accumulators for primitive data types like int and
        float, users can also define accumulators for custom types by providing a custom
         object. Refer to the doctest of this module for an example.
        
        See python implementation in accumulators.py, worker.py, PythonRDD.scala

####Methods

Name	Description
	Adds a term to this accumulator's value
	The += operator; adds a term to this accumulator's value
	Creates and returns a string representation of the current accumulator
	Provide a "zero value" for the type
	Add two values of the accumulator's data type, returning a new value;

###Microsoft.Spark.CSharp.Core.Accumulator`1 ####Summary

        A generic version of  where the element type is specified by the driver program.
        
        The type of element in the accumulator.

####Methods

Name	Description
Add	Adds a term to this accumulator's value
op_Addition	The += operator; adds a term to this accumulator's value
ToString	Creates and returns a string representation of the current accumulator

###Microsoft.Spark.CSharp.Core.AccumulatorParam`1 ####Summary

        An AccumulatorParam that uses the + operators to add values. Designed for simple types
        such as integers, floats, and lists. Requires the zero value for the underlying type
        as a parameter.

####Methods

Name	Description
Zero	Provide a "zero value" for the type
AddInPlace	Add two values of the accumulator's data type, returning a new value;

###Microsoft.Spark.CSharp.Core.AccumulatorServer ####Summary

        A simple TCP server that intercepts shutdown() in order to interrupt
        our continuous polling on the handler.

###Microsoft.Spark.CSharp.Core.Broadcast ####Summary

        A broadcast variable created with SparkContext.Broadcast().
        Access its value through Value.
        
        var b = sc.Broadcast(new int[] {1, 2, 3, 4, 5})
        b.Value
        [1, 2, 3, 4, 5]
        sc.Parallelize(new in[] {0, 0}).FlatMap(x: b.Value).Collect()
        [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
        b.Unpersist()
        
        See python implementation in broadcast.py, worker.py, PythonRDD.scala

####Methods

Name	Description
	Delete cached copies of this broadcast on the executors.

###Microsoft.Spark.CSharp.Core.Broadcast`1 ####Summary

        A generic version of  where the element can be specified.
        
        The type of element in Broadcast

####Methods

Name	Description
Unpersist	Delete cached copies of this broadcast on the executors.

###Microsoft.Spark.CSharp.Core.CSharpWorkerFunc ####Summary

        Function that will be executed in CSharpWorker

####Methods

Name	Description
Chain	Used to chain functions

###Microsoft.Spark.CSharp.Core.Option`1 ####Summary

        Container for an optional value of type T. If the value of type T is present, the Option.IsDefined is TRUE and GetValue() return the value. 
        If the value is absent, the Option.IsDefined is FALSE, exception will be thrown when calling GetValue().

####Methods

Name	Description
GetValue	Returns the value of the option if Option.IsDefined is TRUE; otherwise, throws an .

###Microsoft.Spark.CSharp.Core.Partitioner ####Summary

        An object that defines how the elements in a key-value pair RDD are partitioned by key.
        Maps each key to a partition ID, from 0 to "numPartitions - 1".

####Methods

Name	Description
Equals	Determines whether the specified object is equal to the current object.
GetHashCode	Serves as the default hash function.

###Microsoft.Spark.CSharp.Core.RDDCollector ####Summary

        Used for collect operation on RDD

###Microsoft.Spark.CSharp.Core.DoubleRDDFunctions ####Summary

        Extra functions available on RDDs of Doubles through an implicit conversion.

####Methods

Name	Description
Sum	Add up the elements in this RDD. sc.Parallelize(new double[] {1.0, 2.0, 3.0}).Sum() 6.0
Stats	Return a object that captures the mean, variance and count of the RDD's elements in one operation.
Histogram	Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which are evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60]) # evenly spaced buckets ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2])
Mean	Compute the mean of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Mean() 2.0
Variance	Compute the variance of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Variance() 0.666...
Stdev	Compute the standard deviation of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Stdev() 0.816...
SampleStdev	Compute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N). sc.Parallelize(new double[]{1, 2, 3}).SampleStdev() 1.0
SampleVariance	Compute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N). sc.Parallelize(new double[]{1, 2, 3}).SampleVariance() 1.0

###Microsoft.Spark.CSharp.Core.IRDDCollector ####Summary

        Interface for collect operation on RDD

###Microsoft.Spark.CSharp.Core.OrderedRDDFunctions ####Summary

        Extra functions available on RDDs of (key, value) pairs where the key is sortable through
        a function to sort the key.

####Methods

Name	Description
SortByKey``2	Sorts this RDD, which is assumed to consist of Tuple pairs.
SortByKey``3	Sorts this RDD, which is assumed to consist of Tuples. If Item1 is type of string, case is sensitive.
repartitionAndSortWithinPartitions``2	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.

###Microsoft.Spark.CSharp.Core.PairRDDFunctions ####Summary

        operations only available to Tuple RDD
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

####Methods

Name	Description
CollectAsMap``2	Return the key-value pairs in this RDD to the master as a dictionary. var m = sc.Parallelize(new[] { new Tuple<int, int>(1, 2), new Tuple<int, int>(3, 4) }, 1).CollectAsMap() m[1] 2 m[3] 4
Keys``2	Return an RDD with the keys of each tuple. >>> m = sc.Parallelize(new[] { new Tuple<int, int>(1, 2), new Tuple<int, int>(3, 4) }, 1).Keys().Collect() [1, 3]
Values``2	Return an RDD with the values of each tuple. >>> m = sc.Parallelize(new[] { new Tuple<int, int>(1, 2), new Tuple<int, int>(3, 4) }, 1).Values().Collect() [2, 4]
ReduceByKey``2	Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with partitions, or the default parallelism level if is not specified. sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .ReduceByKey((x, y) => x + y).Collect() [('a', 2), ('b', 1)]
ReduceByKeyLocally``2	Merge the values for each key using an associative reduce function, but return the results immediately to the master as a dictionary. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .ReduceByKeyLocally((x, y) => x + y).Collect() [('a', 2), ('b', 1)]
CountByKey``2	Count the number of elements for each key, and return the result to the master as a dictionary. sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CountByKey((x, y) => x + y).Collect() [('a', 2), ('b', 1)]
Join``3	Return an RDD containing all pairs of elements with matching keys in this RDD and . Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this RDD and (k, v2) is in . Performs a hash join across the cluster. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 2), new Tuple<string, int>("a", 3) }, 1); var m = l.Join(r, 2).Collect(); [('a', (1, 2)), ('a', (1, 3))]
LeftOuterJoin``3	Perform a left outer join of this RDD and . For each element (k, v) in this RDD, the resulting RDD will either contain all pairs (k, (v, Option)) for w in , where Option.IsDefined is TRUE, or the pair (k, (v, Option)) if no elements in have key k, where Option.IsDefined is FALSE. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 2) }, 1); var m = l.LeftOuterJoin(r).Collect(); [('a', (1, 2)), ('b', (4, Option))] * Option.IsDefined = FALSE
RightOuterJoin``3	Perform a right outer join of this RDD and . For each element (k, w) in , the resulting RDD will either contain all pairs (k, (Option, w)) for v in this, where Option.IsDefined is TRUE, or the pair (k, (Option, w)) if no elements in this RDD have key k, where Option.IsDefined is FALSE. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 2) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 1); var m = l.RightOuterJoin(r).Collect(); [('a', (2, 1)), ('b', (Option, 4))] * Option.IsDefined = FALSE
FullOuterJoin``3	Perform a full outer join of this RDD and . For each element (k, v) in this RDD, the resulting RDD will either contain all pairs (k, (v, w)) for w in , or the pair (k, (v, None)) if no elements in have key k. Similarly, for each element (k, w) in , the resulting RDD will either contain all pairs (k, (v, w)) for v in this RDD, or the pair (k, (None, w)) if no elements in this RDD have key k. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new Tuple<string, int>("a", 1), Tuple<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new Tuple<string, int>("a", 2), new Tuple<string, int>("c", 8) }, 1); var m = l.FullOuterJoin(r).Collect(); [('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))]
PartitionBy``2	Return a copy of the RDD partitioned using the specified partitioner. sc.Parallelize(new[] { 1, 2, 3, 4, 2, 4, 1 }, 1).Map(x => new Tuple<int, int>(x, x)).PartitionBy(3).Glom().Collect()
CombineByKey``3	# TODO: add control over map-side aggregation Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C. Note that V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Users provide three functions: - , which turns a V into a C (e.g., creates a one-element list) - , to merge a V into a C (e.g., adds it to the end of a list) - , to combine two C's into a single one. In addition, users can control the partitioning of the output RDD. sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', '11'), ('b', '1')]
AggregateByKey``3	Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U. sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', 2), ('b', 1)]
FoldByKey``2	Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.). sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', 2), ('b', 1)]
GroupByKey``2	Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance. sc.Parallelize( new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 1), new Tuple<string, int>("a", 1) }, 2) .GroupByKey().MapValues(l => string.Join(" ", l)).Collect() [('a', [1, 1]), ('b', [1])]
MapValues``3	Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning. sc.Parallelize( new[] { new Tuple<string, string[]>("a", new[]{"apple", "banana", "lemon"}), new Tuple<string, string[]>("b", new[]{"grapes"}) }, 2) .MapValues(x => x.Length).Collect() [('a', 3), ('b', 1)]
FlatMapValues``3	Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning. x = sc.Parallelize( new[] { new Tuple<string, string[]>("a", new[]{"x", "y", "z"}), new Tuple<string, string[]>("b", new[]{"p", "r"}) }, 2) .FlatMapValues(x => x).Collect() [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
MapPartitionsWithIndex``5	explicitly convert Tuple<K, V> to Tuple<K, dynamic> since they are incompatibles types unlike V to dynamic
GroupWith``3	For each key k in this RDD or , return a resulting RDD that contains a tuple with the list of values for that key in this RDD as well as . var x = sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int>("a", 2) }, 1); x.GroupWith(y).Collect(); [('a', ([1], [2])), ('b', ([4], []))]
GroupWith``4	var x = sc.Parallelize(new[] { new Tuple<string, int>("a", 5), new Tuple<string, int>("b", 6) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 2); var z = sc.Parallelize(new[] { new Tuple<string, int>("a", 2) }, 1); x.GroupWith(y, z).Collect();
GroupWith``5	var x = sc.Parallelize(new[] { new Tuple<string, int>("a", 5), new Tuple<string, int>("b", 6) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int>("a", 1), new Tuple<string, int>("b", 4) }, 2); var z = sc.Parallelize(new[] { new Tuple<string, int>("a", 2) }, 1); var w = sc.Parallelize(new[] { new Tuple<string, int>("b", 42) }, 1); var m = x.GroupWith(y, z, w).MapValues(l => string.Join(" ", l.Item1) + " : " + string.Join(" ", l.Item2) + " : " + string.Join(" ", l.Item3) + " : " + string.Join(" ", l.Item4)).Collect();
SubtractByKey``3	Return each (key, value) pair in this RDD that has no pair with matching key in . var x = sc.Parallelize(new[] { new Tuple<string, int?>("a", 1), new Tuple<string, int?>("b", 4), new Tuple<string, int?>("b", 5), new Tuple<string, int?>("a", 2) }, 2); var y = sc.Parallelize(new[] { new Tuple<string, int?>("a", 3), new Tuple<string, int?>("c", null) }, 2); x.SubtractByKey(y).Collect(); [('b', 4), ('b', 5)]
Lookup``2	Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.Parallelize(Enumerable.Range(0, 1000).Zip(Enumerable.Range(0, 1000), (x, y) => new Tuple<int, int>(x, y)), 10) >>> rdd.lookup(42) [42]
SaveAsNewAPIHadoopDataset``2	Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package). Keys/values are converted for output using either user specified converters or, by default, org.apache.spark.api.python.JavaToWritableConverter.
SaveAsNewAPIHadoopFile``2
SaveAsHadoopDataset``2	Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). Keys/values are converted for output using either user specified converters or, by default, org.apache.spark.api.python.JavaToWritableConverter.
SaveAsHadoopFile``2	Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). Key and value types will be inferred if not specified. Keys and values are converted for output using either user specified converters or org.apache.spark.api.python.JavaToWritableConverter. The is applied on top of the base Hadoop conf associated with the SparkContext of this RDD to create a merged Hadoop MapReduce job configuration for saving the data.
SaveAsSequenceFile``2	Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the org.apache.hadoop.io.Writable types that we convert from the RDD's key and value types. The mechanism is as follows: 1. Pyrolite is used to convert pickled Python RDD into RDD of Java objects. 2. Keys and values of this Java RDD are converted to Writables and written out.
NullIfEmpty``1	Converts a collection to a list where the element type is Option(T) type. If the collection is empty, just returns the empty list.

###Microsoft.Spark.CSharp.Core.PipelinedRDD`1 ####Summary

        Wraps C#-based transformations that can be executed within a stage. It helps avoid unnecessary Ser/De of data between
        JVM and CLR to execute C# transformations and pipelines them

####Methods

Name	Description
MapPartitionsWithIndex``1	Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

###Microsoft.Spark.CSharp.Core.PriorityQueue`1 ####Summary

        A bounded priority queue implemented with max binary heap.
        
        Construction steps:
         1. Build a Max Heap of the first k elements.
         2. For each element after the kth element, compare it with the root of the max heap,
           a. If the element is less than the root, replace root with this element, heapify.
           b. Else ignore it.

####Methods

Name	Description
Offer	Inserts the specified element into this priority queue.

###Microsoft.Spark.CSharp.Core.Profiler ####Summary

        A class represents a profiler

###Microsoft.Spark.CSharp.Core.RDD`1 ####Summary

        Represents a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, 
        partitioned collection of elements that can be operated on in parallel
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
        
        Type of the RDD

####Methods

Name	Description
Cache	Persist this RDD with the default storage level .
Persist	Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. If no storage level is specified defaults to . sc.Parallelize(new string[] {"b", "a", "c").Persist().isCached True
Unpersist	Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
Checkpoint	Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with ) and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
GetNumPartitions	Returns the number of partitions of this RDD.
Map``1	Return a new RDD by applying a function to each element of this RDD. sc.Parallelize(new string[]{"b", "a", "c"}, 1).Map(x => new Tuple<string, int>(x, 1)).Collect() [('a', 1), ('b', 1), ('c', 1)]
FlatMap``1	Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. sc.Parallelize(new int[] {2, 3, 4}, 1).FlatMap(x => Enumerable.Range(1, x - 1)).Collect() [1, 1, 1, 2, 2, 3]
MapPartitions``1	Return a new RDD by applying a function to each partition of this RDD. sc.Parallelize(new int[] {1, 2, 3, 4}, 2).MapPartitions(iter => new[]{iter.Sum(x => (x as decimal?))}).Collect() [3, 7]
MapPartitionsWithIndex``1	Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. sc.Parallelize(new int[]{1, 2, 3, 4}, 4).MapPartitionsWithIndex<double>((pid, iter) => (double)pid).Sum() 6
Filter	Return a new RDD containing only the elements that satisfy a predicate. sc.Parallelize(new int[]{1, 2, 3, 4, 5}, 1).Filter(x => x % 2 == 0).Collect() [2, 4]
Distinct	Return a new RDD containing the distinct elements in this RDD. >>> sc.Parallelize(new int[] {1, 1, 2, 3}, 1).Distinct().Collect() [1, 2, 3]
Sample	Return a sampled subset of this RDD. var rdd = sc.Parallelize(Enumerable.Range(0, 100), 4) 6 <= rdd.Sample(False, 0.1, 81).count() <= 14 true
RandomSplit	Randomly splits this RDD with the provided weights. var rdd = sc.Parallelize(Enumerable.Range(0, 500), 1) var rdds = rdd.RandomSplit(new double[] {2, 3}, 17) 150 < rdds[0].Count() < 250 250 < rdds[1].Count() < 350
TakeSample	Return a fixed-size sampled subset of this RDD. var rdd = sc.Parallelize(Enumerable.Range(0, 10), 2) rdd.TakeSample(true, 20, 1).Length 20 rdd.TakeSample(false, 5, 2).Length 5 rdd.TakeSample(false, 15, 3).Length 10
ComputeFractionForSampleSize	Returns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time. How the sampling rate is determined: Let p = num / total, where num is the sample size and total is the total number of data points in the RDD. We're trying to compute q > p such that - when sampling with replacement, we're drawing each data point with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined). - when sampling without replacement, we're drawing each data point with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1-delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement.
Union	Return the union of this RDD and another one. var rdd = sc.Parallelize(new int[] { 1, 1, 2, 3 }, 1) rdd.union(rdd).collect() [1, 1, 2, 3, 1, 1, 2, 3]
Intersection	Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. Note that this method performs a shuffle internally. var rdd1 = sc.Parallelize(new int[] { 1, 10, 2, 3, 4, 5 }, 1) var rdd2 = sc.Parallelize(new int[] { 1, 6, 2, 3, 7, 8 }, 1) var rdd1.Intersection(rdd2).Collect() [1, 2, 3]
Glom	Return an RDD created by coalescing all elements within each partition into a list. var rdd = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 2) rdd.Glom().Collect() [[1, 2], [3, 4]]
Cartesian``1	Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other. rdd = sc.Parallelize(new int[] { 1, 2 }, 1) rdd.Cartesian(rdd).Collect() [(1, 1), (1, 2), (2, 1), (2, 2)]
GroupBy``1	Return an RDD of grouped items. Each group consists of a key and a sequence of elements mapping to that key. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated. Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] or [[PairRDDFunctions.reduceByKey]] will provide much better performance. >>> rdd = sc.Parallelize(new int[] { 1, 1, 2, 3, 5, 8 }, 1) >>> result = rdd.GroupBy(lambda x: x % 2).Collect() [(0, [2, 8]), (1, [1, 1, 3, 5])]
Pipe	Return an RDD created by piping elements to a forked external process. >>> sc.Parallelize(new char[] { '1', '2', '3', '4' }, 1).Pipe("cat").Collect() [u'1', u'2', u'3', u'4']
Foreach	Applies a function to all elements of this RDD. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Foreach(x => Console.Write(x))
ForeachPartition	Applies a function to each partition of this RDD. sc.parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).ForeachPartition(iter => { foreach (var x in iter) Console.Write(x + " "); })
Collect	Return a list that contains all of the elements in this RDD.
Reduce	Reduces the elements of this RDD using the specified commutative and associative binary operator. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Reduce((x, y) => x + y) 15
TreeReduce	Reduces the elements of this RDD in a multi-level tree pattern. >>> add = lambda x, y: x + y >>> rdd = sc.Parallelize(new int[] { -5, -4, -3, -2, -1, 1, 2, 3, 4 }, 10).TreeReduce((x, y) => x + y)) >>> rdd.TreeReduce(add) -5 >>> rdd.TreeReduce(add, 1) -5 >>> rdd.TreeReduce(add, 2) -5 >>> rdd.TreeReduce(add, 5) -5 >>> rdd.TreeReduce(add, 10) -5
Fold	Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value." The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection. >>> from operator import add >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add) 15
Aggregate``1	Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value." The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U >>> sc.parallelize(new int[] { 1, 2, 3, 4 }, 1).Aggregate(0, (x, y) => x + y, (x, y) => x + y)) 10
TreeAggregate``1	Aggregates the elements of this RDD in a multi-level tree pattern. rdd = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1).TreeAggregate(0, (x, y) => x + y, (x, y) => x + y)) 10
Count	Return the number of elements in this RDD.
CountByValue	Return the count of each unique value in this RDD as a dictionary of (value, count) pairs. sc.Parallelize(new int[] { 1, 2, 1, 2, 2 }, 2).CountByValue()) [(1, 2), (2, 3)]
Take	Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. Translated from the Scala implementation in RDD#take(). sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Cache().Take(2))) [2, 3] sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Take(10) [2, 3, 4, 5, 6] sc.Parallelize(Enumerable.Range(0, 100), 100).Filter(x => x > 90).Take(3) [91, 92, 93]
First	Return the first element in this RDD. >>> sc.Parallelize(new int[] { 2, 3, 4 }, 2).First() 2
IsEmpty	Returns true if and only if the RDD contains no elements at all. Note that an RDD may be empty even when it has at least 1 partition. sc.Parallelize(new int[0], 1).isEmpty() true sc.Parallelize(new int[] {1}).isEmpty() false
Subtract	Return each value in this RDD that is not contained in . var x = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1) var y = sc.Parallelize(new int[] { 3 }, 1) x.Subtract(y).Collect()) [1, 2, 4]
KeyBy``1	Creates tuples of the elements in this RDD by applying . sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1).KeyBy(x => x * x).Collect()) (1, 1), (4, 2), (9, 3), (16, 4)
Repartition	Return a new RDD that has exactly numPartitions partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using `Coalesce`, which can avoid performing a shuffle. var rdd = sc.Parallelize(new int[] { 1, 2, 3, 4, 5, 6, 7 }, 4) rdd.Glom().Collect().Length 4 rdd.Repartition(2).Glom().Collect().Length 2
Coalesce	Return a new RDD that is reduced into `numPartitions` partitions. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 3).Glom().Collect().Length 3 >>> sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 3).Coalesce(1).Glom().Collect().Length 1
Zip``1	Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other). var x = sc.parallelize(range(0,5)) var y = sc.parallelize(range(1000, 1005)) x.Zip(y).Collect() [(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]
ZipWithIndex	Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This method needs to trigger a spark job when this RDD contains more than one partitions. sc.Parallelize(new string[] { "a", "b", "c", "d" }, 3).ZipWithIndex().Collect() [('a', 0), ('b', 1), ('c', 2), ('d', 3)]
ZipWithUniqueId	Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from >>> sc.Parallelize(new string[] { "a", "b", "c", "d" }, 1).ZipWithIndex().Collect() [('a', 0), ('b', 1), ('c', 4), ('d', 2), ('e', 5)]
SetName	Assign a name to this RDD. >>> rdd1 = sc.parallelize([1, 2]) >>> rdd1.setName('RDD1').name() u'RDD1'
ToDebugString	A description of this RDD and its recursive dependencies for debugging.
GetStorageLevel	Get the RDD's current storage level. >>> rdd1 = sc.parallelize([1,2]) >>> rdd1.getStorageLevel() StorageLevel(False, False, False, False, 1) >>> print(rdd1.getStorageLevel()) Serialized 1x Replicated
ToLocalIterator	Return an iterator that contains all of the elements in this RDD. The iterator will consume as much memory as the largest partition in this RDD. sc.Parallelize(Enumerable.Range(0, 10), 1).ToLocalIterator() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
RandomSampleWithRange	Internal method exposed for Random Splits in DataFrames. Samples an RDD given a probability range.

###Microsoft.Spark.CSharp.Core.StringRDDFunctions ####Summary

        Some useful utility functions for RDD{string}

####Methods

Name	Description
SaveAsTextFile	Save this RDD as a text file, using string representations of elements.

###Microsoft.Spark.CSharp.Core.ComparableRDDFunctions ####Summary

        Some useful utility functions for RDD's containing IComparable values.

####Methods

Name	Description
Max``1	Find the maximum item in this RDD. sc.Parallelize(new double[] { 1.0, 5.0, 43.0, 10.0 }, 2).Max() 43.0
Min``1	Find the minimum item in this RDD. sc.Parallelize(new double[] { 2.0, 5.0, 43.0, 10.0 }, 2).Min() >>> rdd.min() 2.0
TakeOrdered``1	Get the N elements from a RDD ordered in ascending order or as specified by the optional key function. sc.Parallelize(new int[] { 10, 1, 2, 9, 3, 4, 5, 6, 7 }, 2).TakeOrdered(6) [1, 2, 3, 4, 5, 6]
Top``1	Get the top N elements from a RDD. Note: It returns the list sorted in descending order. sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Top(3) [6, 5, 4]

###Microsoft.Spark.CSharp.Core.SparkConf ####Summary

         Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
        
         Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
         by the user. Spark does not support modifying the configuration at runtime.
         
         See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf

####Methods

Name	Description
SetMaster	The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
SetAppName	Set a name for your application. Shown in the Spark web UI.
SetSparkHome	Set the location where Spark is installed on worker nodes.
Set	Set the value of a string config
GetInt	Get a int parameter value, falling back to a default if not set
Get	Get a string parameter value, falling back to a default if not set
GetAll	Get all parameters as a list of pairs

###Microsoft.Spark.CSharp.Core.SparkContext ####Summary

        Main entry point for Spark functionality. A SparkContext represents the 
        connection to a Spark cluster, and can be used to create RDDs, accumulators 
        and broadcast variables on that cluster.

####Methods

Name	Description
GetActiveSparkContext	Get existing SparkContext
GetConf	Return a copy of this JavaSparkContext's configuration. The configuration ''cannot'' be changed at runtime.
GetOrCreate	This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext. Note: This function cannot be used to create multiple SparkContext instances even if multiple contexts are allowed.
TextFile	Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
Parallelize``1	Distribute a local collection to form an RDD. sc.Parallelize(new int[] {0, 2, 3, 4, 6}, 5).Glom().Collect() [[0], [2], [3], [4], [6]]
EmptyRDD	Create an RDD that has no partitions or elements.
WholeTextFiles	Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. For example, if you have the following files: {{{ hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn }}} Do {{{ RDD<Tuple<string, string>> rdd = sparkContext.WholeTextFiles("hdfs://a-hdfs-path") }}} then `rdd` contains {{{ (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) }}} Small files are preferred, large file is also allowable, but may cause bad performance. minPartitions A suggestion value of the minimal splitting number for input data.
BinaryFiles	Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. For example, if you have the following files: {{{ hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn }}} Do RDD<Tuple<string, byte[]>>"/> rdd = sparkContext.dataStreamFiles("hdfs://a-hdfs-path")`, then `rdd` contains {{{ (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) }}} @note Small files are preferred; very large files but may cause bad performance. @param minPartitions A suggestion value of the minimal splitting number for input data.
SequenceFile	Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is as follows: 1. A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes 2. Serialization is attempted via Pyrolite pickling 3. If this fails, the fallback is to call 'toString' on each key and value 4. PickleSerializer is used to deserialize pickled objects on the Python side
NewAPIHadoopFile	Read a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java
NewAPIHadoopRDD	Read a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.
HadoopFile	Read an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java.
HadoopRDD	Read an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile.
Union``1	Build the union of a list of RDDs. This supports unions() of RDDs with different serialized formats, although this forces them to be reserialized using the default serializer: >>> path = os.path.join(tempdir, "union-text.txt") >>> with open(path, "w") as testFile: ... _ = testFile.write("Hello") >>> textFile = sc.textFile(path) >>> textFile.collect() [u'Hello'] >>> parallelized = sc.parallelize(["World!"]) >>> sorted(sc.union([textFile, parallelized]).collect()) [u'Hello', 'World!']
Broadcast``1	Broadcast a read-only variable to the cluster, returning a Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
Accumulator``1	Create an with the given initial value, using a given helper object to define how to add values of the data type if provided. Default AccumulatorParams are used for integers and floating-point numbers if you do not provide one. For other types, a custom AccumulatorParam can be used.
Stop	Shut down the SparkContext.
AddFile	Add a file to be downloaded with this Spark job on every node. The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use `SparkFiles.get(fileName)` to find its download location.
SetCheckpointDir	Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.
SetJobGroup	Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. Often, a unit of execution in an application consists of multiple Spark actions or jobs. Application programmers can use this method to group all those jobs together and give a group description. Once set, the Spark web UI will associate such jobs with this group. The application can also use [[org.apache.spark.api.java.JavaSparkContext.cancelJobGroup]] to cancel all running jobs in this group. For example, {{{ // In the main thread: sc.setJobGroup("some_job_to_cancel", "some job description"); rdd.map(...).count(); // In a separate thread: sc.cancelJobGroup("some_job_to_cancel"); }}} If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead.
SetLocalProperty	Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.
GetLocalProperty	Get a local property set in this thread, or null if it is missing. See [[org.apache.spark.api.java.JavaSparkContext.setLocalProperty]].
SetLogLevel	Control our logLevel. This overrides any user-defined log settings. @param logLevel The desired log level as a string. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
RunJob``1	Run a job on a given set of partitions of an RDD.
CancelJobGroup	Cancel active jobs for the specified group. See for more information.
CancelAllJobs	Cancel all jobs that have been scheduled or are running.

###Microsoft.Spark.CSharp.Core.StatCounter ####Summary

        A class for tracking the statistics of a set of numbers (count, mean and variance) in a numerically
        robust way. Includes support for merging two StatCounters. Based on Welford and Chan's algorithms
        for running variance.

####Methods

Name	Description
Merge	Add a value into this StatCounter, updating the internal statistics.
Merge	Add multiple values into this StatCounter, updating the internal statistics.
Merge	Merge another StatCounter into this one, adding up the internal statistics.
copy	Clone this StatCounter
ToString	Returns a string that represents this StatCounter.

###Microsoft.Spark.CSharp.Core.StatusTracker ####Summary

        Low-level status reporting APIs for monitoring job and stage progress.

####Methods

Name	Description
GetJobIdsForGroup	Return a list of all known jobs in a particular job group. If `jobGroup` is None, then returns all known jobs that are not associated with a job group. The returned list may contain running, failed, and completed jobs, and may vary across invocations of this method. This method does not guarantee the order of the elements in its result.
GetActiveStageIds	Returns an array containing the ids of all active stages.
GetActiveJobsIds	Returns an array containing the ids of all active jobs.
GetJobInfo	Returns a :class:`SparkJobInfo` object, or None if the job info could not be found or was garbage collected.
GetStageInfo	Returns a :class:`SparkStageInfo` object, or None if the stage info could not be found or was garbage collected.

###Microsoft.Spark.CSharp.Core.JobExecutionStatus ####Summary

        Status associated with a job information of Spark

###Microsoft.Spark.CSharp.Core.SparkJobInfo ####Summary

        SparkJobInfo represents a job information of Spark

####Methods

Name	Description

###Microsoft.Spark.CSharp.Core.SparkStageInfo ####Summary

        SparkJobInfo represents a stage information of Spark

####Methods

Name	Description

###Microsoft.Spark.CSharp.Core.StorageLevelType ####Summary

        Defines the type of storage levels

###Microsoft.Spark.CSharp.Core.StorageLevel ####Summary

        Flags for controlling the storage of an RDD. Each StorageLevel records whether to use 
        memory, whether to drop the RDD to disk if it falls out of memory, whether to keep the 
        data in memory in a serialized format, and whether to replicate the RDD partitions 
        on multiple nodes.

####Methods

Name	Description
ToString	Returns a readable string that represents the type

###Microsoft.Spark.CSharp.Network.ByteBuf ####Summary

        ByteBuf delimits a section of a ByteBufChunk.
        It is the smallest unit to be allocated.

####Methods

Name	Description
Clear	Sets the readerIndex and writerIndex of this buffer to 0.
IsReadable	Is this ByteSegment readable if and only if the buffer contains equal or more than the specified number of elements
IsWritable	Returns true if and only if the buffer has enough Capacity to accommodate size additional bytes.
ReadByte	Gets a byte at the current readerIndex and increases the readerIndex by 1 in this buffer.
ReadBytes	Reads a block of bytes from the ByteBuf and writes the data to a buffer.
Release	Release the ByteBuf back to the ByteBufPool
WriteBytes	Writes a block of bytes to the ByteBuf using data read from a buffer.
GetInputRioBuf	Returns a RioBuf object for input (receive)
GetOutputRioBuf	Returns a RioBuf object for output (send).
NewErrorStatusByteBuf	Creates an empty ByteBuf with error status.
	Finalizer.
	Allocates a ByteBuf from this ByteChunk.
	Release all resources
	Releases the ByteBuf back to this ByteChunk
	Returns a readable string for the ByteBufChunk
	Static method to create a new ByteBufChunk with given segment and chunk size. If isUnsafe is true, it allocates memory from the process's heap.
	Wraps HeapFree to process heap.
	Implementation of the Dispose pattern.
	Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage.
	Allocates a ByteBuf from this ByteBufChunkList if it is not empty.
	Releases the segment back to its ByteBufChunk.
	Adds the ByteBufChunk to this ByteBufChunkList
	Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage.
	Remove the ByteBufChunk from this ByteBufChunkList
	Returns a readable string for this ByteBufChunkList
	Allocates a ByteBuf from this ByteBufPool to use.
	Deallocates a ByteBuf back to this ByteBufPool.
	Gets a readable string for this ByteBufPool
	Returns the chunk numbers in each queue.

###Microsoft.Spark.CSharp.Network.ByteBufChunk ####Summary

        ByteBufChunk represents a memory blocks that can be allocated from 
        .Net heap (managed code) or process heap(unsafe code)

####Methods

Name	Description
Finalize	Finalizer.
Allocate	Allocates a ByteBuf from this ByteChunk.
Dispose	Release all resources
Free	Releases the ByteBuf back to this ByteChunk
ToString	Returns a readable string for the ByteBufChunk
NewChunk	Static method to create a new ByteBufChunk with given segment and chunk size. If isUnsafe is true, it allocates memory from the process's heap.
FreeToProcessHeap	Wraps HeapFree to process heap.
Dispose	Implementation of the Dispose pattern.
	Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage.
	Allocates a ByteBuf from this ByteBufChunkList if it is not empty.
	Releases the segment back to its ByteBufChunk.
	Adds the ByteBufChunk to this ByteBufChunkList
	Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage.
	Remove the ByteBufChunk from this ByteBufChunkList
	Returns a readable string for this ByteBufChunkList

###Microsoft.Spark.CSharp.Network.ByteBufChunk.Segment ####Summary

        Segment struct delimits a section of a byte chunk.

###Microsoft.Spark.CSharp.Network.ByteBufChunkList ####Summary

        ByteBufChunkList class represents a simple linked like list used to store ByteBufChunk objects
        based on its usage.

####Methods

Name	Description
Add	Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage.
Allocate	Allocates a ByteBuf from this ByteBufChunkList if it is not empty.
Free	Releases the segment back to its ByteBufChunk.
AddInternal	Adds the ByteBufChunk to this ByteBufChunkList
MoveInternal	Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage.
Remove	Remove the ByteBufChunk from this ByteBufChunkList
ToString	Returns a readable string for this ByteBufChunkList

###Microsoft.Spark.CSharp.Network.ByteBufPool ####Summary

        ByteBufPool class is used to manage the ByteBuf pool that allocate and free pooled memory buffer.
        We borrows some ideas from Netty buffer memory management.

####Methods

Name	Description
Allocate	Allocates a ByteBuf from this ByteBufPool to use.
Free	Deallocates a ByteBuf back to this ByteBufPool.
ToString	Gets a readable string for this ByteBufPool
GetUsages	Returns the chunk numbers in each queue.

###Microsoft.Spark.CSharp.Network.RioNative ####Summary

        RioNative class imports and initializes RIOSock.dll for use with RIO socket APIs.
        It also provided a simple thread pool that retrieves the results from IO completion port.

####Methods

Name	Description
Finalize	Finalizer
Dispose	Release all resources.
SetUseThreadPool	Sets whether use thread pool to query RIO socket results, it must be called before calling EnsureRioLoaded()
EnsureRioLoaded	Ensures that the native dll of RIO socket is loaded and initialized.
UnloadRio	Explicitly unload the native dll of RIO socket, and release resources.
Init	Initializes RIOSock native library.

###Microsoft.Spark.CSharp.Network.RioResult ####Summary

        The RioResult structure contains data used to indicate request completion results used with RIO socket

###Microsoft.Spark.CSharp.Network.SocketStream ####Summary

        Provides the underlying stream of data for network access.
        Just like a NetworkStream.

####Methods

Name	Description
Flush	Flushes data in send cache to the stream.
Seek	Seeks a specific position in the stream. This method is not supported by the SocketDataStream class.
SetLength	Sets the length of the stream. This method is not supported by the SocketDataStream class.
ReadByte	Reads a byte from the stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
Read	Reads data from the stream.
Write	Writes data to the stream.

###Microsoft.Spark.CSharp.Network.SockDataToken ####Summary

        SockDataToken class is used to associate with the SocketAsyncEventArgs object.
        Primarily, it is a way to pass state to the event handler.

####Methods

Name	Description
Reset	Reset this token
DetachData	Detach the data ownership.

###Microsoft.Spark.CSharp.Network.SocketFactory ####Summary

        SocketFactory is used to create ISocketWrapper instance based on a configuration and OS version.
        
        The ISocket instance can be RioSocket object, if the configuration is set to RioSocket and
        only the application is running on a Windows OS that supports Registered IO socket.

####Methods

Name	Description
CreateSocket	Creates a ISocket instance based on the configuration and OS version.
IsRioSockSupported	Indicates whether current OS supports RIO socket.

###Microsoft.Spark.CSharp.Sql.Builder ####Summary

        The entry point to programming Spark with the Dataset and DataFrame API.

####Methods

Name	Description
Master	Sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
AppName	Sets a name for the application, which will be shown in the Spark web UI. If no application name is set, a randomly generated name will be used.
Config	Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
Config	Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
Config	Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
Config	Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
Config	Sets a list of config options based on the given SparkConf
EnableHiveSupport	Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
GetOrCreate	Gets an existing [[SparkSession]] or, if there is no existing one, creates a new one based on the options set in this builder.

###Microsoft.Spark.CSharp.Sql.Catalog.Catalog ####Summary

        Catalog interface for Spark.

####Methods

Name	Description
ListDatabases	Returns a list of databases available across all sessions.
ListTables	Returns a list of tables in the current database or given database This includes all temporary tables.
ListColumns	Returns a list of columns for the given table in the current database or the given temporary table.
ListFunctions	Returns a list of functions registered in the specified database. This includes all temporary functions
SetCurrentDatabase	Sets the current default database in this session.
DropTempView	Drops the temporary view with the given view name in the catalog. If the view has been cached before, then it will also be uncached.
IsCached	Returns true if the table is currently cached in-memory.
CacheTable	Caches the specified table in-memory.
UnCacheTable	Removes the specified table from the in-memory cache.
RefreshTable	Invalidate and refresh all the cached metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks.When those change outside of Spark SQL, users should call this function to invalidate the cache. If this table is cached as an InMemoryRelation, drop the original cached version and make the new version cached lazily.
ClearCache	Removes all cached tables from the in-memory cache.
CreateExternalTable	Creates an external table from the given path and returns the corresponding DataFrame. It will use the default data source configured by spark.sql.sources.default.
CreateExternalTable	Creates an external table from the given path on a data source and returns DataFrame
CreateExternalTable	Creates an external table from the given path based on a data source and a set of options. Then, returns the corresponding DataFrame.
CreateExternalTable	Create an external table from the given path based on a data source, a schema and a set of options.Then, returns the corresponding DataFrame.

###Microsoft.Spark.CSharp.Sql.Catalog.Database ####Summary

        A database in Spark

###Microsoft.Spark.CSharp.Sql.Catalog.Table ####Summary

        A table in Spark

###Microsoft.Spark.CSharp.Sql.Catalog.Column ####Summary

        A column in Spark

###Microsoft.Spark.CSharp.Sql.Catalog.Function ####Summary

        A user-defined function in Spark

###Microsoft.Spark.CSharp.Sql.Column ####Summary

        A column that will be computed based on the data in a DataFrame.

####Methods

Name	Description
op_LogicalNot	The logical negation operator that negates its operand.
op_UnaryNegation	Negation of itself.
op_Addition	Sum of this expression and another expression.
op_Subtraction	Subtraction of this expression and another expression.
op_Multiply	Multiplication of this expression and another expression.
op_Division	Division this expression by another expression.
op_Modulus	Modulo (a.k.a. remainder) expression.
op_Equality	The equality operator returns true if the values of its operands are equal, false otherwise.
op_Inequality	The inequality operator returns false if its operands are equal, true otherwise.
op_LessThan	The "less than" relational operator that returns true if the first operand is less than the second, false otherwise.
op_LessThanOrEqual	The "less than or equal" relational operator that returns true if the first operand is less than or equal to the second, false otherwise.
op_GreaterThanOrEqual	The "greater than or equal" relational operator that returns true if the first operand is greater than or equal to the second, false otherwise.
op_GreaterThan	The "greater than" relational operator that returns true if the first operand is greater than the second, false otherwise.
op_BitwiseOr	Compute bitwise OR of this expression with another expression.
op_BitwiseAnd	Compute bitwise AND of this expression with another expression.
op_ExclusiveOr	Compute bitwise XOR of this expression with another expression.
GetHashCode	Required when operator == or operator != is defined
Equals	Required when operator == or operator != is defined
Like	SQL like expression.
RLike	SQL RLIKE expression (LIKE with Regex).
StartsWith	String starts with another string literal.
EndsWith	String ends with another string literal.
Asc	Returns a sort expression based on the ascending order.
Desc	Returns a sort expression based on the descending order.
Alias	Returns this column aliased with a new name.
Alias	Returns this column aliased with new names
Cast	Casts the column to a different data type, using the canonical string representation of the type. The supported types are: `string`, `boolean`, `byte`, `short`, `int`, `long`, `float`, `double`, `decimal`, `date`, `timestamp`. E.g. // Casts colA to integer. df.select(df("colA").cast("int"))

###Microsoft.Spark.CSharp.Sql.DataFrame ####Summary

         A distributed collection of data organized into named columns.
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

####Methods

Name	Description
RegisterTempTable	Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SqlContext that was used to create this DataFrame.
Count	Number of rows in the DataFrame
Show	Displays rows of the DataFrame in tabular form
ShowSchema	Prints the schema information of the DataFrame
Collect	Returns all of Rows in this DataFrame
ToRDD	Converts the DataFrame to RDD of Row
ToJSON	Returns the content of the DataFrame as RDD of JSON strings
Explain	Prints the plans (logical and physical) to the console for debugging purposes
Select	Selects a set of columns specified by column name or Column. df.Select("colA", df["colB"]) df.Select("*", df["colB"] + 10)
Select	Selects a set of columns. This is a variant of `select` that can only select existing columns using column names (i.e. cannot construct expressions). df.Select("colA", "colB")
SelectExpr	Selects a set of SQL expressions. This is a variant of `select` that accepts SQL expressions. df.SelectExpr("colA", "colB as newName", "abs(colC)")
Where	Filters rows using the given condition
Filter	Filters rows using the given condition
GroupBy	Groups the DataFrame using the specified columns, so we can run aggregation on them.
Rollup	Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
Cube	Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
Agg	Aggregates on the DataFrame for the given column-aggregate function mapping
Join	Join with another DataFrame - Cartesian join
Join	Join with another DataFrame - Inner equi-join using given column name
Join	Join with another DataFrame - Inner equi-join using given column name
Join	Join with another DataFrame, using the specified JoinType
Intersect	Intersect with another DataFrame. This is equivalent to `INTERSECT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, intersect(self, other)
UnionAll	Union with another DataFrame WITHOUT removing duplicated rows. This is equivalent to `UNION ALL` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, unionAll(self, other)
Subtract	Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to `EXCEPT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, subtract(self, other)
Drop	Returns a new DataFrame with a column dropped. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, drop(self, col)
DropNa	Returns a new DataFrame omitting rows with null values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropna(self, how='any', thresh=None, subset=None)
Na	Returns a DataFrameNaFunctions for working with missing data.
FillNa	Replace null values, alias for ``na.fill()`
DropDuplicates	Returns a new DataFrame with duplicate rows removed, considering only the subset of columns. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropDuplicates(self, subset=None)
Replace``1	Returns a new DataFrame replacing a value with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None)
ReplaceAll``1	Returns a new DataFrame replacing values with other values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None)
ReplaceAll``1	Returns a new DataFrame replacing values with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None)
RandomSplit	Randomly splits this DataFrame with the provided weights. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, randomSplit(self, weights, seed=None)
Columns	Returns all column names as a list. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, columns(self)
DTypes	Returns all column names and their data types. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dtypes(self)
Sort	Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, cols, *kwargs)
Sort	Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, cols, *kwargs)
SortWithinPartitions	Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, cols, *kwargs)
SortWithinPartition	Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, cols, *kwargs)
Alias	Returns a new DataFrame with an alias set. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, alias(self, alias)
WithColumn	Returns a new DataFrame by adding a column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumn(self, colName, col)
WithColumnRenamed	Returns a new DataFrame by renaming an existing column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumnRenamed(self, existing, new)
Corr	Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, corr(self, col1, col2, method=None)
Cov	Calculate the sample covariance of two columns as a double value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, cov(self, col1, col2)
FreqItems	Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in "http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou". Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, freqItems(self, cols, support=None) Note: This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
Crosstab	Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, crosstab(self, col1, col2)
Describe	Computes statistics for numeric columns. This include count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
Limit	Returns a new DataFrame by taking the first `n` rows. The difference between this function and `head` is that `head` returns an array while `limit` returns a new DataFrame.
Head	Returns the first `n` rows.
First	Returns the first row.
Take	Returns the first `n` rows in the DataFrame.
Distinct	Returns a new DataFrame that contains only the unique rows from this DataFrame.
Coalesce	Returns a new DataFrame that has exactly `numPartitions` partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
Persist	Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`)
Unpersist	Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
Cache	Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`)
Repartition	Returns a new DataFrame that has exactly `numPartitions` partitions.
Repartition	Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions.
Repartition	Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions.
Sample	Returns a new DataFrame by sampling a fraction of rows.
FlatMap``1	Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.
Map``1	Returns a new RDD by applying a function to all rows of this DataFrame.
MapPartitions``1	Returns a new RDD by applying a function to each partition of this DataFrame.
ForeachPartition	Applies a function f to each partition of this DataFrame.
Foreach	Applies a function f to all rows.
Write	Interface for saving the content of the DataFrame out into external storage.
SaveAsParquetFile	Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the `parquetFile` function in SQLContext.
InsertInto	Adds the rows from this RDD to the specified table, optionally overwriting the existing data.
SaveAsTable	Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options. Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an `insertInto`. Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved.
Save	Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.
	Returns a new DataFrame that drops rows containing any null values.
	Returns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row.
	Returns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row.
	Returns a new DataFrame that drops rows containing any null values in the specified columns.
	Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values.
	Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns.
	Returns a new DataFrame that replaces null values in numeric columns with `value`.
	Returns a new DataFrame that replaces null values in string columns with `value`.
	Returns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored.
	Returns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored.
	Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0));
	Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("", ImmutableMap.of("UNKNOWN", "unnamed"));
	Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed"));
	Specifies the input data source format.
	Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading.
	Adds an input option for the underlying data source.
	Adds input options for the underlying data source.
	Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system).
	Loads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores).
	Construct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties.
	Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
	Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
	Loads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
	Loads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in.
	Loads a AVRO file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
	Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
	Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
	Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
	Adds an output option for the underlying data source.
	Adds output options for the underlying data source.
	Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment.
	Saves the content of the DataFrame at the specified path.
	Saves the content of the DataFrame as the specified table.
	Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored.
	Saves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored.
	Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
	Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path)
	Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path)
	Saves the content of the DataFrame in AVRO format at the specified path. This is equivalent to: Format("com.databricks.spark.avro").Save(path)

###Microsoft.Spark.CSharp.Sql.JoinType ####Summary

        The type of join operation for DataFrame

###Microsoft.Spark.CSharp.Sql.GroupedData ####Summary

        A set of methods for aggregations on a DataFrame, created by DataFrame.groupBy.

####Methods

Name	Description
Agg	Compute aggregates by specifying a dictionary from column name to aggregate methods. The available aggregate methods are avg, max, min, sum, count.
Count	Count the number of rows for each group.
Mean	Compute the average value for each numeric columns for each group. This is an alias for avg. When specified columns are given, only compute the average values for them.
Max	Compute the max value for each numeric columns for each group. When specified columns are given, only compute the max values for them.
Min	Compute the min value for each numeric column for each group.
Avg	Compute the mean value for each numeric columns for each group. When specified columns are given, only compute the mean values for them.
Sum	Compute the sum for each numeric columns for each group. When specified columns are given, only compute the sum for them.

###Microsoft.Spark.CSharp.Sql.DataFrameNaFunctions ####Summary

        Functionality for working with missing data in DataFrames.

####Methods

Name	Description
Drop	Returns a new DataFrame that drops rows containing any null values.
Drop	Returns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row.
Drop	Returns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row.
Drop	Returns a new DataFrame that drops rows containing any null values in the specified columns.
Drop	Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values.
Drop	Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns.
Fill	Returns a new DataFrame that replaces null values in numeric columns with `value`.
Fill	Returns a new DataFrame that replaces null values in string columns with `value`.
Fill	Returns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored.
Fill	Returns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored.
Fill	Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0));
Replace``1	Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("", ImmutableMap.of("UNKNOWN", "unnamed"));
Replace``1	Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed"));

###Microsoft.Spark.CSharp.Sql.DataFrameReader ####Summary

        Interface used to load a DataFrame from external storage systems (e.g. file systems,
        key-value stores, etc). Use SQLContext.read() to access this.

####Methods

Name	Description
Format	Specifies the input data source format.
Schema	Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading.
Option	Adds an input option for the underlying data source.
Options	Adds input options for the underlying data source.
Load	Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system).
Load	Loads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores).
Jdbc	Construct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties.
Jdbc	Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
Jdbc	Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
Json	Loads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
Parquet	Loads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in.
Avro	Loads a AVRO file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.

###Microsoft.Spark.CSharp.Sql.DataFrameWriter ####Summary

        Interface used to write a DataFrame to external storage systems (e.g. file systems,
        key-value stores, etc). Use DataFrame.Write to access this.
        
        See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

####Methods

Name	Description
Mode	Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
Mode	Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
Format	Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
Option	Adds an output option for the underlying data source.
Options	Adds output options for the underlying data source.
PartitionBy	Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment.
Save	Saves the content of the DataFrame at the specified path.
Save	Saves the content of the DataFrame as the specified table.
InsertInto	Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored.
SaveAsTable	Saves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored.
Jdbc	Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
Json	Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path)
Parquet	Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path)
Avro	Saves the content of the DataFrame in AVRO format at the specified path. This is equivalent to: Format("com.databricks.spark.avro").Save(path)

###Microsoft.Spark.CSharp.Sql.Dataset ####Summary

         Dataset is a strongly typed collection of domain-specific objects that can be transformed
        in parallel using functional or relational operations.Each Dataset also has an untyped view 
        called a DataFrame, which is a Dataset of Row.

####Methods

Name	Description
ToDF	Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic[[Row]] objects that allow fields to be accessed by ordinal or name.
PrintSchema	Prints the schema to the console in a nice tree format.
Explain	Prints the plans (logical and physical) to the console for debugging purposes.
Explain	Prints the physical plan to the console for debugging purposes.
DTypes	Returns all column names and their data types as an array.
Columns	Returns all column names as an array.
Show	Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
ShowSchema	Prints schema

###Microsoft.Spark.CSharp.Sql.Dataset`1 ####Summary

        Dataset of specific types
        
        Type parameter

###Microsoft.Spark.CSharp.Sql.HiveContext ####Summary

        HiveContext is deprecated. Use SparkSession.Builder().EnableHiveSupport()
        HiveContext is a variant of Spark SQL that integrates with data stored in Hive. 
        Configuration for Hive is read from hive-site.xml on the classpath.
        It supports running both SQL and HiveQL commands.

####Methods

Name	Description
Sql	Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'
RefreshTable	Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache.

###Microsoft.Spark.CSharp.Sql.PythonSerDe ####Summary

        Used for SerDe of Python objects

####Methods

Name	Description
GetUnpickledObjects	Unpickles objects from byte[]

###Microsoft.Spark.CSharp.Sql.RowConstructor ####Summary

        Used by Unpickler to unpickle pickled objects. It is also used to construct a Row (C# representation of pickled objects).

####Methods

Name	Description
ToString	Returns a string that represents the current object.
construct	Used by Unpickler - do not use to construct Row. Use GetRow() method
GetRow	Used to construct a Row

###Microsoft.Spark.CSharp.Sql.Row ####Summary

         Represents one row of output from a relational operator.

####Methods

Name	Description
	Returns a string that represents the current object.
	Used by Unpickler - do not use to construct Row. Use GetRow() method
	Used to construct a Row
Size	Number of elements in the Row.
GetSchema	Schema for the row.
Get	Returns the value at position i.
Get	Returns the value of a given columnName.
GetAs``1	Returns the value at position i, the return value will be cast to type T.
GetAs``1	Returns the value of a given columnName, the return value will be cast to type T.

###Microsoft.Spark.CSharp.Sql.Functions ####Summary

        DataFrame Built-in functions

####Methods

Name	Description
Lit	Creates a Column of any literal value.
Col	Returns a Column based on the given column name.
Column	Returns a Column based on the given column name.
Asc	Returns a sort expression based on ascending order of the column.
Desc	Returns a sort expression based on the descending order of the column.
Upper	Converts a string column to upper case.
Lower	Converts a string column to lower case.
Sqrt	Computes the square root of the specified float column.
Abs	Computes the absolute value.
Max	Returns the maximum value of the expression in a group.
Min	Returns the minimum value of the expression in a group.
First	Returns the first value in a group.
Last	Returns the last value in a group.
Count	Returns the number of items in a group.
Sum	Returns the sum of all values in the expression.
Avg	Returns the average of the values in a group.
Mean	Returns the average of the values in a group.
SumDistinct	Returns the sum of distinct values in the expression.
Array	Creates a new array column. The input columns must all have the same data type.
Coalesce	Returns the first column that is not null, or null if all inputs are null.
CountDistinct	Returns the number of distinct items in a group.
Struct	Creates a new struct column.
ApproxCountDistinct	Returns the approximate number of distinct items in a group
Explode	Creates a new row for each element in the given array or map column.
Rand	Generate a random column with i.i.d. samples from U[0.0, 1.0].
Randn	Generate a column with i.i.d. samples from the standard normal distribution.
Ntile	Returns the ntile group id (from 1 to n inclusive) in an ordered window partition. This is equivalent to the NTILE function in SQL.
Acos	Computes the cosine inverse of the given column; the returned angle is in the range 0.
Asin	Computes the sine inverse of the given column; the returned angle is in the range -pi/2 through pi/2.
Atan	Computes the tangent inverse of the given column.
Cbrt	Computes the cube-root of the given column.
Ceil	Computes the ceiling of the given column.
Cos	Computes the cosine of the given column.
Cosh	Computes the hyperbolic cosine of the given column.
Exp	Computes the exponential of the given column.
Expm1	Computes the exponential of the given value minus column.
Floor	Computes the floor of the given column.
Log	Computes the natural logarithm of the given column.
Log10	Computes the logarithm of the given column in base 10.
Log1p	Computes the natural logarithm of the given column plus one.
Rint	Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Signum	Computes the signum of the given column.
Sin	Computes the sine of the given column.
Sinh	Computes the hyperbolic sine of the given column.
Tan	Computes the tangent of the given column.
Tanh	Computes the hyperbolic tangent of the given column.
ToDegrees	Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
ToRadians	Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
BitwiseNOT	Computes bitwise NOT.
Atan2	Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).
Hypot	Computes sqrt(a2 + b2) without intermediate overflow or underflow.
Hypot	Computes sqrt(a2 + b2) without intermediate overflow or underflow.
Hypot	Computes sqrt(a2 + b2) without intermediate overflow or underflow.
Pow	Returns the value of the first argument raised to the power of the second argument.
Pow	Returns the value of the first argument raised to the power of the second argument.
Pow	Returns the value of the first argument raised to the power of the second argument.
ApproxCountDistinct	Returns the approximate number of distinct items in a group.
When	Evaluates a list of conditions and returns one of multiple possible result expressions.
Lag	Returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.
Lead	Returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.
RowNumber	Returns a sequential number starting at 1 within a window partition.
DenseRank	Returns the rank of rows within a window partition, without any gaps.
Rank	Returns the rank of rows within a window partition.
CumeDist	Returns the cumulative distribution of values within a window partition
PercentRank	Returns the relative rank (i.e. percentile) of rows within a window partition.
MonotonicallyIncreasingId	A column expression that generates monotonically increasing 64-bit integers.
SparkPartitionId	Partition ID of the Spark task. Note that this is indeterministic because it depends on data partitioning and task scheduling.
Rand	Generate a random column with i.i.d. samples from U[0.0, 1.0].
Randn	Generate a column with i.i.d. samples from the standard normal distribution.
Udf``1	Defines a user-defined function of 0 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``2	Defines a user-defined function of 1 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``3	Defines a user-defined function of 2 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``4	Defines a user-defined function of 3 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``5	Defines a user-defined function of 4 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``6	Defines a user-defined function of 5 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``7	Defines a user-defined function of 6 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``8	Defines a user-defined function of 7 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``9	Defines a user-defined function of 8 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``10	Defines a user-defined function of 9 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.
Udf``11	Defines a user-defined function of 10 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature.

###Microsoft.Spark.CSharp.Sql.SaveMode ####Summary

        SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.

####Methods

Name	Description
	Gets the string for the value of SaveMode

###Microsoft.Spark.CSharp.Sql.SaveModeExtensions ####Summary

        For SaveMode.ErrorIfExists, the corresponding literal string in spark is "error" or "default".

####Methods

Name	Description
GetStringValue	Gets the string for the value of SaveMode

###Microsoft.Spark.CSharp.Sql.SparkSession ####Summary

        The entry point to programming Spark with the Dataset and DataFrame API.

####Methods

Name	Description
Builder	Builder for SparkSession
NewSession	Start a new session with isolated SQL configurations, temporary tables, registered functions are isolated, but sharing the underlying [[SparkContext]] and cached data. Note: Other than the [[SparkContext]], all shared state is initialized lazily. This method will force the initialization of the shared state to ensure that parent and child sessions are set up with the same shared state. If the underlying catalog implementation is Hive, this will initialize the metastore, which may take some time.
Stop	Stop underlying SparkContext
Read	Returns a DataFrameReader that can be used to read non-streaming data in as a DataFrame
CreateDataFrame	Creates a from a RDD containing array of object using the given schema.
Table	Returns the specified table as a
Sql	Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'

###Microsoft.Spark.CSharp.Sql.SqlContext ####Summary

        The entry point for working with structured data (rows and columns) in Spark.  
        Allows the creation of [[DataFrame]] objects as well as the execution of SQL queries.

####Methods

Name	Description
GetOrCreate	Get the existing SQLContext or create a new one with given SparkContext.
NewSession	Returns a new SQLContext as new session, that has separate SQLConf, registered temporary tables and UDFs, but shared SparkContext and table cache.
GetConf	Returns the value of Spark SQL configuration property for the given key. If the key is not set, returns defaultValue.
SetConf	Sets the given Spark SQL configuration property.
Read	Returns a DataFrameReader that can be used to read data in as a DataFrame.
ReadDataFrame	Loads a dataframe the source path using the given schema and options
CreateDataFrame	Creates a from a RDD containing array of object using the given schema.
RegisterDataFrameAsTable	Registers the given as a temporary table in the catalog. Temporary tables exist only during the lifetime of this instance of SqlContext.
DropTempTable	Remove the temp table from catalog.
Table	Returns the specified table as a
Tables	Returns a containing names of tables in the given database. If is not specified, the current database will be used. The returned DataFrame has two columns: 'tableName' and 'isTemporary' (a column with bool type indicating if a table is a temporary one or not).
TableNames	Returns a list of names of tables in the database
CacheTable	Caches the specified table in-memory.
UncacheTable	Removes the specified table from the in-memory cache.
ClearCache	Removes all cached tables from the in-memory cache.
IsCached	Returns true if the table is currently cached in-memory.
Sql	Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'
JsonFile	Loads a JSON file (one object per line), returning the result as a DataFrame It goes through the entire dataset once to determine the schema.
JsonFile	Loads a JSON file (one object per line) and applies the given schema
TextFile	Loads text file with the specific column delimited using the given schema
TextFile	Loads a text file (one object per line), returning the result as a DataFrame
RegisterFunction``1	Register UDF with no input argument, e.g: SqlContext.RegisterFunction<bool>("MyFilter", () => true); sqlContext.Sql("SELECT * FROM MyTable where MyFilter()");
RegisterFunction``2	Register UDF with 1 input argument, e.g: SqlContext.RegisterFunction<bool, string>("MyFilter", (arg1) => arg1 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1)");
RegisterFunction``3	Register UDF with 2 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string>("MyFilter", (arg1, arg2) => arg1 != null && arg2 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2)");
RegisterFunction``4	Register UDF with 3 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, string>("MyFilter", (arg1, arg2, arg3) => arg1 != null && arg2 != null && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, columnName3)");
RegisterFunction``5	Register UDF with 4 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg4) => arg1 != null && arg2 != null && ... && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName4)");
RegisterFunction``6	Register UDF with 5 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg5) => arg1 != null && arg2 != null && ... && arg5 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName5)");
RegisterFunction``7	Register UDF with 6 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg6) => arg1 != null && arg2 != null && ... && arg6 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName6)");
RegisterFunction``8	Register UDF with 7 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg7) => arg1 != null && arg2 != null && ... && arg7 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName7)");
RegisterFunction``9	Register UDF with 8 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg8) => arg1 != null && arg2 != null && ... && arg8 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName8)");
RegisterFunction``10	Register UDF with 9 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg9) => arg1 != null && arg2 != null && ... && arg9 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName9)");
RegisterFunction``11	Register UDF with 10 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg10) => arg1 != null && arg2 != null && ... && arg10 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName10)");

###Microsoft.Spark.CSharp.Sql.DataType ####Summary

        The base type of all Spark SQL data types.

####Methods

Name	Description
ParseDataTypeFromJson	Parses a Json string to construct a DataType.
ParseDataTypeFromJson	Parse a JToken object to construct a DataType.

###Microsoft.Spark.CSharp.Sql.AtomicType ####Summary

        An internal type used to represent a simple type.

###Microsoft.Spark.CSharp.Sql.ComplexType ####Summary

        An internal type used to represent a complex type (such as arrays, structs, and maps).

####Methods

Name	Description
FromJson	Abstract method that constructs a complex type from a Json object
FromJson	Constructs a complex type from a Json string

###Microsoft.Spark.CSharp.Sql.NullType ####Summary

        The data type representing NULL values.

###Microsoft.Spark.CSharp.Sql.StringType ####Summary

        The data type representing String values.

###Microsoft.Spark.CSharp.Sql.BinaryType ####Summary

        The data type representing binary values.

###Microsoft.Spark.CSharp.Sql.BooleanType ####Summary

        The data type representing Boolean values.

###Microsoft.Spark.CSharp.Sql.DateType ####Summary

        The data type representing Date values.

###Microsoft.Spark.CSharp.Sql.TimestampType ####Summary

        The data type representing Timestamp values.

###Microsoft.Spark.CSharp.Sql.DoubleType ####Summary

        The data type representing Double values.

###Microsoft.Spark.CSharp.Sql.FloatType ####Summary

###Microsoft.Spark.CSharp.Sql.ByteType ####Summary

        The data type representing Float values.

###Microsoft.Spark.CSharp.Sql.IntegerType ####Summary

###Microsoft.Spark.CSharp.Sql.LongType ####Summary

        The data type representing Int values.

###Microsoft.Spark.CSharp.Sql.ShortType ####Summary

        The data type representing Short values.

###Microsoft.Spark.CSharp.Sql.DecimalType ####Summary

        The data type representing Decimal values.

####Methods

Name	Description
FromJson	Constructs a DecimalType from a Json object

###Microsoft.Spark.CSharp.Sql.ArrayType ####Summary

        The data type for collections of multiple values.

####Methods

Name	Description
FromJson	Constructs a ArrayType from a Json object

###Microsoft.Spark.CSharp.Sql.MapType ####Summary

        The data type for Maps. Not implemented yet.

####Methods

Name	Description
FromJson	Constructs a StructField from a Json object. Not implemented yet.

###Microsoft.Spark.CSharp.Sql.StructField ####Summary

        A field inside a StructType.

####Methods

Name	Description
FromJson	Constructs a StructField from a Json object

###Microsoft.Spark.CSharp.Sql.StructType ####Summary

        Struct type, consisting of a list of StructField
        This is the data type representing a Row

####Methods

Name	Description
FromJson	Constructs a StructType from a Json object

###Microsoft.Spark.CSharp.Sql.UdfRegistration ####Summary

        Used for registering User Defined Functions. SparkSession.Udf is used to access instance of this type.

####Methods

Name	Description
RegisterFunction``1	Register UDF with no input argument, e.g: SqlContext.RegisterFunction<bool>("MyFilter", () => true); sqlContext.Sql("SELECT * FROM MyTable where MyFilter()");
RegisterFunction``2	Register UDF with 1 input argument, e.g: SqlContext.RegisterFunction<bool, string>("MyFilter", (arg1) => arg1 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1)");
RegisterFunction``3	Register UDF with 2 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string>("MyFilter", (arg1, arg2) => arg1 != null && arg2 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2)");
RegisterFunction``4	Register UDF with 3 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, string>("MyFilter", (arg1, arg2, arg3) => arg1 != null && arg2 != null && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, columnName3)");
RegisterFunction``5	Register UDF with 4 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg4) => arg1 != null && arg2 != null && ... && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName4)");
RegisterFunction``6	Register UDF with 5 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg5) => arg1 != null && arg2 != null && ... && arg5 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName5)");
RegisterFunction``7	Register UDF with 6 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg6) => arg1 != null && arg2 != null && ... && arg6 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName6)");
RegisterFunction``8	Register UDF with 7 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg7) => arg1 != null && arg2 != null && ... && arg7 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName7)");
RegisterFunction``9	Register UDF with 8 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg8) => arg1 != null && arg2 != null && ... && arg8 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName8)");
RegisterFunction``10	Register UDF with 9 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg9) => arg1 != null && arg2 != null && ... && arg9 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName9)");
RegisterFunction``11	Register UDF with 10 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg10) => arg1 != null && arg2 != null && ... && arg10 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName10)");

###Microsoft.Spark.CSharp.Streaming.ConstantInputDStream`1 ####Summary

        An input stream that always returns the same RDD on each timestep. Useful for testing.

####Methods

Name	Description

###Microsoft.Spark.CSharp.Streaming.DStream`1 ####Summary

        A Discretized Stream (DStream), the basic abstraction in Spark Streaming,
        is a continuous sequence of RDDs (of the same type) representing a
        continuous stream of data (see ) in the Spark core documentation
        for more details on RDDs).
        
        DStreams can either be created from live data (such as, data from TCP
        sockets, Kafka, Flume, etc.) using a  or it can be
        generated by transforming existing DStreams using operations such as
        `Map`, `Window` and `ReduceByKeyAndWindow`. While a Spark Streaming
        program is running, each DStream periodically generates a RDD, either
        from live data or by transforming the RDD generated by a parent DStream.
        
        DStreams internally is characterized by a few basic properties:
         - A list of other DStreams that the DStream depends on
         - A time interval at which the DStream generates an RDD
         - A function that is used to generate an RDD after each time interval

####Methods

Name	Description
Count	Return a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
Filter	Return a new DStream containing only the elements that satisfy predicate.
FlatMap``1	Return a new DStream by applying a function to all elements of this DStream, and then flattening the results
Map``1	Return a new DStream by applying a function to each element of DStream.
MapPartitions``1	Return a new DStream in which each RDD is generated by applying mapPartitions() to each RDDs of this DStream.
MapPartitionsWithIndex``1	Return a new DStream in which each RDD is generated by applying mapPartitionsWithIndex() to each RDDs of this DStream.
Reduce	Return a new DStream in which each RDD has a single element generated by reducing each RDD of this DStream.
ForeachRDD	Apply a function to each RDD in this DStream.
ForeachRDD	Apply a function to each RDD in this DStream.
Print	Print the first num elements of each RDD generated in this DStream. @param num: the number of elements from the first will be printed.
Glom	Return a new DStream in which RDD is generated by applying glom() to RDD of this DStream.
Cache	Persist the RDDs of this DStream with the default storage level .
Persist	Persist the RDDs of this DStream with the given storage level
Checkpoint	Enable periodic checkpointing of RDDs of this DStream
CountByValue	Return a new DStream in which each RDD contains the counts of each distinct value in each RDD of this DStream.
SaveAsTextFiles	Save each RDD in this DStream as text file, using string representation of elements.
Transform``1	Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream. `func` can have one argument of `rdd`, or have two arguments of (`time`, `rdd`)
Transform``1	Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream. `func` can have one argument of `rdd`, or have two arguments of (`time`, `rdd`)
TransformWith``2	Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and 'other' DStream. `func` can have two arguments of (`rdd_a`, `rdd_b`) or have three arguments of (`time`, `rdd_a`, `rdd_b`)
TransformWith``2	Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and 'other' DStream. `func` can have two arguments of (`rdd_a`, `rdd_b`) or have three arguments of (`time`, `rdd_a`, `rdd_b`)
Repartition	Return a new DStream with an increased or decreased level of parallelism.
Union	Return a new DStream by unifying data of another DStream with this DStream. @param other: Another DStream having the same interval (i.e., slideDuration) as this DStream.
Slice	Return all the RDDs between 'fromTime' to 'toTime' (both included)
Window	Return a new DStream in which each RDD contains all the elements in seen in a sliding window of time over this DStream. @param windowDuration: width of the window; must be a multiple of this DStream's batching interval @param slideDuration: sliding interval of the window (i.e., the interval after which the new DStream will generate RDDs); must be a multiple of this DStream's batching interval
ReduceByWindow	Return a new DStream in which each RDD has a single element generated by reducing all elements in a sliding window over this DStream. if `invReduceFunc` is not None, the reduction is done incrementally using the old window's reduced value : 1. reduce the new values that entered the window (e.g., adding new counts) 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts) This is more efficient than `invReduceFunc` is None.
CountByWindow	Return a new DStream in which each RDD has a single element generated by counting the number of elements in a window over this DStream. windowDuration and slideDuration are as defined in the window() operation. This is equivalent to window(windowDuration, slideDuration).count(), but will be more efficient if window is large.
CountByValueAndWindow	Return a new DStream in which each RDD contains the count of distinct elements in RDDs in a sliding window over this DStream.

###Microsoft.Spark.CSharp.Streaming.EventHubsUtils ####Summary

        Utility for creating streams from

####Methods

Name	Description
CreateUnionStream	Create a unioned EventHubs stream that receives data from Microsoft Azure Eventhubs The unioned stream will receive message from all partitions of the EventHubs

###Microsoft.Spark.CSharp.Streaming.KafkaUtils ####Summary

        Utils for Kafka input stream.

####Methods

Name	Description
CreateStream	Create an input stream that pulls messages from a Kafka Broker.
CreateStream	Create an input stream that pulls messages from a Kafka Broker.
CreateDirectStream	Create an input stream that directly pulls messages from a Kafka Broker and specific offset. This is not a receiver based Kafka input stream, it directly pulls the message from Kafka in each batch duration and processed without storing. This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. You can access the offsets used in each batch from the generated RDDs (see [[org.apache.spark.streaming.kafka.HasOffsetRanges]]). To recover from driver failures, you have to enable checkpointing in the StreamingContext. The information on consumed offset can be recovered from the checkpoint. See the programming guide for details (constraints, etc.).
CreateDirectStream``1	Create an input stream that directly pulls messages from a Kafka Broker and specific offset. This is not a receiver based Kafka input stream, it directly pulls the message from Kafka in each batch duration and processed without storing. This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. You can access the offsets used in each batch from the generated RDDs (see [[org.apache.spark.streaming.kafka.HasOffsetRanges]]). To recover from driver failures, you have to enable checkpointing in the StreamingContext. The information on consumed offset can be recovered from the checkpoint. See the programming guide for details (constraints, etc.).
GetOffsetRange	create offset range from kafka messages when CSharpReader is enabled
GetNumPartitionsFromConfig	topics should contain only one topic if choose to repartitions to a configured numPartitions TODO: move to scala and merge into DynamicPartitionKafkaRDD.getPartitions to remove above limitation

###Microsoft.Spark.CSharp.Streaming.OffsetRange ####Summary

        Kafka offset range

####Methods

Name	Description
ToString	OffsetRange string format

###Microsoft.Spark.CSharp.Streaming.MapWithStateDStream`4 ####Summary

        DStream representing the stream of data generated by `mapWithState` operation on a pair DStream.
        Additionally, it also gives access to the stream of state snapshots, that is, the state data of all keys after a batch has updated them.
        
        Type of the key
        Type of the value
        Type of the state data
        Type of the mapped data

####Methods

Name	Description
StateSnapshots	Return a pair DStream where each RDD is the snapshot of the state of all the keys.

###Microsoft.Spark.CSharp.Streaming.KeyedState`1 ####Summary

        Class to hold a state instance and the timestamp when the state is updated or created.
        No need to explicitly make this class clonable, since the serialization and deserialization in Worker is already a kind of clone mechanism. 
        
        Type of the state data

###Microsoft.Spark.CSharp.Streaming.MapWithStateRDDRecord`3 ####Summary

        Record storing the keyed-state MapWithStateRDD. 
        Each record contains a stateMap and a sequence of records returned by the mapping function of MapWithState.
        Note: don't need to explicitly make this class clonable, since the serialization and deserialization in Worker is already a kind of clone. 
        
        Type of the key
        Type of the state data
        Type of the mapped data

###Microsoft.Spark.CSharp.Streaming.StateSpec`4 ####Summary

        Representing all the specifications of the DStream transformation `mapWithState` operation.
        
        Type of the key
        Type of the value
        Type of the state data
        Type of the mapped data

####Methods

Name	Description
NumPartitions	Set the number of partitions by which the state RDDs generated by `mapWithState` will be partitioned. Hash partitioning will be used.
Timeout	Set the duration after which the state of an idle key will be removed. A key and its state is considered idle if it has not received any data for at least the given duration. The mapping function will be called one final time on the idle states that are going to be removed; [[org.apache.spark.streaming.State State.isTimingOut()]] set to `true` in that call.
InitialState	Set the RDD containing the initial states that will be used by mapWithState

###Microsoft.Spark.CSharp.Streaming.State`1 ####Summary

        class for getting and updating the state in mapping function used in the `mapWithState` operation
        
        Type of the state

####Methods

Name	Description
Exists	Returns whether the state already exists
Get	Gets the state if it exists, otherwise it will throw ArgumentException.
Update	Updates the state with a new value.
Remove	Removes the state if it exists.
IsTimingOut	Returns whether the state is timing out and going to be removed by the system after the current batch.

###Microsoft.Spark.CSharp.Streaming.PairDStreamFunctions ####Summary

        operations only available to Tuple RDD

####Methods

Name	Description
ReduceByKey``2	Return a new DStream by applying ReduceByKey to each RDD.
CombineByKey``3	Return a new DStream by applying combineByKey to each RDD.
PartitionBy``2	Return a new DStream in which each RDD are partitioned by numPartitions.
MapValues``3	Return a new DStream by applying a map function to the value of each key-value pairs in this DStream without changing the key.
FlatMapValues``3	Return a new DStream by applying a flatmap function to the value of each key-value pairs in this DStream without changing the key.
GroupByKey``2	Return a new DStream by applying groupByKey on each RDD.
GroupWith``3	Return a new DStream by applying 'cogroup' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
Join``3	Return a new DStream by applying 'join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
LeftOuterJoin``3	Return a new DStream by applying 'left outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
RightOuterJoin``3	Return a new DStream by applying 'right outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
FullOuterJoin``3	Return a new DStream by applying 'full outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
GroupByKeyAndWindow``2	Return a new DStream by applying `GroupByKey` over a sliding window. Similar to `DStream.GroupByKey()`, but applies it over a sliding window.
ReduceByKeyAndWindow``2	Return a new DStream by applying incremental `reduceByKey` over a sliding window. The reduced value of over a new window is calculated using the old window's reduce value : 1. reduce the new values that entered the window (e.g., adding new counts) 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts) `invFunc` can be None, then it will reduce all the RDDs in window, could be slower than having `invFunc`.
UpdateStateByKey``3	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.
UpdateStateByKey``3	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.
UpdateStateByKey``3	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.
MapWithState``4	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key.

###Microsoft.Spark.CSharp.Streaming.CSharpInputDStreamUtils ####Summary

        Utils for csharp input stream.

####Methods

Name	Description
CreateStream``1	Create an input stream that user can control the data injection by C# code
CreateStream``1	Create an input stream that user can control the data injection by C# code

###Microsoft.Spark.CSharp.Streaming.StreamingContext ####Summary

        Main entry point for Spark Streaming functionality. It provides methods used to create
        [[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
        created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
        configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
        The associated SparkContext can be accessed using `context.sparkContext`. After
        creating and transforming DStreams, the streaming computation can be started and stopped
        using `context.start()` and `context.stop()`, respectively.
        `context.awaitTermination()` allows the current thread to wait for the termination
        of the context by `stop()` or by an exception.

####Methods

Name	Description
GetOrCreate	Either recreate a StreamingContext from checkpoint data or create a new StreamingContext. If checkpoint data exists in the provided `checkpointPath`, then StreamingContext will be recreated from the checkpoint data. If the data does not exist, then the provided setupFunc will be used to create a JavaStreamingContext.
Start	Start the execution of the streams.
Stop	Stop the execution of the streams.
Remember	Set each DStreams in this context to remember RDDs it generated in the last given duration. DStreams remember RDDs only for a limited duration of time and releases them for garbage collection. This method allows the developer to specify how long to remember the RDDs ( if the developer wishes to query old data outside the DStream computation).
Checkpoint	Set the context to periodically checkpoint the DStream operations for driver fault-tolerance.
SocketTextStream	Create an input from TCP source hostname:port. Data is received using a TCP socket and receive byte is interpreted as UTF8 encoded ``\\n`` delimited lines.
TextFileStream	Create an input stream that monitors a Hadoop-compatible file system for new files and reads them as text files. Files must be wrriten to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.
AwaitTermination	Wait for the execution to stop.
AwaitTerminationOrTimeout	Wait for the execution to stop.
Transform``1	Create a new DStream in which each RDD is generated by applying a function on RDDs of the DStreams. The order of the JavaRDDs in the transform function parameter will be the same as the order of corresponding DStreams in the list.
Union``1	Create a unified DStream from multiple DStreams of the same type and same slide duration.

###Microsoft.Spark.CSharp.Streaming.TransformedDStream`1 ####Summary

        TransformedDStream is an DStream generated by an C# function
        transforming each RDD of an DStream to another RDDs.
        
        Multiple continuous transformations of DStream can be combined into
        one transformation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mobius_API_Documentation.md

Mobius_API_Documentation.md

Mobius API Documentation

Files

Mobius_API_Documentation.md

Latest commit

History

Mobius_API_Documentation.md

File metadata and controls

Mobius API Documentation