Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull apache spark #9

Merged
merged 4,053 commits into from
May 1, 2017
Merged

Pull apache spark #9

merged 4,053 commits into from
May 1, 2017
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Mar 28, 2017

  1. [SPARK-19088][SQL] Optimize sequence type deserialization codegen

    ## What changes were proposed in this pull request?
    
    Optimization of arbitrary Scala sequence deserialization introduced by #16240.
    
    The previous implementation constructed an array which was then converted by `to`. This required two passes in most cases.
    
    This implementation attempts to remedy that by using `Builder`s provided by the `newBuilder` method on every Scala collection's companion object to build the resulting collection directly.
    
    Example codegen for simple `List` (obtained using `Seq(List(1)).toDS().map(identity).queryExecution.debug.codegen`):
    
    Before:
    
    ```
    /* 001 */ public Object generate(Object[] references) {
    /* 002 */   return new GeneratedIterator(references);
    /* 003 */ }
    /* 004 */
    /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 006 */   private Object[] references;
    /* 007 */   private scala.collection.Iterator[] inputs;
    /* 008 */   private scala.collection.Iterator inputadapter_input;
    /* 009 */   private boolean deserializetoobject_resultIsNull;
    /* 010 */   private java.lang.Object[] deserializetoobject_argValue;
    /* 011 */   private boolean MapObjects_loopIsNull1;
    /* 012 */   private int MapObjects_loopValue0;
    /* 013 */   private boolean deserializetoobject_resultIsNull1;
    /* 014 */   private scala.collection.generic.CanBuildFrom deserializetoobject_argValue1;
    /* 015 */   private UnsafeRow deserializetoobject_result;
    /* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder deserializetoobject_holder;
    /* 017 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter deserializetoobject_rowWriter;
    /* 018 */   private scala.collection.immutable.List mapelements_argValue;
    /* 019 */   private UnsafeRow mapelements_result;
    /* 020 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder mapelements_holder;
    /* 021 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter mapelements_rowWriter;
    /* 022 */   private scala.collection.immutable.List serializefromobject_argValue;
    /* 023 */   private UnsafeRow serializefromobject_result;
    /* 024 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
    /* 025 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
    /* 026 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter serializefromobject_arrayWriter;
    /* 027 */
    /* 028 */   public GeneratedIterator(Object[] references) {
    /* 029 */     this.references = references;
    /* 030 */   }
    /* 031 */
    /* 032 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 033 */     partitionIndex = index;
    /* 034 */     this.inputs = inputs;
    /* 035 */     inputadapter_input = inputs[0];
    /* 036 */
    /* 037 */     deserializetoobject_result = new UnsafeRow(1);
    /* 038 */     this.deserializetoobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(deserializetoobject_result, 32);
    /* 039 */     this.deserializetoobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(deserializetoobject_holder, 1);
    /* 040 */
    /* 041 */     mapelements_result = new UnsafeRow(1);
    /* 042 */     this.mapelements_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mapelements_result, 32);
    /* 043 */     this.mapelements_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mapelements_holder, 1);
    /* 044 */
    /* 045 */     serializefromobject_result = new UnsafeRow(1);
    /* 046 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 32);
    /* 047 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
    /* 048 */     this.serializefromobject_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
    /* 049 */
    /* 050 */   }
    /* 051 */
    /* 052 */   protected void processNext() throws java.io.IOException {
    /* 053 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 054 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 055 */       ArrayData inputadapter_value = inputadapter_row.getArray(0);
    /* 056 */
    /* 057 */       deserializetoobject_resultIsNull = false;
    /* 058 */
    /* 059 */       if (!deserializetoobject_resultIsNull) {
    /* 060 */         ArrayData deserializetoobject_value3 = null;
    /* 061 */
    /* 062 */         if (!false) {
    /* 063 */           Integer[] deserializetoobject_convertedArray = null;
    /* 064 */           int deserializetoobject_dataLength = inputadapter_value.numElements();
    /* 065 */           deserializetoobject_convertedArray = new Integer[deserializetoobject_dataLength];
    /* 066 */
    /* 067 */           int deserializetoobject_loopIndex = 0;
    /* 068 */           while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
    /* 069 */             MapObjects_loopValue0 = (int) (inputadapter_value.getInt(deserializetoobject_loopIndex));
    /* 070 */             MapObjects_loopIsNull1 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
    /* 071 */
    /* 072 */             if (MapObjects_loopIsNull1) {
    /* 073 */               throw new RuntimeException(((java.lang.String) references[0]));
    /* 074 */             }
    /* 075 */             if (false) {
    /* 076 */               deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null;
    /* 077 */             } else {
    /* 078 */               deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue0;
    /* 079 */             }
    /* 080 */
    /* 081 */             deserializetoobject_loopIndex += 1;
    /* 082 */           }
    /* 083 */
    /* 084 */           deserializetoobject_value3 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray);
    /* 085 */         }
    /* 086 */         boolean deserializetoobject_isNull2 = true;
    /* 087 */         java.lang.Object[] deserializetoobject_value2 = null;
    /* 088 */         if (!false) {
    /* 089 */           deserializetoobject_isNull2 = false;
    /* 090 */           if (!deserializetoobject_isNull2) {
    /* 091 */             Object deserializetoobject_funcResult = null;
    /* 092 */             deserializetoobject_funcResult = deserializetoobject_value3.array();
    /* 093 */             if (deserializetoobject_funcResult == null) {
    /* 094 */               deserializetoobject_isNull2 = true;
    /* 095 */             } else {
    /* 096 */               deserializetoobject_value2 = (java.lang.Object[]) deserializetoobject_funcResult;
    /* 097 */             }
    /* 098 */
    /* 099 */           }
    /* 100 */           deserializetoobject_isNull2 = deserializetoobject_value2 == null;
    /* 101 */         }
    /* 102 */         deserializetoobject_resultIsNull = deserializetoobject_isNull2;
    /* 103 */         deserializetoobject_argValue = deserializetoobject_value2;
    /* 104 */       }
    /* 105 */
    /* 106 */       boolean deserializetoobject_isNull1 = deserializetoobject_resultIsNull;
    /* 107 */       final scala.collection.Seq deserializetoobject_value1 = deserializetoobject_resultIsNull ? null : scala.collection.mutable.WrappedArray.make(deserializetoobject_argValue);
    /* 108 */       deserializetoobject_isNull1 = deserializetoobject_value1 == null;
    /* 109 */       boolean deserializetoobject_isNull = true;
    /* 110 */       scala.collection.immutable.List deserializetoobject_value = null;
    /* 111 */       if (!deserializetoobject_isNull1) {
    /* 112 */         deserializetoobject_resultIsNull1 = false;
    /* 113 */
    /* 114 */         if (!deserializetoobject_resultIsNull1) {
    /* 115 */           boolean deserializetoobject_isNull6 = false;
    /* 116 */           final scala.collection.generic.CanBuildFrom deserializetoobject_value6 = false ? null : scala.collection.immutable.List.canBuildFrom();
    /* 117 */           deserializetoobject_isNull6 = deserializetoobject_value6 == null;
    /* 118 */           deserializetoobject_resultIsNull1 = deserializetoobject_isNull6;
    /* 119 */           deserializetoobject_argValue1 = deserializetoobject_value6;
    /* 120 */         }
    /* 121 */
    /* 122 */         deserializetoobject_isNull = deserializetoobject_resultIsNull1;
    /* 123 */         if (!deserializetoobject_isNull) {
    /* 124 */           Object deserializetoobject_funcResult1 = null;
    /* 125 */           deserializetoobject_funcResult1 = deserializetoobject_value1.to(deserializetoobject_argValue1);
    /* 126 */           if (deserializetoobject_funcResult1 == null) {
    /* 127 */             deserializetoobject_isNull = true;
    /* 128 */           } else {
    /* 129 */             deserializetoobject_value = (scala.collection.immutable.List) deserializetoobject_funcResult1;
    /* 130 */           }
    /* 131 */
    /* 132 */         }
    /* 133 */         deserializetoobject_isNull = deserializetoobject_value == null;
    /* 134 */       }
    /* 135 */
    /* 136 */       boolean mapelements_isNull = true;
    /* 137 */       scala.collection.immutable.List mapelements_value = null;
    /* 138 */       if (!false) {
    /* 139 */         mapelements_argValue = deserializetoobject_value;
    /* 140 */
    /* 141 */         mapelements_isNull = false;
    /* 142 */         if (!mapelements_isNull) {
    /* 143 */           Object mapelements_funcResult = null;
    /* 144 */           mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
    /* 145 */           if (mapelements_funcResult == null) {
    /* 146 */             mapelements_isNull = true;
    /* 147 */           } else {
    /* 148 */             mapelements_value = (scala.collection.immutable.List) mapelements_funcResult;
    /* 149 */           }
    /* 150 */
    /* 151 */         }
    /* 152 */         mapelements_isNull = mapelements_value == null;
    /* 153 */       }
    /* 154 */
    /* 155 */       if (mapelements_isNull) {
    /* 156 */         throw new RuntimeException(((java.lang.String) references[2]));
    /* 157 */       }
    /* 158 */       serializefromobject_argValue = mapelements_value;
    /* 159 */
    /* 160 */       final ArrayData serializefromobject_value = false ? null : new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_argValue);
    /* 161 */       serializefromobject_holder.reset();
    /* 162 */
    /* 163 */       // Remember the current cursor so that we can calculate how many bytes are
    /* 164 */       // written later.
    /* 165 */       final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
    /* 166 */
    /* 167 */       if (serializefromobject_value instanceof UnsafeArrayData) {
    /* 168 */         final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
    /* 169 */         // grow the global buffer before writing data.
    /* 170 */         serializefromobject_holder.grow(serializefromobject_sizeInBytes);
    /* 171 */         ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
    /* 172 */         serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
    /* 173 */
    /* 174 */       } else {
    /* 175 */         final int serializefromobject_numElements = serializefromobject_value.numElements();
    /* 176 */         serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4);
    /* 177 */
    /* 178 */         for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
    /* 179 */           if (serializefromobject_value.isNullAt(serializefromobject_index)) {
    /* 180 */             serializefromobject_arrayWriter.setNullInt(serializefromobject_index);
    /* 181 */           } else {
    /* 182 */             final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index);
    /* 183 */             serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
    /* 184 */           }
    /* 185 */         }
    /* 186 */       }
    /* 187 */
    /* 188 */       serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
    /* 189 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
    /* 190 */       append(serializefromobject_result);
    /* 191 */       if (shouldStop()) return;
    /* 192 */     }
    /* 193 */   }
    /* 194 */ }
    ```
    
    After:
    
    ```
    /* 001 */ public Object generate(Object[] references) {
    /* 002 */   return new GeneratedIterator(references);
    /* 003 */ }
    /* 004 */
    /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 006 */   private Object[] references;
    /* 007 */   private scala.collection.Iterator[] inputs;
    /* 008 */   private scala.collection.Iterator inputadapter_input;
    /* 009 */   private boolean CollectObjects_loopIsNull1;
    /* 010 */   private int CollectObjects_loopValue0;
    /* 011 */   private UnsafeRow deserializetoobject_result;
    /* 012 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder deserializetoobject_holder;
    /* 013 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter deserializetoobject_rowWriter;
    /* 014 */   private scala.collection.immutable.List mapelements_argValue;
    /* 015 */   private UnsafeRow mapelements_result;
    /* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder mapelements_holder;
    /* 017 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter mapelements_rowWriter;
    /* 018 */   private scala.collection.immutable.List serializefromobject_argValue;
    /* 019 */   private UnsafeRow serializefromobject_result;
    /* 020 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
    /* 021 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
    /* 022 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter serializefromobject_arrayWriter;
    /* 023 */
    /* 024 */   public GeneratedIterator(Object[] references) {
    /* 025 */     this.references = references;
    /* 026 */   }
    /* 027 */
    /* 028 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 029 */     partitionIndex = index;
    /* 030 */     this.inputs = inputs;
    /* 031 */     inputadapter_input = inputs[0];
    /* 032 */
    /* 033 */     deserializetoobject_result = new UnsafeRow(1);
    /* 034 */     this.deserializetoobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(deserializetoobject_result, 32);
    /* 035 */     this.deserializetoobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(deserializetoobject_holder, 1);
    /* 036 */
    /* 037 */     mapelements_result = new UnsafeRow(1);
    /* 038 */     this.mapelements_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mapelements_result, 32);
    /* 039 */     this.mapelements_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mapelements_holder, 1);
    /* 040 */
    /* 041 */     serializefromobject_result = new UnsafeRow(1);
    /* 042 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 32);
    /* 043 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
    /* 044 */     this.serializefromobject_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
    /* 045 */
    /* 046 */   }
    /* 047 */
    /* 048 */   protected void processNext() throws java.io.IOException {
    /* 049 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 050 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 051 */       ArrayData inputadapter_value = inputadapter_row.getArray(0);
    /* 052 */
    /* 053 */       scala.collection.immutable.List deserializetoobject_value = null;
    /* 054 */
    /* 055 */       if (!false) {
    /* 056 */         int deserializetoobject_dataLength = inputadapter_value.numElements();
    /* 057 */         scala.collection.mutable.Builder CollectObjects_builderValue2 = scala.collection.immutable.List$.MODULE$.newBuilder();
    /* 058 */         CollectObjects_builderValue2.sizeHint(deserializetoobject_dataLength);
    /* 059 */
    /* 060 */         int deserializetoobject_loopIndex = 0;
    /* 061 */         while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
    /* 062 */           CollectObjects_loopValue0 = (int) (inputadapter_value.getInt(deserializetoobject_loopIndex));
    /* 063 */           CollectObjects_loopIsNull1 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
    /* 064 */
    /* 065 */           if (CollectObjects_loopIsNull1) {
    /* 066 */             throw new RuntimeException(((java.lang.String) references[0]));
    /* 067 */           }
    /* 068 */           if (false) {
    /* 069 */             CollectObjects_builderValue2.$plus$eq(null);
    /* 070 */           } else {
    /* 071 */             CollectObjects_builderValue2.$plus$eq(CollectObjects_loopValue0);
    /* 072 */           }
    /* 073 */
    /* 074 */           deserializetoobject_loopIndex += 1;
    /* 075 */         }
    /* 076 */
    /* 077 */         deserializetoobject_value = (scala.collection.immutable.List) CollectObjects_builderValue2.result();
    /* 078 */       }
    /* 079 */
    /* 080 */       boolean mapelements_isNull = true;
    /* 081 */       scala.collection.immutable.List mapelements_value = null;
    /* 082 */       if (!false) {
    /* 083 */         mapelements_argValue = deserializetoobject_value;
    /* 084 */
    /* 085 */         mapelements_isNull = false;
    /* 086 */         if (!mapelements_isNull) {
    /* 087 */           Object mapelements_funcResult = null;
    /* 088 */           mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
    /* 089 */           if (mapelements_funcResult == null) {
    /* 090 */             mapelements_isNull = true;
    /* 091 */           } else {
    /* 092 */             mapelements_value = (scala.collection.immutable.List) mapelements_funcResult;
    /* 093 */           }
    /* 094 */
    /* 095 */         }
    /* 096 */         mapelements_isNull = mapelements_value == null;
    /* 097 */       }
    /* 098 */
    /* 099 */       if (mapelements_isNull) {
    /* 100 */         throw new RuntimeException(((java.lang.String) references[2]));
    /* 101 */       }
    /* 102 */       serializefromobject_argValue = mapelements_value;
    /* 103 */
    /* 104 */       final ArrayData serializefromobject_value = false ? null : new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_argValue);
    /* 105 */       serializefromobject_holder.reset();
    /* 106 */
    /* 107 */       // Remember the current cursor so that we can calculate how many bytes are
    /* 108 */       // written later.
    /* 109 */       final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
    /* 110 */
    /* 111 */       if (serializefromobject_value instanceof UnsafeArrayData) {
    /* 112 */         final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
    /* 113 */         // grow the global buffer before writing data.
    /* 114 */         serializefromobject_holder.grow(serializefromobject_sizeInBytes);
    /* 115 */         ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
    /* 116 */         serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
    /* 117 */
    /* 118 */       } else {
    /* 119 */         final int serializefromobject_numElements = serializefromobject_value.numElements();
    /* 120 */         serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 4);
    /* 121 */
    /* 122 */         for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
    /* 123 */           if (serializefromobject_value.isNullAt(serializefromobject_index)) {
    /* 124 */             serializefromobject_arrayWriter.setNullInt(serializefromobject_index);
    /* 125 */           } else {
    /* 126 */             final int serializefromobject_element = serializefromobject_value.getInt(serializefromobject_index);
    /* 127 */             serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
    /* 128 */           }
    /* 129 */         }
    /* 130 */       }
    /* 131 */
    /* 132 */       serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
    /* 133 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
    /* 134 */       append(serializefromobject_result);
    /* 135 */       if (shouldStop()) return;
    /* 136 */     }
    /* 137 */   }
    /* 138 */ }
    ```
    
    Benchmark results before:
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_112-b15 on Linux 4.8.13-1-ARCH
    AMD A10-4600M APU with Radeon(tm) HD Graphics
    collect:                                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Seq                                            269 /  370          0.0      269125.8       1.0X
    List                                           154 /  176          0.0      154453.5       1.7X
    mutable.Queue                                  210 /  233          0.0      209691.6       1.3X
    ```
    
    Benchmark results after:
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_112-b15 on Linux 4.8.13-1-ARCH
    AMD A10-4600M APU with Radeon(tm) HD Graphics
    collect:                                 Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    Seq                                            255 /  316          0.0      254697.3       1.0X
    List                                           152 /  177          0.0      152410.0       1.7X
    mutable.Queue                                  213 /  235          0.0      213470.0       1.2X
    ```
    
    ## How was this patch tested?
    
    ```bash
    ./build/mvn -DskipTests clean package && ./dev/run-tests
    ```
    
    Additionally in Spark Shell:
    
    ```scala
    case class QueueClass(q: scala.collection.immutable.Queue[Int])
    
    spark.createDataset(Seq(List(1,2,3))).map(x => QueueClass(scala.collection.immutable.Queue(x: _*))).map(_.q.dequeue).collect
    ```
    
    Author: Michal Senkyr <[email protected]>
    
    Closes #16541 from michalsenkyr/dataset-seq-builder.
    michalsenkyr authored and cloud-fan committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    6c70a38 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20119][TEST-MAVEN] Fix the test case fail in DataSourceScanExe…

    …cRedactionSuite
    
    ### What changes were proposed in this pull request?
    Changed the pattern to match the first n characters in the location field so that the string truncation does not affect it.
    
    ### How was this patch tested?
    N/A
    
    Author: Xiao Li <[email protected]>
    
    Closes #17448 from gatorsmile/fixTestCAse.
    gatorsmile authored and hvanhovell committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    a9abff2 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20094][SQL] Preventing push down of IN subquery to Join operator

    ## What changes were proposed in this pull request?
    
    TPCDS q45 fails becuase:
    `ReorderJoin` collects all predicates and try to put them into join condition when creating ordered join. If a predicate with an IN subquery (`ListQuery`) is in a join condition instead of a filter condition, `RewritePredicateSubquery.rewriteExistentialExpr` would fail to convert the subquery to an `ExistenceJoin`, and thus result in error.
    
    We should prevent push down of IN subquery to Join operator.
    
    ## How was this patch tested?
    
    Add a new test case in `FilterPushdownSuite`.
    
    Author: wangzhenhua <[email protected]>
    
    Closes #17428 from wzhfy/noSubqueryInJoinCond.
    wzhfy authored and hvanhovell committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    91559d2 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20124][SQL] Join reorder should keep the same order of final p…

    …roject attributes
    
    ## What changes were proposed in this pull request?
    
    Join reorder algorithm should keep exactly the same order of output attributes in the top project.
    For example, if user want to select a, b, c, after reordering, we should output a, b, c in the same order as specified by user, instead of b, a, c or other orders.
    
    ## How was this patch tested?
    
    A new test case is added in `JoinReorderSuite`.
    
    Author: wangzhenhua <[email protected]>
    
    Closes #17453 from wzhfy/keepOrderInProject.
    wzhfy authored and cloud-fan committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    4fcc214 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20126][SQL] Remove HiveSessionState

    ## What changes were proposed in this pull request?
    Commit ea36116 moved most of the logic from the SessionState classes into an accompanying builder. This makes the existence of the `HiveSessionState` redundant. This PR removes the `HiveSessionState`.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #17457 from hvanhovell/SPARK-20126.
    hvanhovell authored and cloud-fan committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    f82461f View commit details
    Browse the repository at this point in the history
  6. [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuin…

    …g of tokens in yarn client mode
    
    ## What changes were proposed in this pull request?
    
    In the current Spark on YARN code, we will obtain tokens from provided services, but we're not going to add these tokens to the current user's credentials. This will make all the following operations to these services still require TGT rather than delegation tokens. This is unnecessary since we already got the tokens, also this will lead to failure in user impersonation scenario, because the TGT is granted by real user, not proxy user.
    
    So here changing to put all the tokens to the current UGI, so that following operations to these services will honor tokens rather than TGT, and this will further handle the proxy user issue mentioned above.
    
    ## How was this patch tested?
    
    Local verified in secure cluster.
    
    vanzin tgravescs mridulm  dongjoon-hyun please help to review, thanks a lot.
    
    Author: jerryshao <[email protected]>
    
    Closes #17335 from jerryshao/SPARK-19995.
    jerryshao authored and Marcelo Vanzin committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    17eddb3 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20125][SQL] Dataset of type option of map does not work

    ## What changes were proposed in this pull request?
    
    When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17454 from cloud-fan/map.
    cloud-fan authored and liancheng committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    d4fac41 View commit details
    Browse the repository at this point in the history
  8. [SPARK-19868] conflict TasksetManager lead to spark stopped

    ## What changes were proposed in this pull request?
    
    We must set the taskset to zombie before the DAGScheduler handles the taskEnded event. It's possible the taskEnded event will cause the DAGScheduler to launch a new stage attempt (this happens when map output data was lost), and if this happens before the taskSet has been set to zombie, it will appear that we have conflicting task sets.
    
    Author: liujianhui <liujianhui@didichuxing>
    
    Closes #17208 from liujianhuiouc/spark-19868.
    liujianhui authored and kayousterhout committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    92e385e View commit details
    Browse the repository at this point in the history
  9. [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails…

    … for uppercase impurity type Gini
    
    Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading
    
    TODO:
    + [x] add unit test
    + [x] fix the bug
    
    Author: 颜发才(Yan Facai) <[email protected]>
    
    Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.
    facaiy authored and jkbradley committed Mar 28, 2017
    Configuration menu
    Copy the full SHA
    7d432af View commit details
    Browse the repository at this point in the history

Commits on Mar 29, 2017

  1. [SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest

    ## What changes were proposed in this pull request?
    
    A pyspark wrapper for spark.ml.stat.ChiSquareTest
    
    ## How was this patch tested?
    
    unit tests
    doctests
    
    Author: Bago Amirbekian <[email protected]>
    
    Closes #17421 from MrBago/chiSquareTestWrapper.
    MrBago authored and jkbradley committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    a5c8770 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20134][SQL] SQLMetrics.postDriverMetricUpdates to simplify dri…

    …ver side metric updates
    
    ## What changes were proposed in this pull request?
    It is not super intuitive how to update SQLMetric on the driver side. This patch introduces a new SQLMetrics.postDriverMetricUpdates function to do that, and adds documentation to make it more obvious.
    
    ## How was this patch tested?
    Updated a test case to use this method.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17464 from rxin/SPARK-20134.
    rxin committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    9712bd3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-19556][CORE] Do not encrypt block manager data in memory.

    This change modifies the way block data is encrypted to make the more
    common cases faster, while penalizing an edge case. As a side effect
    of the change, all data that goes through the block manager is now
    encrypted only when needed, including the previous path (broadcast
    variables) where that did not happen.
    
    The way the change works is by not encrypting data that is stored in
    memory; so if a serialized block is in memory, it will only be encrypted
    once it is evicted to disk.
    
    The penalty comes when transferring that encrypted data from disk. If the
    data ends up in memory again, it is as efficient as before; but if the
    evicted block needs to be transferred directly to a remote executor, then
    there's now a performance penalty, since the code now uses a custom
    FileRegion implementation to decrypt the data before transferring.
    
    This also means that block data transferred between executors now is
    not encrypted (and thus relies on the network library encryption support
    for secrecy). Shuffle blocks are still transferred in encrypted form,
    since they're handled in a slightly different way by the code. This also
    keeps compatibility with existing external shuffle services, which transfer
    encrypted shuffle blocks, and avoids having to make the external service
    aware of encryption at all.
    
    The serialization and deserialization APIs in the SerializerManager now
    do not do encryption automatically; callers need to explicitly wrap their
    streams with an appropriate crypto stream before using those.
    
    As a result of these changes, some of the workarounds added in SPARK-19520
    are removed here.
    
    Testing: a new trait ("EncryptionFunSuite") was added that provides an easy
    way to run a test twice, with encryption on and off; broadcast, block manager
    and caching tests were modified to use this new trait so that the existing
    tests exercise both encrypted and non-encrypted paths. I also ran some
    applications with encryption turned on to verify that they still work,
    including streaming tests that failed without the fix for SPARK-19520.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #17295 from vanzin/SPARK-19556.
    Marcelo Vanzin authored and cloud-fan committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    b56ad2b View commit details
    Browse the repository at this point in the history
  4. [SPARK-20059][YARN] Use the correct classloader for HBaseCredentialPr…

    …ovider
    
    ## What changes were proposed in this pull request?
    
    Currently we use system classloader to find HBase jars, if it is specified by `--jars`, then it will be failed with ClassNotFound issue. So here changing to use child classloader.
    
    Also putting added jars and main jar into classpath of submitted application in yarn cluster mode, otherwise HBase jars specified with `--jars` will never be honored in cluster mode, and fetching tokens in client side will always be failed.
    
    ## How was this patch tested?
    
    Unit test and local verification.
    
    Author: jerryshao <[email protected]>
    
    Closes #17388 from jerryshao/SPARK-20059.
    jerryshao authored and Marcelo Vanzin committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    c622a87 View commit details
    Browse the repository at this point in the history
  5. [SPARK-19955][PYSPARK] Jenkins Python Conda based test.

    ## What changes were proposed in this pull request?
    
    Allow Jenkins Python tests to use the installed conda to test Python 2.7 support & test pip installability.
    
    ## How was this patch tested?
    
    Updated shell scripts, ran tests locally with installed conda, ran tests in Jenkins.
    
    Author: Holden Karau <[email protected]>
    
    Closes #17355 from holdenk/SPARK-19955-support-python-tests-with-conda.
    holdenk committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    d6ddfdf View commit details
    Browse the repository at this point in the history
  6. [SPARK-20048][SQL] Cloning SessionState does not clone query executio…

    …n listeners
    
    ## What changes were proposed in this pull request?
    
    Bugfix from [SPARK-19540.](#16826)
    Cloning SessionState does not clone query execution listeners, so cloned session is unable to listen to events on queries.
    
    ## How was this patch tested?
    
    - Unit test
    
    Author: Kunal Khamar <[email protected]>
    
    Closes #17379 from kunalkhamar/clone-bugfix.
    kunalkhamar authored and hvanhovell committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    142f6d1 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20009][SQL] Support DDL strings for defining schema in functio…

    …ns.from_json
    
    ## What changes were proposed in this pull request?
    This pr added `StructType.fromDDL`  to convert a DDL format string into `StructType` for defining schemas in `functions.from_json`.
    
    ## How was this patch tested?
    Added tests in `JsonFunctionsSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17406 from maropu/SPARK-20009.
    maropu authored and gatorsmile committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    c400848 View commit details
    Browse the repository at this point in the history
  8. [SPARK-17075][SQL][FOLLOWUP] Add Estimation of Constant Literal

    ### What changes were proposed in this pull request?
    `FalseLiteral` and `TrueLiteral` should have been eliminated by optimizer rule `BooleanSimplification`, but null literals might be added by optimizer rule `NullPropagation`. For safety, our filter estimation should handle all the eligible literal cases.
    
    Our optimizer rule BooleanSimplification is unable to remove the null literal in many cases. For example, `a < 0 or null`. Thus, we need to handle null literal in filter estimation.
    
    `Not` can be pushed down below `And` and `Or`. Then, we could see two consecutive `Not`, which need to be collapsed into one. Because of the limited expression support for filter estimation, we just need to handle the case `Not(null)` for avoiding incorrect error due to the boolean operation on null. For details, see below matrix.
    
    ```
    not NULL = NULL
    NULL or false = NULL
    NULL or true = true
    NULL or NULL = NULL
    NULL and false = false
    NULL and true = NULL
    NULL and NULL = NULL
    ```
    ### How was this patch tested?
    Added the test cases.
    
    Author: Xiao Li <[email protected]>
    
    Closes #17446 from gatorsmile/constantFilterEstimation.
    gatorsmile committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    5c8ef37 View commit details
    Browse the repository at this point in the history
  9. [SPARK-20120][SQL] spark-sql support silent mode

    ## What changes were proposed in this pull request?
    
    It is similar to Hive silent mode, just show the query result. see: [Hive LanguageManual+Cli](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) and [the implementation of Hive silent mode](https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L948-L950).
    
    This PR set the Logger level to `WARN` to get similar result.
    
    ## How was this patch tested?
    
    manual tests
    
    ![manual test spark sql silent mode](https://cloud.githubusercontent.com/assets/5399861/24390165/989b7780-13b9-11e7-8496-6e68f55757e3.gif)
    
    Author: Yuming Wang <[email protected]>
    
    Closes #17449 from wangyum/SPARK-20120.
    wangyum authored and gatorsmile committed Mar 29, 2017
    Configuration menu
    Copy the full SHA
    fe1d6b0 View commit details
    Browse the repository at this point in the history

Commits on Mar 30, 2017

  1. [SPARK-19088][SQL] Fix 2.10 build.

    ## What changes were proposed in this pull request?
    
    Commit 6c70a38 broke the build for scala 2.10. The commit uses some reflections which are not available in Scala 2.10. This PR fixes them.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <[email protected]>
    
    Closes #17473 from ueshin/issues/SPARK-19088.
    ueshin committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    dd2e7d5 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20146][SQL] fix comment missing issue for thrift server

    ## What changes were proposed in this pull request?
    
    The column comment was missing while constructing the Hive TableSchema. This fix will preserve the original comment.
    
    ## How was this patch tested?
    
    I have added a new test case to test the column with/without comment.
    
    Author: bomeng <[email protected]>
    
    Closes #17470 from bomeng/SPARK-20146.
    bomeng authored and rxin committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    22f07fe View commit details
    Browse the repository at this point in the history
  3. [SPARK-20136][SQL] Add num files and metadata operation timing to sca…

    …n operator metrics
    
    ## What changes were proposed in this pull request?
    This patch adds explicit metadata operation timing and number of files in data source metrics. Those would be useful to include for performance profiling.
    
    Screenshot of a UI with this change (num files and metadata time are new metrics):
    
    <img width="321" alt="screen shot 2017-03-29 at 12 29 28 am" src="https://cloud.githubusercontent.com/assets/323388/24443272/d4ea58c0-1416-11e7-8940-ecb69375554a.png">
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17465 from rxin/SPARK-20136.
    rxin committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    6097788 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20148][SQL] Extend the file commit API to allow subscribing to…

    … task commit messages
    
    ## What changes were proposed in this pull request?
    
    The internal FileCommitProtocol interface returns all task commit messages in bulk to the implementation when a job finishes. However, it is sometimes useful to access those messages before the job completes, so that the driver gets incremental progress updates before the job finishes.
    
    This adds an `onTaskCommit` listener to the internal api.
    
    ## How was this patch tested?
    
    Unit tests.
    
    cc rxin
    
    Author: Eric Liang <[email protected]>
    
    Closes #17475 from ericl/file-commit-api-ext.
    ericl authored and rxin committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    7963605 View commit details
    Browse the repository at this point in the history
  5. [MINOR][SPARKR] Add run command comment in examples

    ## What changes were proposed in this pull request?
    
    There are two examples in r folder missing the run commands.
    
    In this PR, I just add the missing comment, which is consistent with other examples.
    
    ## How was this patch tested?
    
    Manual test.
    
    Author: [email protected] <[email protected]>
    
    Closes #17474 from wangmiao1981/stat.
    wangmiao1981 authored and Felix Cheung committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    471de5d View commit details
    Browse the repository at this point in the history
  6. [SPARK-20107][DOC] Add spark.hadoop.mapreduce.fileoutputcommitter.alg…

    …orithm.version option to configuration.md
    
    ## What changes were proposed in this pull request?
    
    Add `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` option to `configuration.md`.
    Set `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2` can speed up [HadoopMapReduceCommitProtocol.commitJob](https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121) for many output files.
    
    All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433 and https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0) and apache's hadoop 2.7.0 or higher versions support this improvement.
    
    More see:
    
    1. [MAPREDUCE-4815](https://issues.apache.org/jira/browse/MAPREDUCE-4815): Speed up FileOutputCommitter#commitJob for many output files.
    2. [MAPREDUCE-6406](https://issues.apache.org/jira/browse/MAPREDUCE-6406): Update the default version for the property mapreduce.fileoutputcommitter.algorithm.version to 2.
    
    ## How was this patch tested?
    
    Manual test and exist tests.
    
    Author: Yuming Wang <[email protected]>
    
    Closes #17442 from wangyum/SPARK-20107.
    wangyum authored and srowen committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    edc87d7 View commit details
    Browse the repository at this point in the history
  7. [SPARK-15354][CORE] Topology aware block replication strategies

    ## What changes were proposed in this pull request?
    
    Implementations of strategies for resilient block replication for different resource managers that replicate the 3-replica strategy used by HDFS, where the first replica is on an executor, the second replica within the same rack as the executor and a third replica on a different rack.
    The implementation involves providing two pluggable classes, one running in the driver that provides topology information for every host at cluster start and the second prioritizing a list of peer BlockManagerIds.
    
    The prioritization itself can be thought of an optimization problem to find a minimal set of peers that satisfy certain objectives and replicating to these peers first. The objectives can be used to express richer constraints over and above HDFS like 3-replica strategy.
    ## How was this patch tested?
    
    This patch was tested with unit tests for storage, along with new unit tests to verify prioritization behaviour.
    
    Author: Shubham Chopra <[email protected]>
    
    Closes #13932 from shubhamchopra/PrioritizerStrategy.
    shubhamchopra authored and cloud-fan committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    b454d44 View commit details
    Browse the repository at this point in the history
  8. [DOCS] Docs-only improvements

    …adoc
    
    ## What changes were proposed in this pull request?
    
    Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0).
    
    ## How was this patch tested?
    
    Local build
    
    Author: Jacek Laskowski <[email protected]>
    
    Closes #17417 from jaceklaskowski/window-expression-scaladoc.
    jaceklaskowski authored and srowen committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    0197262 View commit details
    Browse the repository at this point in the history
  9. [SPARK-19999] Workaround JDK-8165231 to identify PPC64 architectures …

    …as supporting unaligned access
    
     java.nio.Bits.unaligned() does not return true for the ppc64le arch.
    see https://bugs.openjdk.java.net/browse/JDK-8165231
    ## What changes were proposed in this pull request?
    check architecture
    
    ## How was this patch tested?
    
    unit test
    
    Author: samelamin <[email protected]>
    Author: samelamin <[email protected]>
    
    Closes #17472 from samelamin/SPARK-19999.
    samelamin authored and srowen committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    258bff2 View commit details
    Browse the repository at this point in the history
  10. [SPARK-20096][SPARK SUBMIT][MINOR] Expose the right queue name not nu…

    …ll if set by --conf or configure file
    
    ## What changes were proposed in this pull request?
    
    while submit apps with -v or --verbose, we can print the right queue name, but if we set a queue name with `spark.yarn.queue` by --conf or in the spark-default.conf, we just got `null`  for the queue in Parsed arguments.
    ```
    bin/spark-shell -v --conf spark.yarn.queue=thequeue
    Using properties file: /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf
    ....
    Adding default property: spark.yarn.queue=default
    Parsed arguments:
      master                  yarn
      deployMode              client
      ...
      queue                   null
      ....
      verbose                 true
    Spark properties used, including those specified through
     --conf and those from the properties file /home/hadoop/spark-2.1.0-bin-apache-hdp2.7.3/conf/spark-defaults.conf:
      spark.yarn.queue -> thequeue
      ....
    ```
    ## How was this patch tested?
    
    ut and local verify
    
    Author: Kent Yao <[email protected]>
    
    Closes #17430 from yaooqinn/SPARK-20096.
    yaooqinn authored and srowen committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    e9d268f View commit details
    Browse the repository at this point in the history
  11. [DOCS][MINOR] Fixed a few typos in the Structured Streaming documenta…

    …tion
    
    Fixed a few typos.
    
    There is one more I'm not sure of:
    
    ```
            Append mode uses watermark to drop old aggregation state. But the output of a
            windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
            the modes semantics, rows can be added to the Result Table only once after they are
    ```
    
    Not sure how to change `is delayed the late threshold`.
    
    Author: Seigneurin, Alexis (CONT) <[email protected]>
    
    Closes #17443 from aseigneurin/typos.
    Seigneurin, Alexis (CONT) authored and srowen committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    669a11b View commit details
    Browse the repository at this point in the history
  12. [SPARK-20127][CORE] few warning have been fixed which Intellij IDEA r…

    …eported Intellij IDEA
    
    ## What changes were proposed in this pull request?
    Few changes related to Intellij IDEA inspection.
    
    ## How was this patch tested?
    Changes were tested by existing unit tests
    
    Author: Denis Bolshakov <[email protected]>
    
    Closes #17458 from dbolshak/SPARK-20127.
    Denis Bolshakov authored and srowen committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    5e00a5d View commit details
    Browse the repository at this point in the history
  13. [SPARK-20121][SQL] simplify NullPropagation with NullIntolerant

    ## What changes were proposed in this pull request?
    
    Instead of iterating all expressions that can return null for null inputs, we can just check `NullIntolerant`.
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17450 from cloud-fan/null.
    cloud-fan authored and gatorsmile committed Mar 30, 2017
    Configuration menu
    Copy the full SHA
    c734fc5 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2017

  1. [SPARK-20151][SQL] Account for partition pruning in scan metadataTime…

    … metrics
    
    ## What changes were proposed in this pull request?
    After SPARK-20136, we report metadata timing metrics in scan operator. However, that timing metric doesn't include one of the most important part of metadata, which is partition pruning. This patch adds that time measurement to the scan metrics.
    
    ## How was this patch tested?
    N/A - I tried adding a test in SQLMetricsSuite but it was extremely convoluted to the point that I'm not sure if this is worth it.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17476 from rxin/SPARK-20151.
    rxin committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    a8a765b View commit details
    Browse the repository at this point in the history
  2. [SPARK-20164][SQL] AnalysisException not tolerant of null query plan.

    ## What changes were proposed in this pull request?
    
    The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen).
    `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception.
    The fix is to add a `null` check in `getMessage`.
    
    ## How was this patch tested?
    
    - Unit test
    
    Author: Kunal Khamar <[email protected]>
    
    Closes #17486 from kunalkhamar/spark-20164.
    kunalkhamar authored and gatorsmile committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    254877c View commit details
    Browse the repository at this point in the history
  3. [SPARK-20084][CORE] Remove internal.metrics.updatedBlockStatuses from…

    … history files.
    
    ## What changes were proposed in this pull request?
    
    Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
    
    ## How was this patch tested?
    
    Current History UI tests cover use of the history file.
    
    Author: Ryan Blue <[email protected]>
    
    Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info.
    rdblue authored and Marcelo Vanzin committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    c4c03ee View commit details
    Browse the repository at this point in the history
  4. [SPARK-20160][SQL] Move ParquetConversions and OrcConversions Out Of …

    …HiveSessionCatalog
    
    ### What changes were proposed in this pull request?
    `ParquetConversions` and `OrcConversions` should be treated as regular `Analyzer` rules. It is not reasonable to be part of `HiveSessionCatalog`. This PR also combines two rules `ParquetConversions` and `OrcConversions` to build a new rule `RelationConversions `.
    
    After moving these two rules out of HiveSessionCatalog, the next step is to clean up, rename and move `HiveMetastoreCatalog` because it is not related to the hive package any more.
    
    ### How was this patch tested?
    The existing test cases
    
    Author: Xiao Li <[email protected]>
    
    Closes #17484 from gatorsmile/cleanup.
    gatorsmile authored and cloud-fan committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    b2349e6 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20165][SS] Resolve state encoder's deserializer in driver in F…

    …latMapGroupsWithStateExec
    
    ## What changes were proposed in this pull request?
    
    - Encoder's deserializer must be resolved at the driver where the class is defined. Otherwise there are corner cases using nested classes where resolving at the executor can fail.
    
    - Fixed flaky test related to processing time timeout. The flakiness is caused because the test thread (that adds data to memory source) has a race condition with the streaming query thread. When testing the manual clock, the goal is to add data and increment clock together atomically, such that a trigger sees new data AND updated clock simultaneously (both or none). This fix adds additional synchronization in when adding data; it makes sure that the streaming query thread is waiting on the manual clock to be incremented (so no batch is currently running) before adding data.
    
    - Added`testQuietly` on some tests that generate a lot of error logs.
    
    ## How was this patch tested?
    Multiple runs on existing unit tests
    
    Author: Tathagata Das <[email protected]>
    
    Closes #17488 from tdas/SPARK-20165.
    tdas committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    567a50a View commit details
    Browse the repository at this point in the history

Commits on Apr 1, 2017

  1. [SPARK-20177] Document about compression way has some little detail ch…

    …anges.
    
    ## What changes were proposed in this pull request?
    
    Document compression way little detail changes.
    1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.'
    2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.'
    3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
    4.spark.io.compression.codec add 'event log describe'.
    
    eg
    Through the documents, I don't know  what is compression mode about 'event log'.
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 郭小龙 10207633 <[email protected]>
    
    Closes #17498 from guoxiaolongzte/SPARK-20177.
    郭小龙 10207633 authored and srowen committed Apr 1, 2017
    Configuration menu
    Copy the full SHA
    cf5963c View commit details
    Browse the repository at this point in the history
  2. [SPARK-19148][SQL][FOLLOW-UP] do not expose the external table concep…

    …t in Catalog
    
    ### What changes were proposed in this pull request?
    After we renames `Catalog`.`createExternalTable` to `createTable` in the PR: #16528, we also need to deprecate the corresponding functions in `SQLContext`.
    
    ### How was this patch tested?
    N/A
    
    Author: Xiao Li <[email protected]>
    
    Closes #17502 from gatorsmile/deprecateCreateExternalTable.
    gatorsmile authored and cloud-fan committed Apr 1, 2017
    Configuration menu
    Copy the full SHA
    89d6822 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20186][SQL] BroadcastHint should use child's stats

    ## What changes were proposed in this pull request?
    
    `BroadcastHint` should use child's statistics and set `isBroadcastable` to true.
    
    ## How was this patch tested?
    
    Added a new stats estimation test for `BroadcastHint`.
    
    Author: wangzhenhua <[email protected]>
    
    Closes #17504 from wzhfy/broadcastHintEstimation.
    wzhfy authored and cloud-fan committed Apr 1, 2017
    Configuration menu
    Copy the full SHA
    2287f3d View commit details
    Browse the repository at this point in the history

Commits on Apr 2, 2017

  1. [SPARK-20143][SQL] DataType.fromJson should throw an exception with b…

    …etter message
    
    ## What changes were proposed in this pull request?
    
    Currently, `DataType.fromJson` throws `scala.MatchError` or `java.util.NoSuchElementException` in some cases when the JSON input is invalid as below:
    
    ```scala
    DataType.fromJson(""""abcd"""")
    ```
    
    ```
    java.util.NoSuchElementException: key not found: abcd
      at ...
    ```
    
    ```scala
    DataType.fromJson("""{"abcd":"a"}""")
    ```
    
    ```
    scala.MatchError: JObject(List((abcd,JString(a)))) (of class org.json4s.JsonAST$JObject)
      at ...
    ```
    
    ```scala
    DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
    ```
    
    ```
    scala.MatchError: JObject(List((a,JInt(123)))) (of class org.json4s.JsonAST$JObject)
      at ...
    ```
    
    After this PR,
    
    ```scala
    DataType.fromJson(""""abcd"""")
    ```
    
    ```
    java.lang.IllegalArgumentException: Failed to convert the JSON string 'abcd' to a data type.
      at ...
    ```
    
    ```scala
    DataType.fromJson("""{"abcd":"a"}""")
    ```
    
    ```
    java.lang.IllegalArgumentException: Failed to convert the JSON string '{"abcd":"a"}' to a data type.
      at ...
    ```
    
    ```scala
    DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
      at ...
    ```
    
    ```
    java.lang.IllegalArgumentException: Failed to convert the JSON string '{"a":123}' to a field.
    ```
    
    ## How was this patch tested?
    
    Unit test added in `DataTypeSuite`.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17468 from HyukjinKwon/fromjson_exception.
    HyukjinKwon authored and gatorsmile committed Apr 2, 2017
    Configuration menu
    Copy the full SHA
    d40cbb8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g.…

    … $SPARK…
    
    JIRA Issue: https://issues.apache.org/jira/browse/SPARK-20123
    
    ## What changes were proposed in this pull request?
    
    If $SPARK_HOME or $FWDIR variable contains spaces, then use "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.
    
    ## How was this patch tested?
    
    manual tests
    
    Author: zuotingbing <[email protected]>
    
    Closes #17452 from zuotingbing/spark-bulid.
    zuotingbing authored and srowen committed Apr 2, 2017
    Configuration menu
    Copy the full SHA
    76de2d1 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20173][SQL][HIVE-THRIFTSERVER] Throw NullPointerException when…

    … HiveThriftServer2 is shutdown
    
    ## What changes were proposed in this pull request?
    
    If the shutdown hook called before the variable `uiTab` is set , it will throw a NullPointerException.
    
    ## How was this patch tested?
    
    manual tests
    
    Author: zuotingbing <[email protected]>
    
    Closes #17496 from zuotingbing/SPARK-HiveThriftServer2.
    zuotingbing authored and srowen committed Apr 2, 2017
    Configuration menu
    Copy the full SHA
    657cb95 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20159][SPARKR][SQL] Support all catalog API in R

    ## What changes were proposed in this pull request?
    
    Add a set of catalog API in R
    
    ```
    "currentDatabase",
    "listColumns",
    "listDatabases",
    "listFunctions",
    "listTables",
    "recoverPartitions",
    "refreshByPath",
    "refreshTable",
    "setCurrentDatabase",
    ```
    https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff
    
    ## How was this patch tested?
    
    manual tests, unit tests
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17483 from felixcheung/rcatalog.
    felixcheung authored and Felix Cheung committed Apr 2, 2017
    Configuration menu
    Copy the full SHA
    93dbfe7 View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2017

  1. [SPARK-19985][ML] Fixed copy method for some ML Models

    ## What changes were proposed in this pull request?
    Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator.  This change fixes these by creating a new instance of the model and explicitly setting values and parent.
    
    ## How was this patch tested?
    Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set.
    
    Author: Bryan Cutler <[email protected]>
    
    Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.
    BryanCutler authored and Nick Pentreath committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    2a903a1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastD…

    …ateFormat specific) in CSV/JSON timeformat options
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to use `XXX` format instead of `ZZ`. `ZZ` seems a `FastDateFormat` specific.
    
    `ZZ` supports "ISO 8601 extended format time zones" but it seems `FastDateFormat` specific option.
    I misunderstood this is compatible format with `SimpleDateFormat` when this change is introduced.
    Please see [SimpleDateFormat documentation]( https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone) and [FastDateFormat documentation](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html).
    
    It seems we better replace `ZZ` to `XXX` because they look using the same strategy - [FastDateParser.java#L930](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930), [FastDateParser.java#L932-L951 ](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L932-L951) and [FastDateParser.java#L596-L601](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L596-L601).
    
    I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern `( Z|(?:[+-]\\d{2}(?::)\\d{2}))`.
    
    _Note that this should be rather a fix about documentation and not the behaviour change because `ZZ` seems invalid date format in `SimpleDateFormat` as documented in `DataFrameReader` and etc, and both `ZZ` and `XXX` look identically working with `FastDateFormat`_
    
    Current documentation is as below:
    
    ```
       * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
       * indicates a timestamp format. Custom date formats follow the formats at
       * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
    ```
    
    ## How was this patch tested?
    
    Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):
    
    **Parse**
    
    ```scala
    scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
    res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
    
    scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
    res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
    
    scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
    java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
      at java.text.DateFormat.parse(DateFormat.java:366)
      ... 48 elided
    scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
    java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
      at java.text.DateFormat.parse(DateFormat.java:366)
      ... 48 elided
    ```
    
    ```scala
    scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
    res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
    
    scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
    res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
    
    scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
    res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
    
    scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
    res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
    ```
    
    **Format**
    
    ```scala
    scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").format(new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
    res6: String = 2017-03-21T20:00:00.000+09:00
    ```
    
    ```scala
    scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ")
    fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSZZ,ko_KR,Asia/Seoul]
    
    scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
    res1: String = 2017-03-21T20:00:00.000+09:00
    
    scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
    fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSXXX,ko_KR,Asia/Seoul]
    
    scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
    res2: String = 2017-03-21T20:00:00.000+09:00
    ```
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17489 from HyukjinKwon/SPARK-20166.
    HyukjinKwon authored and srowen committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    cff11fd View commit details
    Browse the repository at this point in the history
  3. [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks…

    … rendering markdown
    
    # What changes were proposed in this pull request?
    
    It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files.
    
    These are different. For example, this can be checked via `python` as below:
    
    ```python
    >>> " "
    '\xc2\xa0'
    >>> " "
    ' '
    ```
    
    _Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._
    
    I checked the output below in  Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10.
    
    **Before**
    
    ![2017-04-03 12 37 17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png)
    ![2017-04-03 12 36 57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png)
    
    **After**
    
    ![2017-04-03 12 36 46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png)
    ![2017-04-03 12 36 31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png)
    
    ## How was this patch tested?
    
    Manually checking.
    
    These instances were found via
    
    ```
    grep --include=*.scala --include=*.python --include=*.java --include=*.r --include=*.R --include=*.md --include=*.r -r -I " " .
    ```
    
    in Mac OS.
    
    It seems there are several instances more as below:
    
    ```
    ./docs/sql-programming-guide.md:        │   ├── ...
    ./docs/sql-programming-guide.md:        │   │
    ./docs/sql-programming-guide.md:        │   ├── country=US
    ./docs/sql-programming-guide.md:        │   │   └── data.parquet
    ./docs/sql-programming-guide.md:        │   ├── country=CN
    ./docs/sql-programming-guide.md:        │   │   └── data.parquet
    ./docs/sql-programming-guide.md:        │   └── ...
    ./docs/sql-programming-guide.md:            ├── ...
    ./docs/sql-programming-guide.md:            │
    ./docs/sql-programming-guide.md:            ├── country=US
    ./docs/sql-programming-guide.md:            │   └── data.parquet
    ./docs/sql-programming-guide.md:            ├── country=CN
    ./docs/sql-programming-guide.md:            │   └── data.parquet
    ./docs/sql-programming-guide.md:            └── ...
    ./sql/core/src/test/README.md:│   ├── *.avdl                  # Testing Avro IDL(s)
    ./sql/core/src/test/README.md:│   └── *.avpr                  # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
    ./sql/core/src/test/README.md:│   ├── gen-avro.sh             # Script used to generate Java code for Avro
    ./sql/core/src/test/README.md:│   └── gen-thrift.sh           # Script used to generate Java code for Thrift
    ```
    
    These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17517 from HyukjinKwon/non-breaking-space.
    HyukjinKwon authored and srowen committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    364b0db View commit details
    Browse the repository at this point in the history
  4. [SPARK-9002][CORE] KryoSerializer initialization does not include 'Ar…

    …ray[Int]'
    
    [SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]'
    
    ## What changes were proposed in this pull request?
    
    Array[Int] has been registered in KryoSerializer.
    The following file has been changed
    core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
    
    ## How was this patch tested?
    
    First, the issue was reproduced by new unit test.
    Then, the issue was fixed to pass the failed test.
    
    Author: Denis Bolshakov <[email protected]>
    
    Closes #17482 from dbolshak/SPARK-9002.
    Denis Bolshakov authored and srowen committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    fb5869f View commit details
    Browse the repository at this point in the history
  5. [SPARK-19969][ML] Imputer doc and example

    ## What changes were proposed in this pull request?
    
    Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after #17316
    
    ## How was this patch tested?
    
    local doc generation and example execution
    
    Author: Yuhao Yang <[email protected]>
    
    Closes #17324 from hhbyyh/imputerdoc.
    YY-OnCall authored and Nick Pentreath committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    4d28e84 View commit details
    Browse the repository at this point in the history
  6. [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produc…

    …es incorrect schema for non-array/object JSONs
    
    ## What changes were proposed in this pull request?
    
    Currently, when we infer the types for vaild JSON strings but object or array, we are producing empty schemas regardless of parse modes as below:
    
    ```scala
    scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
    root
    ```
    
    ```scala
    scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
    root
    ```
    
    This PR proposes to handle parse modes in type inference.
    
    After this PR,
    
    ```scala
    
    scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
    root
     |-- a: long (nullable = true)
    ```
    
    ```
    scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
    java.lang.RuntimeException: Failed to infer a common schema. Struct types are expected but string was found.
    ```
    
    This PR is based on NathanHowell@e233fd0 and I and NathanHowell talked about this in https://issues.apache.org/jira/browse/SPARK-19641
    
    ## How was this patch tested?
    
    Unit tests in `JsonSuite` for both `DROPMALFORMED` and `FAILFAST` modes.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17492 from HyukjinKwon/SPARK-19641.
    HyukjinKwon authored and cloud-fan committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    4fa1a43 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20194] Add support for partition pruning to in-memory catalog

    ## What changes were proposed in this pull request?
    This patch implements `listPartitionsByFilter()` for `InMemoryCatalog` and thus resolves an outstanding TODO causing the `PruneFileSourcePartitions` optimizer rule not to apply when "spark.sql.catalogImplementation" is set to "in-memory" (which is the default).
    
    The change is straightforward: it extracts the code for further filtering of the list of partitions returned by the metastore's `getPartitionsByFilter()` out from `HiveExternalCatalog` into `ExternalCatalogUtils` and calls this new function from `InMemoryCatalog` on the whole list of partitions.
    
    Now that this method is implemented we can always pass the `CatalogTable` to the `DataSource` in `FindDataSourceTable`, so that the latter is resolved to a relation with a `CatalogFileIndex`, which is what the `PruneFileSourcePartitions` rule matches for.
    
    ## How was this patch tested?
    Ran existing tests and added new test for `listPartitionsByFilter` in `ExternalCatalogSuite`, which is subclassed by both `InMemoryCatalogSuite` and `HiveExternalCatalogSuite`.
    
    Author: Adrian Ionescu <[email protected]>
    
    Closes #17510 from adrian-ionescu/InMemoryCatalog.
    adrian-ionescu authored and gatorsmile committed Apr 3, 2017
    Configuration menu
    Copy the full SHA
    703c42c View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2017

  1. [SPARK-20145] Fix range case insensitive bug in SQL

    ## What changes were proposed in this pull request?
    Range in SQL should be case insensitive
    
    ## How was this patch tested?
    unit test
    
    Author: samelamin <[email protected]>
    Author: samelamin <[email protected]>
    
    Closes #17487 from samelamin/SPARK-20145.
    samelamin authored and rxin committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    58c9e6e View commit details
    Browse the repository at this point in the history
  2. [SPARK-19408][SQL] filter estimation on two columns of same table

    ## What changes were proposed in this pull request?
    
    In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.
    
    This PR estimates filter selectivity on two columns of same table.  For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"
    
    ## How was this patch tested?
    
    We added 6 new test cases to test various logical predicates involving two columns of same table.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Ron Hu <[email protected]>
    Author: U-CHINA\r00754707 <[email protected]>
    
    Closes #17415 from ron8hu/filterTwoColumns.
    ron8hu authored and gatorsmile committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    e7877fd View commit details
    Browse the repository at this point in the history
  3. [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS

    ## What changes were proposed in this pull request?
    
    **Description** from JIRA
    
    The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
    For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
    For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.
    ## How was this patch tested?
    
    Added new tests in ParquetQuerySuite and ParquetIOSuite
    
    Author: Dilip Biswal <[email protected]>
    
    Closes #15332 from dilipbiswal/parquet-time-millis.
    dilipbiswal authored and ueshin committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    3bfb639 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Int…

    …erface
    
    ### What changes were proposed in this pull request?
    
    This PR is to unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC FORMATTED`.
    
    In addition, by following what we did in Dataset API `printSchema`, we can use `treeString` to show the schema in the more readable way.
    
    Below is the current way:
    ```
    Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
    ```
    After the change, it should look like
    ```
    Schema: root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = true)
     |-- c: string (nullable = true)
     |-- d: string (nullable = true)
    ```
    
    ### How was this patch tested?
    `describe.sql` and `show-tables.sql`
    
    Author: Xiao Li <[email protected]>
    
    Closes #17394 from gatorsmile/descFollowUp.
    gatorsmile committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    51d3c85 View commit details
    Browse the repository at this point in the history
  5. [SPARK-19825][R][ML] spark.ml R API for FPGrowth

    ## What changes were proposed in this pull request?
    
    Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
    
    - `spark.fpGrowth` -model training.
    - `freqItemsets` and `associationRules` methods with new corresponding generics.
    - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
    - unit tests.
    
    ## How was this patch tested?
    
    Feature specific unit tests.
    
    Author: zero323 <[email protected]>
    
    Closes #17170 from zero323/SPARK-19825.
    zero323 authored and Felix Cheung committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    b34f766 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20190][APP-ID] applications//jobs' in rest api,status should b…

    …e [running|s…
    
    …ucceeded|failed|unknown]
    
    ## What changes were proposed in this pull request?
    
    '/applications/[app-id]/jobs' in rest api.status should be'[running|succeeded|failed|unknown]'.
    now status is '[complete|succeeded|failed]'.
    but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'.
    Added '?status=running' and '?status=unknown'.
    code :
    public enum JobExecutionStatus {
    RUNNING,
    SUCCEEDED,
    FAILED,
    UNKNOWN;
    
    ## How was this patch tested?
    
     manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: guoxiaolongzte <[email protected]>
    
    Closes #17507 from guoxiaolongzte/SPARK-20190.
    guoxiaolongzte authored and srowen committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    c95fbea View commit details
    Browse the repository at this point in the history
  7. [SPARK-20198][SQL] Remove the inconsistency in table/function name co…

    …nventions in SparkSession.Catalog APIs
    
    ### What changes were proposed in this pull request?
    Observed by felixcheung , in `SparkSession`.`Catalog` APIs, we have different conventions/rules for table/function identifiers/names. Most APIs accept the qualified name (i.e., `databaseName`.`tableName` or `databaseName`.`functionName`). However, the following five APIs do not accept it.
    - def listColumns(tableName: String): Dataset[Column]
    - def getTable(tableName: String): Table
    - def getFunction(functionName: String): Function
    - def tableExists(tableName: String): Boolean
    - def functionExists(functionName: String): Boolean
    
    To make them consistent with the other Catalog APIs, this PR does the changes, updates the function/API comments and adds the `params` to clarify the inputs we allow.
    
    ### How was this patch tested?
    Added the test cases .
    
    Author: Xiao Li <[email protected]>
    
    Closes #17518 from gatorsmile/tableIdentifier.
    gatorsmile authored and cloud-fan committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    26e7bca View commit details
    Browse the repository at this point in the history
  8. [SPARK-18278][SCHEDULER] Documentation to point to Kubernetes cluster…

    … scheduler
    
    ## What changes were proposed in this pull request?
    
    Adding documentation to point to Kubernetes cluster scheduler being developed out-of-repo in https://github.com/apache-spark-on-k8s/spark
    cc rxin srowen tnachen ash211 mccheah erikerlandson
    
    ## How was this patch tested?
    
    Docs only change
    
    Author: Anirudh Ramanathan <[email protected]>
    Author: foxish <[email protected]>
    
    Closes #17522 from foxish/upstream-doc.
    foxish authored and rxin committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    11238d4 View commit details
    Browse the repository at this point in the history
  9. [SPARK-20191][YARN] Crate wrapper for RackResolver so tests can overr…

    …ide it.
    
    Current test code tries to override the RackResolver used by setting
    configuration params, but because YARN libs statically initialize the
    resolver the first time it's used, that means that those configs don't
    really take effect during Spark tests.
    
    This change adds a wrapper class that easily allows tests to override the
    behavior of the resolver for the Spark code that uses it.
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #17508 from vanzin/SPARK-20191.
    Marcelo Vanzin committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    0736980 View commit details
    Browse the repository at this point in the history
  10. [MINOR][R] Reorder Collate fields in DESCRIPTION file

    ## What changes were proposed in this pull request?
    
    It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the order in `Collate` fields.
    
    This PR proposes to fix `catalog.R`'s order so that running this script does not show up a small diff in this file every time.
    
    ## How was this patch tested?
    
    Manually via `./R/check-cran.sh`.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17528 from HyukjinKwon/minor-reorder-description.
    HyukjinKwon authored and Felix Cheung committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    0e2ee82 View commit details
    Browse the repository at this point in the history
  11. [SPARK-20204][SQL] remove SimpleCatalystConf and CatalystConf type alias

    ## What changes were proposed in this pull request?
    
    This is a follow-up of #17285 .
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17521 from cloud-fan/conf.
    cloud-fan authored and rxin committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    402bf2a View commit details
    Browse the repository at this point in the history
  12. [SPARK-19716][SQL] support by-name resolution for struct type element…

    …s in array
    
    ## What changes were proposed in this pull request?
    
    Previously when we construct deserializer expression for array type, we will first cast the corresponding field to expected array type and then apply `MapObjects`.
    
    However, by doing that, we lose the opportunity to do by-name resolution for struct type inside array type. In this PR, I introduce a `UnresolvedMapObjects` to hold the lambda function and the input array expression. Then during analysis, after the input array expression is resolved, we get the actual array element type and apply by-name resolution. Then we don't need to add `Cast` for array type when constructing the deserializer expression, as the element type is determined later at analyzer.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17398 from cloud-fan/dataset.
    cloud-fan authored and liancheng committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    295747e View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2017

  1. [SPARK-20183][ML] Added outlierRatio arg to MLTestingUtils.testOutlie…

    …rsWithSmallWeights
    
    ## What changes were proposed in this pull request?
    
    This is a small piece from #16722 which ultimately will add sample weights to decision trees.  This is to allow more flexibility in testing outliers since linear models and trees behave differently.
    
    Note: The primary author when this is committed should be sethah since this is taken from his code.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #17501 from jkbradley/SPARK-20183.
    Seth Hendrickson authored and jkbradley committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    a59759e View commit details
    Browse the repository at this point in the history
  2. [SPARK-20003][ML] FPGrowthModel setMinConfidence should affect rules …

    …generation and transform
    
    ## What changes were proposed in this pull request?
    
    jira: https://issues.apache.org/jira/browse/SPARK-20003
    I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform.
    Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed .
    
    I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.
    
    ## How was this patch tested?
    
    new unit test and I strength the unit test for model save/load to ensure the cache mechanism.
    
    Author: Yuhao Yang <[email protected]>
    
    Closes #17336 from hhbyyh/fpmodelminconf.
    YY-OnCall authored and jkbradley committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    b28bbff View commit details
    Browse the repository at this point in the history
  3. [SPARKR][DOC] update doc for fpgrowth

    ## What changes were proposed in this pull request?
    
    minor update
    
    zero323
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17526 from felixcheung/rfpgrowthfollowup.
    felixcheung authored and Felix Cheung committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    c1b8b66 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    b6e7103 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20209][SS] Execute next trigger immediately if previous batch …

    …took longer than trigger interval
    
    ## What changes were proposed in this pull request?
    
    For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval.
    
    In this PR, I modified the ProcessingTimeExecutor to do so. Another minor change I did was to extract our StreamManualClock into a separate class so that it can be used outside subclasses of StreamTest. For example, ProcessingTimeExecutorSuite does not need to create any context for testing, just needs the StreamManualClock.
    
    ## How was this patch tested?
    Added new unit tests to comprehensively test this behavior.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #17525 from tdas/SPARK-20209.
    tdas committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    dad499f View commit details
    Browse the repository at this point in the history
  6. [SPARK-20042][WEB UI] Fix log page buttons for reverse proxy mode

    with spark.ui.reverseProxy=true, full path URLs like /log will point to
    the master web endpoint which is serving the worker UI as reverse proxy.
    To access a REST endpoint in the worker in reverse proxy mode , the
    leading /proxy/"target"/ part of the base URI must be retained.
    
    Added logic to log-view.js to handle this, similar to executorspage.js
    
    Patch was tested manually
    
    Author: Oliver Köth <[email protected]>
    
    Closes #17370 from okoethibm/master.
    okoethibm authored and srowen committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    6f09dc7 View commit details
    Browse the repository at this point in the history
  7. [SPARK-19807][WEB UI] Add reason for cancellation when a stage is kil…

    …led using web UI
    
    ## What changes were proposed in this pull request?
    
    When a user kills a stage using web UI (in Stages page), StagesTab.handleKillRequest requests SparkContext to cancel the stage without giving a reason. SparkContext has cancelStage(stageId: Int, reason: String) that Spark could use to pass the information for monitoring/debugging purposes.
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: shaolinliu <[email protected]>
    Author: lvdongr <[email protected]>
    
    Closes #17258 from shaolinliu/SPARK-19807.
    shaolinliu authored and srowen committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    71c3c48 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20223][SQL] Fix typo in tpcds q77.sql

    ## What changes were proposed in this pull request?
    
    Fix typo in tpcds q77.sql
    
    ## How was this patch tested?
    
    N/A
    
    Author: wangzhenhua <[email protected]>
    
    Closes #17538 from wzhfy/typoQ77.
    wzhfy authored and gatorsmile committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    a2d8d76 View commit details
    Browse the repository at this point in the history
  9. [SPARK-19454][PYTHON][SQL] DataFrame.replace improvements

    ## What changes were proposed in this pull request?
    
    - Allows skipping `value` argument if `to_replace` is a `dict`:
    	```python
    	df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
    	df.replace({"Alice": "Bob"}).show()
    	````
    - Adds validation step to ensure homogeneous values / replacements.
    - Simplifies internal control flow.
    - Improves unit tests coverage.
    
    ## How was this patch tested?
    
    Existing unit tests, additional unit tests, manual testing.
    
    Author: zero323 <[email protected]>
    
    Closes #16793 from zero323/SPARK-19454.
    zero323 authored and holdenk committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    e277399 View commit details
    Browse the repository at this point in the history
  10. [SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGr…

    …oupsWithState
    
    ## What changes were proposed in this pull request?
    
    - Fixed bug in Java API not passing timeout conf to scala API
    - Updated markdown docs
    - Updated scala docs
    - Added scala and Java example
    
    ## How was this patch tested?
    Manually ran examples.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #17539 from tdas/SPARK-20224.
    tdas committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    9543fc0 View commit details
    Browse the repository at this point in the history

Commits on Apr 6, 2017

  1. [SPARK-20204][SQL][FOLLOWUP] SQLConf should react to change in defaul…

    …t timezone settings
    
    ## What changes were proposed in this pull request?
    Make sure SESSION_LOCAL_TIMEZONE reflects the change in JVM's default timezone setting. Currently several timezone related tests fail as the change to default timezone is not picked up by SQLConf.
    
    ## How was this patch tested?
    Added an unit test in ConfigEntrySuite
    
    Author: Dilip Biswal <[email protected]>
    
    Closes #17537 from dilipbiswal/timezone_debug.
    dilipbiswal authored and cloud-fan committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    9d68c67 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20214][ML] Make sure converted csc matrix has sorted indices

    ## What changes were proposed in this pull request?
    
    `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
    
        from scipy.sparse import lil_matrix
        lil = lil_matrix((4, 1))
        lil[1, 0] = 1
        lil[3, 0] = 2
        _convert_to_vector(lil.todok())
    
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
          return SparseVector(l.shape[0], csc.indices, csc.data)
        File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
          % (self.indices[i], self.indices[i + 1]))
        TypeError: Indices 3 and 1 are not strictly increasing
    
    A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
    
        >>> from scipy.sparse import lil_matrix
        >>> lil = lil_matrix((4, 1))
        >>> lil[1, 0] = 1
        >>> lil[3, 0] = 2
        >>> dok = lil.todok()
        >>> csc = dok.tocsc()
        >>> csc.has_sorted_indices
        0
        >>> csc.indices
        array([3, 1], dtype=int32)
    
    I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #17532 from viirya/make-sure-sorted-indices.
    viirya authored and jkbradley committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    1220605 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20231][SQL] Refactor star schema code for the subsequent star …

    …join detection in CBO
    
    ## What changes were proposed in this pull request?
    
    This commit moves star schema code from ```join.scala``` to ```StarSchemaDetection.scala```. It also applies some minor fixes in ```StarJoinReorderSuite.scala```.
    
    ## How was this patch tested?
    Run existing ```StarJoinReorderSuite.scala```.
    
    Author: Ioana Delaney <[email protected]>
    
    Closes #17544 from ioana-delaney/starSchemaCBOv2.
    ioana-delaney authored and gatorsmile committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    4000f12 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20217][CORE] Executor should not fail stage if killed task thr…

    …ows non-interrupted exception
    
    ## What changes were proposed in this pull request?
    
    If tasks throw non-interrupted exceptions on kill (e.g. java.nio.channels.ClosedByInterruptException), their death is reported back as TaskFailed instead of TaskKilled. This causes stage failure in some cases.
    
    This is reproducible as follows. Run the following, and then use SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will fail since we threw a RuntimeException instead of InterruptedException.
    
    ```
    spark.range(100).repartition(100).foreach { i =>
      try {
        Thread.sleep(10000000)
      } catch {
        case t: InterruptedException =>
          throw new RuntimeException(t)
      }
    }
    ```
    Based on the code in TaskSetManager, I think this also affects kills of speculative tasks. However, since the number of speculated tasks is few, and usually you need to fail a task a few times before the stage is cancelled, it unlikely this would be noticed in production unless both speculation was enabled and the num allowed task failures was = 1.
    
    We should probably unconditionally return TaskKilled instead of TaskFailed if the task was killed by the driver, regardless of the actual exception thrown.
    
    ## How was this patch tested?
    
    Unit test. The test fails before the change in Executor.scala
    
    cc JoshRosen
    
    Author: Eric Liang <[email protected]>
    
    Closes #17531 from ericl/fix-task-interrupt.
    ericl authored and yhuai committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    5142e5d View commit details
    Browse the repository at this point in the history
  5. [SPARK-19953][ML] Random Forest Models use parent UID when being fit

    ## What changes were proposed in this pull request?
    
    The ML `RandomForestClassificationModel` and `RandomForestRegressionModel` were not using the estimator parent UID when being fit.  This change fixes that so the models can be properly be identified with their parents.
    
    ## How was this patch tested?Existing tests.
    
    Added check to verify that model uid matches that of the parent, then renamed `checkCopy` to `checkCopyAndUids` and verified that it was called by one test for each ML algorithm.
    
    Author: Bryan Cutler <[email protected]>
    
    Closes #17296 from BryanCutler/rfmodels-use-parent-uid-SPARK-19953.
    BryanCutler authored and Nick Pentreath committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    e156b5d View commit details
    Browse the repository at this point in the history
  6. [SPARK-20085][MESOS] Configurable mesos labels for executors

    ## What changes were proposed in this pull request?
    
    Add spark.mesos.task.labels configuration option to add mesos key:value labels to the executor.
    
     "k1:v1,k2:v2" as the format, colons separating key-value and commas to list out more than one.
    
    Discussion of labels with mgummelt at #17404
    
    ## How was this patch tested?
    
    Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor.
    
    Tested with: `./build/sbt -Pmesos mesos/test`
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Kalvin Chau <[email protected]>
    
    Closes #17413 from kalvinnchau/mesos-labels.
    Kalvin Chau authored and srowen committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    c8fc1f3 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20064][PYSPARK] Bump the PySpark verison number to 2.2

    ## What changes were proposed in this pull request?
    PySpark version in version.py was lagging behind
    Versioning is  in line with PEP 440: https://www.python.org/dev/peps/pep-0440/
    
    ## How was this patch tested?
    Simply rebuild the project with existing tests
    
    Author: setjet <[email protected]>
    Author: Ruben Janssen <[email protected]>
    
    Closes #17523 from setjet/SPARK-20064.
    setjet authored and srowen committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    d009fb3 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20196][PYTHON][SQL] update doc for catalog functions for all l…

    …anguages, add pyspark refreshByPath API
    
    ## What changes were proposed in this pull request?
    
    Update doc to remove external for createTable, add refreshByPath in python
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17512 from felixcheung/catalogdoc.
    felixcheung authored and Felix Cheung committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    bccc330 View commit details
    Browse the repository at this point in the history
  9. [SPARK-20195][SPARKR][SQL] add createTable catalog API and deprecate …

    …createExternalTable
    
    ## What changes were proposed in this pull request?
    
    Following up on #17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes
    
    ## How was this patch tested?
    
    manual, unit tests
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17511 from felixcheung/rceatetable.
    felixcheung authored and Felix Cheung committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    5a693b4 View commit details
    Browse the repository at this point in the history
  10. [SPARK-17019][CORE] Expose on-heap and off-heap memory usage in vario…

    …us places
    
    ## What changes were proposed in this pull request?
    
    With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992), Spark supports persisting data into off-heap memory, but the usage of on-heap and off-heap memory is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
    1. Spark UI's executor page will display both on-heap and off-heap memory usage.
    2. REST request returns both on-heap and off-heap memory.
    3. Also this can be gotten from MetricsSystem.
    4. Last this usage can be obtained programmatically from SparkListener.
    
    Attach the UI changes:
    
    ![screen shot 2016-08-12 at 11 20 44 am](https://cloud.githubusercontent.com/assets/850797/17612032/6c2f4480-607f-11e6-82e8-a27fb8cbb4ae.png)
    
    Backward compatibility is also considered for event-log and REST API. Old event log can still be replayed with off-heap usage displayed as 0. For REST API, only adds the new fields, so JSON backward compatibility can still be kept.
    ## How was this patch tested?
    
    Unit test added and manual verification.
    
    Author: jerryshao <[email protected]>
    
    Closes #14617 from jerryshao/SPARK-17019.
    jerryshao authored and squito committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    a449162 View commit details
    Browse the repository at this point in the history
  11. [MINOR][DOCS] Fix typo in Hive Examples

    ## What changes were proposed in this pull request?
    
    Fix typo in hive examples from "DaraFrames" to "DataFrames"
    
    ## How was this patch tested?
    
    N/A
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Dustin Koupal <[email protected]>
    
    Closes #17554 from cooper6581/typo-daraframes.
    Dustin Koupal authored and rxin committed Apr 6, 2017
    Configuration menu
    Copy the full SHA
    8129d59 View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2017

  1. [SPARK-19495][SQL] Make SQLConf slightly more extensible - addendum

    ## What changes were proposed in this pull request?
    This is a tiny addendum to SPARK-19495 to remove the private visibility for copy, which is the only package private method in the entire file.
    
    ## How was this patch tested?
    N/A - no semantic change.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17555 from rxin/SPARK-19495-2.
    rxin authored and gatorsmile committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    626b4ca View commit details
    Browse the repository at this point in the history
  2. [SPARK-20245][SQL][MINOR] pass output to LogicalRelation directly

    ## What changes were proposed in this pull request?
    
    Currently `LogicalRelation` has a `expectedOutputAttributes` parameter, which makes it hard to reason about what the actual output is. Like other leaf nodes, `LogicalRelation` should also take `output` as a parameter, to simplify the logic
    
    ## How was this patch tested?
    
    existing tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17552 from cloud-fan/minor.
    cloud-fan committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    ad3cc13 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20076][ML][PYSPARK] Add Python interface for ml.stats.Correlation

    ## What changes were proposed in this pull request?
    
    The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.
    
    ## How was this patch tested?
    
    Python unit test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #17494 from viirya/correlation-python-api.
    viirya authored and Nick Pentreath committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    1a52a62 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add desc…

    …ription.
    
    ## What changes were proposed in this pull request?
    
    1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'
    
    Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.
    
    2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
    Because only one stage is determined based on stage-id.
    
    code:
      GET
      def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
        val listener = ui.jobProgressListener
        val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
        val adjStatuses = {
          if (statuses.isEmpty()) {
            Arrays.asList(StageStatus.values(): _*)
          } else {
            statuses
          }
        };
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 郭小龙 10207633 <[email protected]>
    
    Closes #17534 from guoxiaolongzte/SPARK-20218.
    郭小龙 10207633 authored and srowen committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    9e0893b View commit details
    Browse the repository at this point in the history
  5. [SPARK-20026][DOC][SPARKR] Add Tweedie example for SparkR in programm…

    …ing guide
    
    ## What changes were proposed in this pull request?
    Add Tweedie example for SparkR in programming guide.
    The doc was already updated in #17103.
    
    Author: actuaryzhang <[email protected]>
    
    Closes #17553 from actuaryzhang/programGuide.
    actuaryzhang authored and Felix Cheung committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    870b9d9 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20197][SPARKR] CRAN check fail with package installation

    ## What changes were proposed in this pull request?
    
    Test failed because SPARK_HOME is not set before Spark is installed.
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17516 from felixcheung/rdircheckincran.
    felixcheung authored and Felix Cheung committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    8feb799 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20258][DOC][SPARKR] Fix SparkR logistic regression example in …

    …programming guide (did not converge)
    
    ## What changes were proposed in this pull request?
    
    SparkR logistic regression example did not converge in programming guide (for IRWLS). All estimates are essentially zero:
    
    ```
    training2 <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm")
    df_list2 <- randomSplit(training2, c(7,3), 2)
    binomialDF <- df_list2[[1]]
    binomialTestDF <- df_list2[[2]]
    binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
    
    17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver.
    
    > summary(binomialGLM)
    
    Coefficients:
                     Estimate
    (Intercept)    9.0255e+00
    features_0     0.0000e+00
    features_1     0.0000e+00
    features_2     0.0000e+00
    features_3     0.0000e+00
    features_4     0.0000e+00
    features_5     0.0000e+00
    features_6     0.0000e+00
    features_7     0.0000e+00
    ```
    
    Author: actuaryzhang <[email protected]>
    
    Closes #17571 from actuaryzhang/programGuide2.
    actuaryzhang authored and Felix Cheung committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    1ad73f0 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20255] Move listLeafFiles() to InMemoryFileIndex

    ## What changes were proposed in this pull request
    
    Trying to get a grip on the `FileIndex` hierarchy, I was confused by the following inconsistency:
    
    On the one hand, `PartitioningAwareFileIndex` defines `leafFiles` and `leafDirToChildrenFiles` as abstract, but on the other it fully implements `listLeafFiles` which does all the listing of files. However, the latter is only used by `InMemoryFileIndex`.
    
    I'm hereby proposing to move this method (and all its dependencies) to the implementation class that actually uses it, and thus unclutter the `PartitioningAwareFileIndex` interface.
    
    ## How was this patch tested?
    
    `./build/sbt sql/test`
    
    Author: Adrian Ionescu <[email protected]>
    
    Closes #17570 from adrian-ionescu/list-leaf-files.
    adrian-ionescu authored and rxin committed Apr 7, 2017
    Configuration menu
    Copy the full SHA
    589f3ed View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2017

  1. [SPARK-20246][SQL] should not push predicate down through aggregate w…

    …ith non-deterministic expressions
    
    ## What changes were proposed in this pull request?
    
    Similar to `Project`, when `Aggregate` has non-deterministic expressions, we should not push predicate down through it, as it will change the number of input rows and thus change the evaluation result of non-deterministic expressions in `Aggregate`.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17562 from cloud-fan/filter.
    cloud-fan authored and gatorsmile committed Apr 8, 2017
    Configuration menu
    Copy the full SHA
    7577e9c View commit details
    Browse the repository at this point in the history
  2. [SPARK-20262][SQL] AssertNotNull should throw NullPointerException

    ## What changes were proposed in this pull request?
    AssertNotNull currently throws RuntimeException. It should throw NullPointerException, which is more specific.
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17573 from rxin/SPARK-20262.
    rxin authored and gatorsmile committed Apr 8, 2017
    Configuration menu
    Copy the full SHA
    e1afc4d View commit details
    Browse the repository at this point in the history

Commits on Apr 9, 2017

  1. [MINOR] Issue: Change "slice" vs "partition" in exception messages (a…

    …nd code?)
    
    ## What changes were proposed in this pull request?
    
    Came across the term "slice" when running some spark scala code. Consequently, a Google search indicated that "slices" and "partitions" refer to the same things; indeed see:
    
    - [This issue](https://issues.apache.org/jira/browse/SPARK-1701)
    - [This pull request](#2305)
    - [This StackOverflow answer](http://stackoverflow.com/questions/23436640/what-is-the-difference-between-an-rdd-partition-and-a-slice) and [this one](http://stackoverflow.com/questions/24269495/what-are-the-differences-between-slices-and-partitions-of-rdds)
    
    Thus this pull request fixes the occurrence of slice I came accross. Nonetheless, [it would appear](https://github.com/apache/spark/search?utf8=%E2%9C%93&q=slice&type=) there are still many references to "slice/slices" - thus I thought I'd raise this Pull Request to address the issue (sorry if this is the wrong place, I'm not too familar with raising apache issues).
    
    ## How was this patch tested?
    
    (Not tested locally - only a minor exception message change.)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: asmith26 <[email protected]>
    
    Closes #17565 from asmith26/master.
    asmith26 authored and srowen committed Apr 9, 2017
    Configuration menu
    Copy the full SHA
    34fc48f View commit details
    Browse the repository at this point in the history
  2. [SPARK-19991][CORE][YARN] FileSegmentManagedBuffer performance improv…

    …ement
    
    ## What changes were proposed in this pull request?
    
    Avoid `NoSuchElementException` every time `ConfigProvider.get(val, default)` falls back to default. This apparently causes non-trivial overhead in at least one path, and can easily be avoided.
    
    See #17329
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <[email protected]>
    
    Closes #17567 from srowen/SPARK-19991.
    srowen committed Apr 9, 2017
    Configuration menu
    Copy the full SHA
    1f0de3c View commit details
    Browse the repository at this point in the history
  3. [SPARK-20260][MLLIB] String interpolation required for error message

    ## What changes were proposed in this pull request?
    This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:
    
    ```
    Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
    ```
    (note the literal `$current` instead of the interpolated value)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Vijay Ramesh <[email protected]>
    
    Closes #17572 from vijaykramesh/master.
    Vijay Ramesh authored and srowen committed Apr 9, 2017
    Configuration menu
    Copy the full SHA
    261eaf5 View commit details
    Browse the repository at this point in the history

Commits on Apr 10, 2017

  1. [SPARK-20253][SQL] Remove unnecessary nullchecks of a return value fr…

    …om Spark runtime routines in generated Java code
    
    ## What changes were proposed in this pull request?
    
    This PR elminates unnecessary nullchecks of a return value from known Spark runtime routines. We know whether a given Spark runtime routine returns ``null`` or not (e.g. ``ArrayData.toDoubleArray()`` never returns ``null``). Thus, we can eliminate a null check for the return value from the Spark runtime routine.
    
    When we run the following example program, now we get the Java code "Without this PR". In this code, since we know ``ArrayData.toDoubleArray()`` never returns ``null```, we can eliminate null checks at lines 90-92, and 97.
    
    ```java
    val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
    ds.count
    ds.map(e => e).show
    ```
    
    Without this PR
    ```java
    /* 050 */   protected void processNext() throws java.io.IOException {
    /* 051 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 052 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 053 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
    /* 054 */       ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
    /* 055 */
    /* 056 */       ArrayData deserializetoobject_value1 = null;
    /* 057 */
    /* 058 */       if (!inputadapter_isNull) {
    /* 059 */         int deserializetoobject_dataLength = inputadapter_value.numElements();
    /* 060 */
    /* 061 */         Double[] deserializetoobject_convertedArray = null;
    /* 062 */         deserializetoobject_convertedArray = new Double[deserializetoobject_dataLength];
    /* 063 */
    /* 064 */         int deserializetoobject_loopIndex = 0;
    /* 065 */         while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
    /* 066 */           MapObjects_loopValue2 = (double) (inputadapter_value.getDouble(deserializetoobject_loopIndex));
    /* 067 */           MapObjects_loopIsNull2 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
    /* 068 */
    /* 069 */           if (MapObjects_loopIsNull2) {
    /* 070 */             throw new RuntimeException(((java.lang.String) references[0]));
    /* 071 */           }
    /* 072 */           if (false) {
    /* 073 */             deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null;
    /* 074 */           } else {
    /* 075 */             deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue2;
    /* 076 */           }
    /* 077 */
    /* 078 */           deserializetoobject_loopIndex += 1;
    /* 079 */         }
    /* 080 */
    /* 081 */         deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray); /*###*/
    /* 082 */       }
    /* 083 */       boolean deserializetoobject_isNull = true;
    /* 084 */       double[] deserializetoobject_value = null;
    /* 085 */       if (!inputadapter_isNull) {
    /* 086 */         deserializetoobject_isNull = false;
    /* 087 */         if (!deserializetoobject_isNull) {
    /* 088 */           Object deserializetoobject_funcResult = null;
    /* 089 */           deserializetoobject_funcResult = deserializetoobject_value1.toDoubleArray();
    /* 090 */           if (deserializetoobject_funcResult == null) {
    /* 091 */             deserializetoobject_isNull = true;
    /* 092 */           } else {
    /* 093 */             deserializetoobject_value = (double[]) deserializetoobject_funcResult;
    /* 094 */           }
    /* 095 */
    /* 096 */         }
    /* 097 */         deserializetoobject_isNull = deserializetoobject_value == null;
    /* 098 */       }
    /* 099 */
    /* 100 */       boolean mapelements_isNull = true;
    /* 101 */       double[] mapelements_value = null;
    /* 102 */       if (!false) {
    /* 103 */         mapelements_resultIsNull = false;
    /* 104 */
    /* 105 */         if (!mapelements_resultIsNull) {
    /* 106 */           mapelements_resultIsNull = deserializetoobject_isNull;
    /* 107 */           mapelements_argValue = deserializetoobject_value;
    /* 108 */         }
    /* 109 */
    /* 110 */         mapelements_isNull = mapelements_resultIsNull;
    /* 111 */         if (!mapelements_isNull) {
    /* 112 */           Object mapelements_funcResult = null;
    /* 113 */           mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
    /* 114 */           if (mapelements_funcResult == null) {
    /* 115 */             mapelements_isNull = true;
    /* 116 */           } else {
    /* 117 */             mapelements_value = (double[]) mapelements_funcResult;
    /* 118 */           }
    /* 119 */
    /* 120 */         }
    /* 121 */         mapelements_isNull = mapelements_value == null;
    /* 122 */       }
    /* 123 */
    /* 124 */       serializefromobject_resultIsNull = false;
    /* 125 */
    /* 126 */       if (!serializefromobject_resultIsNull) {
    /* 127 */         serializefromobject_resultIsNull = mapelements_isNull;
    /* 128 */         serializefromobject_argValue = mapelements_value;
    /* 129 */       }
    /* 130 */
    /* 131 */       boolean serializefromobject_isNull = serializefromobject_resultIsNull;
    /* 132 */       final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
    /* 133 */       serializefromobject_isNull = serializefromobject_value == null;
    /* 134 */       serializefromobject_holder.reset();
    /* 135 */
    /* 136 */       serializefromobject_rowWriter.zeroOutNullBytes();
    /* 137 */
    /* 138 */       if (serializefromobject_isNull) {
    /* 139 */         serializefromobject_rowWriter.setNullAt(0);
    /* 140 */       } else {
    /* 141 */         // Remember the current cursor so that we can calculate how many bytes are
    /* 142 */         // written later.
    /* 143 */         final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
    /* 144 */
    /* 145 */         if (serializefromobject_value instanceof UnsafeArrayData) {
    /* 146 */           final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
    /* 147 */           // grow the global buffer before writing data.
    /* 148 */           serializefromobject_holder.grow(serializefromobject_sizeInBytes);
    /* 149 */           ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
    /* 150 */           serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
    /* 151 */
    /* 152 */         } else {
    /* 153 */           final int serializefromobject_numElements = serializefromobject_value.numElements();
    /* 154 */           serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
    /* 155 */
    /* 156 */           for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
    /* 157 */             if (serializefromobject_value.isNullAt(serializefromobject_index)) {
    /* 158 */               serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
    /* 159 */             } else {
    /* 160 */               final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
    /* 161 */               serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
    /* 162 */             }
    /* 163 */           }
    /* 164 */         }
    /* 165 */
    /* 166 */         serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
    /* 167 */       }
    /* 168 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
    /* 169 */       append(serializefromobject_result);
    /* 170 */       if (shouldStop()) return;
    /* 171 */     }
    /* 172 */   }
    ```
    
    With this PR (removed most of lines 90-97 in the above code)
    ```java
    /* 050 */   protected void processNext() throws java.io.IOException {
    /* 051 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 052 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 053 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
    /* 054 */       ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
    /* 055 */
    /* 056 */       ArrayData deserializetoobject_value1 = null;
    /* 057 */
    /* 058 */       if (!inputadapter_isNull) {
    /* 059 */         int deserializetoobject_dataLength = inputadapter_value.numElements();
    /* 060 */
    /* 061 */         Double[] deserializetoobject_convertedArray = null;
    /* 062 */         deserializetoobject_convertedArray = new Double[deserializetoobject_dataLength];
    /* 063 */
    /* 064 */         int deserializetoobject_loopIndex = 0;
    /* 065 */         while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
    /* 066 */           MapObjects_loopValue2 = (double) (inputadapter_value.getDouble(deserializetoobject_loopIndex));
    /* 067 */           MapObjects_loopIsNull2 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
    /* 068 */
    /* 069 */           if (MapObjects_loopIsNull2) {
    /* 070 */             throw new RuntimeException(((java.lang.String) references[0]));
    /* 071 */           }
    /* 072 */           if (false) {
    /* 073 */             deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null;
    /* 074 */           } else {
    /* 075 */             deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue2;
    /* 076 */           }
    /* 077 */
    /* 078 */           deserializetoobject_loopIndex += 1;
    /* 079 */         }
    /* 080 */
    /* 081 */         deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray); /*###*/
    /* 082 */       }
    /* 083 */       boolean deserializetoobject_isNull = true;
    /* 084 */       double[] deserializetoobject_value = null;
    /* 085 */       if (!inputadapter_isNull) {
    /* 086 */         deserializetoobject_isNull = false;
    /* 087 */         if (!deserializetoobject_isNull) {
    /* 088 */           Object deserializetoobject_funcResult = null;
    /* 089 */           deserializetoobject_funcResult = deserializetoobject_value1.toDoubleArray();
    /* 090 */           deserializetoobject_value = (double[]) deserializetoobject_funcResult;
    /* 091 */
    /* 092 */         }
    /* 093 */
    /* 094 */       }
    /* 095 */
    /* 096 */       boolean mapelements_isNull = true;
    /* 097 */       double[] mapelements_value = null;
    /* 098 */       if (!false) {
    /* 099 */         mapelements_resultIsNull = false;
    /* 100 */
    /* 101 */         if (!mapelements_resultIsNull) {
    /* 102 */           mapelements_resultIsNull = deserializetoobject_isNull;
    /* 103 */           mapelements_argValue = deserializetoobject_value;
    /* 104 */         }
    /* 105 */
    /* 106 */         mapelements_isNull = mapelements_resultIsNull;
    /* 107 */         if (!mapelements_isNull) {
    /* 108 */           Object mapelements_funcResult = null;
    /* 109 */           mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
    /* 110 */           if (mapelements_funcResult == null) {
    /* 111 */             mapelements_isNull = true;
    /* 112 */           } else {
    /* 113 */             mapelements_value = (double[]) mapelements_funcResult;
    /* 114 */           }
    /* 115 */
    /* 116 */         }
    /* 117 */         mapelements_isNull = mapelements_value == null;
    /* 118 */       }
    /* 119 */
    /* 120 */       serializefromobject_resultIsNull = false;
    /* 121 */
    /* 122 */       if (!serializefromobject_resultIsNull) {
    /* 123 */         serializefromobject_resultIsNull = mapelements_isNull;
    /* 124 */         serializefromobject_argValue = mapelements_value;
    /* 125 */       }
    /* 126 */
    /* 127 */       boolean serializefromobject_isNull = serializefromobject_resultIsNull;
    /* 128 */       final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
    /* 129 */       serializefromobject_isNull = serializefromobject_value == null;
    /* 130 */       serializefromobject_holder.reset();
    /* 131 */
    /* 132 */       serializefromobject_rowWriter.zeroOutNullBytes();
    /* 133 */
    /* 134 */       if (serializefromobject_isNull) {
    /* 135 */         serializefromobject_rowWriter.setNullAt(0);
    /* 136 */       } else {
    /* 137 */         // Remember the current cursor so that we can calculate how many bytes are
    /* 138 */         // written later.
    /* 139 */         final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
    /* 140 */
    /* 141 */         if (serializefromobject_value instanceof UnsafeArrayData) {
    /* 142 */           final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
    /* 143 */           // grow the global buffer before writing data.
    /* 144 */           serializefromobject_holder.grow(serializefromobject_sizeInBytes);
    /* 145 */           ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
    /* 146 */           serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
    /* 147 */
    /* 148 */         } else {
    /* 149 */           final int serializefromobject_numElements = serializefromobject_value.numElements();
    /* 150 */           serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
    /* 151 */
    /* 152 */           for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
    /* 153 */             if (serializefromobject_value.isNullAt(serializefromobject_index)) {
    /* 154 */               serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
    /* 155 */             } else {
    /* 156 */               final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
    /* 157 */               serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
    /* 158 */             }
    /* 159 */           }
    /* 160 */         }
    /* 161 */
    /* 162 */         serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
    /* 163 */       }
    /* 164 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
    /* 165 */       append(serializefromobject_result);
    /* 166 */       if (shouldStop()) return;
    /* 167 */     }
    /* 168 */   }
    ```
    
    ## How was this patch tested?
    
    Add test suites to ``DatasetPrimitiveSuite``
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #17569 from kiszk/SPARK-20253.
    kiszk authored and cloud-fan committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    7a63f5e View commit details
    Browse the repository at this point in the history
  2. [SPARK-20264][SQL] asm should be non-test dependency in sql/core

    ## What changes were proposed in this pull request?
    sq/core module currently declares asm as a test scope dependency. Transitively it should actually be a normal dependency since the actual core module defines it. This occasionally confuses IntelliJ.
    
    ## How was this patch tested?
    N/A - This is a build change.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17574 from rxin/SPARK-20264.
    rxin authored and gatorsmile committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    7bfa05e View commit details
    Browse the repository at this point in the history
  3. [SPARK-20270][SQL] na.fill should not change the values in long or in…

    …teger when the default value is in double
    
    ## What changes were proposed in this pull request?
    
    This bug was partially addressed in SPARK-18555 #15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big.
    
    Here is an example how this happens, with
    ```
          Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null),
            (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
    ```
    the logical plan will be
    ```
    == Analyzed Logical Plan ==
    a: bigint, b: double
    Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241]
    +- Project [_1#229L AS a#232L, _2#230 AS b#233]
       +- LocalRelation [_1#229L, _2#230]
    ```
    
    Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision.
    
    The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong.
    
    With the PR, the logical plan will be
    ```
    == Analyzed Logical Plan ==
    a: bigint, b: double
    Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
    +- Project [_1#229L AS a#232L, _2#230 AS b#233]
       +- LocalRelation [_1#229L, _2#230]
    ```
    which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting.
    
    ## How was this patch tested?
    
    unit test added.
    
    +cc srowen rxin cloud-fan gatorsmile
    
    Thanks.
    
    Author: DB Tsai <[email protected]>
    
    Closes #17577 from dbtsai/fixnafill.
    DB Tsai authored and dbtsai committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    1a0bc41 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20229][SQL] add semanticHash to QueryPlan

    ## What changes were proposed in this pull request?
    
    Like `Expression`, `QueryPlan` should also have a `semanticHash` method, then we can put plans to a hash map and look it up fast. This PR refactors `QueryPlan` to follow `Expression` and put all the normalization logic in `QueryPlan.canonicalized`, so that it's very natural to implement `semanticHash`.
    
    follow-up: improve `CacheManager` to leverage this `semanticHash` and speed up plan lookup, instead of iterating all cached plans.
    
    ## How was this patch tested?
    
    existing tests. Note that we don't need to test the `semanticHash` method, once the existing tests prove `sameResult` is correct, we are good.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17541 from cloud-fan/plan-semantic.
    cloud-fan committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    3d7f201 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams thread race

    ## What changes were proposed in this pull request?
    
    Synchronize access to openStreams map.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Bogdan Raducanu <[email protected]>
    
    Closes #17592 from bogdanrdc/SPARK-20243.
    bogdanrdc authored and hvanhovell committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    4f7d49b View commit details
    Browse the repository at this point in the history
  6. [SPARK-19518][SQL] IGNORE NULLS in first / last in SQL

    ## What changes were proposed in this pull request?
    
    This PR proposes to add `IGNORE NULLS` keyword in `first`/`last` in Spark's parser likewise http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions057.htm.  This simply maps the keywords to existing `ignoreNullsExpr`.
    
    **Before**
    
    ```scala
    scala> sql("select first('a' IGNORE NULLS)").show()
    ```
    
    ```
    org.apache.spark.sql.catalyst.parser.ParseException:
    extraneous input 'NULLS' expecting {')', ','}(line 1, pos 24)
    
    == SQL ==
    select first('a' IGNORE NULLS)
    ------------------------^^^
    
      at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:210)
      at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:112)
      at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
      at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:66)
      at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:622)
      ... 48 elided
    ```
    
    **After**
    
    ```scala
    scala> sql("select first('a' IGNORE NULLS)").show()
    ```
    
    ```
    +--------------+
    |first(a, true)|
    +--------------+
    |             a|
    +--------------+
    ```
    
    ## How was this patch tested?
    
    Unit tests in `ExpressionParserSuite`.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17566 from HyukjinKwon/SPARK-19518.
    HyukjinKwon authored and hvanhovell committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    5acaf8c View commit details
    Browse the repository at this point in the history
  7. [SPARK-20273][SQL] Disallow Non-deterministic Filter push-down into J…

    …oin Conditions
    
    ## What changes were proposed in this pull request?
    ```
    sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show()
    ```
    We will get the following error:
    ```
    Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
    	at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
    	at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
    	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
    ```
    Filters could be pushed down to the join conditions by the optimizer rule `PushPredicateThroughJoin`. However, Analyzer [blocks users to add non-deterministics conditions](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395) (For details, see the PR #7535).
    
    We should not push down non-deterministic conditions; otherwise, we need to explicitly initialize the non-deterministic expressions. This PR is to simply block it.
    
    ### How was this patch tested?
    Added a test case
    
    Author: Xiao Li <[email protected]>
    
    Closes #17585 from gatorsmile/joinRandCondition.
    gatorsmile committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    fd711ea View commit details
    Browse the repository at this point in the history
  8. [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "T…

    …urkish locale bug" causes Spark problems
    
    ## What changes were proposed in this pull request?
    
    Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
    
    The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <[email protected]>
    
    Closes #17527 from srowen/SPARK-20156.
    srowen committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    a26e3ed View commit details
    Browse the repository at this point in the history
  9. [SPARK-20280][CORE] FileStatusCache Weigher integer overflow

    ## What changes were proposed in this pull request?
    
    Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor.
    
    ## How was this patch tested?
    New test in FileIndexSuite
    
    Author: Bogdan Raducanu <[email protected]>
    
    Closes #17591 from bogdanrdc/SPARK-20280.
    bogdanrdc authored and hvanhovell committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    f6dd8e0 View commit details
    Browse the repository at this point in the history
  10. [SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 3…

    …0 seconds
    
    ## What changes were proposed in this pull request?
    
    Saw the following failure locally:
    
    ```
    Traceback (most recent call last):
      File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup
        self._test_func(input, func, expected, sort=True, input2=input2)
      File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func
        self.assertEqual(expected, result)
    AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []
    
    First list contains 3 additional elements.
    First extra element 0:
    [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]
    
    + []
    - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
    -  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
    -  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
    ```
    
    It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120
    
    It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17597 from zsxwing/SPARK-20285.
    zsxwing committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    f9a50ba View commit details
    Browse the repository at this point in the history
  11. [SPARK-20282][SS][TESTS] Write the commit log first to fix a race con…

    …tion in tests
    
    ## What changes were proposed in this pull request?
    
    This PR fixes the following failure:
    ```
    sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException:
    Assert on query failed:
    
    == Progress ==
       AssertOnQuery(<condition>, )
       StopStream
       AddData to MemoryStream[value#30891]: 1,2
       StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock35cdc93a,Map())
       CheckAnswer: [6],[3]
       StopStream
    => AssertOnQuery(<condition>, )
       AssertOnQuery(<condition>, )
       StartStream(OneTimeTrigger,org.apache.spark.util.SystemClockcdb247d,Map())
       CheckAnswer: [6],[3]
       StopStream
       AddData to MemoryStream[value#30891]: 3
       StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock55394e4d,Map())
       CheckLastBatch: [2]
       StopStream
       AddData to MemoryStream[value#30891]: 0
       StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock749aa997,Map())
       ExpectFailure[org.apache.spark.SparkException, isFatalError: false]
       AssertOnQuery(<condition>, )
       AssertOnQuery(<condition>, incorrect start offset or end offset on exception)
    
    == Stream ==
    Output Mode: Append
    Stream state: not started
    Thread state: dead
    
    == Sink ==
    0: [6] [3]
    
    == Plan ==
    
    	at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
    	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
    	at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
    	at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
    	at org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347)
    	at org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318)
    	at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483)
    	at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357)
    	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    	at org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357)
    	at org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161)
    	at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
    	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268)
    	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
    	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268)
    	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    	at org.scalatest.Transformer.apply(Transformer.scala:22)
    	at org.scalatest.Transformer.apply(Transformer.scala:20)
    	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
    	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41)
    	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQuerySuite.scala:41)
    	at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite.runTest(StreamingQuerySuite.scala:41)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
    	at scala.collection.immutable.List.foreach(List.scala:381)
    	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
    	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
    	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
    	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
    	at org.scalatest.Suite$class.run(Suite.scala:1424)
    	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
    	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
    	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
    	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
    	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$run(StreamingQuerySuite.scala:41)
    	at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
    	at org.apache.spark.sql.streaming.StreamingQuerySuite.run(StreamingQuerySuite.scala:41)
    	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
    	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    The failure is because `CheckAnswer` will run once `committedOffsets` is updated. Then writing the commit log may be interrupted by the following `StopStream`.
    
    This PR just change the order to write the commit log first.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17594 from zsxwing/SPARK-20282.
    zsxwing committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    a35b9d9 View commit details
    Browse the repository at this point in the history
  12. [SPARK-20283][SQL] Add preOptimizationBatches

    ## What changes were proposed in this pull request?
    We currently have postHocOptimizationBatches, but not preOptimizationBatches. This patch adds preOptimizationBatches so the optimizer debugging extensions are symmetric.
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17595 from rxin/SPARK-20283.
    rxin committed Apr 10, 2017
    Configuration menu
    Copy the full SHA
    379b0b0 View commit details
    Browse the repository at this point in the history

Commits on Apr 11, 2017

  1. [SPARK-17564][TESTS] Fix flaky RequestTimeoutIntegrationSuite.further…

    …RequestsDelay
    
    ## What changes were proposed in this pull request?
    
    This PR  fixs the following failure:
    ```
    sbt.ForkMain$ForkError: java.lang.AssertionError: null
    	at org.junit.Assert.fail(Assert.java:86)
    	at org.junit.Assert.assertTrue(Assert.java:41)
    	at org.junit.Assert.assertTrue(Assert.java:52)
    	at org.apache.spark.network.RequestTimeoutIntegrationSuite.furtherRequestsDelay(RequestTimeoutIntegrationSuite.java:230)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:497)
    	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    	at org.junit.runners.Suite.runChild(Suite.java:128)
    	at org.junit.runners.Suite.runChild(Suite.java:27)
    	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    	at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
    	at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:296)
    	at sbt.ForkMain$Run$2.call(ForkMain.java:286)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    It happens several times per month on [Jenkins](http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.network.RequestTimeoutIntegrationSuite&test_name=furtherRequestsDelay). The failure is because `callback1` may not be called before `assertTrue(callback1.failure instanceof IOException);`. It's pretty easy to reproduce this error by adding a sleep before this line: https://github.com/apache/spark/blob/379b0b0bbdbba2278ce3bcf471bd75f6ffd9cf0d/common/network-common/src/test/java/org/apache/spark/network/RequestTimeoutIntegrationSuite.java#L267
    
    The fix is straightforward: just use the latch to wait until `callback1` is called.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17599 from zsxwing/SPARK-17564.
    zsxwing authored and rxin committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    734dfbf View commit details
    Browse the repository at this point in the history
  2. [SPARK-20097][ML] Fix visibility discrepancy with numInstances and de…

    …greesOfFreedom in LR and GLR
    
    ## What changes were proposed in this pull request?
    
    - made `numInstances` public in GLR
    - made `degreesOfFreedom` public in LR
    
    ## How was this patch tested?
    
    reran the concerned test suites
    
    Author: Benjamin Fradet <[email protected]>
    
    Closes #17431 from BenFradet/SPARK-20097.
    BenFradet authored and Nick Pentreath committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    0d2b796 View commit details
    Browse the repository at this point in the history
  3. Document Master URL format in high availability set up

    ## What changes were proposed in this pull request?
    
    Add documentation for adding master url in multi host, port format for standalone cluster with high availability with zookeeper.
    Referring documentation [Standby Masters with ZooKeeper](http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper)
    
    ## How was this patch tested?
    
    Documenting the functionality already present.
    
    Author: MirrorZ <[email protected]>
    
    Closes #17584 from MirrorZ/master.
    MirrorZ authored and srowen committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    d11ef3d View commit details
    Browse the repository at this point in the history
  4. [SPARK-20274][SQL] support compatible array element type in encoder

    ## What changes were proposed in this pull request?
    
    This is a regression caused by SPARK-19716.
    
    Before SPARK-19716, we will cast an array field to the expected array type. However, after SPARK-19716, the cast is removed, but we forgot to push the cast to the element level.
    
    ## How was this patch tested?
    
    new regression tests
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17587 from cloud-fan/array.
    cloud-fan committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    c870698 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20175][SQL] Exists should not be evaluated in Join operator

    ## What changes were proposed in this pull request?
    
    Similar to `ListQuery`, `Exists` should not be evaluated in `Join` operator too.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #17491 from viirya/dont-push-exists-to-join.
    viirya authored and cloud-fan committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    cd91f96 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20289][SQL] Use StaticInvoke to box primitive types

    ## What changes were proposed in this pull request?
    Dataset typed API currently uses NewInstance to box primitive types (i.e. calling the constructor). Instead, it'd be slightly more idiomatic in Java to use PrimitiveType.valueOf, which can be invoked using StaticInvoke expression.
    
    ## How was this patch tested?
    The change should be covered by existing tests for Dataset encoders.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17604 from rxin/SPARK-20289.
    rxin committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    123b4fb View commit details
    Browse the repository at this point in the history
  7. [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3

    ## What changes were proposed in this pull request?
    
    Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3).  Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences.
    
    ## How was this patch tested?
    
    - Doctests for helper function
    
    ## Legal
    
    This is my original work and I license the work to the project under the project’s open source license.
    
    Author: David Gingrich <[email protected]>
    
    Closes #16845 from dgingrich/topic-spark-19505-py3-exceptions.
    David Gingrich authored and holdenk committed Apr 11, 2017
    Configuration menu
    Copy the full SHA
    6297697 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2017

  1. [MINOR][DOCS] Update supported versions for Hive Metastore

    ## What changes were proposed in this pull request?
    
    Since SPARK-18112 and SPARK-13446, Apache Spark starts to support reading Hive metastore 2.0 ~ 2.1.1. This updates the docs.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #17612 from dongjoon-hyun/metastore.
    dongjoon-hyun authored and gatorsmile committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    cde9e32 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20291][SQL] NaNvl(FloatType, NullType) should not be cast to N…

    …aNvl(DoubleType, DoubleType)
    
    ## What changes were proposed in this pull request?
    
    `NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`.
    
    This will cause mismatching in the output type when the input type is float.
    
    By adding extra rule in TypeCoercion can resolve this issue.
    
    ## How was this patch tested?
    
    unite tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: DB Tsai <[email protected]>
    
    Closes #17606 from dbtsai/fixNaNvl.
    DB Tsai authored and cloud-fan committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    8ad63ee View commit details
    Browse the repository at this point in the history
  3. [SPARK-19993][SQL] Caching logical plans containing subquery expressi…

    …ons does not work.
    
    ## What changes were proposed in this pull request?
    The sameResult() method does not work when the logical plan contains subquery expressions.
    
    **Before the fix**
    ```SQL
    scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)")
    ds: org.apache.spark.sql.DataFrame = [c1: int]
    
    scala> ds.cache
    res13: ds.type = [c1: int]
    
    scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true)
    == Analyzed Logical Plan ==
    c1: int
    Project [c1#86]
    +- Filter c1#86 IN (list#78 [c1#86])
       :  +- Project [c1#87]
       :     +- Filter (outer(c1#86) = c1#87)
       :        +- SubqueryAlias s2
       :           +- Relation[c1#87] parquet
       +- SubqueryAlias s1
          +- Relation[c1#86] parquet
    
    == Optimized Logical Plan ==
    Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
    :- Relation[c1#86] parquet
    +- Relation[c1#87] parquet
    ```
    **Plan after fix**
    ```SQL
    == Analyzed Logical Plan ==
    c1: int
    Project [c1#22]
    +- Filter c1#22 IN (list#14 [c1#22])
       :  +- Project [c1#23]
       :     +- Filter (outer(c1#22) = c1#23)
       :        +- SubqueryAlias s2
       :           +- Relation[c1#23] parquet
       +- SubqueryAlias s1
          +- Relation[c1#22] parquet
    
    == Optimized Logical Plan ==
    InMemoryRelation [c1#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
       +- *BroadcastHashJoin [c1#1, c1#1], [c1#2, c1#2], LeftSemi, BuildRight
          :- *FileScan parquet default.s1[c1#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>
          +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, true] as bigint), 32) | (cast(input[0, int, true] as bigint) & 4294967295))))
             +- *FileScan parquet default.s2[c1#2] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>
    ```
    ## How was this patch tested?
    New tests are added to CachedTableSuite.
    
    Author: Dilip Biswal <[email protected]>
    
    Closes #17330 from dilipbiswal/subquery_cache_final.
    dilipbiswal authored and cloud-fan committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    b14bfc3 View commit details
    Browse the repository at this point in the history
  4. [MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide

    ## What changes were proposed in this pull request?
    
    1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...`
    2. Omitted colon in Output Model section.
    
    ## How was this patch tested?
    
    None.
    
    Author: Lee Dongjin <[email protected]>
    
    Closes #17564 from dongjinleekr/feature/fix-programming-guide.
    dongjinleekr authored and srowen committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    b938438 View commit details
    Browse the repository at this point in the history
  5. [MINOR][DOCS] JSON APIs related documentation fixes

    ## What changes were proposed in this pull request?
    
    This PR proposes corrections related to JSON APIs as below:
    
    - Rendering links in Python documentation
    - Replacing `RDD` to `Dataset` in programing guide
    - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
    - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API
    
    ## How was this patch tested?
    
    Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.
    
    Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in #17477. So, this PR does not fix those.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17602 from HyukjinKwon/minor-json-documentation.
    HyukjinKwon authored and srowen committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    bca4259 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20298][SPARKR][MINOR] fixed spelling mistake "charactor"

    ## What changes were proposed in this pull request?
    
    Fixed spelling of "charactor"
    
    ## How was this patch tested?
    
    Spelling change only
    
    Author: Brendan Dwyer <[email protected]>
    
    Closes #17611 from bdwyer2/SPARK-20298.
    bdwyer2 authored and srowen committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    044f7ec View commit details
    Browse the repository at this point in the history
  7. [SPARK-20302][SQL] Short circuit cast when from and to types are stru…

    …cturally the same
    
    ## What changes were proposed in this pull request?
    When we perform a cast expression and the from and to types are structurally the same (having the same structure but different field names), we should be able to skip the actual cast.
    
    ## How was this patch tested?
    Added unit tests for the newly introduced functions.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17614 from rxin/SPARK-20302.
    rxin committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    ffc57b0 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20296][TRIVIAL][DOCS] Count distinct error message for streaming

    ## What changes were proposed in this pull request?
    Update count distinct error message for streaming datasets/dataframes to match current behavior. These aggregations are not yet supported, regardless of whether the dataset/dataframe is aggregated.
    
    Author: jtoka <[email protected]>
    
    Closes #17609 from jtoka/master.
    jtoka authored and srowen committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    2e1fd46 View commit details
    Browse the repository at this point in the history
  9. [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins

    ## What changes were proposed in this pull request?
    
    This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
    
    There are several problems with it:
    
    - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
    
    - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
    
      (see  joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
    
    To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
    
    There fixes are similar instances that previously fixed. Please refer #15999 and #16013
    
    Note that this only fixes **errors** not **warnings**. Please see my observation #17389 (comment) for spurious errors by warnings.
    
    ## How was this patch tested?
    
    Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
    
    This was tested via manually adding `time.time()` as below:
    
    ```diff
         profiles_and_goals = build_profiles + sbt_goals
    
         print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
               " ".join(profiles_and_goals))
    
    +    import time
    +    st = time.time()
         exec_sbt(profiles_and_goals)
    +    print("Elapsed :[%s]" % str(time.time() - st))
    ```
    
    produces
    
    ```
    ...
    ========================================================================
    Building Unidoc API Documentation
    ========================================================================
    ...
    [info] Main Java API documentation successful.
    ...
    Elapsed :[94.8746569157]
    ...
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17477 from HyukjinKwon/SPARK-18692.
    HyukjinKwon authored and srowen committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    ceaf77a View commit details
    Browse the repository at this point in the history
  10. [SPARK-20303][SQL] Rename createTempFunction to registerFunction

    ### What changes were proposed in this pull request?
    Session catalog API `createTempFunction` is being used by Hive build-in functions, persistent functions, and temporary functions. Thus, the name is confusing. This PR is to rename it by `registerFunction`. Also we can move construction of `FunctionBuilder` and `ExpressionInfo` into the new `registerFunction`, instead of duplicating the logics everywhere.
    
    In the next PRs, the remaining Function-related APIs also need cleanups.
    
    ### How was this patch tested?
    Existing test cases.
    
    Author: Xiao Li <[email protected]>
    
    Closes #17615 from gatorsmile/cleanupCreateTempFunction.
    gatorsmile committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    504e62e View commit details
    Browse the repository at this point in the history
  11. [SPARK-20304][SQL] AssertNotNull should not include path in string re…

    …presentation
    
    ## What changes were proposed in this pull request?
    AssertNotNull's toString/simpleString dumps the entire walkedTypePath. walkedTypePath is used for error message reporting and shouldn't be part of the output.
    
    ## How was this patch tested?
    Manually tested.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17616 from rxin/SPARK-20304.
    rxin authored and gatorsmile committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    5408553 View commit details
    Browse the repository at this point in the history
  12. [SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell

    ## What changes were proposed in this pull request?
    
    SPARK-15236 do this for scala shell, this ticket is for pyspark shell. This is not only for pyspark itself, but can also benefit downstream project like livy which use shell.py for its interactive session. For now, livy has no control of whether enable hive or not.
    
    ## How was this patch tested?
    
    I didn't find a way to add test for it. Just manually test it.
    Run `bin/pyspark --master local --conf spark.sql.catalogImplementation=in-memory` and verify hive is not enabled.
    
    Author: Jeff Zhang <[email protected]>
    
    Closes #16906 from zjffdu/SPARK-19570.
    zjffdu authored and holdenk committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    99a9473 View commit details
    Browse the repository at this point in the history
  13. [SPARK-20301][FLAKY-TEST] Fix Hadoop Shell.runCommand flakiness in St…

    …ructured Streaming tests
    
    ## What changes were proposed in this pull request?
    
    Some Structured Streaming tests show flakiness such as:
    ```
    [info] - prune results by current_date, complete mode - 696 *** FAILED *** (10 seconds, 937 milliseconds)
    [info]   Timed out while stopping and waiting for microbatchthread to terminate.: The code passed to failAfter did not complete within 10 seconds.
    ```
    
    This happens when we wait for the stream to stop, but it doesn't. The reason it doesn't stop is that we interrupt the microBatchThread, but Hadoop's `Shell.runCommand` swallows the interrupt exception, and the exception is not propagated upstream to the microBatchThread. Then this thread continues to run, only to start blocking on the `streamManualClock`.
    
    ## How was this patch tested?
    
    Thousand retries locally and [Jenkins](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75720/testReport) of the flaky tests
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #17613 from brkyvz/flaky-stream-agg.
    brkyvz authored and tdas committed Apr 12, 2017
    Configuration menu
    Copy the full SHA
    924c424 View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2017

  1. [SPARK-15354][FLAKY-TEST] TopologyAwareBlockReplicationPolicyBehavior…

    ….Peers in 2 racks
    
    ## What changes were proposed in this pull request?
    
    `TopologyAwareBlockReplicationPolicyBehavior.Peers in 2 racks` is failing occasionally: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.storage.TopologyAwareBlockReplicationPolicyBehavior&test_name=Peers+in+2+racks.
    
    This is because, when we generate 10 block manager id to test, they may all belong to the same rack, as the rack is randomly picked. This PR fixes this problem by forcing each rack to be picked at least once.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17624 from cloud-fan/test.
    cloud-fan committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    a7b430b View commit details
    Browse the repository at this point in the history
  2. [SPARK-20131][CORE] Don't use this lock in StandaloneSchedulerBacke…

    …nd.stop
    
    ## What changes were proposed in this pull request?
    
    `o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` is flaky is because there is a potential dead-lock in StandaloneSchedulerBackend which causes `await` timeout. Here is the related stack trace:
    ```
    "Thread-31" #211 daemon prio=5 os_prio=31 tid=0x00007fedd4808000 nid=0x16403 waiting on condition [0x00007000239b7000]
       java.lang.Thread.State: TIMED_WAITING (parking)
    	at sun.misc.Unsafe.park(Native Method)
    	- parking to wait for  <0x000000079b49ca10> (a scala.concurrent.impl.Promise$CompletionLatch)
    	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
    	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
    	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
    	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
    	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
    	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:402)
    	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:213)
    	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
    	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
    	- locked <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
    	at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:517)
    	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1657)
    	at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921)
    	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1302)
    	at org.apache.spark.SparkContext.stop(SparkContext.scala:1920)
    	at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:708)
    	at org.apache.spark.streaming.StreamingContextSuite$$anonfun$43$$anonfun$apply$mcV$sp$66$$anon$3.run(StreamingContextSuite.scala:827)
    
    "dispatcher-event-loop-3" #18 daemon prio=5 os_prio=31 tid=0x00007fedd603a000 nid=0x6203 waiting for monitor entry [0x0000700003be4000]
       java.lang.Thread.State: BLOCKED (on object monitor)
    	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:253)
    	- waiting to lock <0x00000007066fca38> (a org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend)
    	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:124)
    	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
    	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
    	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
    	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    This PR removes `synchronized` and changes `stopping` to AtomicBoolean to ensure idempotent to fix the dead-lock.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17610 from zsxwing/SPARK-20131.
    zsxwing committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    c5f1cc3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20189][DSTREAM] Fix spark kinesis testcases to remove deprecat…

    …ed createStream and use Builders
    
    ## What changes were proposed in this pull request?
    
    The spark-kinesis testcases use the KinesisUtils.createStream which are deprecated now. Modify the testcases to use the recommended KinesisInputDStream.builder instead.
    This change will also enable the testcases to automatically use the session tokens automatically.
    
    ## How was this patch tested?
    
    All the existing testcases work fine as expected with the changes.
    
    https://issues.apache.org/jira/browse/SPARK-20189
    
    Author: Yash Sharma <[email protected]>
    
    Closes #17506 from yssharma/ysharma/cleanup_kinesis_testcases.
    yashs360 authored and srowen committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    ec68d8f View commit details
    Browse the repository at this point in the history
  4. [SPARK-20265][MLLIB] Improve Prefix'span pre-processing efficiency

    ## What changes were proposed in this pull request?
    
    Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
    The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/
    
    ## How was this patch tested?
    
    Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
    result obtained were stricly the same as the original implementation (without this change).
    dev/run-tests was also runned, no error were found.
    
    Author : Cyril de Vogelaere <cyril.devogelaeregmail.com>
    
    Author: Syrux <[email protected]>
    
    Closes #17575 from Syrux/SPARK-20265.
    Syrux authored and srowen committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    095d1cb View commit details
    Browse the repository at this point in the history
  5. [SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable

    ## What changes were proposed in this pull request?
    
    This PR allows to use `SerializationStream` and `DeserializationStream` in try-with-resources.
    
    ## How was this patch tested?
    
    `core` unit tests.
    
    Author: Sergei Lebedev <[email protected]>
    
    Closes #17598 from superbobry/compression-stream-closeable.
    Sergei Lebedev authored and srowen committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    a4293c2 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20233][SQL] Apply star-join filter heuristics to dynamic progr…

    …amming join enumeration
    
    ## What changes were proposed in this pull request?
    
    Implements star-join filter to reduce the search space for dynamic programming join enumeration. Consider the following join graph:
    
    ```
    T1       D1 - T2 - T3
      \     /
        F1
         |
        D2
    
    star-join: {F1, D1, D2}
    non-star: {T1, T2, T3}
    ```
    The following join combinations will be generated:
    ```
    level 0: (F1), (D1), (D2), (T1), (T2), (T3)
    level 1: {F1, D1}, {F1, D2}, {T2, T3}
    level 2: {F1, D1, D2}
    level 3: {F1, D1, D2, T1}, {F1, D1, D2, T2}
    level 4: {F1, D1, D2, T1, T2}, {F1, D1, D2, T2, T3 }
    level 6: {F1, D1, D2, T1, T2, T3}
    ```
    
    ## How was this patch tested?
    
    New test suite ```StarJOinCostBasedReorderSuite.scala```.
    
    Author: Ioana Delaney <[email protected]>
    
    Closes #17546 from ioana-delaney/starSchemaCBOv3.
    ioana-delaney authored and cloud-fan committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    fbe4216 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20232][PYTHON] Improve combineByKey docs

    ## What changes were proposed in this pull request?
    
    Improve combineByKey documentation:
    
    * Add note on memory allocation
    * Change example code to use different mergeValue and mergeCombiners
    
    ## How was this patch tested?
    
    Doctest.
    
    ## Legal
    
    This is my original work and I license the work to the project under the project’s open source license.
    
    Author: David Gingrich <[email protected]>
    
    Closes #17545 from dgingrich/topic-spark-20232-combinebykey-docs.
    David Gingrich authored and holdenk committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    8ddf0d2 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20038][SQL] FileFormatWriter.ExecuteWriteTask.releaseResources…

    …() implementations to be re-entrant
    
    ## What changes were proposed in this pull request?
    
    have the`FileFormatWriter.ExecuteWriteTask.releaseResources()` implementations  set `currentWriter=null` in a finally clause. This guarantees that if the first call to `currentWriter()` throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream.
    
    ## How was this patch tested?
    
    Tricky. I've been fixing the underlying cause when I saw the problem [HADOOP-14204](https://issues.apache.org/jira/browse/HADOOP-14204), but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken.
    
    code review, however, should be straightforward
    
    Author: Steve Loughran <[email protected]>
    
    Closes #17364 from steveloughran/stevel/SPARK-20038-close.
    steveloughran authored and squito committed Apr 13, 2017
    Configuration menu
    Copy the full SHA
    7536e28 View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2017

  1. [SPARK-20318][SQL] Use Catalyst type for min/max in ColumnStat for ea…

    …se of estimation
    
    ## What changes were proposed in this pull request?
    
    Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.
    
    To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.
    
    This pr also fixes a bug for boolean type in `IN` condition.
    
    ## How was this patch tested?
    
    The changes for ColumnStat are covered by existing tests.
    For bug fix, a new test for boolean type in IN condition is added
    
    Author: wangzhenhua <[email protected]>
    
    Closes #17630 from wzhfy/refactorColumnStat.
    wzhfy authored and cloud-fan committed Apr 14, 2017
    Configuration menu
    Copy the full SHA
    fb036c4 View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2017

  1. [SPARK-20316][SQL] Val and Var should strictly follow the Scala syntax

    ## What changes were proposed in this pull request?
    
    val and var should strictly follow the Scala syntax
    
    ## How was this patch tested?
    
    manual test and exisiting test cases
    
    Author: ouyangxiaochen <[email protected]>
    
    Closes #17628 from ouyangxiaochen/spark-413.
    ouyangxiaochen authored and srowen committed Apr 15, 2017
    Configuration menu
    Copy the full SHA
    98b41ec View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2017

  1. [SPARK-19716][SQL][FOLLOW-UP] UnresolvedMapObjects should always be s…

    …erializable
    
    ## What changes were proposed in this pull request?
    
    In #17398 we introduced `UnresolvedMapObjects` as a placeholder of `MapObjects`. Unfortunately `UnresolvedMapObjects` is not serializable as its `function` may reference Scala `Type` which is not serializable.
    
    Ideally this is fine, as we will never serialize and send unresolved expressions to executors. However users may accidentally do this, e.g. mistakenly reference an encoder instance when implementing `Aggregator`, we should fix it so that it's just a performance issue(more network traffic) and should not fail the query.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17639 from cloud-fan/minor.
    cloud-fan committed Apr 16, 2017
    Configuration menu
    Copy the full SHA
    35e5ae4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20335][SQL] Children expressions of Hive UDF impacts the deter…

    …minism of Hive UDF
    
    ### What changes were proposed in this pull request?
    ```JAVA
      /**
       * Certain optimizations should not be applied if UDF is not deterministic.
       * Deterministic UDF returns same result each time it is invoked with a
       * particular input. This determinism just needs to hold within the context of
       * a query.
       *
       * return true if the UDF is deterministic
       */
      boolean deterministic() default true;
    ```
    
    Based on the definition of [UDFType](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFType.java#L42-L50), when Hive UDF's children are non-deterministic, Hive UDF is also non-deterministic.
    
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <[email protected]>
    
    Closes #17635 from gatorsmile/udfDeterministic.
    gatorsmile authored and cloud-fan committed Apr 16, 2017
    Configuration menu
    Copy the full SHA
    e090f3c View commit details
    Browse the repository at this point in the history
  3. [SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameter…

    …s into docker when running on mesos with docker containerizer
    
    ## What changes were proposed in this pull request?
    
    Allow passing in arbitrary parameters into docker when launching spark executors on mesos with docker containerizer tnachen
    
    ## How was this patch tested?
    
    Manually built and tested with passed in parameter
    
    Author: Ji Yan <[email protected]>
    
    Closes #17109 from yanji84/ji/allow_set_docker_user.
    Ji Yan authored and srowen committed Apr 16, 2017
    Configuration menu
    Copy the full SHA
    a888fed View commit details
    Browse the repository at this point in the history
  4. [SPARK-20343][BUILD] Add avro dependency in core POM to resolve build…

    … failure in SBT Hadoop 2.6 master on Jenkins
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to add
    
    ```
          <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
          </dependency>
    ```
    
    in core POM to see if it resolves the build failure as below:
    
    ```
    [error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData
    [error]     writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema))
    [error]
    ```
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
    
    ## How was this patch tested?
    
    I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.
    
    Please refer the comments in #17477 (comment)
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17642 from HyukjinKwon/SPARK-20343.
    HyukjinKwon authored and srowen committed Apr 16, 2017
    Configuration menu
    Copy the full SHA
    ad935f5 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20278][R] Disable 'multiple_dots_linter' lint rule that is aga…

    …inst project's code style
    
    ## What changes were proposed in this pull request?
    
    Currently, multi-dot separated variables in R is not allowed. For example,
    
    ```diff
     setMethod("from_json", signature(x = "Column", schema = "structType"),
    -          function(x, schema, asJsonArray = FALSE, ...) {
    +          function(x, schema, as.json.array = FALSE, ...) {
                 if (asJsonArray) {
                   jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
                                          "createArrayType",
    ```
    
    produces an error as below:
    
    ```
    R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
              function(x, schema, as.json.array = FALSE, ...) {
                                  ^~~~~~~~~~~~~
    ```
    
    This seems against https://google.github.io/styleguide/Rguide.xml#identifiers which says
    
    > The preferred form for variable names is all lower case letters and words separated with dots
    
    This looks because lintr by default https://github.com/jimhester/lintr follows http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases seems not following Google's one as "a few tweaks".
    
    Per [SPARK-6813](https://issues.apache.org/jira/browse/SPARK-6813), we follow Google's R Style Guide with few exceptions https://google.github.io/styleguide/Rguide.xml. This is also merged into Spark's website - apache/spark-website#43
    
    Also, it looks we have no limit on function name. This rule also looks affecting to the name of functions as written in the README.md.
    
    > `multiple_dots_linter`: check that function and variable names are separated by _ rather than ..
    
    ## How was this patch tested?
    
    Manually tested `./dev/lint-r`with the manual change below in `R/functions.R`:
    
    ```diff
     setMethod("from_json", signature(x = "Column", schema = "structType"),
    -          function(x, schema, asJsonArray = FALSE, ...) {
    +          function(x, schema, as.json.array = FALSE, ...) {
                 if (asJsonArray) {
                   jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
                                          "createArrayType",
    ```
    
    **Before**
    
    ```R
    R/functions.R:2462:31: style: Words within variable and function names should be separated by '_' rather than '.'.
              function(x, schema, as.json.array = FALSE, ...) {
                                  ^~~~~~~~~~~~~
    ```
    
    **After**
    
    ```
    lintr checks passed.
    ```
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17590 from HyukjinKwon/disable-dot-in-name.
    HyukjinKwon authored and Felix Cheung committed Apr 16, 2017
    Configuration menu
    Copy the full SHA
    86d251c View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2017

  1. [SPARK-19828][R][FOLLOWUP] Rename asJsonArray to as.json.array in fro…

    …m_json function in R
    
    ## What changes were proposed in this pull request?
    
    This was suggested to be `as.json.array` at the first place in the PR to SPARK-19828 but we could not do this as the lint check emits an error for multiple dots in the variable names.
    
    After SPARK-20278, now we are able to use `multiple.dots.in.names`. `asJsonArray` in `from_json` function is still able to be changed as 2.2 is not released yet.
    
    So, this PR proposes to rename `asJsonArray` to `as.json.array`.
    
    ## How was this patch tested?
    
    Jenkins tests, local tests with `./R/run-tests.sh` and manual `./dev/lint-r`. Existing tests should cover this.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17653 from HyukjinKwon/SPARK-19828-followup.
    HyukjinKwon authored and Felix Cheung committed Apr 17, 2017
    Configuration menu
    Copy the full SHA
    24f09b3 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20349][SQL] ListFunctions returns duplicate functions after us…

    …ing persistent functions
    
    ### What changes were proposed in this pull request?
    The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
    
    It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
    
    ### How was this patch tested?
    Added test cases.
    
    Author: Xiao Li <[email protected]>
    
    Closes #17646 from gatorsmile/showFunctions.
    gatorsmile committed Apr 17, 2017
    Configuration menu
    Copy the full SHA
    01ff035 View commit details
    Browse the repository at this point in the history
  3. [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.

    ## What changes were proposed in this pull request?
    
    This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping.
    A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.
    
    ---
    
    Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features.
    
    | RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
    | --- | --- | --- | --- |
    | [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no |
    | [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes |
    | [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes |
    | [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no |
    | [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes |
    | [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes |
    | Current Spark | _, % | \ | yes |
    
    [1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket.
    
    The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed.
    1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
       PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern.
       According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid.
       _Proposed new behaviour in Spark: throw AnalysisException_
    2. [x] Empty input, e.g. `'' like ''`
       Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule.
    3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
       Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same.
       According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign.
       _Proposed new behaviour in Spark: throw AnalysisException_
    
    The current specification is also described in the operator's source code in this patch.
    ## How was this patch tested?
    
    Extra case in regex unit tests.
    
    Author: Jakob Odersky <[email protected]>
    
    This patch had conflicts when merged, resolved by
    Committer: Reynold Xin <[email protected]>
    
    Closes #15398 from jodersky/SPARK-17647.
    jodersky authored and rxin committed Apr 17, 2017
    Configuration menu
    Copy the full SHA
    e5fee3e View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2017

  1. Typo fix: distitrbuted -> distributed

    ## What changes were proposed in this pull request?
    
    Typo fix: distitrbuted -> distributed
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Andrew Ash <[email protected]>
    
    Closes #17664 from ash211/patch-1.
    ash211 authored and rxin committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    0075562 View commit details
    Browse the repository at this point in the history
  2. [TEST][MINOR] Replace repartitionBy with distribute in CollapseRepart…

    …itionSuite
    
    ## What changes were proposed in this pull request?
    
    Replace non-existent `repartitionBy` with `distribute` in `CollapseRepartitionSuite`.
    
    ## How was this patch tested?
    
    local build and `catalyst/testOnly *CollapseRepartitionSuite`
    
    Author: Jacek Laskowski <[email protected]>
    
    Closes #17657 from jaceklaskowski/CollapseRepartitionSuite.
    jaceklaskowski authored and rxin committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    33ea908 View commit details
    Browse the repository at this point in the history
  3. [SPARK-17647][SQL][FOLLOWUP][MINOR] fix typo

    ## What changes were proposed in this pull request?
    
    fix typo
    
    ## How was this patch tested?
    
    manual
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17663 from felixcheung/likedoctypo.
    felixcheung authored and Felix Cheung committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    b0a1e93 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20344][SCHEDULER] Duplicate call in FairSchedulableBuilder.add…

    …TaskSetManager
    
    ## What changes were proposed in this pull request?
    
    Eliminate the duplicate call to `Pool.getSchedulableByName()` in `FairSchedulableBuilder.addTaskSetManager`
    
    ## How was this patch tested?
    
    ./dev/run-tests
    
    Author: Robert Stupp <[email protected]>
    
    Closes #17647 from snazy/20344-dup-call-master.
    snazy authored and srowen committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    07fd94e View commit details
    Browse the repository at this point in the history
  5. [SPARK-20343][BUILD] Force Avro 1.7.7 in sbt build to resolve build f…

    …ailure in SBT Hadoop 2.6 master on Jenkins
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to force Avro's version to 1.7.7 in core to resolve the build failure as below:
    
    ```
    [error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData
    [error]     writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema))
    [error]
    ```
    
    https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
    
    Note that this is a hack and should be removed in the future.
    
    ## How was this patch tested?
    
    I only tested this actually overrides the dependency.
    
    I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.
    
    Please refer the comments in #17477 (comment)
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17651 from HyukjinKwon/SPARK-20343-sbt.
    HyukjinKwon authored and srowen committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    d4f10cb View commit details
    Browse the repository at this point in the history
  6. [SPARK-20366][SQL] Fix recursive join reordering: inside joins are no…

    …t reordered
    
    ## What changes were proposed in this pull request?
    
    If a plan has multi-level successive joins, e.g.:
    ```
             Join
             /   \
         Union   t5
          /   \
        Join  t4
        /   \
      Join  t3
      /  \
     t1   t2
    ```
    Currently we fail to reorder the inside joins, i.e. t1, t2, t3.
    
    In join reorder, we use `OrderedJoin` to indicate a join has been ordered, such that when transforming down the plan, these joins don't need to be rerodered again.
    
    But there's a problem in the definition of `OrderedJoin`:
    The real join node is a parameter, but not a child. This breaks the transform procedure because `mapChildren` applies transform function on parameters which should be children.
    
    In this patch, we change `OrderedJoin` to a class having the same structure as a join node.
    
    ## How was this patch tested?
    
    Add a corresponding test case.
    
    Author: wangzhenhua <[email protected]>
    
    Closes #17668 from wzhfy/recursiveReorder.
    wzhfy authored and cloud-fan committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    321b4f0 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20354][CORE][REST-API] When I request access to the 'http: //i…

    …p:port/api/v1/applications' link, return 'sparkUser' is empty in REST API.
    
    ## What changes were proposed in this pull request?
    
    When I request access to the 'http: //ip:port/api/v1/applications' link, get the json. I need the 'sparkUser' field specific value, because my Spark big data management platform needs to filter through this field which user submits the application to facilitate my administration and query, but the current return of the json string is empty, causing me this Function can not be achieved, that is, I do not know who the specific application is submitted by this REST Api.
    
    **current return json:**
    [ {
      "id" : "app-20170417152053-0000",
      "name" : "KafkaWordCount",
      "attempts" : [ {
        "startTime" : "2017-04-17T07:20:51.395GMT",
        "endTime" : "1969-12-31T23:59:59.999GMT",
        "lastUpdated" : "2017-04-17T07:20:51.395GMT",
        "duration" : 0,
        **"sparkUser" : "",**
        "completed" : false,
        "endTimeEpoch" : -1,
        "startTimeEpoch" : 1492413651395,
        "lastUpdatedEpoch" : 1492413651395
      } ]
    } ]
    
    **When I fix this question, return json:**
    [ {
      "id" : "app-20170417154201-0000",
      "name" : "KafkaWordCount",
      "attempts" : [ {
        "startTime" : "2017-04-17T07:41:57.335GMT",
        "endTime" : "1969-12-31T23:59:59.999GMT",
        "lastUpdated" : "2017-04-17T07:41:57.335GMT",
        "duration" : 0,
        **"sparkUser" : "mr",**
        "completed" : false,
        "startTimeEpoch" : 1492414917335,
        "endTimeEpoch" : -1,
        "lastUpdatedEpoch" : 1492414917335
      } ]
    } ]
    
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 郭小龙 10207633 <[email protected]>
    Author: guoxiaolong <[email protected]>
    Author: guoxiaolongzte <[email protected]>
    
    Closes #17656 from guoxiaolongzte/SPARK-20354.
    郭小龙 10207633 authored and Marcelo Vanzin committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    1f81dda View commit details
    Browse the repository at this point in the history
  8. [SPARK-20360][PYTHON] reprs for interpreters

    ## What changes were proposed in this pull request?
    
    Establishes a very minimal `_repr_html_` for PySpark's `SparkContext`.
    
    ## How was this patch tested?
    
    nteract:
    
    ![screen shot 2017-04-17 at 3 41 29 pm](https://cloud.githubusercontent.com/assets/836375/25107701/d57090ba-2385-11e7-8147-74bc2c50a41b.png)
    
    Jupyter:
    
    ![screen shot 2017-04-17 at 3 53 19 pm](https://cloud.githubusercontent.com/assets/836375/25107725/05bf1fe8-2386-11e7-93e1-07a20c917dde.png)
    
    Hydrogen:
    
    ![screen shot 2017-04-17 at 3 49 55 pm](https://cloud.githubusercontent.com/assets/836375/25107664/a75e1ddc-2385-11e7-8477-258661833007.png)
    
    Author: Kyle Kelley <[email protected]>
    
    Closes #17662 from rgbkrk/repr.
    rgbkrk authored and holdenk committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    f654b39 View commit details
    Browse the repository at this point in the history
  9. [SPARK-20377][SS] Fix JavaStructuredSessionization example

    ## What changes were proposed in this pull request?
    
    Extra accessors in java bean class causes incorrect encoder generation, which corrupted the state when using timeouts.
    
    ## How was this patch tested?
    manually ran the example
    
    Author: Tathagata Das <[email protected]>
    
    Closes #17676 from tdas/SPARK-20377.
    tdas committed Apr 18, 2017
    Configuration menu
    Copy the full SHA
    74aa0df View commit details
    Browse the repository at this point in the history

Commits on Apr 19, 2017

  1. [SPARK-20254][SQL] Remove unnecessary data conversion for Dataset wit…

    …h primitive array
    
    ## What changes were proposed in this pull request?
    
    This PR elminates unnecessary data conversion, which is introduced by SPARK-19716, for Dataset with primitve array in the generated Java code.
    When we run the following example program, now we get the Java code "Without this PR". In this code, lines 56-82 are unnecessary since the primitive array in ArrayData can be converted into Java primitive array by using ``toDoubleArray()`` method. ``GenericArrayData`` is not required.
    
    ```java
    val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
    ds.count
    ds.map(e => e).show
    ```
    
    Without this PR
    ```
    == Parsed Logical Plan ==
    'SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
    +- 'MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
       +- 'DeserializeToObject unresolveddeserializer(unresolvedmapobjects(<function1>, getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), obj#23: [D
          +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
             +- ExternalRDD [obj#1]
    
    == Analyzed Logical Plan ==
    value: array<double>
    SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
    +- MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
       +- DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D
          +- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
             +- ExternalRDD [obj#1]
    
    == Optimized Logical Plan ==
    SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
    +- MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
       +- DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D
          +- InMemoryRelation [value#2], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                +- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
                   +- Scan ExternalRDDScan[obj#1]
    
    == Physical Plan ==
    *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
    +- *MapElements <function1>, obj#24: [D
       +- *DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D
          +- InMemoryTableScan [value#2]
                +- InMemoryRelation [value#2], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                      +- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
                         +- Scan ExternalRDDScan[obj#1]
    ```
    
    ```java
    /* 050 */   protected void processNext() throws java.io.IOException {
    /* 051 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 052 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 053 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
    /* 054 */       ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
    /* 055 */
    /* 056 */       ArrayData deserializetoobject_value1 = null;
    /* 057 */
    /* 058 */       if (!inputadapter_isNull) {
    /* 059 */         int deserializetoobject_dataLength = inputadapter_value.numElements();
    /* 060 */
    /* 061 */         Double[] deserializetoobject_convertedArray = null;
    /* 062 */         deserializetoobject_convertedArray = new Double[deserializetoobject_dataLength];
    /* 063 */
    /* 064 */         int deserializetoobject_loopIndex = 0;
    /* 065 */         while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
    /* 066 */           MapObjects_loopValue2 = (double) (inputadapter_value.getDouble(deserializetoobject_loopIndex));
    /* 067 */           MapObjects_loopIsNull2 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
    /* 068 */
    /* 069 */           if (MapObjects_loopIsNull2) {
    /* 070 */             throw new RuntimeException(((java.lang.String) references[0]));
    /* 071 */           }
    /* 072 */           if (false) {
    /* 073 */             deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null;
    /* 074 */           } else {
    /* 075 */             deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue2;
    /* 076 */           }
    /* 077 */
    /* 078 */           deserializetoobject_loopIndex += 1;
    /* 079 */         }
    /* 080 */
    /* 081 */         deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray); /*###*/
    /* 082 */       }
    /* 083 */       boolean deserializetoobject_isNull = true;
    /* 084 */       double[] deserializetoobject_value = null;
    /* 085 */       if (!inputadapter_isNull) {
    /* 086 */         deserializetoobject_isNull = false;
    /* 087 */         if (!deserializetoobject_isNull) {
    /* 088 */           Object deserializetoobject_funcResult = null;
    /* 089 */           deserializetoobject_funcResult = deserializetoobject_value1.toDoubleArray();
    /* 090 */           if (deserializetoobject_funcResult == null) {
    /* 091 */             deserializetoobject_isNull = true;
    /* 092 */           } else {
    /* 093 */             deserializetoobject_value = (double[]) deserializetoobject_funcResult;
    /* 094 */           }
    /* 095 */
    /* 096 */         }
    /* 097 */         deserializetoobject_isNull = deserializetoobject_value == null;
    /* 098 */       }
    /* 099 */
    /* 100 */       boolean mapelements_isNull = true;
    /* 101 */       double[] mapelements_value = null;
    /* 102 */       if (!false) {
    /* 103 */         mapelements_resultIsNull = false;
    /* 104 */
    /* 105 */         if (!mapelements_resultIsNull) {
    /* 106 */           mapelements_resultIsNull = deserializetoobject_isNull;
    /* 107 */           mapelements_argValue = deserializetoobject_value;
    /* 108 */         }
    /* 109 */
    /* 110 */         mapelements_isNull = mapelements_resultIsNull;
    /* 111 */         if (!mapelements_isNull) {
    /* 112 */           Object mapelements_funcResult = null;
    /* 113 */           mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
    /* 114 */           if (mapelements_funcResult == null) {
    /* 115 */             mapelements_isNull = true;
    /* 116 */           } else {
    /* 117 */             mapelements_value = (double[]) mapelements_funcResult;
    /* 118 */           }
    /* 119 */
    /* 120 */         }
    /* 121 */         mapelements_isNull = mapelements_value == null;
    /* 122 */       }
    /* 123 */
    /* 124 */       serializefromobject_resultIsNull = false;
    /* 125 */
    /* 126 */       if (!serializefromobject_resultIsNull) {
    /* 127 */         serializefromobject_resultIsNull = mapelements_isNull;
    /* 128 */         serializefromobject_argValue = mapelements_value;
    /* 129 */       }
    /* 130 */
    /* 131 */       boolean serializefromobject_isNull = serializefromobject_resultIsNull;
    /* 132 */       final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
    /* 133 */       serializefromobject_isNull = serializefromobject_value == null;
    /* 134 */       serializefromobject_holder.reset();
    /* 135 */
    /* 136 */       serializefromobject_rowWriter.zeroOutNullBytes();
    /* 137 */
    /* 138 */       if (serializefromobject_isNull) {
    /* 139 */         serializefromobject_rowWriter.setNullAt(0);
    /* 140 */       } else {
    /* 141 */         // Remember the current cursor so that we can calculate how many bytes are
    /* 142 */         // written later.
    /* 143 */         final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
    /* 144 */
    /* 145 */         if (serializefromobject_value instanceof UnsafeArrayData) {
    /* 146 */           final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
    /* 147 */           // grow the global buffer before writing data.
    /* 148 */           serializefromobject_holder.grow(serializefromobject_sizeInBytes);
    /* 149 */           ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
    /* 150 */           serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
    /* 151 */
    /* 152 */         } else {
    /* 153 */           final int serializefromobject_numElements = serializefromobject_value.numElements();
    /* 154 */           serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
    /* 155 */
    /* 156 */           for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
    /* 157 */             if (serializefromobject_value.isNullAt(serializefromobject_index)) {
    /* 158 */               serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
    /* 159 */             } else {
    /* 160 */               final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
    /* 161 */               serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
    /* 162 */             }
    /* 163 */           }
    /* 164 */         }
    /* 165 */
    /* 166 */         serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
    /* 167 */       }
    /* 168 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
    /* 169 */       append(serializefromobject_result);
    /* 170 */       if (shouldStop()) return;
    /* 171 */     }
    /* 172 */   }
    ```
    
    With this PR (eliminated lines 56-62 in the above code)
    ```java
    /* 047 */   protected void processNext() throws java.io.IOException {
    /* 048 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 049 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 050 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
    /* 051 */       ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
    /* 052 */
    /* 053 */       boolean deserializetoobject_isNull = true;
    /* 054 */       double[] deserializetoobject_value = null;
    /* 055 */       if (!inputadapter_isNull) {
    /* 056 */         deserializetoobject_isNull = false;
    /* 057 */         if (!deserializetoobject_isNull) {
    /* 058 */           Object deserializetoobject_funcResult = null;
    /* 059 */           deserializetoobject_funcResult = inputadapter_value.toDoubleArray();
    /* 060 */           if (deserializetoobject_funcResult == null) {
    /* 061 */             deserializetoobject_isNull = true;
    /* 062 */           } else {
    /* 063 */             deserializetoobject_value = (double[]) deserializetoobject_funcResult;
    /* 064 */           }
    /* 065 */
    /* 066 */         }
    /* 067 */         deserializetoobject_isNull = deserializetoobject_value == null;
    /* 068 */       }
    /* 069 */
    /* 070 */       boolean mapelements_isNull = true;
    /* 071 */       double[] mapelements_value = null;
    /* 072 */       if (!false) {
    /* 073 */         mapelements_resultIsNull = false;
    /* 074 */
    /* 075 */         if (!mapelements_resultIsNull) {
    /* 076 */           mapelements_resultIsNull = deserializetoobject_isNull;
    /* 077 */           mapelements_argValue = deserializetoobject_value;
    /* 078 */         }
    /* 079 */
    /* 080 */         mapelements_isNull = mapelements_resultIsNull;
    /* 081 */         if (!mapelements_isNull) {
    /* 082 */           Object mapelements_funcResult = null;
    /* 083 */           mapelements_funcResult = ((scala.Function1) references[0]).apply(mapelements_argValue);
    /* 084 */           if (mapelements_funcResult == null) {
    /* 085 */             mapelements_isNull = true;
    /* 086 */           } else {
    /* 087 */             mapelements_value = (double[]) mapelements_funcResult;
    /* 088 */           }
    /* 089 */
    /* 090 */         }
    /* 091 */         mapelements_isNull = mapelements_value == null;
    /* 092 */       }
    /* 093 */
    /* 094 */       serializefromobject_resultIsNull = false;
    /* 095 */
    /* 096 */       if (!serializefromobject_resultIsNull) {
    /* 097 */         serializefromobject_resultIsNull = mapelements_isNull;
    /* 098 */         serializefromobject_argValue = mapelements_value;
    /* 099 */       }
    /* 100 */
    /* 101 */       boolean serializefromobject_isNull = serializefromobject_resultIsNull;
    /* 102 */       final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
    /* 103 */       serializefromobject_isNull = serializefromobject_value == null;
    /* 104 */       serializefromobject_holder.reset();
    /* 105 */
    /* 106 */       serializefromobject_rowWriter.zeroOutNullBytes();
    /* 107 */
    /* 108 */       if (serializefromobject_isNull) {
    /* 109 */         serializefromobject_rowWriter.setNullAt(0);
    /* 110 */       } else {
    /* 111 */         // Remember the current cursor so that we can calculate how many bytes are
    /* 112 */         // written later.
    /* 113 */         final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
    /* 114 */
    /* 115 */         if (serializefromobject_value instanceof UnsafeArrayData) {
    /* 116 */           final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
    /* 117 */           // grow the global buffer before writing data.
    /* 118 */           serializefromobject_holder.grow(serializefromobject_sizeInBytes);
    /* 119 */           ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
    /* 120 */           serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
    /* 121 */
    /* 122 */         } else {
    /* 123 */           final int serializefromobject_numElements = serializefromobject_value.numElements();
    /* 124 */           serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
    /* 125 */
    /* 126 */           for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
    /* 127 */             if (serializefromobject_value.isNullAt(serializefromobject_index)) {
    /* 128 */               serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
    /* 129 */             } else {
    /* 130 */               final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
    /* 131 */               serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
    /* 132 */             }
    /* 133 */           }
    /* 134 */         }
    /* 135 */
    /* 136 */         serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
    /* 137 */       }
    /* 138 */       serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
    /* 139 */       append(serializefromobject_result);
    /* 140 */       if (shouldStop()) return;
    /* 141 */     }
    /* 142 */   }
    ```
    
    ## How was this patch tested?
    
    Add test suites into `DatasetPrimitiveSuite`
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #17568 from kiszk/SPARK-20254.
    kiszk authored and cloud-fan committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    e468a96 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20208][R][DOCS] Document R fpGrowth support

    ## What changes were proposed in this pull request?
    
    Document  fpGrowth in:
    
    - vignettes
    - programming guide
    - code example
    
    ## How was this patch tested?
    
    Manual tests.
    
    Author: zero323 <[email protected]>
    
    Closes #17557 from zero323/SPARK-20208.
    zero323 authored and Felix Cheung committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    702d85a View commit details
    Browse the repository at this point in the history
  3. [SPARK-20359][SQL] Avoid unnecessary execution in EliminateOuterJoin …

    …optimization that can lead to NPE
    
    Avoid necessary execution that can lead to NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown
    
    ## What changes were proposed in this pull request?
    Change leftHasNonNullPredicate and rightHasNonNullPredicate to lazy so they are only executed when needed.
    
    ## How was this patch tested?
    
    Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Koert Kuipers <[email protected]>
    
    Closes #17660 from koertkuipers/feat-catch-npe-in-eliminate-outer-join.
    koertkuipers authored and cloud-fan committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    608bf30 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20356][SQL] Pruned InMemoryTableScanExec should have correct o…

    …utput partitioning and ordering
    
    ## What changes were proposed in this pull request?
    
    The output of `InMemoryTableScanExec` can be pruned and mismatch with `InMemoryRelation` and its child plan's output. This causes wrong output partitioning and ordering.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #17679 from viirya/SPARK-20356.
    viirya authored and cloud-fan committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    773754b View commit details
    Browse the repository at this point in the history
  5. [SPARK-20343][BUILD] Avoid Unidoc build only if Hadoop 2.6 is explici…

    …tly set in SBT build
    
    ## What changes were proposed in this pull request?
    
    This PR proposes two things as below:
    
    - Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build
    
      Due to a different dependency resolution in SBT & Unidoc by an unknown reason, the documentation build fails on a specific machine & environment in Jenkins but it was unable to reproduce.
    
      So, this PR just checks an environment variable `AMPLAB_JENKINS_BUILD_PROFILE` that is set in Hadoop 2.6 SBT build against branches on Jenkins, and then disables Unidoc build. **Note that PR builder will still build it with Hadoop 2.6 & SBT.**
    
      ```
      ========================================================================
      Building Unidoc API Documentation
      ========================================================================
      [info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments:  -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc
      Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
      ...
      ```
    
      I checked the environment variables from the logs (first bit) as below:
    
      - **spark-master-test-sbt-hadoop-2.6** (this one is being failed) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/lastBuild/consoleFull
    
      ```
      JAVA_HOME=/usr/java/jdk1.8.0_60
      JAVA_7_HOME=/usr/java/jdk1.7.0_79
      SPARK_BRANCH=master
      AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6   <- I use this variable
      AMPLAB_JENKINS="true"
      ```
      - spark-master-test-sbt-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/lastBuild/consoleFull
    
      ```
      JAVA_HOME=/usr/java/jdk1.8.0_60
      JAVA_7_HOME=/usr/java/jdk1.7.0_79
      SPARK_BRANCH=master
      AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.7
      AMPLAB_JENKINS="true"
      ```
    
      - spark-master-test-maven-hadoop-2.6 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/lastBuild/consoleFull
    
      ```
      JAVA_HOME=/usr/java/jdk1.8.0_60
      JAVA_7_HOME=/usr/java/jdk1.7.0_79
      HADOOP_PROFILE=hadoop-2.6
      HADOOP_VERSION=
      SPARK_BRANCH=master
      AMPLAB_JENKINS="true"
      ```
    
      - spark-master-test-maven-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastBuild/consoleFull
    
      ```
      JAVA_HOME=/usr/java/jdk1.8.0_60
      JAVA_7_HOME=/usr/java/jdk1.7.0_79
      HADOOP_PROFILE=hadoop-2.7
      HADOOP_VERSION=
      SPARK_BRANCH=master
      AMPLAB_JENKINS="true"
      ```
    
      - PR builder - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75843/consoleFull
    
      ```
      JENKINS_MASTER_HOSTNAME=amp-jenkins-master
      JAVA_HOME=/usr/java/jdk1.8.0_60
      JAVA_7_HOME=/usr/java/jdk1.7.0_79
      ```
    
      Assuming from other logs in branch-2.1
    
        - SBT & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/lastBuild/consoleFull
    
          ```
          JAVA_HOME=/usr/java/jdk1.8.0_60
          JAVA_7_HOME=/usr/java/jdk1.7.0_79
          SPARK_BRANCH=branch-2.1
          AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6
          AMPLAB_JENKINS="true"
          ```
    
        - Maven & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-maven-hadoop-2.6/lastBuild/consoleFull
    
          ```
          JAVA_HOME=/usr/java/jdk1.8.0_60
          JAVA_7_HOME=/usr/java/jdk1.7.0_79
          HADOOP_PROFILE=hadoop-2.6
          HADOOP_VERSION=
          SPARK_BRANCH=branch-2.1
          AMPLAB_JENKINS="true"
          ```
    
      We have been using the same convention for those variables. These are actually being used in `run-tests.py` script - here https://github.com/apache/spark/blob/master/dev/run-tests.py#L519-L520
    
    - Revert the previous try
    
      After #17651, it seems the build still fails on SBT Hadoop 2.6 master.
    
      I am unable to reproduce this - #17477 (comment) and the reviewer was too. So, this got merged as it looks the only way to verify this is to merge it currently (as no one seems able to reproduce this).
    
    ## How was this patch tested?
    
    I only checked `is_hadoop_version_2_6 = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6"` is working fine as expected as below:
    
    ```python
    >>> import collections
    >>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.6"})
    >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
    False
    >>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.7"})
    >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
    True
    >>> os = collections.namedtuple('os', 'environ')(environ={})
    >>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
    True
    ```
    
    I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.
    
    Please refer the comments in #17477 (comment)
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17669 from HyukjinKwon/revert-SPARK-20343.
    HyukjinKwon authored and srowen committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    3537876 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20036][DOC] Note incompatible dependencies on org.apache.kafka…

    … artifacts
    
    ## What changes were proposed in this pull request?
    
    Note that you shouldn't manually add dependencies on org.apache.kafka artifacts
    
    ## How was this patch tested?
    
    Doc only change, did jekyll build and looked at the page.
    
    Author: cody koeninger <[email protected]>
    
    Closes #17675 from koeninger/SPARK-20036.
    koeninger authored and srowen committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    71a8e9d View commit details
    Browse the repository at this point in the history
  7. [SPARK-20397][SPARKR][SS] Fix flaky test: test_streaming.R.Terminated…

    … by error
    
    ## What changes were proposed in this pull request?
    
    Checking a source parameter is asynchronous. When the query is created, it's not guaranteed that source has been created. This PR just increases the timeout of awaitTermination to ensure the parsing error is thrown.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17687 from zsxwing/SPARK-20397.
    zsxwing committed Apr 19, 2017
    Configuration menu
    Copy the full SHA
    4fea784 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2017

  1. [SPARK-20350] Add optimization rules to apply Complementation Laws.

    ## What changes were proposed in this pull request?
    
    Apply Complementation Laws during boolean expression simplification.
    
    ## How was this patch tested?
    
    Tested using unit tests, integration tests, and manual tests.
    
    Author: ptkool <[email protected]>
    Author: Michael Styles <[email protected]>
    
    Closes #17650 from ptkool/apply_complementation_laws.
    ptkool authored and cloud-fan committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    63824b2 View commit details
    Browse the repository at this point in the history
  2. [MINOR][SS] Fix a missing space in UnsupportedOperationChecker error …

    …message
    
    ## What changes were proposed in this pull request?
    
    Also went through the same file to ensure other string concatenation are correct.
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17691 from zsxwing/fix-error-message.
    zsxwing committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    39e303a View commit details
    Browse the repository at this point in the history
  3. [SPARK-20398][SQL] range() operator should include cancellation reaso…

    …n when killed
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-19820 adds a reason field for why tasks were killed. However, for backwards compatibility it left the old TaskKilledException constructor which defaults to "unknown reason".
    The range() operator should use the constructor that fills in the reason rather than dropping it on task kill.
    
    ## How was this patch tested?
    
    Existing tests, and I tested this manually.
    
    Author: Eric Liang <[email protected]>
    
    Closes #17692 from ericl/fix-kill-reason-in-range.
    ericl authored and rxin committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    dd6d55d View commit details
    Browse the repository at this point in the history
  4. Fixed typos in docs

    ## What changes were proposed in this pull request?
    
    Typos at a couple of place in the docs.
    
    ## How was this patch tested?
    
    build including docs
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: ymahajan <[email protected]>
    
    Closes #17690 from ymahajan/master.
    ymahajan authored and rxin committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    bdc6056 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20375][R] R wrappers for array and map

    ## What changes were proposed in this pull request?
    
    Adds wrappers for `o.a.s.sql.functions.array` and `o.a.s.sql.functions.map`
    
    ## How was this patch tested?
    
    Unit tests, `check-cran.sh`
    
    Author: zero323 <[email protected]>
    
    Closes #17674 from zero323/SPARK-20375.
    zero323 authored and Felix Cheung committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    46c5749 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20156][SQL][FOLLOW-UP] Java String toLowerCase "Turkish locale…

    … bug" in Database and Table DDLs
    
    ### What changes were proposed in this pull request?
    Database and Table names conform the Hive standard ("[a-zA-z_0-9]+"), i.e. if this name only contains characters, numbers, and _.
    
    When calling `toLowerCase` on the names, we should add `Locale.ROOT` to the `toLowerCase`for avoiding inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
    
    ### How was this patch tested?
    Added a test case
    
    Author: Xiao Li <[email protected]>
    
    Closes #17655 from gatorsmile/locale.
    gatorsmile authored and srowen committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    55bea56 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20405][SQL] Dataset.withNewExecutionId should be private

    ## What changes were proposed in this pull request?
    Dataset.withNewExecutionId is only used in Dataset itself and should be private.
    
    ## How was this patch tested?
    N/A - this is a simple visibility change.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17699 from rxin/SPARK-20405.
    rxin authored and hvanhovell committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    c6f62c5 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20409][SQL] fail early if aggregate function in GROUP BY

    ## What changes were proposed in this pull request?
    
    It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17704 from cloud-fan/minor.
    cloud-fan authored and hvanhovell committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    b91873d View commit details
    Browse the repository at this point in the history
  9. [SPARK-20407][TESTS] ParquetQuerySuite 'Enabling/disabling ignoreCorr…

    …uptFiles' flaky test
    
    ## What changes were proposed in this pull request?
    
    SharedSQLContext.afterEach now calls DebugFilesystem.assertNoOpenStreams inside eventually.
    SQLTestUtils withTempDir calls waitForTasksToFinish before deleting the directory.
    
    ## How was this patch tested?
    Added new test in ParquetQuerySuite based on the flaky test
    
    Author: Bogdan Raducanu <[email protected]>
    
    Closes #17701 from bogdanrdc/SPARK-20407.
    bogdanrdc authored and hvanhovell committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    c5a31d1 View commit details
    Browse the repository at this point in the history
  10. [SPARK-20358][CORE] Executors failing stage on interrupted exception …

    …thrown by cancelled tasks
    
    ## What changes were proposed in this pull request?
    
    This was a regression introduced by my earlier PR here: #17531
    
    It turns out NonFatal() does not in fact catch InterruptedException.
    
    ## How was this patch tested?
    
    Extended cancellation unit test coverage. The first test fails before this patch.
    
    cc JoshRosen mridulm
    
    Author: Eric Liang <[email protected]>
    
    Closes #17659 from ericl/spark-20358.
    ericl authored and yhuai committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    b2ebadf View commit details
    Browse the repository at this point in the history
  11. [SPARK-20334][SQL] Return a better error message when correlated pred…

    …icates contain aggregate expression that has mixture of outer and local references.
    
    ## What changes were proposed in this pull request?
    Address a follow up in [comment](#16954 (comment))
    Currently subqueries with correlated predicates containing aggregate expression having mixture of outer references and local references generate a codegen error like following :
    
    ```SQL
    SELECT t1a
    FROM   t1
    GROUP  BY 1
    HAVING EXISTS (SELECT 1
                   FROM  t2
                   WHERE t2a < min(t1a + t2a));
    ```
    Exception snippet.
    ```
    Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
    	at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
    	at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
    	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
    	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
    	at scala.Option.getOrElse(Option.scala:121)
    	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
    
    ```
    After this PR, a better error message is issued.
    ```
    org.apache.spark.sql.AnalysisException
    Error in query: Found an aggregate expression in a correlated
    predicate that has both outer and local references, which is not supported yet.
    Aggregate expression: min((t1.`t1a` + t2.`t2a`)),
    Outer references: t1.`t1a`,
    Local references: t2.`t2a`.;
    ```
    ## How was this patch tested?
    Added tests in SQLQueryTestSuite.
    
    Author: Dilip Biswal <[email protected]>
    
    Closes #17636 from dilipbiswal/subquery_followup1.
    dilipbiswal authored and hvanhovell committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    d95e4d9 View commit details
    Browse the repository at this point in the history
  12. [SPARK-20410][SQL] Make sparkConf a def in SharedSQLContext

    ## What changes were proposed in this pull request?
    It is kind of annoying that `SharedSQLContext.sparkConf` is a val when overriding test cases, because you cannot call `super` on it. This PR makes it a function.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #17705 from hvanhovell/SPARK-20410.
    hvanhovell committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    0332063 View commit details
    Browse the repository at this point in the history
  13. [SPARK-20172][CORE] Add file permission check when listing files in F…

    …sHistoryProvider
    
    ## What changes were proposed in this pull request?
    
    In the current Spark's HistoryServer we expected to get `AccessControlException` during listing all the files, but unfortunately it was not worked because we actually doesn't check the access permission and no other calls will throw such exception. What was worse is that this check will be deferred until reading files, which is not necessary and quite verbose, since it will be printed out the exception in every 10 seconds when checking the files.
    
    So here with this fix, we actually check the read permission during listing the files, which could avoid unnecessary file read later on and suppress the verbose log.
    
    ## How was this patch tested?
    
    Add unit test to verify.
    
    Author: jerryshao <[email protected]>
    
    Closes #17495 from jerryshao/SPARK-20172.
    jerryshao authored and Marcelo Vanzin committed Apr 20, 2017
    Configuration menu
    Copy the full SHA
    592f5c8 View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2017

  1. [SPARK-20367] Properly unescape column names of partitioning columns …

    …parsed from paths.
    
    ## What changes were proposed in this pull request?
    
    When infering partitioning schema from paths, the column in parsePartitionColumn should be unescaped with unescapePathName, just like it is being done in e.g. parsePathFragmentAsSeq.
    
    ## How was this patch tested?
    
    Added a test to FileIndexSuite.
    
    Author: Juliusz Sompolski <[email protected]>
    
    Closes #17703 from juliuszsompolski/SPARK-20367.
    juliuszsompolski authored and cloud-fan committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    0368eb9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20329][SQL] Make timezone aware expression without timezone un…

    …resolved
    
    ## What changes were proposed in this pull request?
    A cast expression with a resolved time zone is not equal to a cast expression without a resolved time zone. The `ResolveAggregateFunction` assumed that these expression were the same, and would fail to resolve `HAVING` clauses which contain a `Cast` expression.
    
    This is in essence caused by the fact that a `TimeZoneAwareExpression` can be resolved without a set time zone. This PR fixes this, and makes a `TimeZoneAwareExpression` unresolved as long as it has no TimeZone set.
    
    ## How was this patch tested?
    Added a regression test to the `SQLQueryTestSuite.having` file.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #17641 from hvanhovell/SPARK-20329.
    hvanhovell authored and cloud-fan committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    760c8d0 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20281][SQL] Print the identical Range parameters of SparkConte…

    …xt APIs and SQL in explain
    
    ## What changes were proposed in this pull request?
    This pr modified code to print the identical `Range` parameters of SparkContext APIs and SQL in `explain` output. In the current master, they internally use `defaultParallelism` for `splits` by default though, they print different strings in explain output;
    
    ```
    scala> spark.range(4).explain
    == Physical Plan ==
    *Range (0, 4, step=1, splits=Some(8))
    
    scala> sql("select * from range(4)").explain
    == Physical Plan ==
    *Range (0, 4, step=1, splits=None)
    ```
    
    ## How was this patch tested?
    Added tests in `SQLQuerySuite` and modified some results in the existing tests.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17670 from maropu/SPARK-20281.
    maropu authored and gatorsmile committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    48d760d View commit details
    Browse the repository at this point in the history
  4. [SPARK-20420][SQL] Add events to the external catalog

    ## What changes were proposed in this pull request?
    It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change.
    
    The following events are fired per object:
    
    - Database
      - CreateDatabasePreEvent: event fired before the database is created.
      - CreateDatabaseEvent: event fired after the database has been created.
      - DropDatabasePreEvent: event fired before the database is dropped.
      - DropDatabaseEvent: event fired after the database has been dropped.
    - Table
      - CreateTablePreEvent: event fired before the table is created.
      - CreateTableEvent: event fired after the table has been created.
      - RenameTablePreEvent: event fired before the table is renamed.
      - RenameTableEvent: event fired after the table has been renamed.
      - DropTablePreEvent: event fired before the table is dropped.
      - DropTableEvent: event fired after the table has been dropped.
    - Function
      - CreateFunctionPreEvent: event fired before the function is created.
      - CreateFunctionEvent: event fired after the function has been created.
      - RenameFunctionPreEvent: event fired before the function is renamed.
      - RenameFunctionEvent: event fired after the function has been renamed.
      - DropFunctionPreEvent: event fired before the function is dropped.
      - DropFunctionPreEvent: event fired after the function has been dropped.
    
    The current events currently only contain the names of the object modified. We add more events, and more details at a later point.
    
    A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`.
    
    ## How was this patch tested?
    Added the `ExternalCatalogEventSuite`.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #17710 from hvanhovell/SPARK-20420.
    hvanhovell authored and rxin committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    e2b3d23 View commit details
    Browse the repository at this point in the history
  5. Small rewording about history server use case

    Hello
    PR #10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only.
    
    Author: Hervé <[email protected]>
    
    Closes #17709 from dud225/patch-1.
    dud225 authored and srowen committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    3476799 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20412] Throw ParseException from visitNonOptionalPartitionSpec…

    … instead of returning null values.
    
    ## What changes were proposed in this pull request?
    
    If a partitionSpec is supposed to not contain optional values, a ParseException should be thrown, and not nulls returned.
    The nulls can later cause NullPointerExceptions in places not expecting them.
    
    ## How was this patch tested?
    
    A query like "SHOW PARTITIONS tbl PARTITION(col1='val1', col2)" used to throw a NullPointerException.
    Now it throws a ParseException.
    
    Author: Juliusz Sompolski <[email protected]>
    
    Closes #17707 from juliuszsompolski/SPARK-20412.
    juliuszsompolski authored and cloud-fan committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    c9e6035 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20341][SQL] Support BigInt's value that does not fit in long v…

    …alue range
    
    ## What changes were proposed in this pull request?
    
    This PR avoids an exception in the case where `scala.math.BigInt` has a value that does not fit into long value range (e.g. `Long.MAX_VALUE+1`). When we run the following code by using the current Spark, the following exception is thrown.
    
    This PR keeps the value using `BigDecimal` if we detect such an overflow case by catching `ArithmeticException`.
    
    Sample program:
    ```
    case class BigIntWrapper(value:scala.math.BigInt)```
    spark.createDataset(BigIntWrapper(scala.math.BigInt("10000000000000000002"))::Nil).show
    ```
    Exception:
    ```
    Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
    staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
    java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
    staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
    	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
    	at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
    	at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
    	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    	at scala.collection.immutable.List.foreach(List.scala:381)
    	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    	at scala.collection.immutable.List.map(List.scala:285)
    	at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
    	at org.apache.spark.sql.Agg$$anonfun$18.apply$mcV$sp(MySuite.scala:192)
    	at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
    	at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
    	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    	at org.scalatest.Transformer.apply(Transformer.scala:22)
    	at org.scalatest.Transformer.apply(Transformer.scala:20)
    	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
    	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    ...
    Caused by: java.lang.ArithmeticException: BigInteger out of long range
    	at java.math.BigInteger.longValueExact(BigInteger.java:4531)
    	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:140)
    	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:434)
    	at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
    	... 59 more
    ```
    
    ## How was this patch tested?
    
    Add new test suite into `DecimalSuite`
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #17684 from kiszk/SPARK-20341.
    kiszk authored and cloud-fan committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    a750a59 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20423][ML] fix MLOR coeffs centering when reg == 0

    ## What changes were proposed in this pull request?
    
    When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
    BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.
    
    In fact the `coefficientMatrix` should be centralized on each feature index itself.
    Because, according to the MLOR probability distribution function, it can be proven easily that:
    suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
    then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
    `c` is an arbitrary vector of `numFeatures` dimension.
    reference
    https://core.ac.uk/download/pdf/6287975.pdf
    
    So that we need to centralize the `coefficientMatrix` on each feature dimension separately.
    
    **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**
    
    ## How was this patch tested?
    
    Tests added.
    
    Author: WeichenXu <[email protected]>
    
    Closes #17706 from WeichenXu123/mlor_center.
    WeichenXu123 authored and dbtsai committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    eb00378 View commit details
    Browse the repository at this point in the history
  9. [SPARK-20371][R] Add wrappers for collect_list and collect_set

    ## What changes were proposed in this pull request?
    
    Adds wrappers for `collect_list` and `collect_set`.
    
    ## How was this patch tested?
    
    Unit tests, `check-cran.sh`
    
    Author: zero323 <[email protected]>
    
    Closes #17672 from zero323/SPARK-20371.
    zero323 authored and Felix Cheung committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    fd648bf View commit details
    Browse the repository at this point in the history
  10. [SPARK-20401][DOC] In the spark official configuration document, the …

    …'spark.driver.supervise' configuration parameter specification and default values are necessary.
    
    ## What changes were proposed in this pull request?
    Use the REST interface submits the spark job.
    e.g.
    curl -X  POST http://10.43.183.120:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data'{
        "action": "CreateSubmissionRequest",
        "appArgs": [
            "myAppArgument"
        ],
        "appResource": "/home/mr/gxl/test.jar",
        "clientSparkVersion": "2.2.0",
        "environmentVariables": {
            "SPARK_ENV_LOADED": "1"
        },
        "mainClass": "cn.zte.HdfsTest",
        "sparkProperties": {
            "spark.jars": "/home/mr/gxl/test.jar",
            **"spark.driver.supervise": "true",**
            "spark.app.name": "HdfsTest",
            "spark.eventLog.enabled": "false",
            "spark.submit.deployMode": "cluster",
            "spark.master": "spark://10.43.183.120:6066"
        }
    }'
    
    **I hope that make sure that the driver is automatically restarted if it fails with non-zero exit code.
    But I can not find the 'spark.driver.supervise' configuration parameter specification and default values from the spark official document.**
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 郭小龙 10207633 <[email protected]>
    Author: guoxiaolong <[email protected]>
    Author: guoxiaolongzte <[email protected]>
    
    Closes #17696 from guoxiaolongzte/SPARK-20401.
    郭小龙 10207633 authored and srowen committed Apr 21, 2017
    Configuration menu
    Copy the full SHA
    ad29040 View commit details
    Browse the repository at this point in the history

Commits on Apr 22, 2017

  1. [SPARK-20386][SPARK CORE] modify the log info if the block exists on …

    …the slave already
    
    ## What changes were proposed in this pull request?
    Modify the added memory size to memSize-originalMemSize if the  block exists on the slave already
    since if the  block exists, the added memory size should be memSize-originalMemSize; if originalMemSize is bigger than memSize ,then the log info should be Removed memory, removed size should be originalMemSize-memSize
    
    ## How was this patch tested?
    Multiple runs on existing unit tests
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: eatoncys <[email protected]>
    
    Closes #17683 from eatoncys/SPARK-20386.
    eatoncys authored and srowen committed Apr 22, 2017
    Configuration menu
    Copy the full SHA
    05a4514 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20430][SQL] Initialise RangeExec parameters in a driver side

    ## What changes were proposed in this pull request?
    This pr initialised `RangeExec` parameters in a driver side.
    In the current master, a query below throws `NullPointerException`;
    ```
    sql("SET spark.sql.codegen.wholeStage=false")
    sql("SELECT * FROM range(1)").show
    
    17/04/20 17:11:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
    java.lang.NullPointerException
            at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:54)
            at org.apache.spark.sql.execution.RangeExec.numSlices(basicPhysicalOperators.scala:343)
            at org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:506)
            at org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:505)
            at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
            at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
            at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
            at org.apache.spark.scheduler.Task.run(Task.scala:108)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:320)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ```
    
    ## How was this patch tested?
    Added a test in `DataFrameRangeSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17717 from maropu/SPARK-20430.
    maropu authored and gatorsmile committed Apr 22, 2017
    Configuration menu
    Copy the full SHA
    b3c572a View commit details
    Browse the repository at this point in the history

Commits on Apr 23, 2017

  1. [SPARK-20132][DOCS] Add documentation for column string functions

    ## What changes were proposed in this pull request?
    Add docstrings to column.py for the Column functions `rlike`, `like`, `startswith`, and `endswith`. Pass these docstrings through `_bin_op`
    
    There may be a better place to put the docstrings. I put them immediately above the Column class.
    
    ## How was this patch tested?
    
    I ran `make html` on my local computer to remake the documentation, and verified that the html pages were displaying the docstrings correctly. I tried running `dev-tests`, and the formatting tests passed. However, my mvn build didn't work I think due to issues on my computer.
    
    These docstrings are my original work and free license.
    
    davies has done the most recent work reorganizing `_bin_op`
    
    Author: Michael Patterson <[email protected]>
    
    Closes #17469 from map222/patterson-documentation.
    map222 authored and holdenk committed Apr 23, 2017
    Configuration menu
    Copy the full SHA
    8765bc1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20385][WEB-UI] Submitted Time' field, the date format needs to…

    … be formatted, in running Drivers table or Completed Drivers table in master web ui.
    
    ## What changes were proposed in this pull request?
    Submitted Time' field, the date format **needs to be formatted**, in running Drivers table or Completed Drivers table in master web ui.
    Before fix this problem  e.g.
    
    Completed Drivers
    Submission ID	             **Submitted Time**  	             Worker	                            State	   Cores	   Memory	       Main Class
    driver-20170419145755-0005	 **Wed Apr 19 14:57:55 CST 2017**	 worker-20170419145250-zdh120-40412	FAILED	   1	       1024.0 MB	   cn.zte.HdfsTest
    
    please see the  attachment:https://issues.apache.org/jira/secure/attachment/12863977/before_fix.png
    
    After fix this problem e.g.
    
    Completed Drivers
    Submission ID	             **Submitted Time**  	             Worker	                            State	   Cores	   Memory	       Main Class
    driver-20170419145755-0006	 **2017/04/19 16:01:25**	 worker-20170419145250-zdh120-40412	         FAILED	   1	       1024.0 MB	   cn.zte.HdfsTest
    
    please see the  attachment:https://issues.apache.org/jira/secure/attachment/12863976/after_fix.png
    
    'Submitted Time' field, the date format **has been formatted**, in running Applications table or Completed Applicationstable in master web ui, **it is correct.**
    e.g.
    Running Applications
    Application ID	                Name	                Cores	Memory per Executor	   **Submitted Time**	      User	   State	        Duration
    app-20170419160910-0000 (kill)	SparkSQL::10.43.183.120	1	    5.0 GB	               **2017/04/19 16:09:10**	  root	   RUNNING	    53 s
    
    **Format after the time easier to observe, and consistent with the applications table,so I think it's worth fixing.**
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 郭小龙 10207633 <[email protected]>
    Author: guoxiaolong <[email protected]>
    Author: guoxiaolongzte <[email protected]>
    
    Closes #17682 from guoxiaolongzte/SPARK-20385.
    郭小龙 10207633 authored and srowen committed Apr 23, 2017
    Configuration menu
    Copy the full SHA
    2eaf4f3 View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2017

  1. [BUILD] Close stale PRs

    ## What changes were proposed in this pull request?
    This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues).
    
    // Open PRs whose JIRA tickets have been already closed
    Closes #11785
    Closes #13027
    Closes #13614
    Closes #13761
    Closes #15197
    Closes #14006
    Closes #12576
    Closes #15447
    Closes #13259
    Closes #15616
    Closes #14473
    Closes #16638
    Closes #16146
    Closes #17269
    Closes #17313
    Closes #17418
    Closes #17485
    Closes #17551
    Closes #17463
    Closes #17625
    
    // Open PRs whose JIRA tickets does not exist and they are not minor issues
    Closes #10739
    Closes #15193
    Closes #15344
    Closes #14804
    Closes #16993
    Closes #17040
    Closes #15180
    Closes #17238
    
    ## How was this patch tested?
    N/A
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17734 from maropu/resolved_pr.
    maropu authored and srowen committed Apr 24, 2017
    Configuration menu
    Copy the full SHA
    e9f9715 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20439][SQL] Fix Catalog API listTables and getTable when faile…

    …d to fetch table metadata
    
    ### What changes were proposed in this pull request?
    
    `spark.catalog.listTables` and `spark.catalog.getTable` does not work if we are unable to retrieve table metadata due to any reason (e.g., table serde class is not accessible or the table type is not accepted by Spark SQL). After this PR, the APIs still return the corresponding Table without the description and tableType)
    
    ### How was this patch tested?
    Added a test case
    
    Author: Xiao Li <[email protected]>
    
    Closes #17730 from gatorsmile/listTables.
    gatorsmile authored and cloud-fan committed Apr 24, 2017
    Configuration menu
    Copy the full SHA
    776a2c0 View commit details
    Browse the repository at this point in the history
  3. [SPARK-18901][ML] Require in LR LogisticAggregator is redundant

    ## What changes were proposed in this pull request?
    
    In MultivariateOnlineSummarizer,
    
    `add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: [email protected] <[email protected]>
    
    Closes #17478 from wangmiao1981/logit.
    wangmiao1981 authored and yanboliang committed Apr 24, 2017
    Configuration menu
    Copy the full SHA
    90264ac View commit details
    Browse the repository at this point in the history
  4. [SPARK-20438][R] SparkR wrappers for split and repeat

    ## What changes were proposed in this pull request?
    
    Add wrappers for `o.a.s.sql.functions`:
    
    - `split` as `split_string`
    - `repeat` as `repeat_string`
    
    ## How was this patch tested?
    
    Existing tests, additional unit tests, `check-cran.sh`
    
    Author: zero323 <[email protected]>
    
    Closes #17729 from zero323/SPARK-20438.
    zero323 authored and Felix Cheung committed Apr 24, 2017
    Configuration menu
    Copy the full SHA
    8a272dd View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2017

  1. [SPARK-20239][CORE] Improve HistoryServer's ACL mechanism

    ## What changes were proposed in this pull request?
    
    Current SHS (Spark History Server) two different ACLs:
    
    * ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app.
    * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app.
    
    With this two ACLs, we may encounter several unexpected behaviors:
    
    1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app.
    2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A".
    3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file.
    
    The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all.
    
    So to improve SHS's ACL mechanism, here in this PR proposed to:
    
    1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server.
    2. Check permission for event-log download REST API.
    
    With this PR:
    
    1. Admin user could see/download the list of all applications, as well as application details.
    2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him.
    
    ## How was this patch tested?
    
    New UTs are added, also verified in real cluster.
    
    CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot.
    
    Author: jerryshao <[email protected]>
    
    Closes #17582 from jerryshao/SPARK-20239.
    jerryshao authored and Marcelo Vanzin committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    5280d93 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT

    This patch bumps the master branch version to `2.3.0-SNAPSHOT`.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #17753 from JoshRosen/SPARK-20453.
    JoshRosen authored and rxin committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    f44c8a8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20451] Filter out nested mapType datatypes from sort order in …

    …randomSplit
    
    ## What changes were proposed in this pull request?
    
    In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
    splits.
    
    To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.
    
    ## How was this patch tested?
    
    Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #17751 from sameeragarwal/randomsplit2.
    sameeragarwal authored and cloud-fan committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    31345fd View commit details
    Browse the repository at this point in the history
  4. [SPARK-20455][DOCS] Fix Broken Docker IT Docs

    ## What changes were proposed in this pull request?
    
    Just added the Maven `test`goal.
    
    ## How was this patch tested?
    
    No test needed, just a trivial documentation fix.
    
    Author: Armin Braun <[email protected]>
    
    Closes #17756 from original-brownbear/SPARK-20455.
    original-brownbear authored and srowen committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    c8f1219 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20404][CORE] Using Option(name) instead of Some(name)

    Using Option(name) instead of Some(name) to prevent runtime failures when using accumulators created like the following
    ```
    sparkContext.accumulator(0, null)
    ```
    
    Author: Sergey Zhemzhitsky <[email protected]>
    
    Closes #17740 from szhem/SPARK-20404-null-acc-names.
    szhem authored and srowen committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    0bc7a90 View commit details
    Browse the repository at this point in the history
  6. [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redun…

    …dant
    
    ## What changes were proposed in this pull request?
    
    This is a follow-up PR of #17478.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: wangmiao1981 <[email protected]>
    
    Closes #17754 from wangmiao1981/followup.
    wangmiao1981 authored and yanboliang committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    387565c View commit details
    Browse the repository at this point in the history
  7. [SPARK-20449][ML] Upgrade breeze version to 0.13.1

    ## What changes were proposed in this pull request?
    Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
    
    ## How was this patch tested?
    Existing unit tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #17746 from yanboliang/spark-20449.
    yanboliang authored and dbtsai committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    67eef47 View commit details
    Browse the repository at this point in the history
  8. [SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel

    ## What changes were proposed in this pull request?
    
    Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.
    
    This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
    This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
    ## How was this patch tested?
    
    unit tests, manual tests
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Author: ding <[email protected]>
    Author: dding3 <[email protected]>
    Author: Michael Allman <[email protected]>
    
    Closes #15125 from dding3/cp2_pregel.
    ding authored and Felix Cheung committed Apr 25, 2017
    Configuration menu
    Copy the full SHA
    0a7f5f2 View commit details
    Browse the repository at this point in the history

Commits on Apr 26, 2017

  1. [SPARK-18127] Add hooks and extension points to Spark

    ## What changes were proposed in this pull request?
    
    This patch adds support for customizing the spark session by injecting user-defined custom extensions. This allows a user to add custom analyzer rules/checks, optimizer rules, planning strategies or even a customized parser.
    
    ## How was this patch tested?
    
    Unit Tests in SparkSessionExtensionSuite
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #17724 from sameeragarwal/session-extensions.
    sameeragarwal authored and gatorsmile committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    caf3920 View commit details
    Browse the repository at this point in the history
  2. [SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL fu…

    …nctions
    
    ## What changes were proposed in this pull request?
    
    change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`
    
        public JsonParser createParser(String content)
    
    ## How was this patch tested?
    
    existing unit tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Eric Wasserman <[email protected]>
    
    Closes #17693 from ewasserman/SPARK-20314.
    Eric Wasserman authored and cloud-fan committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    57e1da3 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20437][R] R wrappers for rollup and cube

    ## What changes were proposed in this pull request?
    
    - Add `rollup` and `cube` methods and corresponding generics.
    - Add short description to the vignette.
    
    ## How was this patch tested?
    
    - Existing unit tests.
    - Additional unit tests covering new features.
    - `check-cran.sh`.
    
    Author: zero323 <[email protected]>
    
    Closes #17728 from zero323/SPARK-20437.
    zero323 authored and Felix Cheung committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    df58a95 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20400][DOCS] Remove References to 3rd Party Vendor Tools

    ## What changes were proposed in this pull request?
    
    Simple documentation change to remove explicit vendor references.
    
    ## How was this patch tested?
    
    NA
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: anabranch <[email protected]>
    
    Closes #17695 from anabranch/remove-vendor.
    anabranch authored and srowen committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    7a36525 View commit details
    Browse the repository at this point in the history
  5. [SPARK-19812] YARN shuffle service fails to relocate recovery DB acro…

    …ss NFS directories
    
    ## What changes were proposed in this pull request?
    
    Change from using java Files.move to use Hadoop filesystem operations to move the directories.  The java Files.move does not work when moving directories across NFS mounts and in fact also says that if the directory has entries you should do a recursive move. We are already using Hadoop filesystem here so just use the local filesystem from there as it handles this properly.
    
    Note that the DB here is actually a directory of files and not just a single file, hence the change in the name of the local var.
    
    ## How was this patch tested?
    
    Ran YarnShuffleServiceSuite unit tests.  Unfortunately couldn't easily add one here since involves NFS.
    Ran manual tests to verify that the DB directories were properly moved across NFS mounted directories. Have been running this internally for weeks.
    
    Author: Tom Graves <[email protected]>
    
    Closes #17748 from tgravescs/SPARK-19812.
    tgravescs authored and Tom Graves committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    7fecf51 View commit details
    Browse the repository at this point in the history
  6. [MINOR][ML] Fix some PySpark & SparkR flaky tests

    ## What changes were proposed in this pull request?
    Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #17757 from yanboliang/flaky-test.
    yanboliang committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    dbb06c6 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20391][CORE] Rename memory related fields in ExecutorSummay

    ## What changes were proposed in this pull request?
    
    This is a follow-up of #14617 to make the name of memory related fields more meaningful.
    
    Here  for the backward compatibility, I didn't change `maxMemory` and `memoryUsed` fields.
    
    ## How was this patch tested?
    
    Existing UT and local verification.
    
    CC squito and tgravescs .
    
    Author: jerryshao <[email protected]>
    
    Closes #17700 from jerryshao/SPARK-20391.
    jerryshao authored and squito committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    66dd5b8 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20473] Enabling missing types in ColumnVector.Array

    ## What changes were proposed in this pull request?
    ColumnVector implementations originally did not support some Catalyst types (float, short, and boolean). Now that they do, those types should be also added to the ColumnVector.Array.
    
    ## How was this patch tested?
    Tested using existing unit tests.
    
    Author: Michal Szafranski <[email protected]>
    
    Closes #17772 from michal-databricks/spark-20473.
    michal-databricks authored and rxin committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    99c6cf9 View commit details
    Browse the repository at this point in the history
  9. [SPARK-20474] Fixing OnHeapColumnVector reallocation

    ## What changes were proposed in this pull request?
    OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used.
    
    ## How was this patch tested?
    Tested using existing unit tests.
    
    Author: Michal Szafranski <[email protected]>
    
    Closes #17773 from michal-databricks/spark-20474.
    michal-databricks authored and rxin committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    a277ae8 View commit details
    Browse the repository at this point in the history
  10. [SPARK-12868][SQL] Allow adding jars from hdfs

    ## What changes were proposed in this pull request?
    Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](#16324) , but all of them are inactivity for a long time or have been closed.
    
    This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate
    UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler.
    
    ## How was this patch tested?
    1. Add a new unit test.
    2. Check manually.
    Before: throw an exception with " failed unknown protocol: hdfs"
    <img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png">
    
    After:
    <img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png">
    
    Author: Weiqing Yang <[email protected]>
    
    Closes #17342 from weiqingy/SPARK-18910.
    weiqingy authored and Marcelo Vanzin committed Apr 26, 2017
    Configuration menu
    Copy the full SHA
    2ba1eba View commit details
    Browse the repository at this point in the history

Commits on Apr 27, 2017

  1. [SPARK-20435][CORE] More thorough redaction of sensitive information

    This change does a more thorough redaction of sensitive information from logs and UI
    Add unit tests that ensure that no regressions happen that leak sensitive information to the logs.
    
    The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations:
    `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
    `
    Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well.
    
    ## How was this patch tested?
    
    New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it.
    
    Author: Mark Grover <[email protected]>
    
    Closes #17725 from markgrover/spark-20435.
    markgrover authored and Marcelo Vanzin committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    66636ef View commit details
    Browse the repository at this point in the history
  2. [SPARK-20425][SQL] Support a vertical display mode for Dataset.show

    ## What changes were proposed in this pull request?
    This pr added a new display mode for `Dataset.show` to print output rows vertically (one line per column value). In the current master, when printing Dataset with many columns, the readability is low like;
    
    ```
    scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() AS c$i"): _*)
    scala> df.show(3, 0)
    +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+
    |c0                |c1                |c2                |c3                 |c4                |c5                |c6                 |c7                |c8                |c9                |c10               |c11                |c12               |c13               |c14               |c15                |c16                |c17                |c18               |c19               |c20                |c21               |c22                |c23               |c24                |c25                |c26                |c27                 |c28                |c29               |c30                |c31                 |c32               |c33               |c34                |c35                |c36                |c37               |c38               |c39                |c40               |c41               |c42                |c43                |c44                |c45               |c46                 |c47                 |c48                |c49                |c50                |c51                |c52                |c53                |c54                 |c55                |c56                |c57                |c58                |c59               |c60               |c61                |c62                |c63               |c64                |c65               |c66               |c67              |c68                |c69                |c70               |c71                |c72               |c73                |c74                |c75                |c76               |c77                |c78               |c79                |c80                |c81                |c82                |c83                |c84                |c85                |c86                |c87               |c88                |c89                |c90               |c91               |c92               |c93                |c94               |c95                |c96               |c97                |c98                |c99                |
    +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+
    |0.6306087152476858|0.9174349686288383|0.5511324165035159|0.3320844128641819 |0.7738486877101489|0.2154915886962553|0.4754997600674299 |0.922780639280355 |0.7136894772661909|0.2277580838165979|0.5926874459847249|0.40311408392226633|0.467830264333843 |0.8330466896984213|0.1893258482389527|0.6320849515511165 |0.7530911056912044 |0.06700254871955424|0.370528597355559 |0.2755437445193154|0.23704391110980128|0.8067400174905822|0.13597793616251852|0.1708888820162453|0.01672725007605702|0.983118121881555  |0.25040195628629924|0.060537253723083384|0.20000530582637488|0.3400572407133511|0.9375689433322597 |0.057039316954370256|0.8053269714347623|0.5247817572228813|0.28419308820527944|0.9798908885194533 |0.31805988175678146|0.7034448027077574|0.5400575751346084|0.25336322371116216|0.9361634546853429|0.6118681368289798|0.6295081549153907 |0.13417468943957422|0.41617137072255794|0.7267230869252035|0.023792726137561115|0.5776157058356362  |0.04884204913195467|0.26728716103441275|0.646680370807925  |0.9782712690657244 |0.16434031314818154|0.20985522381321275|0.24739842475440077 |0.26335189682977334|0.19604841662422068|0.10742950487300651|0.20283136488091502|0.3100312319723688|0.886959006630645 |0.25157102269776244|0.34428775168410786|0.3500506818575777|0.3781142441912052 |0.8560316444386715|0.4737104888956839|0.735903101602148|0.02236617130529006|0.8769074095835873 |0.2001426662503153|0.5534032319238532 |0.7289496620397098|0.41955191309992157|0.9337700133660436 |0.34059094378451005|0.6419144759403556|0.08167496930341167|0.9947099478497635|0.48010888605366586|0.22314796858167918|0.17786598882331306|0.7351521162297135 |0.5422057170020095 |0.9521927872726792 |0.7459825486368227 |0.40907708791990627|0.8903819313311575|0.7251413746923618 |0.2977174938745204 |0.9515209660203555|0.9375968604766713|0.5087851740042524|0.4255237544908751 |0.8023768698664653|0.48003189618006703|0.1775841829745185|0.09050775629268382|0.6743909291138167 |0.2498415755876865 |
    |0.6866473844170801|0.4774360641212433|0.631696201340726 |0.33979113021468343|0.5663049010847052|0.7280190472258865|0.41370958502324806|0.9977433873622218|0.7671957338989901|0.2788708556233931|0.3355106391656496|0.88478952319287   |0.0333974166999893|0.6061744715862606|0.9617779139652359|0.22484954822341863|0.12770906021550898|0.5577789629508672 |0.2877649024640704|0.5566577406549361|0.9334933255278052 |0.9166720585157266|0.9689249324600591 |0.6367502457478598|0.7993572745928459 |0.23213222324218108|0.11928284054154137|0.6173493362456599  |0.0505122058694798 |0.9050228629552983|0.17112767911121707|0.47395598348370005 |0.5820498657823081|0.6241124650645072|0.18587258258036776|0.14987593554122225|0.3079446253653946 |0.9414228822867968|0.8362276265462365|0.9155655305576353 |0.5121559807153562|0.8963362656525707|0.22765970274318037|0.8177039187132797 |0.8190326635933787 |0.5256005177032199|0.8167598457269669  |0.030936807130934496|0.6733006585281015 |0.4208049626816347 |0.24603085738518538|0.22719198954208153|0.1622280557565281 |0.22217325159218038|0.014684419513742553|0.08987111517447499|0.2157764759142622 |0.8223414104088321 |0.4868624404491777 |0.4016191733088167|0.6169281906889263|0.15603611040433385|0.18289285085714913|0.9538408988218972|0.15037154865295121|0.5364516961987454|0.8077254873163031|0.712600478545675|0.7277477241003857 |0.19822912960348305|0.8305051199208777|0.18631911396566114|0.8909532487898342|0.3470409226992506 |0.35306974180587636|0.9107058868891469 |0.3321327206004986|0.48952332459050607|0.3630403307479373|0.5400046826340376 |0.5387377194310529 |0.42860539421837585|0.23214101630985995|0.21438968839794847|0.15370603160082352|0.04355605642700022|0.6096006707067466 |0.6933354157094292|0.06302172470859002|0.03174631856164001|0.664243581650643 |0.7833239547446621|0.696884598352864 |0.34626385933237736|0.9263495598791336|0.404818892816584  |0.2085585394755507|0.6150004897990109 |0.05391193524302473|0.28188484028329097|
    +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+
    only showing top 2 rows
    ```
    
    `psql`, CLI for PostgreSQL, supports a vertical display mode for this case like:
    http://stackoverflow.com/questions/9604723/alternate-output-format-for-psql
    
    ```
    -RECORD 0-------------------
     c0  | 0.6306087152476858
     c1  | 0.9174349686288383
     c2  | 0.5511324165035159
    ...
     c98 | 0.05391193524302473
     c99 | 0.28188484028329097
    -RECORD 1-------------------
     c0  | 0.6866473844170801
     c1  | 0.4774360641212433
     c2  | 0.631696201340726
    ...
     c98 | 0.05391193524302473
     c99 | 0.28188484028329097
    only showing top 2 rows
    ```
    
    ## How was this patch tested?
    Added tests in `DataFrameSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17733 from maropu/SPARK-20425.
    maropu authored and gatorsmile committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    b4724db View commit details
    Browse the repository at this point in the history
  3. [DOCS][MINOR] Add missing since to SparkR repeat_string note.

    ## What changes were proposed in this pull request?
    
    Replace
    
        note repeat_string 2.3.0
    
    with
    
        note repeat_string since 2.3.0
    
    ## How was this patch tested?
    
    `create-docs.sh`
    
    Author: zero323 <[email protected]>
    
    Closes #17779 from zero323/REPEAT-NOTE.
    zero323 authored and Felix Cheung committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    b58cf77 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide

    ## What changes were proposed in this pull request?
    
    Add `spark.fpGrowth` to SparkR programming guide.
    
    ## How was this patch tested?
    
    Manual tests.
    
    Author: zero323 <[email protected]>
    
    Closes #17775 from zero323/SPARK-20208-FOLLOW-UP.
    zero323 authored and Felix Cheung committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    ba76662 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20483] Mesos Coarse mode may starve other Mesos frameworks

    ## What changes were proposed in this pull request?
    
    Set maxCores to be a multiple of the smallest executor that can be launched. This ensures that we correctly detect the condition where no more executors will be launched when spark.cores.max is not a multiple of spark.executor.cores
    
    ## How was this patch tested?
    
    This was manually tested with other sample frameworks measuring their incoming offers to determine if starvation would occur.
    
    dbtsai mgummelt
    
    Author: Davis Shepherd <[email protected]>
    
    Closes #17786 from dgshep/fix_mesos_max_cores.
    dgshep authored and dbtsai committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    7633933 View commit details
    Browse the repository at this point in the history
  6. [SPARK-20421][CORE] Mark internal listeners as deprecated.

    These listeners weren't really meant for external consumption, but they're
    public and marked with DeveloperApi. Adding the deprecated tag warns people
    that they may soon go away (as they will as part of the work for SPARK-18085).
    
    Note that not all types made public by #648
    are being deprecated. Some remaining types are still exposed through the
    SparkListener API.
    
    Also note the text for StorageStatus is a tiny bit different, since I'm not
    so sure I'll be able to remove it. But the effect for the users should be the
    same (they should stop trying to use it).
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #17766 from vanzin/SPARK-20421.
    Marcelo Vanzin committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    561e9cc View commit details
    Browse the repository at this point in the history
  7. [SPARK-20426] Lazy initialization of FileSegmentManagedBuffer for shu…

    …ffle service.
    
    ## What changes were proposed in this pull request?
    When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(`FileSegmentManagedBuffer`) in `StreamManager`. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization of `FileSegmentManagedBuffer` in shuffle service.
    
    ## How was this patch tested?
    
    Manually test.
    
    Author: jinxing <[email protected]>
    
    Closes #17744 from jinxing64/SPARK-20426.
    jinxing authored and Tom Graves committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    85c6ce6 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20482][SQL] Resolving Casts is too strict on having time zone set

    ## What changes were proposed in this pull request?
    
    Relax the requirement that a `TimeZoneAwareExpression` has to have its `timeZoneId` set to be considered resolved.
    With this change, a `Cast` (which is a `TimeZoneAwareExpression`) can be considered resolved if the `(fromType, toType)` combination doesn't require time zone information.
    
    Also de-relaxed test cases in `CastSuite` so Casts in that test suite don't get a default`timeZoneId = Option("GMT")`.
    
    ## How was this patch tested?
    
    Ran the de-relaxed`CastSuite` and it's passing. Also ran the SQL unit tests and they're passing too.
    
    Author: Kris Mok <[email protected]>
    
    Closes #17777 from rednaxelafx/fix-catalyst-cast-timezone.
    rednaxelafx authored and gatorsmile committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    26ac2ce View commit details
    Browse the repository at this point in the history
  9. [SPARK-20487][SQL] HiveTableScan node is quite verbose in explained…

    … plan
    
    ## What changes were proposed in this pull request?
    
    Changed `TreeNode.argString` to handle `CatalogTable` separately (otherwise it would call the default `toString` on the `CatalogTable`)
    
    ## How was this patch tested?
    
    - Expanded scope of existing unit test to ensure that verbose information is not present
    - Manual testing
    
    Before
    
    ```
    scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter ('name = foo)
       +- 'UnresolvedRelation `my_table`
    
    == Analyzed Logical Plan ==
    user_id: bigint, name: string, ds: string
    Project [user_id#13L, name#14, ds#15]
    +- Filter (name#14 = foo)
       +- SubqueryAlias my_table
          +- CatalogRelation CatalogTable(
    Database: default
    Table: my_table
    Owner: tejasp
    Created: Fri Apr 14 17:05:50 PDT 2017
    Last Access: Wed Dec 31 16:00:00 PST 1969
    Type: MANAGED
    Provider: hive
    Properties: [serialization.format=1]
    Statistics: 9223372036854775807 bytes
    Location: file:/tmp/warehouse/my_table
    Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat: org.apache.hadoop.mapred.TextInputFormat
    OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Partition Provider: Catalog
    Partition Columns: [`ds`]
    Schema: root
    -- user_id: long (nullable = true)
    -- name: string (nullable = true)
    -- ds: string (nullable = true)
    ), [user_id#13L, name#14], [ds#15]
    
    == Optimized Logical Plan ==
    Filter (isnotnull(name#14) && (name#14 = foo))
    +- CatalogRelation CatalogTable(
    Database: default
    Table: my_table
    Owner: tejasp
    Created: Fri Apr 14 17:05:50 PDT 2017
    Last Access: Wed Dec 31 16:00:00 PST 1969
    Type: MANAGED
    Provider: hive
    Properties: [serialization.format=1]
    Statistics: 9223372036854775807 bytes
    Location: file:/tmp/warehouse/my_table
    Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat: org.apache.hadoop.mapred.TextInputFormat
    OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Partition Provider: Catalog
    Partition Columns: [`ds`]
    Schema: root
    -- user_id: long (nullable = true)
    -- name: string (nullable = true)
    -- ds: string (nullable = true)
    ), [user_id#13L, name#14], [ds#15]
    
    == Physical Plan ==
    *Filter (isnotnull(name#14) && (name#14 = foo))
    +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable(
    Database: default
    Table: my_table
    Owner: tejasp
    Created: Fri Apr 14 17:05:50 PDT 2017
    Last Access: Wed Dec 31 16:00:00 PST 1969
    Type: MANAGED
    Provider: hive
    Properties: [serialization.format=1]
    Statistics: 9223372036854775807 bytes
    Location: file:/tmp/warehouse/my_table
    Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat: org.apache.hadoop.mapred.TextInputFormat
    OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Partition Provider: Catalog
    Partition Columns: [`ds`]
    Schema: root
    -- user_id: long (nullable = true)
    -- name: string (nullable = true)
    -- ds: string (nullable = true)
    ), [user_id#13L, name#14], [ds#15]
    ```
    
    After
    
    ```
    scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter ('name = foo)
       +- 'UnresolvedRelation `my_table`
    
    == Analyzed Logical Plan ==
    user_id: bigint, name: string, ds: string
    Project [user_id#13L, name#14, ds#15]
    +- Filter (name#14 = foo)
       +- SubqueryAlias my_table
          +- CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15]
    
    == Optimized Logical Plan ==
    Filter (isnotnull(name#14) && (name#14 = foo))
    +- CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15]
    
    == Physical Plan ==
    *Filter (isnotnull(name#14) && (name#14 = foo))
    +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15]
    ```
    
    Author: Tejas Patil <[email protected]>
    
    Closes #17780 from tejasapatil/SPARK-20487_verbose_plan.
    tejasapatil authored and gatorsmile committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    a4aa466 View commit details
    Browse the repository at this point in the history
  10. [SPARK-20483][MINOR] Test for Mesos Coarse mode may starve other Meso…

    …s frameworks
    
    ## What changes were proposed in this pull request?
    
    Add test case for scenarios where executor.cores is set as a
    (non)divisor of spark.cores.max
    This tests the change in
    #17786
    
    ## How was this patch tested?
    
    Ran the existing test suite with the new tests
    
    dbtsai
    
    Author: Davis Shepherd <[email protected]>
    
    Closes #17788 from dgshep/add_mesos_test.
    dgshep authored and dbtsai committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    039e32c View commit details
    Browse the repository at this point in the history
  11. [SPARK-20047][ML] Constrained Logistic Regression

    ## What changes were proposed in this pull request?
    MLlib ```LogisticRegression``` should support bound constrained optimization (only for L2 regularization). Users can add bound constraints to coefficients to make the solver produce solution in the specified range.
    
    Under the hood, we call Breeze [```L-BFGS-B```](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGSB.scala) as the solver for bound constrained optimization. But in the current breeze implementation, there are some bugs in L-BFGS-B, and scalanlp/breeze#633 fixed them. We need to upgrade dependent breeze later, and currently we use the workaround L-BFGS-B in this PR temporary for reviewing.
    
    ## How was this patch tested?
    Unit tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #17715 from yanboliang/spark-20047.
    yanboliang authored and dbtsai committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    606432a View commit details
    Browse the repository at this point in the history
  12. [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fi…

    …x the potential hang in CachedKafkaConsumer
    
    ## What changes were proposed in this pull request?
    
    This PR changes Executor's threads to `UninterruptibleThread` so that we can use `runUninterruptibly` in `CachedKafkaConsumer`. However, this is just best effort to avoid hanging forever. If the user uses`CachedKafkaConsumer` in another thread (e.g., create a new thread or Future), the potential hang may still happen.
    
    ## How was this patch tested?
    
    The new added test.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17761 from zsxwing/int.
    zsxwing authored and tdas committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    01c999e View commit details
    Browse the repository at this point in the history
  13. [SPARK-20452][SS][KAFKA] Fix a potential ConcurrentModificationExcept…

    …ion for batch Kafka DataFrame
    
    ## What changes were proposed in this pull request?
    
    Cancel a batch Kafka query but one of task cannot be cancelled, and rerun the same DataFrame may cause ConcurrentModificationException because it may launch two tasks sharing the same group id.
    
    This PR always create a new consumer when `reuseKafkaConsumer = false` to avoid ConcurrentModificationException. It also contains other minor fixes.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #17752 from zsxwing/kafka-fix.
    zsxwing authored and tdas committed Apr 27, 2017
    Configuration menu
    Copy the full SHA
    823baca View commit details
    Browse the repository at this point in the history

Commits on Apr 28, 2017

  1. [SPARK-12837][CORE] Do not send the name of internal accumulator to e…

    …xecutor side
    
    ## What changes were proposed in this pull request?
    
    When sending accumulator updates back to driver, the network overhead is pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query plan is complicated.
    
    Therefore, it's critical to reduce the size of serialized accumulator. A simple way is to not send the name of internal accumulators to executor side, as it's unnecessary. When executor sends accumulator updates back to driver, we can look up the accumulator name in `AccumulatorContext` easily. Note that, we still need to send names of normal accumulators, as the user code run at executor side may rely on accumulator names.
    
    In the future, we should reimplement `TaskMetrics` to not rely on accumulators and use custom serialization.
    
    Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, the size of serialized accumulator has been cut down by about 40%.
    
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17596 from cloud-fan/oom.
    cloud-fan authored and hvanhovell committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    b90bf52 View commit details
    Browse the repository at this point in the history
  2. [SPARKR][DOC] Document LinearSVC in R programming guide

    ## What changes were proposed in this pull request?
    
    add link to svmLinear in the SparkR programming document.
    
    ## How was this patch tested?
    
    Build doc manually and click the link to the document. It looks good.
    
    Author: wangmiao1981 <[email protected]>
    
    Closes #17797 from wangmiao1981/doc.
    wangmiao1981 authored and Felix Cheung committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    7fe8249 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20476][SQL] Block users to create a table that use commas in t…

    …he column names
    
    ### What changes were proposed in this pull request?
    ```SQL
    hive> create table t1(`a,` string);
    OK
    Time taken: 1.399 seconds
    
    hive> create table t2(`a,` string, b string);
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements while columns.types has 2 elements!)
    
    hive> create table t2(`a,` string, b string) stored as parquet;
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: ParquetHiveSerde initialization failed. Number of column name and column type differs. columnNames = [a, , b], columnTypes = [string, string]
    ```
    It has a bug in Hive metastore.
    
    When users do not provide alias name in the SELECT query, we call `toPrettySQL` to generate the alias name. For example, the string `get_json_object(jstring, '$.f1')` will be the alias name for the function call in the statement
    ```SQL
    SELECT key, get_json_object(jstring, '$.f1') FROM tempView
    ```
    Above is not an issue for the SELECT query statements. However, for CTAS, we hit the issue due to a bug in Hive metastore. Hive metastore does not like the column names containing commas and returned a confusing error message, like:
    ```
    17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
    org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
    ```
    
    Thus, this PR is to block users to create a table in Hive metastore when the table table has a column containing commas in the name.
    
    ### How was this patch tested?
    Added a test case
    
    Author: Xiao Li <[email protected]>
    
    Closes #17781 from gatorsmile/blockIllegalColumnNames.
    gatorsmile authored and cloud-fan committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    e3c8160 View commit details
    Browse the repository at this point in the history
  4. [SPARK-14471][SQL] Aliases in SELECT could be used in GROUP BY

    ## What changes were proposed in this pull request?
    This pr added a new rule in `Analyzer` to resolve aliases in `GROUP BY`.
    The current master throws an exception if `GROUP BY` clauses have aliases in `SELECT`;
    ```
    scala> spark.sql("select a a1, a1 + 1 as b, count(1) from t group by a1")
    org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input columns: [a]; line 1 pos 51;
    'Aggregate ['a1], [a#83L AS a1#87L, ('a1 + 1) AS b#88, count(1) AS count(1)#90L]
    +- SubqueryAlias t
       +- Project [id#80L AS a#83L]
          +- Range (0, 10, step=1, splits=Some(8))
    
      at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    ```
    
    ## How was this patch tested?
    Added tests in `SQLQuerySuite` and `SQLQueryTestSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17191 from maropu/SPARK-14471.
    maropu authored and cloud-fan committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    59e3a56 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20465][CORE] Throws a proper exception when any temp directory…

    … could not be got
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to throw an exception with better message rather than `ArrayIndexOutOfBoundsException` when temp directories could not be created.
    
    Running the commands below:
    
    ```bash
    ./bin/spark-shell --conf spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
    ```
    
    produces ...
    
    **Before**
    
    ```
    Exception in thread "main" java.lang.ExceptionInInitializerError
            ...
    Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
            ...
    ```
    
    **After**
    
    ```
    Exception in thread "main" java.lang.ExceptionInInitializerError
            ...
    Caused by: java.io.IOException: Failed to get a temp directory under [/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO].
            ...
    ```
    
    ## How was this patch tested?
    
    Unit tests in `LocalDirsSuite.scala`.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17768 from HyukjinKwon/throws-temp-dir-exception.
    HyukjinKwon authored and srowen committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    8c911ad View commit details
    Browse the repository at this point in the history
  6. [SPARK-20496][SS] Bug in KafkaWriter Looks at Unanalyzed Plans

    ## What changes were proposed in this pull request?
    
    We didn't enforce analyzed plans in Spark 2.1 when writing out to Kafka.
    
    ## How was this patch tested?
    
    New unit test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Bill Chambers <[email protected]>
    
    Closes #17804 from anabranch/SPARK-20496-2.
    Bill Chambers authored and brkyvz committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    733b81b View commit details
    Browse the repository at this point in the history
  7. [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v20160721

    Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular).
    Without this change, because of incompatibily between Jetty versions,
    Spark fails to compile when built against Hadoop 3
    
    ## How was this patch tested?
    Unit tests being run.
    
    Author: Mark Grover <[email protected]>
    
    Closes #17790 from markgrover/spark-20514.
    markgrover authored and Marcelo Vanzin committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    5d71f3d View commit details
    Browse the repository at this point in the history
  8. [SPARK-20471] Remove AggregateBenchmark testsuite warning: Two level …

    …hashmap is disabled but vectorized hashmap is enabled
    
    What changes were proposed in this pull request?
    
    remove  AggregateBenchmark testsuite warning:
    such as '14:26:33.220 WARN org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap is disabled but vectorized hashmap is enabled.'
    
    How was this patch tested?
    unit tests: AggregateBenchmark
    Modify the 'ignore function for 'test funtion
    
    Author: caoxuewen <[email protected]>
    
    Closes #17771 from heary-cao/AggregateBenchmark.
    heary-cao authored and gatorsmile committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    ebff519 View commit details
    Browse the repository at this point in the history
  9. [SPARK-19525][CORE] Add RDD checkpoint compression support

    ## What changes were proposed in this pull request?
    
    This PR adds RDD checkpoint compression support and add a new config `spark.checkpoint.compress` to enable/disable it. Credit goes to aramesh117
    
    Closes #17024
    
    ## How was this patch tested?
    
    The new unit test.
    
    Author: Shixiong Zhu <[email protected]>
    Author: Aaditya Ramesh <[email protected]>
    
    Closes #17789 from zsxwing/pr17024.
    Aaditya Ramesh authored and zsxwing committed Apr 28, 2017
    Configuration menu
    Copy the full SHA
    77bcd77 View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2017

  1. [SPARK-20487][SQL] Display serde for HiveTableScan node in explai…

    …ned plan
    
    ## What changes were proposed in this pull request?
    
    This was a suggestion by rxin at #17780 (comment)
    
    ## How was this patch tested?
    
    - modified existing unit test
    - manual testing:
    
    ```
    scala> hc.sql(" SELECT * FROM tejasp_bucketed_partitioned_1  where name = ''  ").explain(true)
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter ('name = )
       +- 'UnresolvedRelation `tejasp_bucketed_partitioned_1`
    
    == Analyzed Logical Plan ==
    user_id: bigint, name: string, ds: string
    Project [user_id#24L, name#25, ds#26]
    +- Filter (name#25 = )
       +- SubqueryAlias tejasp_bucketed_partitioned_1
          +- CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26]
    
    == Optimized Logical Plan ==
    Filter (isnotnull(name#25) && (name#25 = ))
    +- CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26]
    
    == Physical Plan ==
    *Filter (isnotnull(name#25) && (name#25 = ))
    +- HiveTableScan [user_id#24L, name#25, ds#26], CatalogRelation `default`.`tejasp_bucketed_partitioned_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [user_id#24L, name#25], [ds#26]
    ```
    
    Author: Tejas Patil <[email protected]>
    
    Closes #17806 from tejasapatil/add_serde.
    tejasapatil authored and gatorsmile committed Apr 29, 2017
    Configuration menu
    Copy the full SHA
    814a61a View commit details
    Browse the repository at this point in the history
  2. [SPARK-20477][SPARKR][DOC] Document R bisecting k-means in R programm…

    …ing guide
    
    ## What changes were proposed in this pull request?
    
    Add hyper link in the SparkR programming guide.
    
    ## How was this patch tested?
    
    Build doc and manually check the doc link.
    
    Author: wangmiao1981 <[email protected]>
    
    Closes #17805 from wangmiao1981/doc.
    wangmiao1981 authored and Felix Cheung committed Apr 29, 2017
    Configuration menu
    Copy the full SHA
    b28c3bc View commit details
    Browse the repository at this point in the history
  3. [SPARK-19791][ML] Add doc and example for fpgrowth

    ## What changes were proposed in this pull request?
    
    Add a new section for fpm
    Add Example for FPGrowth in scala and Java
    
    updated: Rewrite transform to be more compact.
    
    ## How was this patch tested?
    
    local doc generation.
    
    Author: Yuhao Yang <[email protected]>
    
    Closes #17130 from hhbyyh/fpmdoc.
    YY-OnCall authored and Felix Cheung committed Apr 29, 2017
    Configuration menu
    Copy the full SHA
    add9d1b View commit details
    Browse the repository at this point in the history
  4. [SPARK-20533][SPARKR] SparkR Wrappers Model should be private and val…

    …ue should be lazy
    
    ## What changes were proposed in this pull request?
    
    MultilayerPerceptronClassifierWrapper model should be private.
    LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy.
    
    ## How was this patch tested?
    
    Unit tests.
    
    Author: wangmiao1981 <[email protected]>
    
    Closes #17808 from wangmiao1981/lazy.
    wangmiao1981 authored and Felix Cheung committed Apr 29, 2017
    Configuration menu
    Copy the full SHA
    ee694cd View commit details
    Browse the repository at this point in the history
  5. [SPARK-20493][R] De-duplicate parse logics for DDL-like type strings …

    …in R
    
    ## What changes were proposed in this pull request?
    
    It seems we are using `SQLUtils.getSQLDataType` for type string in structField. It looks we can replace this with `CatalystSqlParser.parseDataType`.
    
    They look similar DDL-like type definitions as below:
    
    ```scala
    scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
    ```
    ```
    +---+
    | _1|
    +---+
    |[a]|
    +---+
    ```
    
    ```scala
    scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
    ```
    ```
    +---+
    | _1|
    +---+
    |[a]|
    +---+
    ```
    
    Such type strings looks identical when R’s one as below:
    
    ```R
    > write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
    > collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
      struct
    1      a
    ```
    
    R’s one is stricter because we are checking the types via regular expressions in R side ahead.
    
    Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks `structField` is the only place that calls this method).
    
    ## How was this patch tested?
    
    Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17785 from HyukjinKwon/SPARK-20493.
    HyukjinKwon authored and Felix Cheung committed Apr 29, 2017
    Configuration menu
    Copy the full SHA
    70f1bcd View commit details
    Browse the repository at this point in the history
  6. [SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in C…

    …olumn API in PySpark
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to fill up the documentation with examples for `bitwiseOR`, `bitwiseAND`, `bitwiseXOR`. `contains`, `asc` and `desc` in `Column` API.
    
    Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
    
    Lastly, this PR suggests to use `spark` rather than `sc` in doc tests in `Column` for Python documentation.
    
    ## How was this patch tested?
    
    Doc tests were added and manually tested with the commands below:
    
    `./python/run-tests.py --module pyspark-sql`
    `./python/run-tests.py --module pyspark-sql --python-executable python3`
    `./dev/lint-python`
    
    Output was checked via `make html` under `./python/docs`. The snapshots will be left on the codes with comments.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17737 from HyukjinKwon/SPARK-20442.
    HyukjinKwon authored and holdenk committed Apr 29, 2017
    Configuration menu
    Copy the full SHA
    d228cd0 View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2017

  1. [SPARK-20521][DOC][CORE] The default of 'spark.worker.cleanup.appData…

    …Ttl' should be 604800 in spark-standalone.md
    
    ## What changes were proposed in this pull request?
    
    Currently, our project needs to be set to clean up the worker directory cleanup cycle is three days.
    When I follow http://spark.apache.org/docs/latest/spark-standalone.html, configure the 'spark.worker.cleanup.appDataTtl' parameter, I configured to 3 * 24 * 3600.
    When I start the spark service, the startup fails, and the worker log displays the error log as follows:
    
    2017-04-28 15:02:03,306 INFO Utils: Successfully started service 'sparkWorker' on port 48728.
    Exception in thread "main" java.lang.NumberFormatException: For input string: "3 * 24 * 3600"
    	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    	at java.lang.Long.parseLong(Long.java:430)
    	at java.lang.Long.parseLong(Long.java:483)
    	at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
    	at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
    	at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
    	at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
    	at scala.Option.map(Option.scala:146)
    	at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
    	at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:100)
    	at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:730)
    	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:709)
    	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    
    **Because we put 7 * 24 * 3600 as a string, forced to convert to the dragon type,  will lead to problems in the program.**
    
    **So I think the default value of the current configuration should be a specific long value, rather than 7 * 24 * 3600,should be 604800. Because it would mislead users for similar configurations, resulting in spark start failure.**
    
    ## How was this patch tested?
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: 郭小龙 10207633 <[email protected]>
    Author: guoxiaolong <[email protected]>
    Author: guoxiaolongzte <[email protected]>
    
    Closes #17798 from guoxiaolongzte/SPARK-20521.
    郭小龙 10207633 authored and srowen committed Apr 30, 2017
    Configuration menu
    Copy the full SHA
    4d99b95 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20492][SQL] Do not print empty parentheses for invalid primiti…

    …ve types in parser
    
    ## What changes were proposed in this pull request?
    
    Currently, when the type string is invalid, it looks printing empty parentheses. This PR proposes a small improvement in an error message by removing it in the parse as below:
    
    ```scala
    spark.range(1).select($"col".cast("aa"))
    ```
    
    **Before**
    
    ```
    org.apache.spark.sql.catalyst.parser.ParseException:
    DataType aa() is not supported.(line 1, pos 0)
    
    == SQL ==
    aa
    ^^^
    ```
    
    **After**
    
    ```
    org.apache.spark.sql.catalyst.parser.ParseException:
    DataType aa is not supported.(line 1, pos 0)
    
    == SQL ==
    aa
    ^^^
    ```
    
    ## How was this patch tested?
    
    Unit tests in `DataTypeParserSuite`.
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17784 from HyukjinKwon/SPARK-20492.
    HyukjinKwon authored and hvanhovell committed Apr 30, 2017
    Configuration menu
    Copy the full SHA
    1ee494d View commit details
    Browse the repository at this point in the history
  3. [SPARK-20535][SPARKR] R wrappers for explode_outer and posexplode_outer

    ## What changes were proposed in this pull request?
    
    Ad R wrappers for
    
    - `o.a.s.sql.functions.explode_outer`
    - `o.a.s.sql.functions.posexplode_outer`
    
    ## How was this patch tested?
    
    Additional unit tests, manual testing.
    
    Author: zero323 <[email protected]>
    
    Closes #17809 from zero323/SPARK-20535.
    zero323 authored and Felix Cheung committed Apr 30, 2017
    Configuration menu
    Copy the full SHA
    ae3df4e View commit details
    Browse the repository at this point in the history

Commits on May 1, 2017

  1. [MINOR][DOCS][PYTHON] Adding missing boolean type for replacement val…

    …ue in fillna
    
    ## What changes were proposed in this pull request?
    
    Currently pyspark Dataframe.fillna API supports boolean type when we pass dict, but it is missing in documentation.
    
    ## How was this patch tested?
    >>> spark.createDataFrame([Row(a=True),Row(a=None)]).fillna({"a" : True}).show()
    +----+
    |   a|
    +----+
    |true|
    |true|
    +----+
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Srinivasa Reddy Vundela <[email protected]>
    
    Closes #17688 from vundela/fillna_doc_fix.
    Srinivasa Reddy Vundela authored and Felix Cheung committed May 1, 2017
    Configuration menu
    Copy the full SHA
    6613046 View commit details
    Browse the repository at this point in the history
  2. [SPARK-20490][SPARKR] Add R wrappers for eqNullSafe and ! / not

    ## What changes were proposed in this pull request?
    
    - Add null-safe equality operator `%<=>%` (sames as `o.a.s.sql.Column.eqNullSafe`, `o.a.s.sql.Column.<=>`)
    - Add boolean negation operator `!` and function `not `.
    
    ## How was this patch tested?
    
    Existing unit tests, additional unit tests, `check-cran.sh`.
    
    Author: zero323 <[email protected]>
    
    Closes #17783 from zero323/SPARK-20490.
    zero323 authored and Felix Cheung committed May 1, 2017
    Configuration menu
    Copy the full SHA
    80e9cf1 View commit details
    Browse the repository at this point in the history
  3. [SPARK-20541][SPARKR][SS] support awaitTermination without timeout

    ## What changes were proposed in this pull request?
    
    Add without param for timeout - will need this to submit a job that runs until stopped
    Need this for 2.2
    
    ## How was this patch tested?
    
    manually, unit test
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17815 from felixcheung/rssawaitinfinite.
    felixcheung authored and Felix Cheung committed May 1, 2017
    Configuration menu
    Copy the full SHA
    a355b66 View commit details
    Browse the repository at this point in the history
  4. [SPARK-20290][MINOR][PYTHON][SQL] Add PySpark wrapper for eqNullSafe

    ## What changes were proposed in this pull request?
    
    Adds Python bindings for `Column.eqNullSafe`
    
    ## How was this patch tested?
    
    Manual tests, existing unit tests, doc build.
    
    Author: zero323 <[email protected]>
    
    Closes #17605 from zero323/SPARK-20290.
    zero323 authored and gatorsmile committed May 1, 2017
    Configuration menu
    Copy the full SHA
    f0169a1 View commit details
    Browse the repository at this point in the history
  5. [SPARK-20534][SQL] Make outer generate exec return empty rows

    ## What changes were proposed in this pull request?
    Generate exec does not produce `null` values if the generator for the input row is empty and the generate operates in outer mode without join. This is caused by the fact that the `join=false` code path is different from the `join=true` code path, and that the `join=false` code path did deal with outer properly. This PR addresses this issue.
    
    ## How was this patch tested?
    Updated `outer*` tests in `GeneratorFunctionSuite`.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #17810 from hvanhovell/SPARK-20534.
    hvanhovell authored and gatorsmile committed May 1, 2017
    Configuration menu
    Copy the full SHA
    6b44c4d View commit details
    Browse the repository at this point in the history
  6. [SPARK-20517][UI] Fix broken history UI download link

    The download link in history server UI is concatenated with:
    
    ```
     <td><a href="{{uiroot}}/api/v1/applications/{{id}}/{{num}}/logs" class="btn btn-info btn-mini">Download</a></td>
    ```
    
    Here `num` field represents number of attempts, this is not equal to REST APIs. In the REST API, if attempt id is not existed the URL should be `api/v1/applications/<id>/logs`, otherwise the URL should be `api/v1/applications/<id>/<attemptId>/logs`. Using `<num>` to represent `<attemptId>` will lead to the issue of "no such app".
    
    Manual verification.
    
    CC ajbozarth can you please review this change, since you add this feature before? Thanks!
    
    Author: jerryshao <[email protected]>
    
    Closes #17795 from jerryshao/SPARK-20517.
    jerryshao authored and Marcelo Vanzin committed May 1, 2017
    Configuration menu
    Copy the full SHA
    ab30590 View commit details
    Browse the repository at this point in the history
  7. [SPARK-20464][SS] Add a job group and description for streaming queri…

    …es and fix cancellation of running jobs using the job group
    
    ## What changes were proposed in this pull request?
    
    Job group: adding a job group is required to properly cancel running jobs related to a query.
    Description: the new description makes it easier to group the batches of a query by sorting by name in the Spark Jobs UI.
    
    ## How was this patch tested?
    
    - Unit tests
    - UI screenshot
    
      - Order by job id:
    ![screen shot 2017-04-27 at 5 10 09 pm](https://cloud.githubusercontent.com/assets/7865120/25509468/15452274-2b6e-11e7-87ba-d929816688cf.png)
    
      - Order by description:
    ![screen shot 2017-04-27 at 5 10 22 pm](https://cloud.githubusercontent.com/assets/7865120/25509474/1c298512-2b6e-11e7-99b8-fef1ef7665c1.png)
    
      - Order by job id (no query name):
    ![screen shot 2017-04-27 at 5 21 33 pm](https://cloud.githubusercontent.com/assets/7865120/25509482/28c96dc8-2b6e-11e7-8df0-9d3cdbb05e36.png)
    
      - Order by description (no query name):
    ![screen shot 2017-04-27 at 5 21 44 pm](https://cloud.githubusercontent.com/assets/7865120/25509489/37674742-2b6e-11e7-9357-b5c38ec16ac4.png)
    
    Author: Kunal Khamar <[email protected]>
    
    Closes #17765 from kunalkhamar/sc-6696.
    kunalkhamar authored and zsxwing committed May 1, 2017
    Configuration menu
    Copy the full SHA
    6fc6cf8 View commit details
    Browse the repository at this point in the history
  8. [SPARK-20540][CORE] Fix unstable executor requests.

    There are two problems fixed in this commit. First, the
    ExecutorAllocationManager sets a timeout to avoid requesting executors
    too often. However, the timeout is always updated based on its value and
    a timeout, not the current time. If the call is delayed by locking for
    more than the ongoing scheduler timeout, the manager will request more
    executors on every run. This seems to be the main cause of SPARK-20540.
    
    The second problem is that the total number of requested executors is
    not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
    the value based on the current status of 3 variables: the number of
    known executors, the number of executors that have been killed, and the
    number of pending executors. But, the number of pending executors is
    never less than 0, even though there may be more known than requested.
    When executors are killed and not replaced, this can cause the request
    sent to YARN to be incorrect because there were too many executors due
    to the scheduler's state being slightly out of date. This is fixed by tracking
    the currently requested size explicitly.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Ryan Blue <[email protected]>
    
    Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation.
    rdblue authored and Marcelo Vanzin committed May 1, 2017
    Configuration menu
    Copy the full SHA
    2b2dd08 View commit details
    Browse the repository at this point in the history