SPARK-1166: clean vpc_id if the group was just now created #59

CodingCat · 2014-03-03T03:07:43Z

Reported in https://spark-project.atlassian.net/browse/SPARK-1166

In some very weird situation (when new created group master_group and slave_group have valid vpc_id), user will receive the following error when running the spark-ec2 script


Setting up security groups...
ERROR:boto:400 Bad Request
ERROR:boto:<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidParameterValue</Code><Message>Invalid value 'null' for protocol. VPC security group rules must specify protocols explicitly.</Message></Error></Errors><RequestID>fc56f0ba-915a-45b6-8555-05d4dd0f14ee</RequestID></Response>
Traceback (most recent call last):
  File "./spark_ec2.py", line 813, in <module>
    main()
  File "./spark_ec2.py", line 806, in main
    real_main()
  File "./spark_ec2.py", line 689, in real_main
    conn, opts, cluster_name)
  File "./spark_ec2.py", line 244, in launch_cluster
    slave_group.authorize(src_group=master_group)
  File "/Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py", line 184, in authorize
  File "/Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py", line 2181, in authorize_security_group
  File "/Users/nanzhu/code/spark/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py", line 944, in get_status
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidParameterValue</Code><Message>Invalid value 'null' for protocol. VPC security group rules must specify protocols explicitly.</Message></Error></Errors><RequestID>fc56f0ba-915a-45b6-8555-05d4dd0f14ee</RequestID></Response>

The related code in boto is as following, with a valid vpc_id, boto thinks that we should pass protocol type explicitly....

group_name = None
        if not self.vpc_id:
            group_name = self.name
        group_id = None
        if self.vpc_id:
            group_id = self.id
        src_group_name = None
        src_group_owner_id = None
        src_group_group_id = None
        if src_group:
            cidr_ip = None
            src_group_owner_id = src_group.owner_id
            if not self.vpc_id:
                src_group_name = src_group.name
            else:
                if hasattr(src_group, 'group_id'):
                    src_group_group_id = src_group.group_id
                else:
                    src_group_group_id = src_group.id
        status = self.connection.authorize_security_group(group_name,
                                                          src_group_name,
                                                          src_group_owner_id,
                                                          ip_protocol,
                                                          from_port,
                                                          to_port,
                                                          cidr_ip,
                                                          group_id,
                                                          src_group_group_id)

So if we just create a new cluster, we should clean the vpc_id for the user

Bump up logging level to warning for failed tasks. (cherry picked from commit 3249e0e) Signed-off-by: Reynold Xin <[email protected]>

2015-07-28

Add copyright file head

apache#59 Add the alias function week of weekofyear

### What changes were proposed in this pull request? 1. Make more expressions extend `NullIntolerant`. 2. Add a checker(in `ExpressionInfoSuite`) to identify whether the expression is `NullIntolerant`. ### Why are the changes needed? Avoid skew join if the join column has many null values and can improve query performance. For examples: ```sql CREATE TABLE t1(c1 string, c2 string) USING parquet; CREATE TABLE t2(c1 string, c2 string) USING parquet; EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON upper(t1.c1) = upper(t2.c1); ``` Before and after this PR: ```sql == Physical Plan == *(2) Project [c1#5, c2#6] +- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#41] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[c1#5,c2#6] +- *(2) ColumnarToRow +- FileScan parquet default.t2[c1#7] == Physical Plan == *(2) Project [c1#5, c2#6] +- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildRight :- *(2) Project [c1#5, c2#6] : +- *(2) Filter isnotnull(c1#5) : +- *(2) ColumnarToRow : +- FileScan parquet default.t1[c1#5,c2#6] +- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#59] +- *(1) Project [c1#7] +- *(1) Filter isnotnull(c1#7) +- *(1) ColumnarToRow +- FileScan parquet default.t2[c1#7] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #28626 from wangyum/SPARK-28481. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…query ### What changes were proposed in this pull request? Remove redundant aliases after `RewritePredicateSubquery`. For example: ```scala sql("CREATE TABLE t1 USING parquet AS SELECT id AS a, id AS b, id AS c FROM range(10)") sql("CREATE TABLE t2 USING parquet AS SELECT id AS x, id AS y FROM range(8)") sql( """ |SELECT * |FROM t1 |WHERE a IN (SELECT x | FROM (SELECT x AS x, | Rank() OVER (partition BY x ORDER BY Sum(y) DESC) AS ranking | FROM t2 | GROUP BY x) tmp1 | WHERE ranking <= 5) |""".stripMargin).explain ``` Before this PR: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [a#10L], [x#7L], LeftSemi, BuildRight, false :- FileScan parquet default.t1[a#10L,b#11L,c#12L] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#68] +- Project [x#7L] +- Filter (ranking#8 <= 5) +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST] +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#62] +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)]) +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59] +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)]) +- FileScan parquet default.t2[x#15L,y#16L] ``` After this PR: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [a#10L], [x#15L], LeftSemi, BuildRight, false :- FileScan parquet default.t1[a#10L,b#11L,c#12L] +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#67] +- Project [x#15L] +- Filter (ranking#8 <= 5) +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST] +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0 +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)]) +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59] +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)]) +- FileScan parquet default.t2[x#15L,y#16L] ``` ### Why are the changes needed? Reduce shuffle to improve query performance. This change can benefit TPC-DS q70. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33509 from wangyum/SPARK-36280. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? 1. Make more expressions extend `NullIntolerant`. 2. Add a checker(in `ExpressionInfoSuite`) to identify whether the expression is `NullIntolerant`. ### Why are the changes needed? Avoid skew join if the join column has many null values and can improve query performance. For examples: ```sql CREATE TABLE t1(c1 string, c2 string) USING parquet; CREATE TABLE t2(c1 string, c2 string) USING parquet; EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON upper(t1.c1) = upper(t2.c1); ``` Before and after this PR: ```sql == Physical Plan == *(2) Project [c1#5, c2#6] +- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#41] : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[c1#5,c2#6] +- *(2) ColumnarToRow +- FileScan parquet default.t2[c1#7] == Physical Plan == *(2) Project [c1#5, c2#6] +- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildRight :- *(2) Project [c1#5, c2#6] : +- *(2) Filter isnotnull(c1#5) : +- *(2) ColumnarToRow : +- FileScan parquet default.t1[c1#5,c2#6] +- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#59] +- *(1) Project [c1#7] +- *(1) Filter isnotnull(c1#7) +- *(1) ColumnarToRow +- FileScan parquet default.t2[c1#7] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #28626 from wangyum/SPARK-28481. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 91148f4)

…59) * [CARMEL-3580] Backport [CARMEL-1603]ViewPoint - RunTime Query Glance * [CARMEL-3580] Backport [CARMEL-1603] add username

…HAVING ### What changes were proposed in this pull request? This PR enhanced the analyzer to handle the following pattern properly. ``` Sort - Filter - Aggregate ``` ### Why are the changes needed? ``` spark-sql (default)> CREATE TABLE t1 (flag BOOLEAN, dt STRING); spark-sql (default)> SELECT LENGTH(dt), > COUNT(t1.flag) > FROM t1 > GROUP BY LENGTH(dt) > HAVING COUNT(t1.flag) > 1 > ORDER BY LENGTH(dt); [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `dt` cannot be resolved. Did you mean one of the following? [`length(dt)`, `count(flag)`].; line 6 pos 16; 'Sort ['LENGTH('dt) ASC NULLS FIRST], true +- Filter (count(flag)#60L > cast(1 as bigint)) +- Aggregate [length(dt#9)], [length(dt#9) AS length(dt)#59, count(flag#8) AS count(flag)#60L] +- SubqueryAlias spark_catalog.default.t1 +- Relation spark_catalog.default.t1[flag#8,dt#9] parquet ``` The above code demonstrates the failure case, the query failed during the analysis phase when both `HAVING` and `ORDER BY` clauses are present, but successful if only one is present. ### Does this PR introduce _any_ user-facing change? Yes, maybe we can call it a bugfix. ### How was this patch tested? New UTs are added ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44352 from pan3793/SPARK-28386. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…ead pool ### What changes were proposed in this pull request? This PR aims to use a meaningful class name prefix for REST Submission API thread pool instead of the default value of Jetty QueuedThreadPool, `"qtp"+super.hashCode()`. https://github.com/dekellum/jetty/blob/3dc0120d573816de7d6a83e2d6a97035288bdd4a/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L64 ### Why are the changes needed? This is helpful during JVM investigation. **BEFORE (4.0.0-preview2)** ``` $ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh $ jstack 28217 | grep qtp "qtp1925630411-52" #52 daemon prio=5 os_prio=31 cpu=0.07ms elapsed=19.06s tid=0x0000000134906c10 nid=0xde03 runnable [0x0000000314592000] "qtp1925630411-53" #53 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134ac6810 nid=0xc603 runnable [0x000000031479e000] "qtp1925630411-54" #54 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x000000013491ae10 nid=0xdc03 runnable [0x00000003149aa000] "qtp1925630411-55" #55 daemon prio=5 os_prio=31 cpu=0.08ms elapsed=19.06s tid=0x0000000134ac9810 nid=0xc803 runnable [0x0000000314bb6000] "qtp1925630411-56" #56 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134ac9e10 nid=0xda03 runnable [0x0000000314dc2000] "qtp1925630411-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134aca410 nid=0xca03 runnable [0x0000000314fce000] "qtp1925630411-58" #58 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134acaa10 nid=0xcb03 runnable [0x00000003151da000] "qtp1925630411-59" #59 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x0000000134acb010 nid=0xcc03 runnable [0x00000003153e6000] "qtp1925630411-60-acceptor-0108e9815-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.11ms elapsed=19.06s tid=0x00000001317ffa10 nid=0xcd03 runnable [0x00000003155f2000] "qtp1925630411-61-acceptor-11d90f2aa-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.10ms elapsed=19.06s tid=0x00000001314ed610 nid=0xcf03 waiting on condition [0x00000003157fe000] ``` **AFTER** ``` $ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh $ jstack 28317 | grep StandaloneRestServer "StandaloneRestServer-52" #52 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284a8e10 nid=0xdb03 runnable [0x000000032cfce000] "StandaloneRestServer-53" #53 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284acc10 nid=0xda03 runnable [0x000000032d1da000] "StandaloneRestServer-54" #54 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284ae610 nid=0xd803 runnable [0x000000032d3e6000] "StandaloneRestServer-55" #55 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284aec10 nid=0xd703 runnable [0x000000032d5f2000] "StandaloneRestServer-56" #56 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284af210 nid=0xc803 runnable [0x000000032d7fe000] "StandaloneRestServer-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284af810 nid=0xc903 runnable [0x000000032da0a000] "StandaloneRestServer-58" #58 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284afe10 nid=0xcb03 runnable [0x000000032dc16000] "StandaloneRestServer-59" #59 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284b0410 nid=0xcc03 runnable [0x000000032de22000] "StandaloneRestServer-60-acceptor-04aefbaa8-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.13ms elapsed=60.05s tid=0x000000015cda1a10 nid=0xcd03 runnable [0x000000032e02e000] "StandaloneRestServer-61-acceptor-148976251-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.12ms elapsed=60.05s tid=0x000000015cd1c810 nid=0xce03 waiting on condition [0x000000032e23a000] ``` ### Does this PR introduce _any_ user-facing change? No, the thread names are accessed during the debugging. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48924 from dongjoon-hyun/SPARK-50385. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: panbingkun <[email protected]>

clean vpc_id if the group was just now created

ecf485c

CodingCat closed this Mar 3, 2014

jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014

Merge pull request alteryx#59 from rxin/warning

2760055

Bump up logging level to warning for failed tasks. (cherry picked from commit 3249e0e) Signed-off-by: Reynold Xin <[email protected]>

JasonMWhite pushed a commit to JasonMWhite/spark that referenced this pull request Dec 2, 2015

Merge pull request apache#59 from Shopify/2015-07-28

d6d7f2e

2015-07-28

robert3005 added a commit to robert3005/spark that referenced this pull request Jan 12, 2017

open reader only once with correct schema (apache#59)

aa4e13d

marcosdotps pushed a commit to marcosdotps/spark that referenced this pull request Sep 14, 2017

fixed certs resources (apache#59)

71d1d50

gcz2022 pushed a commit to gcz2022/spark that referenced this pull request Jul 30, 2018

fix AE job desc (apache#59)

34285fa

luzhonghao pushed a commit to luzhonghao/spark that referenced this pull request Dec 11, 2018

fix AE job desc (apache#59)

4422458

hejian991 pushed a commit to growingio/spark that referenced this pull request Jun 24, 2019

fix AE job desc (apache#59)

55c157c

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#59 from theopenlab/add-copyright

73a9096

Add copyright file head

hn5092 pushed a commit to hn5092/spark that referenced this pull request Nov 4, 2019

Merge pull request apache#60 from zheniantoushipashi/spark-13871-2

8df7b7f

apache#59 Add the alias function week of weekofyear

XinDongSh pushed a commit to XinDongSh/spark that referenced this pull request Jan 18, 2021

Remove Dummy PMem Shuffle Manager (apache#59)

75a4af6

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-3580] Backport [CARMEL-1603]ViewPoint - RunTime Query Glance (#…

20617ee

…59) * [CARMEL-3580] Backport [CARMEL-1603]ViewPoint - RunTime Query Glance * [CARMEL-3580] Backport [CARMEL-1603] add username

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1166: clean vpc_id if the group was just now created #59

SPARK-1166: clean vpc_id if the group was just now created #59

CodingCat commented Mar 3, 2014

SPARK-1166: clean vpc_id if the group was just now created #59

SPARK-1166: clean vpc_id if the group was just now created #59

Conversation

CodingCat commented Mar 3, 2014