Spark各种问题集锦[持续更新]

xiaoxiao2025-02-05 34

1、Initial job has not accepted any resources

16/08/13 17:05:42 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 16/08/13 17:05:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

这个信息就是告诉我们，初始化的作业不能接受到任何资源，spark只会寻找两件资源：Cores和Memory。所以，出现这个信息，肯定是这两种资源不够，我们可以打开Spark UI界面看看情况：

从图中可以发现，cores已经被用完了，也就是有其他任务正在占用这些资源，也或者是spark-shell，所以，才会出现上述警告信息。

参考： http://www.datastax.com/dev/blog/common-spark-troubleshooting

2、Exception in thread “main” java.lang.ClassNotFoundException

Exception in thread "main" java.lang.ClassNotFoundException: Main at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:174) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:56) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)

当我们在提交spark-submit的时候，经常会遇到这个异常，但导致这个异常的原因真的很多，比如，在你的JAR包中，真的没有这个类，这个异常与其他找不到类的异常有个区别，区别在于，这里找不到类，是找不到主类，而不是找不到其他引用的类，如果找不到其他引用的类的话，很可能是类路径有问题，或没引入相应的类库，这里是没有找到主类，当时我也很奇怪，同样在一个JAR里，为什么有的主类可以找到，有些主类无法找到，后面发现当我用package把那个主类放在某个包下面时，这个主类就无法找到了，然后把这个主类放到源代码的根目录下，就能找到，所以，主类找不到的解决方法可以试试把主类放到源代码的根目录下，至少，我的情况是这样的，然后成功解决了，毕竟，每个人遇到的情况不一样，所以，good luck to you！

解决方法：把主类放到源代码的根目录，即src下。

3、When running with master ‘yarn’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment

hadoop@master:~$ ./shell/spark-submit.sh 16/09/03 10:35:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/09/03 10:35:46 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /jar/edu-cloud-assembly-1.0.jar 16/09/03 10:35:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:251) at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:228) at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

解决方法：编辑$SPARK_HOME/conf/spark-env.sh文件

hadoop@master:~$ vi spark-1.6.0-bin-hadoop2.4/conf/spark-env.sh

加入以下行：

HADOOP_CONF_DIR=/home/hadoop/hadoop-2.4.0/etc/hadoop/

然后，将集群上的这个文件都更新。

4、awaitResult Exception

Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult

问题原因：

解决方法：将默认的配置调大，默认为300s，具体如下：

spark.conf.set("spark.sql.broadcastTimeout", 1200)

5、Exception in thread “main” org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true

18/01/09 20:25:33 INFO FileSourceStrategy: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes. Exception in thread "main" org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true; at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doPrepare(BroadcastNestedLoopJoinExec.scala:345) at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:199) at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:134) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:323) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2193) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2197) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2197) at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2197) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2173) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 18/01/09 20:25:34 INFO SparkContext: Invoking stop() from shutdown hook

解决方法：

set spark.sql.crossJoin.enabled = true;

转载请注明原文地址: https://ju.6miu.com/read-1296129.html

最新回复(0)