在OpenStack(Mitaka版本)上通过Sahara部署Hadoop&Spark集群

    xiaoxiao2025-08-15  8

    在OpenStack(Mitaka)上通过Sahara部署Hadoop&Spark集群 1. 准备工作 1.1 创建用于部署集群的项目和用户1.2 准备Cloudera Plugin镜像1.3 创建集群网络和云主机类型模板 2. 通过Sahara部署CDH5.5集群 2.1 注册镜像到Sahara2.2 创建节点组模板2.3 创建集群模板2.4. 创建集群 3. Elastic Data Processing (EDP)4. 遇到的问题 4.1 Sahara集群无法删除4.2 RAM配额不足4.3 floating_ip配额不足4.4 Sahara集群创建失败,状态Error4.5 安全组规则配额不足4.6 虚拟机无法ping通浮动IP4.7 创建集群错误,提示Heat stack资源创建超时4.8 controller节点磁盘空间不足4.9 sahara集群无法启动hadoop服务4.10 sahara集群主机重启后cdh/spark等服务没有自动启动 参考链接

    1. 准备工作

    1.1 创建用于部署集群的项目和用户

    创建项目lab和用户lab,并授予普通用户角色

    source ~/.openstack/.admin-openrcopenstack project create --domain default --description "Lab Project" labopenstack user create --domain default --password-prompt labopenstack role add --project lab --user lab user

    注:也可以以admin用户登录Dashboard,以图形化方式创建项目lab和用户lab。

    设置OpenStack lab用户环境变量  ① 创建lab用户环境变量配置文件vi ~/.openstack/.lag-openrc

    # Add environment variables for demoexport OS_PROJECT_DOMAIN_NAME=defaultexport OS_USER_DOMAIN_NAME=defaultexport OS_PROJECT_NAME=labexport OS_USERNAME=lab# To avoid security problems,remove the OS_PASSWORD variable# Use the --password parameter with OpenStack client commands insteadexport OS_PASSWORD=lab@a112export OS_AUTH_URL=http://controller:5000/v3export OS_AUTH_TYPE=passwordexport OS_IDENTITY_API_VERSION=3export OS_IMAGE_API_VERSION=2

    ② 使环境变量生效

    source ~/.openstack/.lab-openrc

    1.2 准备Cloudera Plugin镜像

    查看sahara可用插件

    source ~/.openstack/.lab-openrcopenstack dataprocessing plugin list

    注:本文选择cdh 5.5.0

    下载对应版本镜像

    wget http://sahara-files.mirantis.com/images/upstream/mitaka/http://sahara-files.mirantis.com/images/upstream/mitaka/sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2

    创建镜像

    source ~/.openstack/.lab-openrcopenstack image create "sahara-mitaka-cloudera-5.5.0-ubuntu" --file sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2 --disk-format qcow2 --container-format bare

    查看镜像列表

    openstack image list

    1.3 创建集群网络和云主机类型模板

    创建用于部署集群的网络、子网、路由

    source ~/.openstack/.lab-openrcneutron net-create selfservice-sahara-clusterneutron subnet-create --name selfservice-sahara-cluster --dns-nameserver 8.8.4.4 --gateway 172.16.100.1 selfservice-sahara-cluster 172.16.100.0/24neutron router-create routerneutron router-interface-add router selfservice-sahara-clusterneutron router-gateway-set router provideropenstack network listopenstack subnet listneutron router-list

    创建用于部署集群的云主机类型模板

    source ~/.openstack/.admin-openrcopenstack flavor create --vcpus 4 --ram 8192 --disk 20 sahara-flavoropenstack flavor list

    2. 通过Sahara部署CDH5.5集群

    2.1 注册镜像到Sahara

    获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu的ID

    source ~/.openstack/.lab-openrcexport IMAGE_ID=$(openstack image list | awk '/ sahara-mitaka-cloudera-5.5.0-ubuntu / { print $2 }')

    获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu的用户名和标签  可参考:http://docs.openstack.org/developer/sahara/userdoc/cdh_plugin.html  ① 用户名为ubuntu  ② 标签:cdh、5.5.0

    注册镜像

    openstack dataprocessing image register $IMAGE_ID --username ubuntu

    添加标签

    openstack dataprocessing image tags add $IMAGE_ID --tags cdh 5.5.0

    2.2 创建节点组模板

    获取基本信息  ① 云主机类型模板ID:8d824f5a-a829-42ad-9878-f38318cc9821

    openstack flavor list | awk '/ sahara-flavor / { print $2 }'

    ② 浮动IP池ID:20b2a466-cd25-4b9a-9194-2b8005a8b547

    openstack ip floating pool listopenstack network list | awk '/ provider / { print $2 }'

    创建cdh-550-default-namenode节点组模板  ① 新建文件namenode.json,内容如下:

    { "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_processes": [ "HDFS_NAMENODE", "YARN_RESOURCEMANAGER", "HIVE_SERVER2", "HIVE_METASTORE", "CLOUDERA_MANAGER" ], "name": "cdh-550-default-namenode", "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547", "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821", "auto_security_group": true, "is_protected": true}

    ② 创建节点组模板:

    openstack dataprocessing node group template create --json namenode.json

    注:也可用命令直接创建,如

    openstack dataprocessing node group template create --name vanilla-default-worker --plugin <plugin_name> --plugin-version <plugin_version> --processes HDFS_NAMENODE YARN_RESOURCEMANAGER HIVE_SERVER2 HIVE_METASTORE CLOUDERA_MANAGER --flavor <flavor-id> --floating-ip-pool <pool-id> --auto-security-group

    创建cdh-550-default-secondary-namenode节点组模板  ① 新建文件secondary-namenode.json,内容如下:

    { "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_processes": [ "HDFS_SECONDARYNAMENODE", "OOZIE_SERVER", "YARN_JOBHISTORY", "SPARK_YARN_HISTORY_SERVER" ], "name": "cdh-550-default-secondary-namenode", "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547", "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821", "auto_security_group": true, "is_protected": true}

    ② 创建节点组模板:

    openstack dataprocessing node group template create --json secondary-namenode.json

    创建cdh-550-default-datanode节点组模板  ① 新建文件datanode.json,内容如下:

    { "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_processes": [ "HDFS_DATANODE", "YARN_NODEMANAGER" ], "name": "cdh-550-default-datanode", "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547", "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821", "auto_security_group": true, "is_protected": true}

    ② 创建节点组模板:

    openstack dataprocessing node group template create --json datanode.json

    2.3 创建集群模板

    获取节点组模板ID  ① 打印节点组模板列表

    openstack dataprocessing node group template list

    ② 获取每个节点组模板的对应ID:

    Node Template ID cdh-550-default-namenode f8eb08e6-80d5-4409-af7e-13009e694603 cdh-550-default-secondary-namenode a4ebb4d5-67b4-41f2-969a-2ac6db4f892f cdh-550-default-datanode c80540fe-98b7-4dc8-9e94-0cd93c23c0f7

    创建集群模板cdh-550-default-cluster  ① 新建文件cluster.json,内容如下:

    { "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_groups": [ { "name": "datanode", "count": 8, "node_group_template_id": "c80540fe-98b7-4dc8-9e94-0cd93c23c0f7" }, { "name": "secondary-namenode", "count": 1, "node_group_template_id": "a4ebb4d5-67b4-41f2-969a-2ac6db4f892f" }, { "name": "namenode", "count": 1, "node_group_template_id": "f8eb08e6-80d5-4409-af7e-13009e694603" } ], "name": "cdh-550-default-cluster", "cluster_configs": {}, "is_protected": true}

    ② 创建集群模板

    openstack dataprocessing cluster template create --json cluster.json

    查看集群模板列表

    openstack dataprocessing cluster template list

    2.4. 创建集群

    获取创建集群所需基本信息  ① 创建密钥对

    source ~/.openstack/.lab-openrcopenstack keypair create labkey --public-key ~/.ssh/id_rsa.pubopenstack keypair list

    ② 获取集群模板cdh-550-default-cluster的ID

    openstack dataprocessing cluster template list | awk '/ cdh-550-default-cluster / { print $4 }'

    ③ 获取集群默认的sahara注册镜像的ID

    openstack dataprocessing image list | awk '/ 'sahara-mitaka-cloudera-5.5.0-ubuntu' / { print $4 }'

    ④ 获取集群网络selfservice-sahara-cluster的ID

    openstack network list | awk '/ 'selfservice-sahara-cluster' / { print $2 }'

    新建创建集群的配置文件cluster_create.json  内容如下:

    { "plugin_name": "cdh", "hadoop_version": "5.5.0", "name": "cluster-1", "cluster_template_id" : "b55ef1b7-b5df-4642-9543-71a9fe972ac0", "user_keypair_id": "labkey", "default_image_id": "1b0a2a22-26d5-4a0f-b186-f19dbacbb971", "neutron_management_network": "548e06a1-f86c-4dd7-bdcd-dfa1c3bdc24f", "is_protected": true}

    创建集群

    openstack dataprocessing cluster create --json cluster_create.json

    3. Elastic Data Processing (EDP)

    4. 遇到的问题

    4.1 Sahara集群无法删除

    问题  创建集群后,集群还在创建过程中,未等集群创建完成,执行删除操作后,集群状态一直显示Deleting,无法删除集群。

    原因  暂未发现

    解决方法  查询数据库sahara,手动删除集群表clusters和node_groups中该集群对应记录。

    mysql -usahara -puse sahara;show tables;delete from node_groups;delete from clusters;

    注:此处表clusters和node_groups中只有刚创建的集群对应的数据,所以删除表中全部数据。建议删除时带条件,限制只删除该集群ID对应数据。

    4.2 RAM配额不足

    问题  创建集群失败,提示RAM配额不足

    Quota exceeded for RAM: Requested 81920, but available 51200Error ID: c196131b-047d-4ed8-9dbd-4cc074cb8147

    原因  集群请求分配的内存总量超出了项目的RAM配额

    解决方法  以admin登录,修改项目lab的RAM配额

    source .openstack/.admin-openrcopenstack quota show labopenstack quota set --ram 81920 lab

    4.3 floating_ip配额不足

    问题

    Quota exceeded for floating ip: Requested 10, but available 0Error ID: d5d04298-ba8b-466c-80cc-aa12ca989d8f

    原因  项目lab的浮动IP配额不足,但查看项目lab的浮动IP配额,发现浮动IP配额充足。删除集群,重新尝试,发现还是提示配额不足,查看官方参考文档  http://docs.openstack.org/developer/sahara/userdoc/configuration.guide.html#floating-ip-management  发现是配置文件/etc/sahara/sahara.conf中配置有误,将use_floating_ips设置成了False

    解决方法  ① 修改配置文件sudo vi /etc/sahara/sahara.conf,将use_floating_ips=False改为use_floating_ips=true  ② 将修改写入数据库

    su rootsahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head

    ② 重启sahara服务

    sudo service sahara-api restartsudo service sahara-engine restart

    4.4 Sahara集群创建失败,状态Error

    问题  创建集群后,集群状态显示Error

    2016-07-22 11:01:39.339 7763 ERROR sahara.service.trusts [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Unable to create trust (reason: Expecting to find id or name in user - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (HTTP 400) (Request-ID: req-3981c44d-4c09-4254-aacb-d67ee74746f8))2016-07-22 11:01:39.476 7763 ERROR sahara.service.ops [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Error during operating on cluster (reason: Failed to create trust

    原因  经分析,是由于keystone_authtoken认证失败,没有管理员权限导致。

    解决方法  ① 修改sudo vi /etc/sahara/sahara.conf,在keystone认证配置部分新增如下内容:  注:将SAHARA_PASS替换为实际密码

    [keystone_authtoken]identity_uri = http://controller:35357admin_tenant_name = serviceadmin_user = saharaadmin_password = SAHARA_PASS

    ② 将修改写入数据库

    su rootsahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head

    ② 重启sahara服务

    sudo service sahara-api restartsudo service sahara-engine restart

    4.5 安全组规则配额不足

    问题  创建集群,状态显示为Spawning,后显示Error,日志文件/var/log/sahara/sahara-engine.log提示如下信息

    2016-07-22 21:18:27.317 110479 WARNING sahara.service.heat.heat_engine [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: OverQuotaClient: resources.cdh-550-default-namenode.resources[0].resources.cluster-1-cdh-550-default-namenode-3e806c9b: Quota exceeded for resources: ['security_group_rule'].Neutron server returns request_ids: ['req-4d91968f-451f-426c-aa67-d8827f1ad426']Error ID: 03aa7921-898e-4444-9c7f-c2321a5f8bdb)2016-07-22 21:18:27.911 110479 INFO sahara.utils.cluster [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster status has been changed. New status=Error

    原因  项目lab的安全组规则配额不足导致

    解决方法  以admin登录,修改项目lab的secgroups和secgroup-rules配额  注:配额为负数表示没有限制

    source .openstack/.admin-openrcopenstack quota show labopenstack quota set --secgroups -1 labopenstack quota set --secgroup-rules -1 lab

    4.6 虚拟机无法ping通浮动IP

    问题  虚拟机可以ping通外网但是无法ping通浮动IP。

    原因  默认安全组default的规则中没有允许ICMP和SSH,添加相应安全组规则即可。

    解决方法

    source ~/.openstack/.lab_openrcopenstack security group rule create --proto icmp defaultopenstack security group rule create --proto tcp --dst-port 22 default

    4.7 创建集群错误,提示Heat stack资源创建超时

    参考:  OS::Heat::WaitCondition doesnt work after upgrade to Liberty  wait condition in HOT heat template  OpenStack Orchestration In Depth, Part II: Single Instance Deployments  https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-waitcondition-timed-out/  https://ask.openstack.org/en/question/42657/how-to-debug-scripts-at-heats-softwareconfig-resource

    问题  创建集群时,集群状态一直显示Spawning,日志文件/var/log/sahara/sahara-enging.log显示:

    2016-07-22 22:51:00.470 119076 WARNING sahara.service.heat.heat_engine [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: WaitConditionTimeout: resources.cdh-550-default-secondary-namenode.resources[0].resources.cdh-550-default-secondary-namenode-wc-waiter: 0 of 1 receivedError ID: 8a34fc47-4e84-4818-8728-78e543c97efb)2016-07-22 22:51:01.069 119076 INFO sahara.utils.cluster [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster status has been changed. New status=Error

    日志文件/var/log/heat/heat-enging.log显示:

    2016-07-28 10:21:12.748 7561 INFO heat.engine.resources.openstack.heat.wait_condition [-] HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159] Timed out (0 of 1 received)2016-07-28 10:21:12.762 7563 DEBUG heat.engine.scheduler [-] Task stack_task from Stack "testcb185f1e-cdh-550-default-namenode-bym45epemale-0-az3fffva54ro" [58bb1eac-744b-4130-b486-98f726975dc0] running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:2162016-07-28 10:21:12.763 7563 DEBUG heat.engine.scheduler [-] Task create running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:2162016-07-28 10:21:12.749 7561 INFO heat.engine.resource [-] CREATE: HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159]2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource Traceback (most recent call last):2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 704, in _action_recorder2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 775, in _do_action2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/scheduler.py", line 314, in wrapper2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource step = next(subtask)2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 749, in action_handler_task2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource while not check(handler_data):2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 130, in check_create_complete2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource return self._wait(*data)2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 108, in _wait2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource raise exc2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource WaitConditionTimeout: 0 of 1 received2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource

    原因  ① 查看日志,初步分析应该是Stack创建失败,尝试创建栈时请求超时,最后导致集群状态Error。  ② 经排查,在集群创建过程中,检查实例是否创建完成,状态为ACTIVE时,从日志中提取一直处于CREATE_IN_PROGRESS状态的stack的ID,在controller节点上查看openstack stack resource list stack_id,查看该stack的资源列表,发现Heat在创建Stack时只有wc-waiter没有创建成功,一直处于CREATE_IN_PROGRESS状态,如下图所示。  ③ 参考AWS相关的提示:猜测是由于实例创建完成后没有通知heat已经创建完成。  ④ SSH登录到实例,查看/var/lib/cloud/data/cfn-userdata,可以看到其中有一段是关于向heat-api发送创建成功消息的HTTP请求,但是heat-api地址使用了主机名controller代替IP地址。通过ping命令测试发现可以ping通主机controller的IP地址,但是无法ping通controller,将主机名替换为IP地址后手动执行该HTTP请求,发现回复OK。如下图:  登录集群每个节点,手动执行该HTTP请求后发现集群创建成功。  ⑤ 经测试和分析,基本断定是由于创建heat-api endpoint时使用了主机名代替IP地址,但是实例无法解析主机名,导致无法访问heat-api,无法发送实例创建成功的消息,导致heat stack创建时一直阻塞,最后超时失败。  集群创建测试成功,如下图: 

    解决方法  ① 方法一:在集群创建时,实例显示ACTIVE后,登录集群所有节点,手动修改集群每个节点的/etc/hosts,添加一条记录192.168.1.11 controller。但这种方法比较费事,只是临时解决方法。  ② 方法二:在创建heat api的endpoint时,将主机名替换为IP地址。(该方法未经测试,推断应该可以解决该问题)  ③ 方法三:通过dnsmasq配置,使实例可以解析主机名。(该方法多次配置测试后未成功,还需要继续研究是否可行)

    4.8 controller节点磁盘空间不足

    问题  在创建sahara集群时,实例创建成功后,集群状态一直starting,查看后台日志,提示磁盘空间不足。

    原因  df -h查看磁盘空间,发现磁盘使用率接近100%。通过在当前目录执行du -h --max-depth=1,发现/var/lib和/var/log目录占用磁盘最多,其中系统日志/var/log/syslog占用磁盘20G,可以看到是因为日志文件的原因。

    解决方法  ① 对于日志文件,可用cat /dev/null > /var/log/syslog,同时清空占用磁盘空间较多的日志文件即可。若是测试环境,可用如下脚本清空日志。

    #!/bin/bash for i in `find /var/log -name *.log`do echo $i cat /dev/null > $i doneexit 0

    注:对于mongodb日志太大的问题,可以采用旋转日志解决。如下:  参考:https://docs.mongodb.com/manual/tutorial/rotate-log-files/

    ps -ef | grep mongod //查看mongodb进程IDkill -SIGUSR1 <mongod process id>

    或者命令行登录mongodb,执行如下命令:

    use admindb.runCommand( { logRotate : 1 } )

    ② 查看/var/lib下的目录,发现占用最多的目录/var/lib/mongodb和/var/lib/glance。经分析,/var/lib/glance下存放的是openstack镜像文件,是正常占用;而/var/lib/mongodb/下可看到许多文件名是ceilometer.*的文件,该类文件是mongodb数据库中的ceilometer计量信息。若为测试环境无用删除即可。  参考:mongodb删除集合后磁盘空间不释放

    mongo --host controllershow dbsuse ceilometerdb // 查看当前数据库show collectionsdb.meter.count()db.meter.remove({}) //数据量较大时,删除较费时db.repairDatabase() //释放磁盘空间

    4.9 sahara集群无法启动hadoop服务

    问题  sahara创建集群时,无法启动集群服务。  ① 在日志sahara-engine.log中显示:

    2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error during operating on cluster (reason: Failed to Provision Hadoop Cluster: Failed to start service.Error ID: df99b010-46e0-41df-89ac-95490a52fc90)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Traceback (most recent call last):2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 192, in wrapper2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] f(cluster_id, *args, **kwds)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 301, in _provision_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] plugin.start_cluster(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/plugin.py", line 51, in start_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] cluster.hadoop_version).start_cluster(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/abstractversionhandler.py", line 109, in start_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.deploy.start_cluster(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/v5_5_0/deploy.py", line 165, in start_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] CU.first_run(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 139, in handler2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] add_fail_event(instance, e)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.force_reraise()2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] six.reraise(self.type_, self.value, self.tb)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 136, in handler2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] value = func(*args, **kwargs)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/cloudera_utils.py", line 42, in wrapper2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] raise ex.HadoopProvisionError(c.resultMessage)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] HadoopProvisionError: Failed to Provision Hadoop Cluster: Failed to start service.2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error ID: df99b010-46e0-41df-89ac-95490a52fc902016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e]2016-08-12 21:28:08.185 7531 INFO sahara.utils.cluster [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Cluster status has been changed. New status=Error

    ② 在实例/var/log/hdfs/日志中显示错误

    java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)at org.apache.hadoop.ipc.Client.call(Client.java:905)at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198) at $Proxy0.addBlock(Unknown Source)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)

    原因  参考链接:  http://stackoverflow.com/questions/5293446/hdfs-error-could-only-be-replicated-to-0-nodes-instead-of-1  http://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1  暂未解决,初步断定是datanode和namenode之间网络状态不良,导致datanode不可用。

    解决方法

    4.10 sahara集群主机重启后cdh/spark等服务没有自动启动

    问题  sahara集群主机重启后,发现CDH或Spark相应服务并没有自动启动,导致服务不可用。

    原因  对于Spark Plugin,官方文档中有说明,Spark服务没有以Ubuntu标准服务方式部署,虚拟机重启后Spark服务不会自动重启。

    Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.

    参考链接:  http://docs.openstack.org/developer/sahara/userdoc/spark_plugin.html

    解决方法

    参考链接

    Sahara (Data Processing) UI User Guide通过Sahara部署Hadoop集群使用Openstack Sahara快速部署Cloudera Hadoop集群Spark集群安装和使用Openstack Cinder 多后端Sahara Quickstart GuideSahara集群的状态一览Sahara Cluster Statuses OverviewData Processing service command-line clientSahara cluster creation/deletion stuckHow-to: Get Started with CDH on OpenStack with Sahara
    转载请注明原文地址: https://ju.6miu.com/read-1301766.html
    最新回复(0)