创建项目lab和用户lab,并授予普通用户角色
source ~/.openstack/.admin-openrcopenstack project create --domain default --description "Lab Project" labopenstack user create --domain default --password-prompt labopenstack role add --project lab --user lab user注:也可以以admin用户登录Dashboard,以图形化方式创建项目lab和用户lab。
设置OpenStack lab用户环境变量 ① 创建lab用户环境变量配置文件vi ~/.openstack/.lag-openrc
# Add environment variables for demoexport OS_PROJECT_DOMAIN_NAME=defaultexport OS_USER_DOMAIN_NAME=defaultexport OS_PROJECT_NAME=labexport OS_USERNAME=lab# To avoid security problems,remove the OS_PASSWORD variable# Use the --password parameter with OpenStack client commands insteadexport OS_PASSWORD=lab@a112export OS_AUTH_URL=http://controller:5000/v3export OS_AUTH_TYPE=passwordexport OS_IDENTITY_API_VERSION=3export OS_IMAGE_API_VERSION=2② 使环境变量生效
source ~/.openstack/.lab-openrc查看sahara可用插件
source ~/.openstack/.lab-openrcopenstack dataprocessing plugin list注:本文选择cdh 5.5.0
下载对应版本镜像
wget http://sahara-files.mirantis.com/images/upstream/mitaka/http://sahara-files.mirantis.com/images/upstream/mitaka/sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2创建镜像
source ~/.openstack/.lab-openrcopenstack image create "sahara-mitaka-cloudera-5.5.0-ubuntu" --file sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2 --disk-format qcow2 --container-format bare查看镜像列表
openstack image list创建用于部署集群的网络、子网、路由
source ~/.openstack/.lab-openrcneutron net-create selfservice-sahara-clusterneutron subnet-create --name selfservice-sahara-cluster --dns-nameserver 8.8.4.4 --gateway 172.16.100.1 selfservice-sahara-cluster 172.16.100.0/24neutron router-create routerneutron router-interface-add router selfservice-sahara-clusterneutron router-gateway-set router provideropenstack network listopenstack subnet listneutron router-list创建用于部署集群的云主机类型模板
source ~/.openstack/.admin-openrcopenstack flavor create --vcpus 4 --ram 8192 --disk 20 sahara-flavoropenstack flavor list获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu的ID
source ~/.openstack/.lab-openrcexport IMAGE_ID=$(openstack image list | awk '/ sahara-mitaka-cloudera-5.5.0-ubuntu / { print $2 }')获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu的用户名和标签 可参考:http://docs.openstack.org/developer/sahara/userdoc/cdh_plugin.html ① 用户名为ubuntu ② 标签:cdh、5.5.0
注册镜像
openstack dataprocessing image register $IMAGE_ID --username ubuntu添加标签
openstack dataprocessing image tags add $IMAGE_ID --tags cdh 5.5.0获取基本信息 ① 云主机类型模板ID:8d824f5a-a829-42ad-9878-f38318cc9821
openstack flavor list | awk '/ sahara-flavor / { print $2 }'② 浮动IP池ID:20b2a466-cd25-4b9a-9194-2b8005a8b547
openstack ip floating pool listopenstack network list | awk '/ provider / { print $2 }'创建cdh-550-default-namenode节点组模板 ① 新建文件namenode.json,内容如下:
{ "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_processes": [ "HDFS_NAMENODE", "YARN_RESOURCEMANAGER", "HIVE_SERVER2", "HIVE_METASTORE", "CLOUDERA_MANAGER" ], "name": "cdh-550-default-namenode", "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547", "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821", "auto_security_group": true, "is_protected": true}② 创建节点组模板:
openstack dataprocessing node group template create --json namenode.json注:也可用命令直接创建,如
openstack dataprocessing node group template create --name vanilla-default-worker --plugin <plugin_name> --plugin-version <plugin_version> --processes HDFS_NAMENODE YARN_RESOURCEMANAGER HIVE_SERVER2 HIVE_METASTORE CLOUDERA_MANAGER --flavor <flavor-id> --floating-ip-pool <pool-id> --auto-security-group创建cdh-550-default-secondary-namenode节点组模板 ① 新建文件secondary-namenode.json,内容如下:
{ "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_processes": [ "HDFS_SECONDARYNAMENODE", "OOZIE_SERVER", "YARN_JOBHISTORY", "SPARK_YARN_HISTORY_SERVER" ], "name": "cdh-550-default-secondary-namenode", "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547", "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821", "auto_security_group": true, "is_protected": true}② 创建节点组模板:
openstack dataprocessing node group template create --json secondary-namenode.json创建cdh-550-default-datanode节点组模板 ① 新建文件datanode.json,内容如下:
{ "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_processes": [ "HDFS_DATANODE", "YARN_NODEMANAGER" ], "name": "cdh-550-default-datanode", "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547", "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821", "auto_security_group": true, "is_protected": true}② 创建节点组模板:
openstack dataprocessing node group template create --json datanode.json获取节点组模板ID ① 打印节点组模板列表
openstack dataprocessing node group template list② 获取每个节点组模板的对应ID:
Node Template ID cdh-550-default-namenode f8eb08e6-80d5-4409-af7e-13009e694603 cdh-550-default-secondary-namenode a4ebb4d5-67b4-41f2-969a-2ac6db4f892f cdh-550-default-datanode c80540fe-98b7-4dc8-9e94-0cd93c23c0f7创建集群模板cdh-550-default-cluster ① 新建文件cluster.json,内容如下:
{ "plugin_name": "cdh", "hadoop_version": "5.5.0", "node_groups": [ { "name": "datanode", "count": 8, "node_group_template_id": "c80540fe-98b7-4dc8-9e94-0cd93c23c0f7" }, { "name": "secondary-namenode", "count": 1, "node_group_template_id": "a4ebb4d5-67b4-41f2-969a-2ac6db4f892f" }, { "name": "namenode", "count": 1, "node_group_template_id": "f8eb08e6-80d5-4409-af7e-13009e694603" } ], "name": "cdh-550-default-cluster", "cluster_configs": {}, "is_protected": true}② 创建集群模板
openstack dataprocessing cluster template create --json cluster.json查看集群模板列表
openstack dataprocessing cluster template list获取创建集群所需基本信息 ① 创建密钥对
source ~/.openstack/.lab-openrcopenstack keypair create labkey --public-key ~/.ssh/id_rsa.pubopenstack keypair list② 获取集群模板cdh-550-default-cluster的ID
openstack dataprocessing cluster template list | awk '/ cdh-550-default-cluster / { print $4 }'③ 获取集群默认的sahara注册镜像的ID
openstack dataprocessing image list | awk '/ 'sahara-mitaka-cloudera-5.5.0-ubuntu' / { print $4 }'④ 获取集群网络selfservice-sahara-cluster的ID
openstack network list | awk '/ 'selfservice-sahara-cluster' / { print $2 }'新建创建集群的配置文件cluster_create.json 内容如下:
{ "plugin_name": "cdh", "hadoop_version": "5.5.0", "name": "cluster-1", "cluster_template_id" : "b55ef1b7-b5df-4642-9543-71a9fe972ac0", "user_keypair_id": "labkey", "default_image_id": "1b0a2a22-26d5-4a0f-b186-f19dbacbb971", "neutron_management_network": "548e06a1-f86c-4dd7-bdcd-dfa1c3bdc24f", "is_protected": true}创建集群
openstack dataprocessing cluster create --json cluster_create.json问题 创建集群后,集群还在创建过程中,未等集群创建完成,执行删除操作后,集群状态一直显示Deleting,无法删除集群。
原因 暂未发现
解决方法 查询数据库sahara,手动删除集群表clusters和node_groups中该集群对应记录。
mysql -usahara -puse sahara;show tables;delete from node_groups;delete from clusters;注:此处表clusters和node_groups中只有刚创建的集群对应的数据,所以删除表中全部数据。建议删除时带条件,限制只删除该集群ID对应数据。
问题 创建集群失败,提示RAM配额不足
Quota exceeded for RAM: Requested 81920, but available 51200Error ID: c196131b-047d-4ed8-9dbd-4cc074cb8147原因 集群请求分配的内存总量超出了项目的RAM配额
解决方法 以admin登录,修改项目lab的RAM配额
source .openstack/.admin-openrcopenstack quota show labopenstack quota set --ram 81920 lab问题
Quota exceeded for floating ip: Requested 10, but available 0Error ID: d5d04298-ba8b-466c-80cc-aa12ca989d8f原因 项目lab的浮动IP配额不足,但查看项目lab的浮动IP配额,发现浮动IP配额充足。删除集群,重新尝试,发现还是提示配额不足,查看官方参考文档 http://docs.openstack.org/developer/sahara/userdoc/configuration.guide.html#floating-ip-management 发现是配置文件/etc/sahara/sahara.conf中配置有误,将use_floating_ips设置成了False
解决方法 ① 修改配置文件sudo vi /etc/sahara/sahara.conf,将use_floating_ips=False改为use_floating_ips=true ② 将修改写入数据库
su rootsahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head② 重启sahara服务
sudo service sahara-api restartsudo service sahara-engine restart问题 创建集群后,集群状态显示Error
2016-07-22 11:01:39.339 7763 ERROR sahara.service.trusts [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Unable to create trust (reason: Expecting to find id or name in user - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (HTTP 400) (Request-ID: req-3981c44d-4c09-4254-aacb-d67ee74746f8))2016-07-22 11:01:39.476 7763 ERROR sahara.service.ops [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Error during operating on cluster (reason: Failed to create trust原因 经分析,是由于keystone_authtoken认证失败,没有管理员权限导致。
解决方法 ① 修改sudo vi /etc/sahara/sahara.conf,在keystone认证配置部分新增如下内容: 注:将SAHARA_PASS替换为实际密码
[keystone_authtoken]identity_uri = http://controller:35357admin_tenant_name = serviceadmin_user = saharaadmin_password = SAHARA_PASS② 将修改写入数据库
su rootsahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head② 重启sahara服务
sudo service sahara-api restartsudo service sahara-engine restart问题 创建集群,状态显示为Spawning,后显示Error,日志文件/var/log/sahara/sahara-engine.log提示如下信息
2016-07-22 21:18:27.317 110479 WARNING sahara.service.heat.heat_engine [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: OverQuotaClient: resources.cdh-550-default-namenode.resources[0].resources.cluster-1-cdh-550-default-namenode-3e806c9b: Quota exceeded for resources: ['security_group_rule'].Neutron server returns request_ids: ['req-4d91968f-451f-426c-aa67-d8827f1ad426']Error ID: 03aa7921-898e-4444-9c7f-c2321a5f8bdb)2016-07-22 21:18:27.911 110479 INFO sahara.utils.cluster [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster status has been changed. New status=Error原因 项目lab的安全组规则配额不足导致
解决方法 以admin登录,修改项目lab的secgroups和secgroup-rules配额 注:配额为负数表示没有限制
source .openstack/.admin-openrcopenstack quota show labopenstack quota set --secgroups -1 labopenstack quota set --secgroup-rules -1 lab问题 虚拟机可以ping通外网但是无法ping通浮动IP。
原因 默认安全组default的规则中没有允许ICMP和SSH,添加相应安全组规则即可。
解决方法
source ~/.openstack/.lab_openrcopenstack security group rule create --proto icmp defaultopenstack security group rule create --proto tcp --dst-port 22 default参考: OS::Heat::WaitCondition doesnt work after upgrade to Liberty wait condition in HOT heat template OpenStack Orchestration In Depth, Part II: Single Instance Deployments https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-waitcondition-timed-out/ https://ask.openstack.org/en/question/42657/how-to-debug-scripts-at-heats-softwareconfig-resource
问题 创建集群时,集群状态一直显示Spawning,日志文件/var/log/sahara/sahara-enging.log显示:
2016-07-22 22:51:00.470 119076 WARNING sahara.service.heat.heat_engine [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: WaitConditionTimeout: resources.cdh-550-default-secondary-namenode.resources[0].resources.cdh-550-default-secondary-namenode-wc-waiter: 0 of 1 receivedError ID: 8a34fc47-4e84-4818-8728-78e543c97efb)2016-07-22 22:51:01.069 119076 INFO sahara.utils.cluster [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster status has been changed. New status=Error日志文件/var/log/heat/heat-enging.log显示:
2016-07-28 10:21:12.748 7561 INFO heat.engine.resources.openstack.heat.wait_condition [-] HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159] Timed out (0 of 1 received)2016-07-28 10:21:12.762 7563 DEBUG heat.engine.scheduler [-] Task stack_task from Stack "testcb185f1e-cdh-550-default-namenode-bym45epemale-0-az3fffva54ro" [58bb1eac-744b-4130-b486-98f726975dc0] running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:2162016-07-28 10:21:12.763 7563 DEBUG heat.engine.scheduler [-] Task create running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:2162016-07-28 10:21:12.749 7561 INFO heat.engine.resource [-] CREATE: HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159]2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource Traceback (most recent call last):2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 704, in _action_recorder2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 775, in _do_action2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/scheduler.py", line 314, in wrapper2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource step = next(subtask)2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 749, in action_handler_task2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource while not check(handler_data):2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 130, in check_create_complete2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource return self._wait(*data)2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 108, in _wait2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource raise exc2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource WaitConditionTimeout: 0 of 1 received2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource原因 ① 查看日志,初步分析应该是Stack创建失败,尝试创建栈时请求超时,最后导致集群状态Error。 ② 经排查,在集群创建过程中,检查实例是否创建完成,状态为ACTIVE时,从日志中提取一直处于CREATE_IN_PROGRESS状态的stack的ID,在controller节点上查看openstack stack resource list stack_id,查看该stack的资源列表,发现Heat在创建Stack时只有wc-waiter没有创建成功,一直处于CREATE_IN_PROGRESS状态,如下图所示。 ③ 参考AWS相关的提示:猜测是由于实例创建完成后没有通知heat已经创建完成。 ④ SSH登录到实例,查看/var/lib/cloud/data/cfn-userdata,可以看到其中有一段是关于向heat-api发送创建成功消息的HTTP请求,但是heat-api地址使用了主机名controller代替IP地址。通过ping命令测试发现可以ping通主机controller的IP地址,但是无法ping通controller,将主机名替换为IP地址后手动执行该HTTP请求,发现回复OK。如下图: 登录集群每个节点,手动执行该HTTP请求后发现集群创建成功。 ⑤ 经测试和分析,基本断定是由于创建heat-api endpoint时使用了主机名代替IP地址,但是实例无法解析主机名,导致无法访问heat-api,无法发送实例创建成功的消息,导致heat stack创建时一直阻塞,最后超时失败。 集群创建测试成功,如下图:
解决方法 ① 方法一:在集群创建时,实例显示ACTIVE后,登录集群所有节点,手动修改集群每个节点的/etc/hosts,添加一条记录192.168.1.11 controller。但这种方法比较费事,只是临时解决方法。 ② 方法二:在创建heat api的endpoint时,将主机名替换为IP地址。(该方法未经测试,推断应该可以解决该问题) ③ 方法三:通过dnsmasq配置,使实例可以解析主机名。(该方法多次配置测试后未成功,还需要继续研究是否可行)
问题 在创建sahara集群时,实例创建成功后,集群状态一直starting,查看后台日志,提示磁盘空间不足。
原因 df -h查看磁盘空间,发现磁盘使用率接近100%。通过在当前目录执行du -h --max-depth=1,发现/var/lib和/var/log目录占用磁盘最多,其中系统日志/var/log/syslog占用磁盘20G,可以看到是因为日志文件的原因。
解决方法 ① 对于日志文件,可用cat /dev/null > /var/log/syslog,同时清空占用磁盘空间较多的日志文件即可。若是测试环境,可用如下脚本清空日志。
#!/bin/bash for i in `find /var/log -name *.log`do echo $i cat /dev/null > $i doneexit 0注:对于mongodb日志太大的问题,可以采用旋转日志解决。如下: 参考:https://docs.mongodb.com/manual/tutorial/rotate-log-files/
ps -ef | grep mongod //查看mongodb进程IDkill -SIGUSR1 <mongod process id>或者命令行登录mongodb,执行如下命令:
use admindb.runCommand( { logRotate : 1 } )② 查看/var/lib下的目录,发现占用最多的目录/var/lib/mongodb和/var/lib/glance。经分析,/var/lib/glance下存放的是openstack镜像文件,是正常占用;而/var/lib/mongodb/下可看到许多文件名是ceilometer.*的文件,该类文件是mongodb数据库中的ceilometer计量信息。若为测试环境无用删除即可。 参考:mongodb删除集合后磁盘空间不释放
mongo --host controllershow dbsuse ceilometerdb // 查看当前数据库show collectionsdb.meter.count()db.meter.remove({}) //数据量较大时,删除较费时db.repairDatabase() //释放磁盘空间问题 sahara创建集群时,无法启动集群服务。 ① 在日志sahara-engine.log中显示:
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error during operating on cluster (reason: Failed to Provision Hadoop Cluster: Failed to start service.Error ID: df99b010-46e0-41df-89ac-95490a52fc90)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Traceback (most recent call last):2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 192, in wrapper2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] f(cluster_id, *args, **kwds)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 301, in _provision_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] plugin.start_cluster(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/plugin.py", line 51, in start_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] cluster.hadoop_version).start_cluster(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/abstractversionhandler.py", line 109, in start_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.deploy.start_cluster(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/v5_5_0/deploy.py", line 165, in start_cluster2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] CU.first_run(cluster)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 139, in handler2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] add_fail_event(instance, e)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.force_reraise()2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] six.reraise(self.type_, self.value, self.tb)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 136, in handler2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] value = func(*args, **kwargs)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/cloudera_utils.py", line 42, in wrapper2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] raise ex.HadoopProvisionError(c.resultMessage)2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] HadoopProvisionError: Failed to Provision Hadoop Cluster: Failed to start service.2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error ID: df99b010-46e0-41df-89ac-95490a52fc902016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e]2016-08-12 21:28:08.185 7531 INFO sahara.utils.cluster [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Cluster status has been changed. New status=Error② 在实例/var/log/hdfs/日志中显示错误
java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)at org.apache.hadoop.ipc.Client.call(Client.java:905)at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198) at $Proxy0.addBlock(Unknown Source)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)原因 参考链接: http://stackoverflow.com/questions/5293446/hdfs-error-could-only-be-replicated-to-0-nodes-instead-of-1 http://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1 暂未解决,初步断定是datanode和namenode之间网络状态不良,导致datanode不可用。
解决方法
问题 sahara集群主机重启后,发现CDH或Spark相应服务并没有自动启动,导致服务不可用。
原因 对于Spark Plugin,官方文档中有说明,Spark服务没有以Ubuntu标准服务方式部署,虚拟机重启后Spark服务不会自动重启。
Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.
参考链接: http://docs.openstack.org/developer/sahara/userdoc/spark_plugin.html
解决方法