HADOOP自带测试类介绍及使用

    xiaoxiao2025-04-03  10

    一. Hadoop基准测试

    Hadoop自带了几个基准测试,被打包在几个jar包中。本文主要是cloudera版本测试 [hsu@server01 ~]$ ls /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop* | egrep "examples|test" /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.2.0.jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test-2.5.0-mr1-cdh5.2.0.jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar (1)、Hadoop Test 当不带参数调用hadoop-test-0.20.2-cdh3u3.jar时,会列出所有的测试程序:  [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar   An example program must be given as the first argument.  Valid program names are:  DFSCIOTest: Distributed i/o benchmark of libhdfs.  DistributedFSCheck: Distributed checkup of the file system consistency.  MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures  TestDFSIO: Distributed i/o benchmark.  dfsthroughput: measure hdfs throughput  filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)  loadgen: Generic map/reduce load generator  mapredtest: A map/reduce test check.  minicluster: Single process HDFS and MR cluster.  mrbench: A map/reduce benchmark that can create many small jobs  nnbench: A benchmark that stresses the namenode.  testarrayfile: A test for flat files of binary key/value pairs.  testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce  testfilesystem: A test for FileSystem read/write.  testmapredsort: A map/reduce program that validates the map-reduce framework's sort.  testrpc: A test for rpc.  testsequencefile: A test for flat files of binary key value pairs.  testsequencefileinputformat: A test for sequence file input format.  testsetfile: A test for flat files of binary key/value pairs.  testtextinputformat: A test for text input format.  threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill  这些程序从多个角度对Hadoop进行测试,TestDFSIO、mrbench和nnbench是三个广泛被使用的测试。 (2) TestDFSIO write TestDFSIO用于测试HDFS的IO性能,使用一个MapReduce作业来并发地执行读写操作,每个map任务用于读或写每个文件,map的输出用于收集与处理文件相关的统计信息,reduce用于累积统计信息,并产生summary。TestDFSIO的用法如下: TestDFSIO Usage: TestDFSIO [genericOptions] -read | -write | -append | -clean [-nrFiles N] [-fileSize Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir] 以下的例子将往HDFS中写入10个1000MB的文件: [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 15/01/13 15:14:17 INFO fs.TestDFSIO: TestDFSIO.1.7 15/01/13 15:14:17 INFO fs.TestDFSIO: nrFiles = 10 15/01/13 15:14:17 INFO fs.TestDFSIO: nrBytes (MB) = 1000.0 15/01/13 15:14:17 INFO fs.TestDFSIO: bufferSize = 1000000 15/01/13 15:14:17 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO 15/01/13 15:14:18 INFO fs.TestDFSIO: creating control file: 1048576000 bytes, 10 files 15/01/13 15:14:19 INFO fs.TestDFSIO: created control files for: 10 files 15/01/13 15:15:23 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 15/01/13 15:15:23 INFO fs.TestDFSIO:            Date & time: Tue Jan 13 15:15:23 CST 2015 15/01/13 15:15:23 INFO fs.TestDFSIO:        Number of files: 10 15/01/13 15:15:23 INFO fs.TestDFSIO: Total MBytes processed: 10000.0 15/01/13 15:15:23 INFO fs.TestDFSIO:      Throughput mb/sec: 29.67623230554649 15/01/13 15:15:23 INFO fs.TestDFSIO: Average IO rate mb/sec: 29.899526596069336 15/01/13 15:15:23 INFO fs.TestDFSIO:  IO rate std deviation: 2.6268824639446526 15/01/13 15:15:23 INFO fs.TestDFSIO:     Test exec time sec: 64.203 15/01/13 15:15:23 INFO fs.TestDFSIO:  (3) TestDFSIO read 以下的例子将从HDFS中读取10个1000MB的文件: [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 15/01/13 15:42:35 INFO fs.TestDFSIO: TestDFSIO.1.7 15/01/13 15:42:35 INFO fs.TestDFSIO: nrFiles = 10 15/01/13 15:42:35 INFO fs.TestDFSIO: nrBytes (MB) = 1000.0 15/01/13 15:42:35 INFO fs.TestDFSIO: bufferSize = 1000000 15/01/13 15:42:35 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO 15/01/13 15:42:36 INFO fs.TestDFSIO: creating control file: 1048576000 bytes, 10 files 15/01/13 15:42:37 INFO fs.TestDFSIO: created control files for: 10 files (4) 清空测试数据 [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar TestDFSIO -clean 15/01/13 15:46:51 INFO fs.TestDFSIO: TestDFSIO.1.7 15/01/13 15:46:51 INFO fs.TestDFSIO: nrFiles = 1 15/01/13 15:46:51 INFO fs.TestDFSIO: nrBytes (MB) = 1.0 15/01/13 15:46:51 INFO fs.TestDFSIO: bufferSize = 1000000 15/01/13 15:46:51 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO 15/01/13 15:46:52 INFO fs.TestDFSIO: Cleaning up test files (4) nnbench测试 nnbench用于测试NameNode的负载,它会生成很多与HDFS相关的请求,给NameNode施加较大的压力。这个测试能在HDFS上模拟创建、读取、重命名和删除文件等操作。nnbench的用法如下: 以下例子使用12个mapper和6个reducer来创建1000个文件: [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname -s` NameNode Benchmark 0.4 15/01/13 15:53:33 INFO hdfs.NNBench: Test Inputs:  15/01/13 15:53:33 INFO hdfs.NNBench:            Test Operation: create_write 15/01/13 15:53:33 INFO hdfs.NNBench:                Start time: 2015-01-13 15:55:33,585 15/01/13 15:53:33 INFO hdfs.NNBench:            Number of maps: 12 15/01/13 15:53:33 INFO hdfs.NNBench:         Number of reduces: 6 15/01/13 15:53:33 INFO hdfs.NNBench:                Block Size: 1 15/01/13 15:53:33 INFO hdfs.NNBench:            Bytes to write: 0 15/01/13 15:53:33 INFO hdfs.NNBench:        Bytes per checksum: 1 15/01/13 15:53:33 INFO hdfs.NNBench:           Number of files: 1000 15/01/13 15:53:33 INFO hdfs.NNBench:        Replication factor: 3 15/01/13 15:53:33 INFO hdfs.NNBench:                  Base dir: /benchmarks/NNBench-server01 15/01/13 15:53:33 INFO hdfs.NNBench:      Read file after open: true 15/01/13 15:53:34 INFO hdfs.NNBench: Deleting data directory 15/01/13 15:53:34 INFO hdfs.NNBench: Creating 12 control files 15/01/13 15:56:06 INFO hdfs.NNBench: -------------- NNBench -------------- :  15/01/13 15:56:06 INFO hdfs.NNBench:                                Version: NameNode Benchmark 0.4 15/01/13 15:56:06 INFO hdfs.NNBench:                            Date & time: 2015-01-13 15:56:06,539 15/01/13 15:56:06 INFO hdfs.NNBench:  15/01/13 15:56:06 INFO hdfs.NNBench:                         Test Operation: create_write 15/01/13 15:56:06 INFO hdfs.NNBench:                             Start time: 2015-01-13 15:55:33,585 15/01/13 15:56:06 INFO hdfs.NNBench:                            Maps to run: 12 15/01/13 15:56:06 INFO hdfs.NNBench:                         Reduces to run: 6 15/01/13 15:56:06 INFO hdfs.NNBench:                     Block Size (bytes): 1 15/01/13 15:56:06 INFO hdfs.NNBench:                         Bytes to write: 0 15/01/13 15:56:06 INFO hdfs.NNBench:                     Bytes per checksum: 1 15/01/13 15:56:06 INFO hdfs.NNBench:                        Number of files: 1000 15/01/13 15:56:06 INFO hdfs.NNBench:                     Replication factor: 3 15/01/13 15:56:06 INFO hdfs.NNBench:             Successful file operations: 0 15/01/13 15:56:06 INFO hdfs.NNBench:  15/01/13 15:56:06 INFO hdfs.NNBench:         # maps that missed the barrier: 0 15/01/13 15:56:06 INFO hdfs.NNBench:                           # exceptions: 0 15/01/13 15:56:06 INFO hdfs.NNBench:  15/01/13 15:56:06 INFO hdfs.NNBench:                TPS: Create/Write/Close: 0 15/01/13 15:56:06 INFO hdfs.NNBench: Avg exec time (ms): Create/Write/Close: 0.0 15/01/13 15:56:06 INFO hdfs.NNBench:             Avg Lat (ms): Create/Write: NaN 15/01/13 15:56:06 INFO hdfs.NNBench:                    Avg Lat (ms): Close: NaN 15/01/13 15:56:06 INFO hdfs.NNBench:  15/01/13 15:56:06 INFO hdfs.NNBench:                  RAW DATA: AL Total #1: 0 15/01/13 15:56:06 INFO hdfs.NNBench:                  RAW DATA: AL Total #2: 0 15/01/13 15:56:06 INFO hdfs.NNBench:               RAW DATA: TPS Total (ms): 0 15/01/13 15:56:06 INFO hdfs.NNBench:        RAW DATA: Longest Map Time (ms): 0.0 15/01/13 15:56:06 INFO hdfs.NNBench:                    RAW DATA: Late maps: 0 15/01/13 15:56:06 INFO hdfs.NNBench:              RAW DATA: # of exceptions: 0 15/01/13 15:56:06 INFO hdfs.NNBench:  (5) mrbench测试 mrbench会多次重复执行一个小作业,用于检查在机群上小作业的运行是否可重复以及运行是否高效。mrbench的用法如下: MRBenchmark.1.7 Usage: mrbench [-baseDir <base DFS path for output/input, default is /benchmarks/MRBench>] [-jar <local path to job jar file containing Mapper and Reducer implementations, default is current jar file>] [-numRuns <number of times to run the job, default is 1>] [-maps <number of maps for each run, default is 2>] [-reduces <number of reduces for each run, default is 1>] [-inputLines <number of input lines to generate, default is 1>] [-inputType <type of input to generate, one of ascending (default), descending, random>] [-verbose] 以下例子会运行一个小作业50次: [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench -numRuns 50 MRBenchmark.0.0.2 15/01/13 16:17:19 INFO mapred.MRBench: creating control file: 1 numLines, ASCENDING sortOrder 15/01/13 16:17:20 INFO mapred.MRBench: created control file: /benchmarks/MRBench/mr_input/input_331064064.txt 15/01/13 16:17:20 INFO mapred.MRBench: Running job 0: input=hdfs://server01:8020/benchmarks/MRBench/mr_input output=hdfs://server01:8020/benchmarks/MRBench/mr_output/output_556018847 DataLines       Maps    Reduces AvgTime (milliseconds) 1               2       1       26748 以上结果表示平均作业完成时间是26秒。 (6)  Hadoop Examples 除了上文提到的测试,Hadoop还自带了一些例子,比如WordCount和TeraSort,这些例子在hadoop-examples*.jar中。 [hsu@server01 ~]$ ls /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples* /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.2.0.jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar 执行以下命令会列出所有的示例程序: [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar  An example program must be given as the first argument.  Valid program names are:  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.  dbcount: An example job that count the pageview counts from a database.  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.  grep: A map/reduce program that counts the matches of a regex in the input.  join: A job that effects a join over sorted, equally partitioned datasets  multifilewc: A job that counts words from several files.  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.  randomwriter: A map/reduce program that writes 10GB of random data per node.  secondarysort: An example defining a secondary sort to the reduce.  sort: A map/reduce program that sorts the data written by the random writer.  sudoku: A sudoku solver.  teragen: Generate data for the terasort  terasort: Run the terasort  teravalidate: Checking results of terasort  wordcount: A map/reduce program that counts the words in the input files.  wordmean: A map/reduce program that counts the average length of the words in the input files.  wordmedian: A map/reduce program that counts the median length of the words in the input files.  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. (7) TeraSort 一个完整的TeraSort测试需要按以下三步执行: 1、用TeraGen生成随机数据 2、对输入数据运行TeraSort 3、用TeraValidate验证排好序的输出数据 并不需要在每次测试时都生成输入数据,生成一次数据之后,每次测试可以跳过第一步。 TeraGen的用法如下: $ hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir> 以下命令运行TeraGen生成10GB的输入数据,并输出到目录/examples/terasort-input: [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen 100000000 /examples/terasort-input 15/01/13 16:57:34 INFO client.RMProxy: Connecting to ResourceManager at server01/135.33.5.53:8032 15/01/13 16:57:35 INFO terasort.TeraSort: Generating 100000000 using 2 15/01/13 16:57:35 INFO mapreduce.JobSubmitter: number of splits:2 15/01/13 16:59:07 INFO mapreduce.Job: Job job_1420542591388_0105 completed successfully 15/01/13 16:59:08 INFO mapreduce.Job: Counters: 31        File System Counters                FILE: Number of bytes read=0                FILE: Number of bytes written=211922                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=170                HDFS: Number of bytes written=10000000000                HDFS: Number of read operations=8                HDFS: Number of large read operations=0                HDFS: Number of write operations=4        Job Counters                 Launched map tasks=2                Other local map tasks=2                Total time spent by all maps in occupied slots (ms)=150416                Total time spent by all reduces in occupied slots (ms)=0                Total time spent by all map tasks (ms)=150416                Total vcore-seconds taken by all map tasks=150416                Total megabyte-seconds taken by all map tasks=154025984        Map-Reduce Framework                Map input records=100000000                Map output records=100000000                Input split bytes=170                Spilled Records=0                Failed Shuffles=0                Merged Map outputs=0                GC time elapsed (ms)=1230                CPU time spent (ms)=175090                Physical memory (bytes) snapshot=504807424                Virtual memory (bytes) snapshot=3230924800                Total committed heap usage (bytes)=1363148800        org.apache.hadoop.examples.terasort.TeraGen$Counters                CHECKSUM=214760662691937609        File Input Format Counters                 Bytes Read=0        File Output Format Counters                 Bytes Written=10000000000 TeraGen产生的数据每行的格式如下: <10 bytes key><10 bytes rowid><78 bytes filler>\r\n 其中: 1、key是一些随机字符,每个字符的ASCII码取值范围为[32, 126] 2、rowid是一个整数,右对齐 3、filler由7组字符组成,每组有10个字符(最后一组8个),字符从’A'到’Z'依次取值 以下命令运行TeraSort对数据进行排序,并将结果输出到目录/examples/terasort-output: [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar terasort /examples/terasort-input /examples/terasort-output 15/01/13 17:08:08 INFO terasort.TeraSort: starting 15/01/13 17:08:10 INFO input.FileInputFormat: Total input paths to process : 2 Spent 187ms computing base-splits. Spent 3ms computing TeraScheduler splits. Computing input splits took 192ms Sampling 10 splits of 76 Making 144 from 100000 sampled records Computing parititions took 596ms Spent 791ms computing partitions.terasort /examples/terasort-input /examples/terasort-output 15/01/13 17:09:13 INFO mapreduce.Job: Counters: 50        File System Counters                FILE: Number of bytes read=4461968618                FILE: Number of bytes written=8889668662                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=10000010260                HDFS: Number of bytes written=10000000000                HDFS: Number of read operations=660                HDFS: Number of large read operations=0                HDFS: Number of write operations=288        Job Counters                 Launched map tasks=76                Launched reduce tasks=144                Data-local map tasks=75                Rack-local map tasks=1                Total time spent by all maps in occupied slots (ms)=933160                Total time spent by all reduces in occupied slots (ms)=1227475                Total time spent by all map tasks (ms)=933160                Total time spent by all reduce tasks (ms)=1227475                Total vcore-seconds taken by all map tasks=933160                Total vcore-seconds taken by all reduce tasks=1227475                Total megabyte-seconds taken by all map tasks=955555840                Total megabyte-seconds taken by all reduce tasks=1256934400        Map-Reduce Framework                Map input records=100000000                Map output records=100000000                Map output bytes=10200000000                Map output materialized bytes=4403942936                Input split bytes=10260                Combine input records=0                Combine output records=0                Reduce input groups=100000000                Reduce shuffle bytes=4403942936                Reduce input records=100000000                Reduce output records=100000000                Spilled Records=200000000                Shuffled Maps =10944                Failed Shuffles=0                Merged Map outputs=10944                GC time elapsed (ms)=45169                CPU time spent (ms)=2021010                Physical memory (bytes) snapshot=95792517120                Virtual memory (bytes) snapshot=357225058304                Total committed heap usage (bytes)=174283816960        Shuffle Errors                BAD_ID=0                CONNECTION=0                IO_ERROR=0                WRONG_LENGTH=0                WRONG_MAP=0                WRONG_REDUCE=0        File Input Format Counters                 Bytes Read=10000000000        File Output Format Counters                 Bytes Written=10000000000 15/01/13 17:09:13 INFO terasort.TeraSort: done (8) terasort-validate 验证是否有序 以下命令运行TeraValidate来验证TeraSort输出的数据是否有序,如果检测到问题,将乱序的key输出到目录/examples/terasort-validate [hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teravalidate /examples/terasort-output /examples/terasort-validate 15/01/13 17:17:37 INFO client.RMProxy: Connecting to ResourceManager at server01/135.33.5.53:8032 15/01/13 17:17:38 INFO input.FileInputFormat: Total input paths to process : 144 Spent 93ms computing base-splits. Spent 3ms computing TeraScheduler splits. 15/01/13 17:17:38 INFO mapreduce.JobSubmitter: number of splits:144 15/01/13 17:17:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1420542591388_0107 15/01/13 17:17:38 INFO impl.YarnClientImpl: Submitted application application_1420542591388_0107teravalidate /examples/terasort-output /examples/terasort-validate 15/01/13 17:18:12 INFO mapreduce.Job: Job job_1420542591388_0107 completed successfully 15/01/13 17:18:12 INFO mapreduce.Job: Counters: 50        File System Counters                FILE: Number of bytes read=6963                FILE: Number of bytes written=15445453                FILE: Number of read operations=0                FILE: Number of large read operations=0                FILE: Number of write operations=0                HDFS: Number of bytes read=10000019584                HDFS: Number of bytes written=25                HDFS: Number of read operations=435                HDFS: Number of large read operations=0                HDFS: Number of write operations=2        Job Counters                 Launched map tasks=144                Launched reduce tasks=1                Data-local map tasks=142                Rack-local map tasks=2                Total time spent by all maps in occupied slots (ms)=685624                Total time spent by all reduces in occupied slots (ms)=3384                Total time spent by all map tasks (ms)=685624                Total time spent by all reduce tasks (ms)=3384                Total vcore-seconds taken by all map tasks=685624                Total vcore-seconds taken by all reduce tasks=3384                Total megabyte-seconds taken by all map tasks=702078976                Total megabyte-seconds taken by all reduce tasks=3465216        Map-Reduce Framework                Map input records=100000000                Map output records=432                Map output bytes=11664                Map output materialized bytes=13830                Input split bytes=19584                Combine input records=0                Combine output records=0                Reduce input groups=289                Reduce shuffle bytes=13830                Reduce input records=432                Reduce output records=1                Spilled Records=864                Shuffled Maps =144                Failed Shuffles=0                Merged Map outputs=144                GC time elapsed (ms)=4014                CPU time spent (ms)=334280                Physical memory (bytes) snapshot=85470654464                Virtual memory (bytes) snapshot=234019295232                Total committed heap usage (bytes)=114868879360        Shuffle Errors                BAD_ID=0                CONNECTION=0                IO_ERROR=0                WRONG_LENGTH=0                WRONG_MAP=0                WRONG_REDUCE=0        File Input Format Counters                 Bytes Read=10000000000        File Output Format Counters                 Bytes Written=25 [hsu@server01 ~]$ hadoop fs -cat /examples/terasort-validate/*                                                           checksum        2fafbaf537afd49 结论:检测通过 (10) 总结 在提交任务目录下会生成两个文件 [hsu@server01 ~]$ LANG=en [hsu@server01 ~]$ ll total 16 -rw-r--r-- 1 root root 1142 Jan 13 15:56 NNBench_results.log -rw-r--r-- 1 root root  903 Jan 13 15:43 TestDFSIO_results.log 约对176838144行数据进行排序刚好1分钟时间,部分数据: 0000000: 00 00 00 a7 0d 2a a8 02 da da 00 11 30 30 30 30  .....*......0000

    0000010: 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 0000000000000000

    Action 数据量(G)   HiveTime(s)  ImpalaTime(s) Hive结论Imapla结论

    Count(*) 39.8 386.804 192.75 通过 警告阈值(内存) join(2) 39.8*2 413.651525.48 通过警告阈值(内存)

    结论:

                      1、对于大数据量impala并不占优势,而且还可能节点impalad节点崩溃,impala非常吃内存,parquet也非常吃内存!

                      2、hive运行会出现大量IO操作,往往impala运行不下来的任务hive能够运行。

                      3、impala对sql支持度以及对hive一些分析函数特殊数据格式支持仍然有待新版本。

    转载请注明原文地址: https://ju.6miu.com/read-1297701.html
    最新回复(0)