Run a Job

description

This shows how to run a job an EMR (Elastic Map Reduce) cluster on AWS

Notes

This command has 2 modes (Hadoop, regular),

  1. Hadoop mode: HPC-cloud converts the command and input for a Hadoop streaming Job e.g run a blast on EMR Cluster
  2. Regular mode: HPC-cloud client send the command as it with the input files and call the program, in that case, the tool should handle hadoop e.g bowtie

Warning

Multi-fasta files have to formatted before passing them as inputs. Please check input Formatter.

Command-line Interface

>>> EHPC-EMR --run -d=DOMAIN --command=COMMAND OPTIONS
All parameters should be in format parameter=value
--mode, -m                   mode of the job, either [Hadoop, regular], default regular
Note: in Hadoop mode, HPC-cloud converts the command and input for a Hadoop streaming Job
      in regular mode, HPC-cloud passes the job as a classic job which the tool will handle hadoop
--command                    original command to run
--input-files, -i            list of input files seprated by commas
--output-files, -o           list of expected output files seprated by commas
--domain, -d                 Domain of the main node
--cache-files, -cf           list of cache files to cache to mappers and reducers, for hadoop mode only
--cache-archives, -ca        list of cache archives to cache to mappers and reducers, for hadoop mode only
--files                      list of files to pack with the job
--reducer                    path of the reducer to execute e.g 'cat', default NONE
--output-dir                 path of the output dir for the mappers and reducers, default /home/hadoop/output/ID
--conf                       set of hadoop configuration parameters seprated by commas, for hadoop mode only
--owner                      the owner of job
        if owner is system, the commad will execute on the command line, client will wait the job is Done
        if owner is hadoop, the job will be submitted as a Hadoop job
        if owner is otherwise, this will be a PBS Torque Job
--no-fetch-output-files     Don't fetch output files

Example 1

Copy data from S3 to HDFS on domain ec2......com

./EHPC-EMR --run -d=ec2......com -m=regular --command='/home/hadoop/bin/hadoop distcp s3://eg.nubios.us/est_human.tar.gz hdfs:///home/hadoop/est_human.tar.gz'

Example 2

Run blast of a query file ‘query.1000.formatted’ against est_human db copied in Example 1 on ec2......com

./EHPC-EMR --run --mode=hadoop -d=ec2......com -id=1 --command='./blastn -query /home/hadoop/query.1000.formatted -db est_human/est_human -out /home/hadoop/blast/out.blast' --input-files='s3://eg.nubios.us/emr/query.1000.formatted' --output-files='/home/hadoop/blast/out.blast' --output-dir='/home/hadoop/output/1' --reducer=NONE -ca=hdfs:///home/hadoop/dbs/est_human.tar.gz#est_human -cf=s3://eg.nubios.us/emr/blastn.64#blastn --conf=mapred.reduce.tasks=1,mapred.map.tasks=1

Table Of Contents

Previous topic

Input Formatter

Next topic

Submit a Job

This Page