Oregon State University
EECS Home
Oregon State Home College of Engineering Find Someone
School of Electrical Engineering and Computer Science
The Oregon State Advantage
 
Research Collaboration
Research Thrusts
Research Centers
Our Facilities
Research Faculty
Online Library
 
Educational Innovations
 
Faculty Careers
 
Graduate Studies
 
Undergraduate Studies
 
Industry Connection
 
Alumni Connection
 
People
 
About Oregon State EECS
 
EECS News



 Site Map Contact Us
 
 


Research Collaboration: Our Facilities
EECS High Performance Computing Cluster

EECS HPC Cluster Home | About the HPC Cluster | Hardware |

Software | Performance | Projects | Using the HPC Cluster

Introduction

The EECS cluster has joined resources with ME, COS and NERHP to provide a more robust high performance computing environment to meet the needs of researchers across the college of engineering. The combined systems provide a lower average use and more appropriate hardware options to solve several different problems.

Deciding which hardware is best

Picking the right hardware is key. Some projects require excessive memory, others need fast processors and some need a lot of disk space.

    Use one of the amd64 (me, cos, nerhp or amd64-low) queues if:
  • you need more than 4GB of memory
  • fast memory IO is crucial
  • your application is designed for the AMD Opteron
  • 64-bit is important for double precision
    Use the em64t queue if:
  • Your job spends more time on the CPU than waiting on memory
  • your application is designed for the Xeon Processor
  • you compiled with Intel's compiler
  • 64-bit is important for double precision
    Use the i386 queue if:
  • You just need processors to solve a problem
  • You are not using MPI
  • A large volume of serial jobs are running
  • you are being courteous of the people that have special demands

Setting up your environment

Before using the cluster it is critical to set up the proper environment. The best way to do this is slurp in environmental settings:

  for mpich v1 (most users):
    host% source /nfs/queue/a1/sge_6.0/settings.csh

  for mpich v2:
    host% source /nfs/queue/a1/sge_6.0/settings-mpich2.csh
                  
It may be more helpful to have this happen when you log into one of the submit hosts. Simply add the line above to your .cshrc (for csh).

How the cluster works

 

The combined HPC environment is configured with three front-end submit nodes and a growing suite of execution hosts. The submit hosts have the host names below. From off campus you must connect to an access server first.
  64-bit jobs:
   - submit-amd64-01.hpc.engr.oregonstate.edu (aka godel.me,submit64.eecs)

  32-bit jobs: submit-i386-01.hpc.engr.oregonstate.edu
   - submit-i386-01.hpc.engr.oregonstate.edu (still bee00)
   - submit-i386-02.hpc.engr.oregonstate.edu (submit32.eecs)
                  

Pick the right submit host! If you are planning to run double-precision 64-bit models you must compile and submit the job from submit-amd64-01. If you are running and submitting 32-bit code, please use submit-i386-01 or submit-i386-02. The execution hosts are named based on their hardware:

 

  64-bit AMD Opteron AMD64: exec-amd64-XX.hpc.engr.oregonstate.edu
  64-bit Xeon EM64T:        exec-em64t-XX.hpc.engr.oregonstate.edu
  32-bit i386 Xeon Systems: exec-i386-XXX.hpc.engr.oregonstate.edu
                

Jobs are compiled, debugged, submitted and monitored from the submit hosts. Users should never connect directly to the exec hosts. The exec hosts are chosen automatically through the scheduler and the job is distributed to the submit host by specially formatted submission scripts. Submission scripts should be used for all jobs. Examples are offered below.

Notifications can be defined to alert the owner when a job is started, finishes, or when errors occur. The output from the run is saved into specified output files and status can be observed from the output or by asking the scheduler what the process is doing.

General notes on submission scripts

Submission scripts are required for all jobs. They are pre-processed by the scheduler and all commands that start with "#$" become arguments to the submission application. Lines beginning with "#" alone are shell comments. Environment parameters can be set in the submission script and will be passed to the environment of the exec host if specified. The values in these scripts are advised for the type of job being executed. For example, ls-dyna and star-cd are better suited for the amd64 queue, where single non-mpi jobs may perform better on em64t. The volume jobs should run on the volumejob queue.

Simple submission script for serial jobs (non-MPI)

 

  #!/bin/csh

  # Give the job a name
  #$ -N example_job

  # set working directory on all host to
  # directory where the job was started
  #$ -cwd

  # send all process STDOUT (fd 2) to this file
  #$ -o job_output.txt

  # send all process STDERR (fd 3) to this file
  #$ -e job_output.err

  # specify the hardware platform to run the job on.
  # options are: amd64, em64t, i386, volumejob (use i386 if you don't care)
  #$ -q i386

  # Commands
  ./my_serial_script.sh
                

Simple MPI submission script

 

  #!/bin/csh

  # Give the job a name
  #$ -N example_job

  # set working directory on all host to
  # directory where the job was started
  #$ -cwd

  # send output to job.log (STDOUT + STDERR)
  #$ -o job.log
  #$ -j y

  # specify the mpich parallel environment and request 4
  # processors from the available hosts
  #$ -pe mpich 4

  # specify the hardware platform to run the job on.
  # options are: amd64, em64t, i386, volumejob (use i386 if you don't care)
  #$ -q i386

  # command to run.  ONLY CHANGE THE NAME OF YOUR MPI APPLICATION  
  mpirun -nolocal -np $NSLOTS -machinefile $TMPDIR/machines ./my_MPI_application
                

Submission script for Star-CD

 

  #!/bin/csh

  # name job
  #$ -N starsolv

  # set current working directory
  #$ -cwd

  # send output to job.log (STDOUT + STDERR)
  #$ -o job.log
  #$ -j y

  # specify the hardware platform to run the job on.
  # options are: amd64, em64t, i386, volumejob (use i386 if you don't care)
  #$ -q amd64

  # number of slots requested from env.
  #$ -pe mpich 4

  # Commands
  star `cat $TMPDIR/machines | xargs`
                

Submission script for ls-dyna

 

  #!/bin/csh

  # set up memory needs
  setenv MEM 300000000
  setenv MEM2 `expr $MEM / $NSLOTS`

  # name job
  #$ -N ls-dyna

  # set current working directory
  #$ -cwd

  # send output to job.log (STDOUT + STDERR)
  #$ -o job.log
  #$ -j y

  # specify the hardware platform to run the job on.
  # options are: amd64, em64t, i386, volumejob (use i386 if you don't care)
  #$ -q amd64

  # number of slots requested from env.
  #$ -pe mpich 4

  # Commands
  mpirun -nolocal -np $NSLOTS -machinefile $TMPDIR/machines mpp-ls970 \
  i=./modelfile.k memory=$MEM memory2=$MEM2 ncpus=$NSLOTS
                

Understanding shared disk

Most applications assume that the job will run in a folder that is shared with the submit host and the files necessary for the job don't need to be distributed to the other nodes. Any NFS volume including your home directory (EECS or ENGR) can be used. /scratch/share can also be utilized. Other application will copy the required components to a remote host for the duration of the run. If this is the case, use /scratch for best I/O performance.

Understanding file locking

Serial jobs that all access the same files generally require special attention to file locking over NFS. It is advised to have each job produce it's own copy of a file. Waiting on NFS file locks can be devistating to volume jobs and system performance.

Submitting a job

In order to submit a job you must have the environment correctly defined for the version of MPICH you are using and any environment parameters required by your application. For STAR-CD, LS-DYNA and others this may include license servers or environment roots.

The qsub command is used to submit the job. If the submission script is correctly defined submission is as simple as:

 

  host% qsub my_submission_script
                

When submission occurs a job number will be returned. Check the status of your STDOUT and STDERR files to see if the job is running as expected. You can also check the status of your job with the qstat command. If you need to stop your job, use the qdel command to remove the job from the queue.

Using qstat

qstat is used to check the status of a running or queued job. qstat -f will show the status of all nodes and queued processes. qstat will show running jobs and the status of each job. Nodes with an E state encountered a fatal error and will require administrative help. Nodes with d status are dead, usually waiting on a node that failed, but sometimes disabled by the administrator.

How to delete a job

qdel is used to remove a job from the queue. If the job is in a dr state the -f flag must be used to force the job to stop. The job ID is supplied as an argument to qdel. The ID can be seen with qstat.

 


School of Electrical Engineering and Computer Science, 1148 Kelley Engineering Center
Oregon State University, Corvallis, OR 97331-5501
Send a comment about this web site | This page was last modified on Monday, April 02, 2007
Copyright © 2009 | Disclaimer | Committed to Diversity