2.press “y” and then specify sgeadmin as the user id (sgeadmin)
3.leave the install dir as /BiO/gridengine (/data/software)
4.You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
5.accept the sge_qmaster info
6.You will now be asked about port configuration for the master, normally you would choose the default (2) which uses the /etc/services file
7.accept the sge_execd info
8.leave the cell name as “default”
9.Enter an appropriate cluster name when requested (Enter new cluster name or hit to use default [p6444] >>,这里选择的回车,出来的结果是creating directory: /data/software/gridengine/default/common, Your $SGE_CLUSTER_NAME: p6444)
10.leave the spool dir as is ( 回车,选择默认)
11.press “n” for no windows hosts! ( 这一步选择的是n,不是默认的的哈)
12.press “y” (permissions are set correctly)
13.press “y” for all hosts in one DNS domain
14.If you have Java available on your Qmaster and wish to use SGE Inspect or SDM thenenable the JMX MBean server and provide the requested information - probably answer “n” at this point!(这一步选择n,不然也会报错额)
15.press enter to accept the directory creation notification
16.enter “classic” for classic spooling (berkeleydb may be more appropriate for large clusters)
17.press enter to accept the next notice
18.enter “20000-20100” as the GID range (increase this range if you have execution nodes capable of running more than 100 concurrent jobs)
19.accept the default spool dir or specify a different folder (for example if you wish to use a shared or local folder outside of SGE_ROOT
20.enter an email address that will be sent problem reports
21.press “n” to refuse to change the parameters you have just configured
报错:Command failed: ./utilbin/lx-amd64/spooldefaults Command failed: configuration Command failed: /tmp/configuration_2018-03-16_09:00:40.43362 Probably a permission problem. Please check file access permissions. Check read/write permission. Check if SGE daemons are running.
重新安装,上面的报错就没有啦
22.press enter to accept the next notice
23.press “y” to install the startup scripts
24.press enter twice to confirm the following messages
26.enter the names of your hosts who will be able to administer and submit jobs (enter alone to finish adding hosts) 输入了C01,然后enter,输入G02,然后enter,输入G03,然后enter (这里输入了一串乱码,可能对后面造成影响) 27.skip shadow hosts for now (press “n”)
27.choose “1” for normal configuration and agree with “y”
28.press enter to accept the next message and “n” to refuse to see the previous screen again and then finally enter to exit the installer You may verify your administrative hosts with the command# qconf -sh and you may add new administrative hosts with the command # qconf -ah
-sql show queue list Show a list of all currently defined cluster queues.
-sq queue_list show queues Displays one or multiple cluster queues or queue instances.
修改
-mq queuename modify queue configuration Retrieves the current configuration for the specified queue, executes an editor, and registers the new configuration with the sge_qmaster.
删除:
# qconf -dq queue_name
新增:
-Aq fname add new queue Add the queue defined in fname to the cluster. Name of the queue creates is specified in the file. It is not define by fname
-aq queue_name add new queue. In this case qconf retrieves the default queue configuration (see queue_conf man page) and invokes an editor for customizing the queue configuration. Upon exit from the editor, the queue is registered with sge_qmaster. A minimal configuration requires only that the queue name and queue hostlist be set.
SGE 与 NFS 用户管理问题 1.sge 用户管理: sge 以用户名来标志相同用户,如果节点 A 上用户 user133 提交的作业想要让节点 B 执行,节点 B 上也必须有用户 user1.当执行作业的时候, 执行节点 B 将自动调用用户 user1 来进行作业的执行。 2.NFS 用户管理: NFS 以用户 id 来标志相同用户,如果节点 A 上 id 为 1000 的用户将文件放到 NFS 的目录上面,则在其余的共享了该目录的主机上,该文 件的属主也是 id 为 1000 的用户。可以看到上面的一个矛盾: sge 以用户名来标志不同节点上的相同用户, nfs 以用户 id 来标志不同节点上的相同用户。现在我们的系统,需要多个节点共同 工作。可能会出现这样的情况: A 机上的 user1 完成作业的一部分,在 nfs 上面 生成了一个目录,而 B 机上的 use1 需要写文件到这个目录上,如果 A 机和 B 机 上 user1 的用户 id 不一样, nfs 就会将它们识别为不同的用户,则 A 机上的用户 user1 建立的目录,但 B 机上的 user1 对这个目录并没有写权限。因为这个目录 在 B 机上属于 id 和 user1 一样的用户。为了避免上面这个情况,我们要求所有机子上的同名用户也拥有相同的 id。
unable to send message to qmaster using port 6444 on host
[root@g02 test]# qsub -cwd uname.sge error: commlib error: got selecterror (Connection refused) Unable to run job: unable to send message to qmaster using port 6444on host “G02”: got send error Exiting.The problem can be: - the qmaster isnot running - the qmaster host is down - an active firewall blocks your request
# which qhost# /opt/sge/bin/lx-amd64/qhost# vim /etc/profile# export SGE_ROOT=/opt/sge# export PATH="${SGE_ROOT}/bin/lx-amd64:$PATH"# source /etc/profile
记得修改队列的host.list
scheduling info提示报错
cheduling info:(-l FEP_GPGPU=16,gpus=1) cannot run in queue “G02” because it offers only hc:gpus=0.000000
cheduling info 会根据已有资源,各种提示,告诉你为什么没有现在运行。这个的问题是gpus的资源已经用光了。 如果你觉得还有资源的话,可以通过修改执行主机和队列中对应的这个资源的值。
# qconf -me G02 # qconf -mq gpu.q
cannot run in PE “smp” because it only offers 0 slots 在用smp运行的时候,提示这个错误:
cannot run in queue "all.q" because it is not contained in its hard queue list (-q) cannot run in queue "cpu.q" because it is not contained in its hard queue list (-q) cannot run in PE "smp" because it only offers 16 slots
SGE messages and logs are usually very helpful$SGE_ROOT/default/spool/qmaster/messages$SGE_ROOT/default/spool/qmaster/schedd/messagesExecd spool logs often hold job specific error dataRemember that local spooling may be used (!)$SGE_ROOT/default/spool/<node>/messagesSGE panic locationQWill log to /tmp on any node when $SGE_ROOT not found or not writable
2.快速显示问题 qsub -w v其他参数 能立马告诉你问题,例如: qsub -w v -cwd -e error12 -q cpu.q simple.sh 邮件告知,为什么任务失败 qsub -m a user@host [rest of command] 3.查看某个job的情况 qstat -j job_id 里面的error信息会提示你报错 6.4.bash脚本运行 4.7. bash脚本与Linux环境变量
#!/bin/bash# SGE Options#$ -S/bin/bash#$ -N MyJob# Create Working DirectoryWDIR=/state/partition1/$USER/$JOB_NAME-$JOB_IDmkdir -p $WDIRif [ ! -d $WDIR ]then echo $WDIR not created exitficd $WDIR# Copy Data and Config Filescp $HOME/Data/FrogProject/FrogFile .# Put your Science related commands here/share/apps/runsforever FrogFile# Copy Results Back to Home DirectoryRDIR=$HOME/FrogProject/Results/$JOB_NAME-$JOB_IDmkdir -p $RDIRcp NobelPrizeWinningResults $RDIR# Cleanup