# -*- mode: Fundamental; fill-column: 89 -*- ------------------------------------------------------------------------------------------------- LosF: A Linux operating system Framework for managing HPC clusters ------------------------------------------------------------------------------------------------- This file outlines the steps for installing LosF and associated dependencies. Certain sections or steps or steps are optional and are identified as such. In particular, LosF is designed to optionally coordinate with Cobbler (http://www.cobblerd.org) or Warewulf (http://warewulf.lbl.gov/trac) for bare-metal provisioning. If you want to use LosF to manage the Cobbler/Warewulf provisioning configuration, you will need to identify at least one master host on your cluster to serve as the provisioning server. If your master server has external network access, you can download the necessary packages to setup Cobbler from the EPEL repository (https://fedoraproject.org/wiki/EPEL). You can download Warewulf builds from the OpenHPC project (https://github.com/openhpc/ohpc). A quickstart method for setting up LosF and optionally, Cobbler on a new cluster that should work with CentOS distributions is outlined in the following sections. ------------------------------------------------------------------------------------------------- 1. Install a master server from baseline OS distribution (e.g. CentOS). (a) Boot from installation media and install as desired. Note: if you are planning to use Cobbler's functionality to mirror repositories, be aware that the default storage location is /var/www. Consequently, consider providing ample space to your /var partition (e.g. 40G or more) during the base install configuration procedure. (b) Configure basic networking. At a minimum, you will likely want to configure two network interfaces: (1) one interface for internal cluster access (e.g. eth0 on a private subnet) and (2) one interface for external access to the outside world (e.g. eth1 into your existing WAN). (c) Add EPEL repo access to your master server. See http://fedoraproject.org/wiki/EPEL/FAQ for more information. An example for CentOS6 is below: $ rpm -Uvh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm (c) [ Optional ]: Install baseline cobbler software and its associated dependencies (available from the EPEL repo above). Cobbler can be used for bare-metal provisioning, but LosF is used for configuration management after a base install, so other provisioning mechanisms can be used as desired. Assuming your master server has external network access, the necessary packages can be installed as follows: $ yum install cobbler $ yum install pykickstart (d) Install LosF OS software dependencies. LosF requires a yum download plugin and two specific perl modules. These are also available from the EPEL repo: $ yum install yum-plugin-downloadonly $ yum install perl-Log-Log4perl $ yum install perl-Config-IniFiles ------------------------------------------------------------------------------------------------- 2. [ Optional ] Perform a basic configuration for Cobbler. Before you can use cobbler, you need to define a basic working configuration. See http://www.cobblerd.org and other online resources for more information; a starting point for a basic configuration outline on your master cluster server is outlined as follows: (a) Edit /etc/cobbler/settings file to define the IPs for your master server. Relevant variables to update are: next_server server default_password_crypted The default_password_crypted is the location to define the root password that will be used during the provisioning process. You need to change this from the default password to a strong password of your own choosing. You can define the crypted hash via: $ openssl passwd -1 -salt 'random-phrase-here' 'your-password-here' (b) Start cobbler. $ /etc/init.d/cobblerd start (c) Check for other cobbler pre-requisites. You can run "cobbler check" which will help point you to any other remaining requirements (like having http enabled, allowing rsync through xinetd, etc). Note that you can safely ignore warnings about a missing debmirror, fencing tools, or boot-loaders. (d) Import an OS distribution (e.g. from the media used to install the initial server or from downloaded ISO images). Assuming CentOS ISO images are available locally on your master host, you can mount and import as follows: $ export DISTRO=CentOS6.5 $ mkdir /mnt/centOS6.5-dvd1 $ mount -o loop CentOS-6.5-x86_64-bin-DVD1.iso /mnt/centOS6.5-dvd1 $ cobbler import --path=/mnt/centOS6.5-dvd1 --name=${DISTRO} --arch=x86_64 Tip: in order to support the possible provisioning of all available CentOS packages, consider importing the 2nd DVD ISO image as well. This requires adding packages to the distro profile created from the DVD1 import and updating the repository metadata. An example of this process is below: $ mkdir /mnt/centOS6.5-dvd2 $ mount -o loop CentOS-6.5-x86_64-bin-DVD2.iso /mnt/centOS6.5-dvd2 $ rsync -a '/mnt/centos6.5-dvd2/' /var/www/cobbler/ks_mirror/CentOS6.5-x86_64/ --exclude-from=/etc/cobbler/rsync.exclude $ export COMPSXML=`ls /var/www/cobbler/ks_mirror/${DISTRO}-x86_64/repodata/*comps*.xml` $ createrepo -c cache -s sha --groupfile ${COMPSXML} /var/www/cobbler/ks_mirror/${DISTRO}-x86_64 # With the distro defined, it is a good time to synchronize cobbler. $ cobbler sync (e) Import an EPEL distribution (this will mirror the repo for use locally throughout your cluster). An example for CentOS6 is below: $ cobbler repo add --mirror=http://dl.fedoraproject.org/pub/epel/6/x86_64 \ --name=epel6 --arch=x86_64 Tip: if your master server is behind a proxy, you can augment the repo mirroring environment to include an http proxy. For example: $ cobbler repo edit --name=epel6 --environment="http_proxy=http://proxy.yourcompany.com:1111" To access this repo, you will want to associate it with the provisioning OS distro you defined. An example for the CentOS6.5 profile above is: $ cobbler profile edit --name=CentOS6.5-x86_64 --repos=epel6 (f) Mirror the newly defined EPEL repository (note: this will take some time). $ cobbler reposync ------------------------------------------------------------------------------------------------- 3. Install and perform baseline configuration for LosF. (a) Untar a release tarball (or clone desired version from GitHub directly), preferably into a shared file system that will be available across the cluster. Suggestion: Assuming you have additional drives available in your master host, a reasonable place to install is into an admin directory or partition hosted by your chosen master host. Note: you can also build a self-contained losf RPM for use locally using the distribution tarball as follows: $ rpmbuild -tb losf-.tar.gz (b) Verify basic networking requirements. LosF relies on the output from standard Linux commands like "hostname" and "dnsdomainname" in order to delineate between different node types and clusters. Make sure these commands return reasonable (non-empty) values and update your local network config if necessary. (c) Define LosF config path and initialize for the current cluster. Note that you can manage multiple clusters with a single LosF install, but you need to first choose a top-level config directory ($config_dir) and designate a unique identifier for the local cluster. There are two options for defining the local config directory: (1) set in "/config/config_dir" file in local LosF install dir (2) set via the "LOSF_CONFIG_DIR" environment variable For production usage, it is recommended to use option (1) to set the path directly in the "config/config_dir" file, a simple ascii file. Alternatively, the second environment variable option provides a convenient way to override the default config path to test alternate config settings prior to a production rollout. To set the $config_dir via option (1), consider the following depending on whether this is a new install or upgrade. (i) New installs: A template file is provided that you can copy and edit to identify a preferred path: $ cd /config $ cp config_dir.template config_dir [then edit to suit] (ii) Upgrades: if you have been using a previous version of LosF and are upgrading to a newer release, you can simply copy your previous config/config_dir file into the latest install path. Alternatively, you can run the "misc/config_latest_install" utility and it will search for the most recently installed config_dir setting, and update the latest install accordingly. Note that for a first-time install, you will likely not have a pre-existing config_dir for LosF; in that case, the path you choose will be created during the following initialization step. (d) Create necessary baseline configuration files for the current cluster. Rudimentary configuration example files are provided in the "config/config_example" directory for a cluster designated as "Bar". These can be used as the basis for creating the necessary configuration files for your cluster in $config_dir and help to illustrate some of the available options. Alternatively, you can use a convenience utility to initialize the cluster with a vanilla configuration as follows: $ /initconfig This utility will define a single "master" node type based on the local hostname and domainname of the host on which it is executed. If the basic initialization is complete, it should be possible to run the "update" utility. Example output after running initconfig is shown below indicating that no specific configuration features have been requested and that the host is presently up to date. $ update -q OK: [RPMs: OS 0/0 Custom 0/0] [Files: 0/0] [Links: 0/0] [Services: 0/0] [Perms: 0/0] -> master ------------------------------------------------------------------------------------------------- 4. Customize cluster configuration See the top-level LosF README for a brief introduction on the primary command-line utilities available. In addition, running "losf" with no command-line options will provide documentation on the available options and syntax. Prior to running any commands, customize the cluster configuration: (a) Update config.machines file. To begin, the first file to edit is the $config_dir/config.machines file to potentially customize the domainname of your new cluster, provide desired node type definitions (e.g. logins, compute, io, etc), and define the location to house RPMs registered with LosF. The configuration files are simple, ascii keyword driven files that are organized by [sections] and can be edited directly. An example file highlighting example syntax and multiple cluster definitions is provided at: /config/config_example/config.machines Alternatively, if you used the "initconfig" utility, you should have a basic working file installed already at $config_dir/config.machines. (b) Update config. file. Once the cluster and node types have been identified, the next file to update is the cluster-specific config file at $config_dir/config.. This keyword driven input file is used to define all of the desired configuration options pertaining to host configuration files, runtime services, file permissions, soft links, and (optionally), cobbler provisioning options. If you used the "initconfig" utility to initialize your configuration, the initial file will be very minimal with no configuration options registered. See /config/config.Bar for details on example syntax and options. Note: when adding new configuration files to sync across the cluster (e.g. in the [ConfigFiles] section), it is necessary to provide a reference template file for the node type in which you desire to synchronize the file. These reference templates should be maintained in: $config_dir/const_files// Unless overridden by specific permission options you add to the config. file, LosF will mimic the file ownership and permissions of the synchronized files to match the reference template version. Suggestion: for traceability, it is recommended to track your local config file changes using your favorite version control system (e.g. git or subversion). ------------------------------------------------------------------------------------------------- 5. Managing your HPC cluster. With a working LosF install and configuration in place (and OS distribution repository defined via external network access or from local Cobbler mirroring), you should be able to further customize your system and synchronize system-wide based on desired node type settings. The following examples highlight a few common tasks: (a) Update host to latest state. The "update" utility is used to bring a node to the latest configuration status (via the installation/removal of desired packages and synchronization of configuration files and services). Basic usage is below (or see "/update -h"): usage: update [OPTIONS] OPTIONS: -h Show help message. -q Quiet logging mode; shows detected system changes only. -p [path] Override configured RPM source directory to prefer provided path instead. -v Print version number and exit. (b) Add OS packages. You can add additional OS packages using the "losf" utility. The utility will automatically download any associated dependencies and configure the OS packages for the *local* node type where you are executing the command. The following example highlights the process to add the "finger" package, followed by update to push the package install. $ /losf addpkg finger $ /update -q More sophisticated options for adding OS groups and updating OS packages are also available. See "losf -h" for more details. (c) Add Custom 3rd packages. HPC systems frequently require custom packages that are not part of a standard Linux distribution mechanism (e.g. commercial compilers, open-source parallel file systems, MPI stacks, etc). The "losf" utility provides a way to manage custom RPMs and organize groups of related custom RPMs via aliases. See "losf -h" for more details regarding management of custom RPMs. The example below shows a simple example registering a locally built RPM for the MVAPICH2 open-source project and installing it locally via update. $ /losf addrpm ./mvapich2-2.0-1.x86_64.rpm $ /update -q More sophisticated options for managing multi-version RPMs, relocatable RPMs, and aliases are available. Please see "losf -h" for more details. (d) Sync config files only. The "update" utility is normally used to verify that all configuration files and OS/Custom packages are in sync. A convenience utility is provided in cases where you only modified a configuration file or runtime service and want to bring a host in sync without checking all RPM packages. To synchronize all defined config files, soft links, and runtime services, issue: $ /sync_config_files To synchronize a particular file, provide the file pathname. For example: $ /sync_config_files /etc/motd The latter example is convenient for synchronizing specific files periodically via cron (e.g. user credentials). (e) Perform system tasks in parallel. The "koomie_cf" utility runs arbitrary commands in parallel across multiple hosts based on user options. This utility assumes a working password-less ssh capability and is intended for use by system administrators. Usage options are shown below: Usage: koomie_cf [OPTIONS] command where "command" is a command to spawn in parallel across one or more cluster hosts using ssh. Results of the commands from each host are written to stdout and are prepended by the executing hostname. If a host is currently unavailable, it will be skipped. If a host fails to execute the command before the timeout window completes, the requested command will be terminated. OPTIONS: --help generate help message and exit -r <1,2,..n>|<2-5> operate on a subset list of racks (e.g. -r 101-105); this option can also accept a special rack types (e.g. -r login) -c - operate on a specific rack/chassis combination (.e.g. -c 101-1) -f operate on hosts specified in provided hostfile -m maximum number of commands to run in parallel (default = 288) -t timeout period for command completion in seconds (default = 5 minutes) -x operate on hosts which match supplied regex pattern -w wait interval (in seconds) between subsequent command spawns (default = 0) -v run LosF in verbose mode A common occurrence when performing cluster updates is to use the "koomie_cf" utility in concert with "update". For example, assuming we have 4 login nodes available named login1,login2,login3, and login4, the following example will run update on all of them in parallel. Note that the "-q" option to update is enabled automatically when run remotely via koomie_cf. $ koomie_cf -x login[1-4] /update