https://github.com/hpcsi/losf
Raw File
Tip revision: 275b2f2dd62c65259c983f606413b7b9c19234fc authored by Karl W. Schulz on 07 March 2019, 17:10:48 UTC
prep changelog for next release
Tip revision: 275b2f2
INSTALL
# -*- mode: Fundamental; fill-column: 89 -*-

-------------------------------------------------------------------------------------------------
LosF: A Linux operating system Framework for managing HPC clusters
-------------------------------------------------------------------------------------------------

This file outlines the steps for installing LosF and associated dependencies. Certain
sections or steps or steps are optional and are identified as such. In particular, LosF
is designed to optionally coordinate with Cobbler (http://www.cobblerd.org) or Warewulf
(http://warewulf.lbl.gov/trac) for bare-metal provisioning.  If you want to use LosF to
manage the Cobbler/Warewulf provisioning configuration, you will need to identify at
least one master host on your cluster to serve as the provisioning server. If your master
server has external network access, you can download the necessary packages to setup
Cobbler from the EPEL repository (https://fedoraproject.org/wiki/EPEL). You can download
Warewulf builds from the OpenHPC project (https://github.com/openhpc/ohpc). 

A quickstart method for setting up LosF and optionally, Cobbler on a new cluster that
should work with CentOS distributions is outlined in the following sections.

-------------------------------------------------------------------------------------------------

1. Install a master server from baseline OS distribution (e.g. CentOS).

   (a) Boot from installation media and install as desired.
   
       Note: if you are planning to use Cobbler's functionality to mirror repositories,
       be aware that the default storage location is /var/www. Consequently, consider
       providing ample space to your /var partition (e.g. 40G or more) during the base
       install configuration procedure.

   (b) Configure basic networking. At a minimum, you will likely want to configure two
       network interfaces: (1) one interface for internal cluster access (e.g. eth0 on a
       private subnet) and (2) one interface for external access to the outside world
       (e.g. eth1 into your existing WAN).

   (c) Add EPEL repo access to your master server. 

       See http://fedoraproject.org/wiki/EPEL/FAQ for more information. An example for
       CentOS6 is below:
   
       $ rpm -Uvh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
   
   (c) [ Optional ]: Install baseline cobbler software and its associated dependencies
       (available from the EPEL repo above). Cobbler can be used for bare-metal
       provisioning, but LosF is used for configuration management after a base install,
       so other provisioning mechanisms can be used as desired. Assuming your master
       server has external network access, the necessary packages can be installed as
       follows:
   
       $ yum install cobbler
       $ yum install pykickstart

   (d) Install LosF OS software dependencies. LosF requires a yum download plugin and two
       specific perl modules. These are also available from the EPEL repo:
   
       $ yum install yum-plugin-downloadonly
       $ yum install perl-Log-Log4perl
       $ yum install perl-Config-IniFiles

-------------------------------------------------------------------------------------------------

2. [ Optional ] Perform a basic configuration for Cobbler. 

   Before you can use cobbler, you need to define a basic working configuration. See
   http://www.cobblerd.org and other online resources for more information; a starting
   point for a basic configuration outline on your master cluster server is outlined as
   follows:

   (a) Edit /etc/cobbler/settings file to define the IPs for your master server. Relevant
       variables to update are:
   
       next_server
       server
       default_password_crypted
   
       The default_password_crypted is the location to define the root password that will
       be used during the provisioning process. You need to change this from the default
       password to a strong password of your own choosing. You can define the crypted
       hash via:
   
       $ openssl passwd -1 -salt 'random-phrase-here' 'your-password-here'
   
   (b) Start cobbler.
   
       $ /etc/init.d/cobblerd start

   (c) Check for other cobbler pre-requisites. You can run "cobbler check" which will
       help point you to any other remaining requirements (like having http enabled, allowing rsync
       through xinetd, etc). Note that you can safely ignore warnings about a missing debmirror, 
       fencing tools, or boot-loaders.
   
   (d) Import an OS distribution (e.g. from the media used to install the initial server
       or from downloaded ISO images). Assuming CentOS ISO images are available locally
       on your master host, you can mount and import as follows:

       $ export DISTRO=CentOS6.5   
       $ mkdir /mnt/centOS6.5-dvd1
       $ mount -o loop CentOS-6.5-x86_64-bin-DVD1.iso /mnt/centOS6.5-dvd1
       $ cobbler import --path=/mnt/centOS6.5-dvd1 --name=${DISTRO} --arch=x86_64

       Tip: in order to support the possible provisioning of all available CentOS
       packages, consider importing the 2nd DVD ISO image as well. This requires adding
       packages to the distro profile created from the DVD1 import and updating the
       repository metadata. An example of this process is below: 

       $ mkdir /mnt/centOS6.5-dvd2
       $ mount -o loop CentOS-6.5-x86_64-bin-DVD2.iso /mnt/centOS6.5-dvd2
       $ rsync -a '/mnt/centos6.5-dvd2/' /var/www/cobbler/ks_mirror/CentOS6.5-x86_64/ --exclude-from=/etc/cobbler/rsync.exclude
       $ export COMPSXML=`ls /var/www/cobbler/ks_mirror/${DISTRO}-x86_64/repodata/*comps*.xml`
       $ createrepo -c cache -s sha --groupfile ${COMPSXML} /var/www/cobbler/ks_mirror/${DISTRO}-x86_64

       # With the distro defined, it is a good time to synchronize cobbler.

       $ cobbler sync
   
   (e) Import an EPEL distribution (this will mirror the repo for use locally throughout
       your cluster). An example for CentOS6 is below:

       $ cobbler repo add --mirror=http://dl.fedoraproject.org/pub/epel/6/x86_64 \
       	 	      --name=epel6 --arch=x86_64
   
       Tip: if your master server is behind a proxy, you can augment the repo mirroring
       environment to include an http proxy. For example:

       $ cobbler repo edit --name=epel6 --environment="http_proxy=http://proxy.yourcompany.com:1111"

       To access this repo, you will want to associate it with the provisioning OS distro
       you defined. An example for the CentOS6.5 profile above is:

       $ cobbler profile edit --name=CentOS6.5-x86_64 --repos=epel6
   
   (f) Mirror the newly defined EPEL repository (note: this will take some time).
   
       $ cobbler reposync
   
-------------------------------------------------------------------------------------------------

3. Install and perform baseline configuration for LosF.

   (a) Untar a release tarball (or clone desired version from GitHub directly), preferably
       into a shared file system that will be available across the cluster. 
   
       Suggestion: Assuming you have additional drives available in your master host, a
       reasonable place to install is into an admin directory or partition hosted by your
       chosen master host.

       Note: you can also build a self-contained losf RPM for use locally using the distribution
       tarball as follows:

       $ rpmbuild -tb losf-<version>.tar.gz

   (b) Verify basic networking requirements. LosF relies on the output from standard
       Linux commands like "hostname" and "dnsdomainname" in order to delineate between
       different node types and clusters. Make sure these commands return reasonable
       (non-empty) values and update your local network config if necessary.
   
   (c) Define LosF config path and initialize for the current cluster.
   
       Note that you can manage multiple clusters with a single LosF install, but you
       need to first choose a top-level config directory ($config_dir) and designate a
       unique identifier for the local cluster. There are two options for defining the
       local config directory:
   
          (1) set in "<losf-install>/config/config_dir" file in local LosF install dir
          (2) set via the "LOSF_CONFIG_DIR" environment variable
   
       For production usage, it is recommended to use option (1) to set the path directly
       in the "config/config_dir" file, a simple ascii file. Alternatively, the second
       environment variable option provides a convenient way to override the default
       config path to test alternate config settings prior to a production rollout.

       To set the $config_dir via option (1), consider the following depending on whether
       this is a new install or upgrade.

          (i)  New installs: A template file is provided that you can copy and edit to
	       identify a preferred path:
	       
	       $ cd <losf-install>/config 
               $ cp config_dir.template config_dir [then edit to suit]

	  (ii) Upgrades: if you have been using a previous version of LosF and are
	       upgrading to a newer release, you can simply copy your previous
	       config/config_dir file into the latest install path. Alternatively, you
	       can run the "misc/config_latest_install" utility and it will search for
	       the most recently installed config_dir setting, and update the latest
	       install accordingly.

       Note that for a first-time install, you will likely not have a pre-existing
       config_dir for LosF; in that case, the path you choose will be created during the
       following initialization step.

   (d) Create necessary baseline configuration files for the current cluster.

       Rudimentary configuration example files are provided in the
       "config/config_example" directory for a cluster designated as "Bar". These can be
       used as the basis for creating the necessary configuration files for your cluster
       in $config_dir and help to illustrate some of the available options.
       
       Alternatively, you can use a convenience utility to initialize the cluster with a
       vanilla configuration as follows:

       $ <losf-install>/initconfig <YourClusterName>

       This utility will define a single "master" node type based on the local hostname
       and domainname of the host on which it is executed. If the basic initialization is
       complete, it should be possible to run the "update" utility. Example output after
       running initconfig is shown below indicating that no specific configuration
       features have been requested and that the host is presently up to date.

       $ update -q
       OK: [RPMs: OS 0/0  Custom 0/0] [Files: 0/0] [Links: 0/0] [Services: 0/0] [Perms: 0/0] -> master

-------------------------------------------------------------------------------------------------

4. Customize cluster configuration

   See the top-level LosF README for a brief introduction on the primary command-line
   utilities available. In addition, running "losf" with no command-line options will
   provide documentation on the available options and syntax. Prior to running any
   commands, customize the cluster configuration:

   (a) Update config.machines file.

       To begin, the first file to edit is the $config_dir/config.machines file to
       potentially customize the domainname of your new cluster, provide desired node
       type definitions (e.g. logins, compute, io, etc), and define the location to house
       RPMs registered with LosF. The configuration files are simple, ascii keyword
       driven files that are organized by [sections] and can be edited directly.

       An example file highlighting example syntax and multiple cluster
       definitions is provided at:

       <losf-install>/config/config_example/config.machines

       Alternatively, if you used the "initconfig" utility, you should have a basic
       working file installed already at $config_dir/config.machines.

   (b) Update config.<YourClusterName> file.

       Once the cluster and node types have been identified, the next file to update is
       the cluster-specific config file at $config_dir/config.<YourClusterName>. This
       keyword driven input file is used to define all of the desired configuration
       options pertaining to host configuration files, runtime services, file
       permissions, soft links, and (optionally), cobbler provisioning options.

       If you used the "initconfig" utility to initialize your configuration,
       the initial file will be very minimal with no configuration options registered.
       See <losf-install>/config/config.Bar for details on example syntax and options. 

       Note: when adding new configuration files to sync across the cluster (e.g. in the
       [ConfigFiles] section), it is necessary to provide a reference template file for
       the node type in which you desire to synchronize the file. These reference
       templates should be maintained in:

       $config_dir/const_files/<YourClusterName>/<Node Type>

       Unless overridden by specific permission options you add to the
       config.<YourClusterName> file, LosF will mimic the file ownership and permissions
       of the synchronized files to match the reference template version.

   Suggestion: for traceability, it is recommended to track your local config
   file changes using your favorite version control system (e.g. git or
   subversion).

-------------------------------------------------------------------------------------------------

5. Managing your HPC cluster.

   With a working LosF install and configuration in place (and OS distribution repository
   defined via external network access or from local Cobbler mirroring), you should be
   able to further customize your system and synchronize system-wide based on desired
   node type settings. The following examples highlight a few common tasks:

   (a) Update host to latest state. 

       The "update" utility is used to bring a node to the latest configuration status
       (via the installation/removal of desired packages and synchronization of
       configuration files and services). Basic usage is below (or see
       "<losf-install>/update -h"):

       usage: update [OPTIONS]
        
       OPTIONS:
          -h          Show help message.
          -q          Quiet logging mode; shows detected system changes only.
          -p [path]   Override configured RPM source directory to prefer provided path instead.
          -v          Print version number and exit.

   (b) Add OS packages.

       You can add additional OS packages using the "losf" utility. The utility will
       automatically download any associated dependencies and configure the OS packages
       for the *local* node type where you are executing the command.

       The following example highlights the process to add the "finger" package, followed
       by update to push the package install.

       $ <losf-install>/losf addpkg finger
       $ <losf-install>/update -q

       More sophisticated options for adding OS groups and updating OS packages are also
       available. See "losf -h" for more details.

   (c) Add Custom 3rd packages.

       HPC systems frequently require custom packages that are not part of a standard
       Linux distribution mechanism (e.g. commercial compilers, open-source parallel
       file systems, MPI stacks, etc). The "losf" utility provides a way to manage custom
       RPMs and organize groups of related custom RPMs via aliases. See "losf -h" for
       more details regarding management of custom RPMs. The example below shows a simple
       example registering a locally built RPM for the MVAPICH2 open-source project and
       installing it locally via update.

       $ <losf-install>/losf addrpm ./mvapich2-2.0-1.x86_64.rpm
       $ <losf-install>/update -q

       More sophisticated options for managing multi-version RPMs, relocatable RPMs, and
       aliases are available. Please see "losf -h" for more details.

   (d) Sync config files only.

       The "update" utility is normally used to verify that all configuration files and
       OS/Custom packages are in sync. A convenience utility is provided in cases where
       you only modified a configuration file or runtime service and want to bring a host
       in sync without checking all RPM packages.  

       To synchronize all defined config files, soft links, and runtime services, issue:

       $ <losf-install>/sync_config_files

       To synchronize a particular file, provide the file pathname. For example:

       $ <losf-install>/sync_config_files /etc/motd

       The latter example is convenient for synchronizing specific files periodically via
       cron (e.g. user credentials).

   (e) Perform system tasks in parallel.

       The "koomie_cf" utility runs arbitrary commands in parallel across multiple hosts
       based on user options. This utility assumes a working password-less ssh capability
       and is intended for use by system administrators. Usage options are shown below:

       Usage: koomie_cf [OPTIONS] command
       
       where "command" is a command to spawn in parallel across one or more
       cluster hosts using ssh. Results of the commands from each host are
       written to stdout and are prepended by the executing hostname. If a
       host is currently unavailable, it will be skipped. If a host fails to
       execute the command before the timeout window completes, the requested
       command will be terminated.
       
       OPTIONS:
         --help                  generate help message and exit
         -r <1,2,..n>|<2-5>      operate on a subset list of racks (e.g. -r 101-105); this option
                                 can also accept a special rack types (e.g. -r login)
         -c <rack>-<chassis>     operate on a specific rack/chassis combination (.e.g. -c 101-1)
         -f <hostfile>           operate on hosts specified in provided hostfile
         -m <max_ssh>            maximum number of commands to run in parallel (default = 288)
         -t <timeout>            timeout period for command completion in seconds (default = 5 minutes)
         -x <regex>              operate on hosts which match supplied regex pattern
         -w <wait>               wait interval (in seconds) between subsequent command spawns (default = 0)
         -v                      run LosF in verbose mode 

        A common occurrence when performing cluster updates is to use the "koomie_cf"
        utility in concert with "update". For example, assuming we have 4 login nodes
        available named login1,login2,login3, and login4, the following example will run
        update on all of them in parallel. Note that the "-q" option to update is enabled
        automatically when run remotely via koomie_cf.

	$ koomie_cf -x login[1-4] <losf-install>/update 

back to top