https://github.com/hpcsi/losf
Revision 275b2f2dd62c65259c983f606413b7b9c19234fc authored by Karl W. Schulz on 07 March 2019, 17:10:48 UTC, committed by Karl W. Schulz on 07 March 2019, 17:10:48 UTC
1 parent 640005b
Tip revision: 275b2f2dd62c65259c983f606413b7b9c19234fc authored by Karl W. Schulz on 07 March 2019, 17:10:48 UTC
prep changelog for next release
prep changelog for next release
Tip revision: 275b2f2
INSTALL
# -*- mode: Fundamental; fill-column: 89 -*-
-------------------------------------------------------------------------------------------------
LosF: A Linux operating system Framework for managing HPC clusters
-------------------------------------------------------------------------------------------------
This file outlines the steps for installing LosF and associated dependencies. Certain
sections or steps or steps are optional and are identified as such. In particular, LosF
is designed to optionally coordinate with Cobbler (http://www.cobblerd.org) or Warewulf
(http://warewulf.lbl.gov/trac) for bare-metal provisioning. If you want to use LosF to
manage the Cobbler/Warewulf provisioning configuration, you will need to identify at
least one master host on your cluster to serve as the provisioning server. If your master
server has external network access, you can download the necessary packages to setup
Cobbler from the EPEL repository (https://fedoraproject.org/wiki/EPEL). You can download
Warewulf builds from the OpenHPC project (https://github.com/openhpc/ohpc).
A quickstart method for setting up LosF and optionally, Cobbler on a new cluster that
should work with CentOS distributions is outlined in the following sections.
-------------------------------------------------------------------------------------------------
1. Install a master server from baseline OS distribution (e.g. CentOS).
(a) Boot from installation media and install as desired.
Note: if you are planning to use Cobbler's functionality to mirror repositories,
be aware that the default storage location is /var/www. Consequently, consider
providing ample space to your /var partition (e.g. 40G or more) during the base
install configuration procedure.
(b) Configure basic networking. At a minimum, you will likely want to configure two
network interfaces: (1) one interface for internal cluster access (e.g. eth0 on a
private subnet) and (2) one interface for external access to the outside world
(e.g. eth1 into your existing WAN).
(c) Add EPEL repo access to your master server.
See http://fedoraproject.org/wiki/EPEL/FAQ for more information. An example for
CentOS6 is below:
$ rpm -Uvh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
(c) [ Optional ]: Install baseline cobbler software and its associated dependencies
(available from the EPEL repo above). Cobbler can be used for bare-metal
provisioning, but LosF is used for configuration management after a base install,
so other provisioning mechanisms can be used as desired. Assuming your master
server has external network access, the necessary packages can be installed as
follows:
$ yum install cobbler
$ yum install pykickstart
(d) Install LosF OS software dependencies. LosF requires a yum download plugin and two
specific perl modules. These are also available from the EPEL repo:
$ yum install yum-plugin-downloadonly
$ yum install perl-Log-Log4perl
$ yum install perl-Config-IniFiles
-------------------------------------------------------------------------------------------------
2. [ Optional ] Perform a basic configuration for Cobbler.
Before you can use cobbler, you need to define a basic working configuration. See
http://www.cobblerd.org and other online resources for more information; a starting
point for a basic configuration outline on your master cluster server is outlined as
follows:
(a) Edit /etc/cobbler/settings file to define the IPs for your master server. Relevant
variables to update are:
next_server
server
default_password_crypted
The default_password_crypted is the location to define the root password that will
be used during the provisioning process. You need to change this from the default
password to a strong password of your own choosing. You can define the crypted
hash via:
$ openssl passwd -1 -salt 'random-phrase-here' 'your-password-here'
(b) Start cobbler.
$ /etc/init.d/cobblerd start
(c) Check for other cobbler pre-requisites. You can run "cobbler check" which will
help point you to any other remaining requirements (like having http enabled, allowing rsync
through xinetd, etc). Note that you can safely ignore warnings about a missing debmirror,
fencing tools, or boot-loaders.
(d) Import an OS distribution (e.g. from the media used to install the initial server
or from downloaded ISO images). Assuming CentOS ISO images are available locally
on your master host, you can mount and import as follows:
$ export DISTRO=CentOS6.5
$ mkdir /mnt/centOS6.5-dvd1
$ mount -o loop CentOS-6.5-x86_64-bin-DVD1.iso /mnt/centOS6.5-dvd1
$ cobbler import --path=/mnt/centOS6.5-dvd1 --name=${DISTRO} --arch=x86_64
Tip: in order to support the possible provisioning of all available CentOS
packages, consider importing the 2nd DVD ISO image as well. This requires adding
packages to the distro profile created from the DVD1 import and updating the
repository metadata. An example of this process is below:
$ mkdir /mnt/centOS6.5-dvd2
$ mount -o loop CentOS-6.5-x86_64-bin-DVD2.iso /mnt/centOS6.5-dvd2
$ rsync -a '/mnt/centos6.5-dvd2/' /var/www/cobbler/ks_mirror/CentOS6.5-x86_64/ --exclude-from=/etc/cobbler/rsync.exclude
$ export COMPSXML=`ls /var/www/cobbler/ks_mirror/${DISTRO}-x86_64/repodata/*comps*.xml`
$ createrepo -c cache -s sha --groupfile ${COMPSXML} /var/www/cobbler/ks_mirror/${DISTRO}-x86_64
# With the distro defined, it is a good time to synchronize cobbler.
$ cobbler sync
(e) Import an EPEL distribution (this will mirror the repo for use locally throughout
your cluster). An example for CentOS6 is below:
$ cobbler repo add --mirror=http://dl.fedoraproject.org/pub/epel/6/x86_64 \
--name=epel6 --arch=x86_64
Tip: if your master server is behind a proxy, you can augment the repo mirroring
environment to include an http proxy. For example:
$ cobbler repo edit --name=epel6 --environment="http_proxy=http://proxy.yourcompany.com:1111"
To access this repo, you will want to associate it with the provisioning OS distro
you defined. An example for the CentOS6.5 profile above is:
$ cobbler profile edit --name=CentOS6.5-x86_64 --repos=epel6
(f) Mirror the newly defined EPEL repository (note: this will take some time).
$ cobbler reposync
-------------------------------------------------------------------------------------------------
3. Install and perform baseline configuration for LosF.
(a) Untar a release tarball (or clone desired version from GitHub directly), preferably
into a shared file system that will be available across the cluster.
Suggestion: Assuming you have additional drives available in your master host, a
reasonable place to install is into an admin directory or partition hosted by your
chosen master host.
Note: you can also build a self-contained losf RPM for use locally using the distribution
tarball as follows:
$ rpmbuild -tb losf-<version>.tar.gz
(b) Verify basic networking requirements. LosF relies on the output from standard
Linux commands like "hostname" and "dnsdomainname" in order to delineate between
different node types and clusters. Make sure these commands return reasonable
(non-empty) values and update your local network config if necessary.
(c) Define LosF config path and initialize for the current cluster.
Note that you can manage multiple clusters with a single LosF install, but you
need to first choose a top-level config directory ($config_dir) and designate a
unique identifier for the local cluster. There are two options for defining the
local config directory:
(1) set in "<losf-install>/config/config_dir" file in local LosF install dir
(2) set via the "LOSF_CONFIG_DIR" environment variable
For production usage, it is recommended to use option (1) to set the path directly
in the "config/config_dir" file, a simple ascii file. Alternatively, the second
environment variable option provides a convenient way to override the default
config path to test alternate config settings prior to a production rollout.
To set the $config_dir via option (1), consider the following depending on whether
this is a new install or upgrade.
(i) New installs: A template file is provided that you can copy and edit to
identify a preferred path:
$ cd <losf-install>/config
$ cp config_dir.template config_dir [then edit to suit]
(ii) Upgrades: if you have been using a previous version of LosF and are
upgrading to a newer release, you can simply copy your previous
config/config_dir file into the latest install path. Alternatively, you
can run the "misc/config_latest_install" utility and it will search for
the most recently installed config_dir setting, and update the latest
install accordingly.
Note that for a first-time install, you will likely not have a pre-existing
config_dir for LosF; in that case, the path you choose will be created during the
following initialization step.
(d) Create necessary baseline configuration files for the current cluster.
Rudimentary configuration example files are provided in the
"config/config_example" directory for a cluster designated as "Bar". These can be
used as the basis for creating the necessary configuration files for your cluster
in $config_dir and help to illustrate some of the available options.
Alternatively, you can use a convenience utility to initialize the cluster with a
vanilla configuration as follows:
$ <losf-install>/initconfig <YourClusterName>
This utility will define a single "master" node type based on the local hostname
and domainname of the host on which it is executed. If the basic initialization is
complete, it should be possible to run the "update" utility. Example output after
running initconfig is shown below indicating that no specific configuration
features have been requested and that the host is presently up to date.
$ update -q
OK: [RPMs: OS 0/0 Custom 0/0] [Files: 0/0] [Links: 0/0] [Services: 0/0] [Perms: 0/0] -> master
-------------------------------------------------------------------------------------------------
4. Customize cluster configuration
See the top-level LosF README for a brief introduction on the primary command-line
utilities available. In addition, running "losf" with no command-line options will
provide documentation on the available options and syntax. Prior to running any
commands, customize the cluster configuration:
(a) Update config.machines file.
To begin, the first file to edit is the $config_dir/config.machines file to
potentially customize the domainname of your new cluster, provide desired node
type definitions (e.g. logins, compute, io, etc), and define the location to house
RPMs registered with LosF. The configuration files are simple, ascii keyword
driven files that are organized by [sections] and can be edited directly.
An example file highlighting example syntax and multiple cluster
definitions is provided at:
<losf-install>/config/config_example/config.machines
Alternatively, if you used the "initconfig" utility, you should have a basic
working file installed already at $config_dir/config.machines.
(b) Update config.<YourClusterName> file.
Once the cluster and node types have been identified, the next file to update is
the cluster-specific config file at $config_dir/config.<YourClusterName>. This
keyword driven input file is used to define all of the desired configuration
options pertaining to host configuration files, runtime services, file
permissions, soft links, and (optionally), cobbler provisioning options.
If you used the "initconfig" utility to initialize your configuration,
the initial file will be very minimal with no configuration options registered.
See <losf-install>/config/config.Bar for details on example syntax and options.
Note: when adding new configuration files to sync across the cluster (e.g. in the
[ConfigFiles] section), it is necessary to provide a reference template file for
the node type in which you desire to synchronize the file. These reference
templates should be maintained in:
$config_dir/const_files/<YourClusterName>/<Node Type>
Unless overridden by specific permission options you add to the
config.<YourClusterName> file, LosF will mimic the file ownership and permissions
of the synchronized files to match the reference template version.
Suggestion: for traceability, it is recommended to track your local config
file changes using your favorite version control system (e.g. git or
subversion).
-------------------------------------------------------------------------------------------------
5. Managing your HPC cluster.
With a working LosF install and configuration in place (and OS distribution repository
defined via external network access or from local Cobbler mirroring), you should be
able to further customize your system and synchronize system-wide based on desired
node type settings. The following examples highlight a few common tasks:
(a) Update host to latest state.
The "update" utility is used to bring a node to the latest configuration status
(via the installation/removal of desired packages and synchronization of
configuration files and services). Basic usage is below (or see
"<losf-install>/update -h"):
usage: update [OPTIONS]
OPTIONS:
-h Show help message.
-q Quiet logging mode; shows detected system changes only.
-p [path] Override configured RPM source directory to prefer provided path instead.
-v Print version number and exit.
(b) Add OS packages.
You can add additional OS packages using the "losf" utility. The utility will
automatically download any associated dependencies and configure the OS packages
for the *local* node type where you are executing the command.
The following example highlights the process to add the "finger" package, followed
by update to push the package install.
$ <losf-install>/losf addpkg finger
$ <losf-install>/update -q
More sophisticated options for adding OS groups and updating OS packages are also
available. See "losf -h" for more details.
(c) Add Custom 3rd packages.
HPC systems frequently require custom packages that are not part of a standard
Linux distribution mechanism (e.g. commercial compilers, open-source parallel
file systems, MPI stacks, etc). The "losf" utility provides a way to manage custom
RPMs and organize groups of related custom RPMs via aliases. See "losf -h" for
more details regarding management of custom RPMs. The example below shows a simple
example registering a locally built RPM for the MVAPICH2 open-source project and
installing it locally via update.
$ <losf-install>/losf addrpm ./mvapich2-2.0-1.x86_64.rpm
$ <losf-install>/update -q
More sophisticated options for managing multi-version RPMs, relocatable RPMs, and
aliases are available. Please see "losf -h" for more details.
(d) Sync config files only.
The "update" utility is normally used to verify that all configuration files and
OS/Custom packages are in sync. A convenience utility is provided in cases where
you only modified a configuration file or runtime service and want to bring a host
in sync without checking all RPM packages.
To synchronize all defined config files, soft links, and runtime services, issue:
$ <losf-install>/sync_config_files
To synchronize a particular file, provide the file pathname. For example:
$ <losf-install>/sync_config_files /etc/motd
The latter example is convenient for synchronizing specific files periodically via
cron (e.g. user credentials).
(e) Perform system tasks in parallel.
The "koomie_cf" utility runs arbitrary commands in parallel across multiple hosts
based on user options. This utility assumes a working password-less ssh capability
and is intended for use by system administrators. Usage options are shown below:
Usage: koomie_cf [OPTIONS] command
where "command" is a command to spawn in parallel across one or more
cluster hosts using ssh. Results of the commands from each host are
written to stdout and are prepended by the executing hostname. If a
host is currently unavailable, it will be skipped. If a host fails to
execute the command before the timeout window completes, the requested
command will be terminated.
OPTIONS:
--help generate help message and exit
-r <1,2,..n>|<2-5> operate on a subset list of racks (e.g. -r 101-105); this option
can also accept a special rack types (e.g. -r login)
-c <rack>-<chassis> operate on a specific rack/chassis combination (.e.g. -c 101-1)
-f <hostfile> operate on hosts specified in provided hostfile
-m <max_ssh> maximum number of commands to run in parallel (default = 288)
-t <timeout> timeout period for command completion in seconds (default = 5 minutes)
-x <regex> operate on hosts which match supplied regex pattern
-w <wait> wait interval (in seconds) between subsequent command spawns (default = 0)
-v run LosF in verbose mode
A common occurrence when performing cluster updates is to use the "koomie_cf"
utility in concert with "update". For example, assuming we have 4 login nodes
available named login1,login2,login3, and login4, the following example will run
update on all of them in parallel. Note that the "-q" option to update is enabled
automatically when run remotely via koomie_cf.
$ koomie_cf -x login[1-4] <losf-install>/update
Computing file changes ...