Sysadmin information

Fundamentals

The purpose of Allegro is to manage user access to data (mostly interferometric observations made with ALMA) on disk file systems (FS) totalling now approximately 0.8 PB, and to facilitate user processing of said data on Allegro compute nodes (see Allegro Computers). Access to the machines and disks is open to anyone with a Sterrewacht account, but we discourage use of the facilities by anyone without Allegro connections. Similarly, although ALMA data can be stored wherever the user pleases, we encourage them to store each dataset as an Allegro project. There is software (orchestrated by a GUI called the pm_wizard) to manage such projects and their details are kept in a database. The database is actually a number of text files which are kept in directories under

<FS>/allegro/allegro_staff/.db/

Although true system administration duties for Allegro computer hardware have been undertaken by the IT department of the Sterrewacht, whereas no member of Allegro has enjoyed superuser privileges, it has been possible to mimic some of the advantages of being superuser via the convention that a single account, ‘alma’, owns all the important files, plus some inventive use of file access permissions.

Note that most Allegro members know the password to the ‘alma’ account.

Use of the hardware has been complicated in recent times by the multiplication of file systems from the original single lustre system to, at time of writing, three FS, of which two are lustre mounted and the third NFS. Twice now too we have had disastrous disk failures which have damaged a significant proportion of the files on one or other of the FS.

In order to simply the effects on the project management (PM) software of addition, deletion or renaming of FS or compute nodes, the current active FS and nodes are listed in text files in /home/alma/etc

/home/alma/etc/available_hosts
/home/alma/etc/lustre_roots

All the PM software is set up to read these files, which makes changes in nodes or FS easy to propagate into the software.

Due to size constraints it is not practical to replicate the data itself over several FS, but certain things are present on all FS:

  • The basic directory tree under <FS>/allegro.
  • The PM software, organized in four git repositories, as described in Allegro management scripts.
  • The database under <FS>/allegro/allegro_staff/.db/ which records the projects and their linked users.
  • The symlinks to project directories in <FS>/allegro/home/<username>/
  • The archive of completed projects. These typically contain not the voluminous data itself but text and FITS files. Because of this, archived projects are much more compact than active ones.

Although the software on all FS is nominally identical, of course we want to access only a single version at any one time; the copies on the other FS are held simply as backup copies. Similarly, we need to direct users to their symlinks on just one FS. The current ‘active’ FS is taken by the PM sofware from the file /home/alma/etc/lustre_root. (Note the slight difference in spelling from the file mentioned above.)

Synchronization

Each time a project is created or removed/archived, or any other change is made via use of the pm_wizard, the changes are written by the software to all the FS. In this way the database and archive files are kept synchronized. The PM software itself will remain synchronized so long as Allegro members who modify it do this via the correct protocol. Departure from this protocol could easily result in unsynchronized software; in the event of another disk crash, such changes could be lost. Following the protocol is therefore important.

GIT protocol

The protocol is to make changes via the following procedure:

  1. Obtain (if you have not already done so) a clone of the relevant git repository.
  2. Edit files in this local copy. Do NOT edit files in the working directories, i.e. the directories from where the software is accessed by Allegro members or other users.
  3. When finished, do git add <new files>, git commit and finally git push to the repo.
  4. Log in to an Allegro machine as user ‘alma’ and invoke the relevant update script (see section Allegro management scripts). Do NOT just go to the location of the working copy and do git pull. The update script will pull the changes to the proper copies on all FS and thus maintain synchronization.

I (IMS) have usually edited as ‘alma’. I have full clones of all four repos in /home/alma/git_repos/src/. I have edited and pushed from that directory. Editing these scripts as ‘alma’ makes it easier to test them, since many of them can only be run by ‘alma’. You may need to reset your PYTHONPATH envar so you see local modules, not the official Allegro ones.

Note that there is also a directory /home/alma/git_repos/doc/ but I have never filled it with anything. I have edited and pushed things to the document repos as myself, not as ‘alma’.

Repairs

In the case of a disk crash leading to FS damage, it can be necessary to replicate the relevant files to a new or reconstructed FS. There is a script for doing this: /home/alma/bin/copy_all_db. Note however that this has so far not been run as a whole script but rather command for command by cut-and-paste to the command line. It should be regarded as experimental: some tuning of the rsync commands is probably still necessary.

Databases and Variables

This section gives you an overview of where your information is stored that various programmes need access to.

Project Database

The database for projects, which is needed by the project manager, is found in $ALLEGRO_DB

ALMA Cycle

The first 6 characters of an ALMA Project Code indicate which cycle it has been observed with. A simple text file contains the prefix-cycle information:

$ cat $ALLEGRO_STAFF/etc/almaCycles/cycleCodes
2011.0 0
2012.1 1
2012.A 1
2013.1 2
2015.1 3

QA2 Delivery Style

The QA2 delivery style is needed for the QA2 packager. The current style is given by the environment variable ALLEGRO_QA2_STYLE. The rc scripts define that variable by reading the value from the file $ALLEGRO_STAFF/etc/qa2/current_cycle_style

Cron-jobs

We have some cron jobs that run regularly and help us to keep some information up to date. Some are run by root, others are run by user alma. The alma ones are all invoked by various wrapper scripts in /home/alma/bin/cronjobs/; the actual work is done by scripts in the $ALLEGRO_STAFF/bin directory, as given in the table below.

Command cycle Description
du <FS> daily ROOT dumps the output of this call so that we can track disk usage of our file systems (FS).
chmod g=u $ALLEGRO_USER_PROJECTS/* daily ROOT runs this to make sure that the group allegro has the same permissions as the user. This is needed for the removal of projects, since we are not root.
update_project_stats.py daily This script digests the output of the du calls above, and summarises the disk usage of the Allegro projects that are currently on disk.
update_leap_seconds_table monthly CASA has to be regularly provided with info about leap seconds. This script provides that and also updates other CASA things (like antenna configurations, etc.).
clear_dangling_projects weekly Sometimes, the permissions to remove a project are not sufficient to get rid of all directories at the time of removal. These directories have to wait until the chmod-command (see above) by root has given “green light”. This script then reads these “dangling projects” and removes them one by one.

To edit the current cron jobs that we manage, type

$ crontab -e

Backups

The home area is backed up regularly, as described in the Sterrewacht documentation, but there are currently NO backups of ANYTHING on the Allegro file systems (/lustre1, /lustre2 and /allegro1). This will change, but at present it is only planned to back up files under <FS>/allegro/, and only a very few of those: the git repositories (which in fact are not on /lustre), the DB files, and any text files or scripts that are in the archives. Users should thus NOT store irreplaceable files on the Allegro file systems.

Replication across file systems

A script is available to perform manual backups (actually a replication across all the Allegro file systems):

$ allegro_replicate.pl

You run it by typing (from one of the Allegro machines)

$ allegro_replicate.pl <some directory> '<list of file types>'

An example would be

$ allegro_replicate.pl /lustre2/ims/ '*.c *.py'

What this will then do is look for ALL the files which match either *.c or *.py under the directory /lustre2/ims/. It will then rsync such files to ALL the other recognized Allegro file systems (FS). At the moment the only other FS I have in the ‘recognized’ list is /allegro1. So anyway, after I ran the above command, the script constructed all necessary directories on /allegro1 and copied all the *.c and *.py files to there from /lustre2/ims. In other words, the script duplicates a file/directory tree on the other recognized FS, containing all files of the desired types. For example I have some *.py files under

/lustre2/ims/temptemp/allegroUtils

After running the script, all these *.py files have also been copied to

/allegro1/ims/temptemp/allegroUtils

and so forth. Thus it functions as a kind of crude backup script.

If you delete a source file from one FS, then run the script, a copy of the deleted file will still remain on the other FS. I didn’t think it was safe to allow a script to go deleting things.

A couple of other cautions:

  • The script is not meant to backup large files (e.g. data). I’ve put some restrictions on file size into it. In principle you should only use it with text files. These will usually be small but tend also to be the ones which cause the most acute pain when lost.

  • It is handy to be able to use Unix ‘wild card’ characters like * or ? in the file specs you send to the script. You need to remember to put these specs in single quotes though like the example above (even if you only have 1 in the list you send to the script). If you forget to do this, you will get funny and possibly undesirable results.

  • NOTE that this is fairly new and un-thoroughly tested software. The more powerful a script is, the more dangerous. I’ve put in as many safeguards as I can think of, but there are bound to be things I didn’t think of, or pitfalls it is simply not practical to guard against. An example would be, suppose you had 2 completely independent scripts with paths as follows:

    /lustre2/fbloggs/somedir/my_important_script.py

    /allegro1/fbloggs/somedir/my_important_script.py

If you ran

$ allegro_replicate.pl /lustre2/fbloggs/ '*.py'

then you would overwrite the script in /allegro1 by the one in /lustre2.