Departmental Servers in the Virtus Slough Data-center

This web-page provides information for departmental staff and researchers concerning hardware and services in the Virtus Slough Data-center.

Key details

As of April 2020 , the department has a number of server racks in the London 3 Virtus data-center in Slough.  These server racks house research and departmental hardware.  The allocation of space within these racks is managed by CSG.  If you are plan to host hardware in this space, please contact CSG well beforehand so that we can collaborate with you on the ordering and deployment process.

Hardware hosting compliance

Please note that any hardware  hosted in Slough must meet the operational standards that one would expect of a modern, remote data-center:

Essential

  • it must be rack-mountable for installation in a standard 42U server rack cabinet.
  • it must not be more than six years old as of April 2020.
  • it must be in a server form-factor – not a desktop or work-station or some other custom configuration.

Desirable

  • it should be under warranty for the duration of its hosting in the Slough Virtus data-center.
  • it should support – and be configured for – out of band management. This is so that the hardware can be remotely managed for common scenarios – for example, installing and rebooting in case of crashes.
  • it should have hot-plug power redundancy.
  • it should have storage redundancy.
  • it should support 10GbE network connectivity through a minimum of two direct-attach SFP+ NICs.
  • if configured for 1GbE network connectivity , it should have a minimum of two such 1GbE NICs.
  • it should be running a modern operating system for which security updates are currently available.

The above criteria are intended to reduce the need for interaction in person with the hardware once it is hosted in the Slough Virtus data-center.

The department reserves the right to reject hosting hardware in the Slough Virtus data-center if it does not meet the above – and other criteria. In particular, if hardware does not meet departmental, ICT or Virtus guide-lines, it will not be hosted.

General

The Slough Virtus data-center is some thirty miles from the college South Kensington campus by car. Neither ICT nor CSG have personnel stationed at the data-center. Instead, CSG and ICT personnel periodically visit the data-center to conduct maintenance and installation tasks. CSG aim to visit the data-center once a month.

Due to the remote location of the data-center, it is important to utilise technologies which facilitate remote management and feature redundancy as much as possible. This is why the need for full out-of-band management capabilities, storage redundancy (RAID) and power redudancy are important.

The department is actively working on plans for additional rack space to meet growing research and teaching requirements. In the meantime, the existing rack space must be carefully managed. It is vitally important that CSG are contacted before research groups make plans to buy additional data-center hardware. Furthermore, life-cycle management of currently-hosted hardware is not optional: when hardware becomes obsolete, it must be removed from the server racks to make way for new hardware.

Special File Recovery for Redkite and Kestrel Departmental Home Directory Servers

Synopsis

Following the Kestrel and Redkite home directory server outage over Friday 4th – Monday 7th May, all affected users are strongly advised to check that the contents of their departmental home directories are intact.

Overview

There were file-system problems on the departmental home directory servers  Kestrel and Redkite between Friday 4th and Monday 7th of May 2018 that may have resulted in some contents of your home directory being lost.

As a result, CSG are advising you to check your home directory. We have created a separate network file-storage area – /vol/recovery – to facilitate the checking process. This area contains a snapshot of the incremental back-ups (see below) for Kestrel and Redkite users around the time of the outage.

/vol/recovery can be accessed on any CSG-managed – for example, shell1.doc.ic.ac.uk, potoo02.doc.ic.ac.uk or a lab computer.

How to check your home directory

We assume that you are viewing this web-page because you know that your departmental home directory server is either Kestrel or Redkite. You can find out if that is indeed the case on-line.

To perform a check, please log-on (interactively or through SSH) to an appropriate computer and run the following sequence of commands (the lines beginning with ‘#’ are comments):

# You need to have a Kerberos ticket to access the network share
kinit ${USER}@IC.AC.UK
# Enter your college password in response to the above
ls /vol/recovery/${USER}

You will see the following output:

inc-00 inc-02 inc-04 inc-06 inc-08 inc-10 inc-12
inc-01 inc-03 inc-05 inc-07 inc-09 inc-11 inc-13

Each ‘inc-XY’ directory contains incremental back-ups (n-1) days from the outage – specifically files detected as having changed
since the outage. In general, the directory ‘inc-01’ corresponds to incremental back-ups the day before the outage but this may not be the case for all users. You can check if there any missing files by running the following command on a specific ‘inc-XY’ directory:

/vol/recovery/check inc-01

The command will list all files that are under the relevant incremental back-up directory – in the above example, inc-01 – that are not under your DoC home directory. If you wish to restore the listed files, then please invoke the following command:

/vol/recovery/recover inc-01

This command will do all that the previous one does and will copy the files to your DoC home directory. Please note: it will not over-write existing files. Please also check the permissions of restored files: they are usually quite restrictive as a consequence of being restored from on-line back-ups. This is especially relevant if you restore content to /homes/${USER}/public_html

Please contact doc-help@imperial.ac.uk if you require further assistance with any of the above.

CSG apologise for the inconvenience caused by asking end-users to make these checks. Please be assured that the provision of a highly-available, high-integrity departmental home directory service is one of our core responsibilities. We will learn lessons from the rather peculiar set of circumstances that led to this outage and aim to provide a more-robust service in future.

How did the outage occur?

Departmental home directories are stored on four servers: Buzzard, Kestrel, Osprey and Redkite.  These are Dell PowerEdge R720xd servers with 24 1TB 7.2K RPM SAS 6Gbps hard drives and a 400GB NVMe SSD.  The servers run Ubuntu Linux and have 10GbE network connectivity.  The storage is organised as follows:

  • 11TB software RAID-10 via mdadm with two hot spares.
  • LVM on top of the software RAID for high-level space allocation and management.
  • LVM Cache – via the NVME SSD – to accelerate read/write operations.
  • XFS – with quota management – on the logical volume in which user home directories reside.

The home directory servers are paired: Buzzard (HXLY 221) is paired with Osprey (CAGB 403) while Redkite (HXLY 221) is paired with Kestrel (CAGB 403). Each server is primary for the home directories of around one quarter the number of departmental users. A near-live mirror of the home directories on each primary is maintained on the respectively-paired server.

On Friday, the XFS on Kestrel and Redkite were almost full. LVM and XFS support on-line resizing, there was ample spare space in the associated Physical Volume (PV) Volume Group (VG) and CSG have done this multiple times in the past without any end-user service impact. A key difference was the presence of the LVM Cache. In accordance with the documentation, CSG ran the following command to remove the cache before expanding the associated Logical Volume (LV):

lvconvert --splitcache VG/cachedLV

This was followed by lvresize and xfs_growfs and finally an lvconvert command to re-attach the cache. All commands completed successfully and without any errors. Around an hour after this, the associated file-systems went off-line with XFS corruption. CSG then spent several hours over the course of three days running xfs_repair to fix this issue. Regrettably, a lot of end-user files were unceremoniously dumped into a generic ‘lost & found’ directory as a result of the xfs_repair. These files had, effectively, become disconnected from their respective end-user home directories.