Synopsis
Following the Kestrel and Redkite home directory server outage over Friday 4th – Monday 7th May, all affected users are strongly advised to check that the contents of their departmental home directories are intact.
Overview
There were file-system problems on the departmental home directory servers Kestrel and Redkite between Friday 4th and Monday 7th of May 2018 that may have resulted in some contents of your home directory being lost.
As a result, CSG are advising you to check your home directory. We have created a separate network file-storage area – /vol/recovery – to facilitate the checking process. This area contains a snapshot of the incremental back-ups (see below) for Kestrel and Redkite users around the time of the outage.
/vol/recovery can be accessed on any CSG-managed – for example, shell1.doc.ic.ac.uk, potoo02.doc.ic.ac.uk or a lab computer.
How to check your home directory
We assume that you are viewing this web-page because you know that your departmental home directory server is either Kestrel or Redkite. You can find out if that is indeed the case on-line.
To perform a check, please log-on (interactively or through SSH) to an appropriate computer and run the following sequence of commands (the lines beginning with ‘#’ are comments):
# You need to have a Kerberos ticket to access the network share kinit ${USER}@IC.AC.UK # Enter your college password in response to the above ls /vol/recovery/${USER}
You will see the following output:
inc-00 inc-02 inc-04 inc-06 inc-08 inc-10 inc-12 inc-01 inc-03 inc-05 inc-07 inc-09 inc-11 inc-13
Each ‘inc-XY’ directory contains incremental back-ups (n-1) days from the outage – specifically files detected as having changed
since the outage. In general, the directory ‘inc-01’ corresponds to incremental back-ups the day before the outage but this may not be the case for all users. You can check if there any missing files by running the following command on a specific ‘inc-XY’ directory:
/vol/recovery/check inc-01
The command will list all files that are under the relevant incremental back-up directory – in the above example, inc-01 – that are not under your DoC home directory. If you wish to restore the listed files, then please invoke the following command:
/vol/recovery/recover inc-01
This command will do all that the previous one does and will copy the files to your DoC home directory. Please note: it will not over-write existing files. Please also check the permissions of restored files: they are usually quite restrictive as a consequence of being restored from on-line back-ups. This is especially relevant if you restore content to /homes/${USER}/public_html
Please contact doc-help@imperial.ac.uk if you require further assistance with any of the above.
CSG apologise for the inconvenience caused by asking end-users to make these checks. Please be assured that the provision of a highly-available, high-integrity departmental home directory service is one of our core responsibilities. We will learn lessons from the rather peculiar set of circumstances that led to this outage and aim to provide a more-robust service in future.
How did the outage occur?
Departmental home directories are stored on four servers: Buzzard, Kestrel, Osprey and Redkite. These are Dell PowerEdge R720xd servers with 24 1TB 7.2K RPM SAS 6Gbps hard drives and a 400GB NVMe SSD. The servers run Ubuntu Linux and have 10GbE network connectivity. The storage is organised as follows:
- 11TB software RAID-10 via mdadm with two hot spares.
- LVM on top of the software RAID for high-level space allocation and management.
- LVM Cache – via the NVME SSD – to accelerate read/write operations.
- XFS – with quota management – on the logical volume in which user home directories reside.
The home directory servers are paired: Buzzard (HXLY 221) is paired with Osprey (CAGB 403) while Redkite (HXLY 221) is paired with Kestrel (CAGB 403). Each server is primary for the home directories of around one quarter the number of departmental users. A near-live mirror of the home directories on each primary is maintained on the respectively-paired server.
On Friday, the XFS on Kestrel and Redkite were almost full. LVM and XFS support on-line resizing, there was ample spare space in the associated Physical Volume (PV) Volume Group (VG) and CSG have done this multiple times in the past without any end-user service impact. A key difference was the presence of the LVM Cache. In accordance with the documentation, CSG ran the following command to remove the cache before expanding the associated Logical Volume (LV):
lvconvert --splitcache VG/cachedLV
This was followed by lvresize and xfs_growfs and finally an lvconvert command to re-attach the cache. All commands completed successfully and without any errors. Around an hour after this, the associated file-systems went off-line with XFS corruption. CSG then spent several hours over the course of three days running xfs_repair to fix this issue. Regrettably, a lot of end-user files were unceremoniously dumped into a generic ‘lost & found’ directory as a result of the xfs_repair. These files had, effectively, become disconnected from their respective end-user home directories.