Scientific Operations



CRASH(VIII)                  2/12/75                  CRASH(VIII)



NAME
     crash - what to do when the system crashes

DESCRIPTION
     This section gives at least a few clues about how to proceed
     if the system crashes.  It can't pretend to be complete.

     How to bring it back up.   If the reason for  the  crash  is
     not  evident  (see  below for guidance on `evident') you may
     want to try to dump the system if you feel up to  debugging.
     At  the  moment a dump can be taken only on magtape.  With a
     tape mounted and ready, stop the machine, load  address  44,
     and  start.   This should write a copy of all of core on the
     tape with an EOF mark.  Caution: Any error is taken to  mean
     the  end of core has been reached.  This means that you must
     be sure the ring is in, the tape is ready, and the  tape  is
     clean  and  new.   If the dump fails, you can try again, but
     some of the registers will be lost.  See below for  what  to
     do with the tape.

     In restarting after a crash, always bring up the system sin-
     gle-user.  This is accomplished by following the  directions
     in  boot procedures  (VIII) as modified for your particular
     installation; a single-user system is indicated by having  a
     particular  value  in  the  switches  (173030  unless you've
     changed init) as the system starts executing.   When  it  is
     running, perform a dcheck and icheck(VIII) on all file sys-
     tems which could have been in use at the time of the  crash.
     If  any  serious file system problems are found, they should
     be repaired.  When you are satisfied with the health of your
     disks,  check  and  set  the date if necessary, then come up
     multi-user.  This is most easily  accomplished  by  changing
     the  single-user  value  in  the switches to something else,
     then logging out by typing an EOT.

     To even boot UNIX at all, three files (and  the  directories
     leading  to them) must be intact.  First, the initialization
     program /etc/init must be present and executable.  If it  is
     not, the CPU will loop in user mode at location 6.  For init
     to work correctly, /dev/tty8 and /bin/sh  must  be  present.
     If  either  does not exist, the symptom is best described as
     thrashing.  Init will go into a  fork/exec  loop  trying  to
     create a Shell with proper standard input and output.

     If you cannot get the system to boot, a runnable system must
     be obtained from a backup medium.  The root file system  may
     then  be  doctored as a mounted file system as described be-
     low.  If there are any problems with the root  file  system,
     it  is  probably  prudent  to go to a backup system to avoid
     working on a mounted file system.

     Repairing disks.   The first rule to keep in mind is that an
     addled disk should be treated gently; it shouldn't be mount-
     ed unless necessary, and if it is very valuable yet in quite
     bad shape, perhaps it should be dumped before trying surgery


                              - 1 -




CRASH(VIII)                  2/12/75                  CRASH(VIII)


     on it.  This  is  an  area  where  experience  and  informed
     courage count for much.

     The  problems  reported  by  icheck  typically fall into two
     kinds.  There can be problems with the free list: duplicates
     in  the  free list, or free blocks also in files.  These can
     be cured easily with an icheck -s.  If the  same  block  ap-
     pears  in  more  than  one  file  or  if a file contains bad
     blocks, the files should be deleted, and the free  list  re-
     constructed.   The  best way to delete such a file is to use
     clri(VIII), then remove its directory entries.  If  any  of
     the  affected  files is really precious, you can try to copy
     it to another device first.

     Dcheck may report files which have  more  directory  entries
     than links.  Such situations are potentially dangerous; clri
     discusses a special case of the problem.  All the  directory
     entries  for  the  file  should be removed.  If on the other
     hand there are more links than directory entries,  there  is
     no danger of spreading infection, but merely some disk space
     that is lost for use.  It is sufficient to copy the file (if
     it has any entries and is useful) then use clri on its inode
     and remove any directory entries that do exist.

     Finally, there may be inodes reported by dcheck that have  0
     links  and  0  entries.  These occur on the root device when
     the system is stopped with pipes open,  and  on  other  file
     systems  when  the  system  stops  with files that have been
     deleted while still open.  A clri will free the  inode,  and
     an icheck -s will recover any missing blocks.

     Why did it crash?    UNIX  types a message on the console
     typewriter when it voluntarily crashes.  Here is the current
     list  of such messages, with enough information to provide a
     hope at least of the remedy.  The message has the form `pan-
     ic:  ...',  possibly accompanied by other information.  Left
     unstated in all cases is the possibility  that  hardware  or
     software  error produced the message in some unexpected way.

     blkdev
          The getblk routine was called with a nonexistent  major
          device  as  argument.   Definitely hardware or software
          error.

     devtab
          Null device table entry for the major  device  used  as
          argument  to  getblk.   Definitely hardware or software
          error.

     iinit
          An I/O error reading the super-block for the root  file
          system during initialization.

     out of inodes
          A mounted file system has no more i-nodes when creating
          a file.  Sorry, the device isn't available; the  icheck


                              - 2 -




CRASH(VIII)                  2/12/75                  CRASH(VIII)


          should tell you.

     no fs
          A device has disappeared from the mounted-device table.
          Definitely hardware or software error.

     no imt
          Like `no fs', but produced elsewhere.

     no inodes
          The in-core inode table is full.  Try increasing NINODE
          in param.h.  Shouldn't be a panic, just a user error.

     no clock
          During   initialization,  neither  the  line  nor  pro-
          grammable clock was found to exist.

     swap error
          An unrecoverable  I/O  error  during  a  swap.   Really
          shouldn't be a panic, but it is hard to fix.

     unlink - iget
          The  directory containing a file being deleted can't be
          found.  Hardware or software.

     out of swap space
          A program needs to be swapped out, and there is no more
          swap  space.   It  has  to  be  increased.  This really
          shouldn't be a panic, but there is no easy fix.

     out of text
          A pure procedure program is  being  executed,  and  the
          table  for  such  things  is full.  This shouldn't be a
          panic.

     trap
          An unexpected trap  has  occurred  within  the  system.
          This is accompanied by three numbers: a `ka6', which is
          the contents of the segmentation register for the  area
          in  which  the  system's stack is kept; `aps', which is
          the location where the hardware stored the program sta-
          tus  word  during the trap; and a `trap type' which en-
          codes which trap occurred.  The trap types are:

          0    bus error
          1    illegal instruction
          2    BPT/trace
          3    IOT
          4    power fail
          5    EMT
          6    recursive system call (TRAP instruction)
          7    11/70 cache parity, or programmed interrupt
          10   floating point trap
          11   segmentation violation

     In some of these cases it is possible for  octal  20  to  be


                              - 3 -




CRASH(VIII)                  2/12/75                  CRASH(VIII)


     added  into the trap type; this indicates that the processor
     was in user mode when the trap occurred.  If you wish to ex-
     amine  the  stack after such a trap, either dump the system,
     or use the console switches to examine  core;  the  required
     address mapping is described below.

     Interpreting dumps.    All  file  system problems should be
     taken care of before attempting to look at dumps.  The  dump
     should  be read into the file /usr/sys/core; cp (I) will do.
     At this point, you should execute ps -alxk and who to  print
     the  process  table and the users who were on at the time of
     the crash.  You should dump ( od(I)) the first 30 bytes  of
     /usr/sys/core.   Starting  at  location 4, the registers R0,
     R1, R2, R3, R4, R5, SP and  KDSA6  (KISA6  for  11/40s)  are
     stored.   If  the  dump  had to be restarted, R0 will not be
     correct.  Next, take the value of KA6 (location 22(8) in the
     dump)  multiplied  by 100(8) and dump 1000(8) bytes starting
     from there.  This is the per-process  data  associated  with
     the  process  running at the time of the crash.  Relabel the
     addresses 140000 to 141776.  R5  is  C's  frame  or  display
     pointer.   Stored at (R5) is the old R5 pointing to the pre-
     vious stack frame.  At (R5)+2 is the saved PC of the calling
     procedure.   Trace this calling chain until you obtain an R5
     value of 141756, which is where the user's R5 is stored.  If
     the chain is broken, you have to look for a plausible R5, PC
     pair and continue from there.  Each PC should be  looked  up
     in  the system's name list using db(I) and its `:' command,
     to get a reverse calling order.  In most cases  this  proce-
     dure  will  give  an idea of what is wrong.  A more complete
     discussion of system debugging is impossible here.

SEE ALSO
     clri, icheck, dcheck, boot procedures(VIII)

BUGS























                              - 4 -