Investigation of recent printing problems

On several occasions recently, a problem have been experienced with jammed print queues on Informatics print servers. Investigation of this problem showed a high load on the print server caused by frequent calls to LDAP (on the order of one every two seconds or so) from LPD. On examining the print queue it was seen that rather than sending the print jobs to the printer, the server was instead forwarding the jobs to itself. In other words, it was behaving like a print client rather than a print server. On some occasions, lpq would print out a warning about looping printcaps when examining an affected queue though this was not always the case. This problem most commonly occurred early in the morning or at weekends. Restarting the LPRng component solved the problem.

At first it was believed that these problems were LDAP related and this was the direction which our initial investigations took. These however proved fruitless and we had to look elsewhere for a solution. An obvious potential cause was the printcap entry being used by the server. Print servers and print clients use different printcap entries for a printer; the server version sends print jobs directly to the printer whereas the client version forwards the job to the appropriate print server. If the server was using the client version of the printcap, this would lead to the kind of behaviour observed. In Informatics, printcap data is held in LDAP with the entry for a printer containing information to allow both a client and a server printcap entry to be generated. A locally written script called pcap-query is used to generate printcap entries from the LDAP data and produces server or client printcap entries depending on the host it is run on. Clients generate a printcap entry from LDAP each time they send a job to the printer but because of previous problems with LDAP, print servers generate a flat text version of the printcap data at regular intervals and look up printcap entries in that file. This is handled by the LPRng component. In addition, the component is used to regenerate the lpd.conf and lpd.perms files. Each time the problem occurred, a check of the local printcap file on the server showed that, as expected, all the printers served by this server were using the server version of the printcap entry. The cause of the problem appeared to lie elsewhere.

In fact, we believe that the cause of the problem was indeed printcap related but that this was masked by a facet of LPD's behaviour that we were unaware of; that LPD by default caches printcap entries. Once cached, the printcap is only reread if the daemon is restarted or sent a HUP signal. This means that it would be quite possible for the printcap cached by LPD to be different to the printcap found in the /etc/printcap file on the server if the contents of the /etc/printcap file had changed but the LPD daemon had not been restarted. Sure enough, we discovered that the LPRng component only restarted LPD when its configuration files were updated, not when the /etc/printcap file was updated. In the initial printing setup, the configuration files were updated at 1:00am and 8:00am. The printcap file was updated hourly. It will be seen therefore that for 22 hours out of the day, the printcap file could be out of sync with the printcap entry LPD was using. Restarting the LPRng component solved the problem because it restarted LPD and caused it to reread the (now correct) printcap file.

This leaves one question to be answered, why would the printcap file on the servers be incorrect in the first place? We believe that recent events have suggested a explanation for this.

There has been a longstanding bug with the network component which caused machines with multiple ethernet interfaces (such as the print servers) to intermittently assign the wrong IP address to an interface. We believe that this bug has been responsible for pcap-query getting the wrong IP address returned for the server it was running on and producing client, rather than server printcap entries. This would only cause printing problems if the faulty printcap was in place when the LPD daemon was restarted.

On the week beginning the 30th of April 2007, something greatly increased the severity of this bug causing the wrong IP address to be assigned in the majority of cases. We believe that a new network component schema distributed on the afternoon of the 29th of April may have been responsible for this. This changed behaviour was not immediately obvious since most hosts were only affected when they were rebooted.

Given this, we can now explain the events of the morning of Tuesday the 1st of May when all of the print servers started exhibiting the looping problem (As mentioned above, the problem had previously only affected individual servers). Since a link between the updating of the configuration and printcap files and the appearance of the printing problem had been suggested, the period at which the configuration and printcap files were updated had been reduced to once a day at 01:00am and 08:00am respectively. We believe the course of events was as follows. On the afternoon of the 29th, the network schema changed causing the behaviour of the network component bug to change. At 01:00am on the 30th the LPD daemon was restarted but this had no effect on the print servers since their printcap files had not changed. At 08:00am on the 30th, the printcap file on each of the servers was recreated with client rather than server printcap entries but since the LPD daemon was not restarted, this also had no immediate effect. At 01:00am on the 1st of May, the config files were once more rewritten, the LPD daemon was restarted and the incorrect printcap entries read in by LPD. The appearance of the problem on each of the print servers corresponds exactly to the time the first print job was sent to the server after LPD restarted.

Fixing the problem

The bug in the network component has now been fixed and the new version installed on all the print servers. If our theory is correct (and so far experimentation has shown no reason to think otherwise), this should prevent future re-occurrences of the problem. As a further safeguard, we are putting improved monitoring of the generated printcap files in place in case there are still unresolved issues and modifying the LPRng component to restart LPD every time a new printcap file is generated.

-- CraigStrachan - 07 May 2007

Topic revision: r6 - 08 May 2007 - 14:49:25 - CraigStrachan
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies