You are sound asleep only to be awoken by your beeper in the middle of the night with a critical job being down. Before you start panicking, what steps would you follow to get it resolved, so you can go to bed sooner?
The first part is to narrow down the problem to where the error occurred. If this were an IMS batch job (a BMP) using COBOL, DB2 and MQ, the error can be in any of the following – COBOL, DB2, MQ or in IMS or data issues.
The first step is to look in SDSF or in JHS and to locate the job step and proc step that abended. This will tell you the driver Then if there is a SYSOUT corresponding to this step, take a look to see if there is anything that can give you a clue as to where the error might be. If the error is the COBOL program, usually S0C4 or S0C7 abends, you would see the program where the abend occurred, followed by the offset or the line number causing the abend. You can then identify what happened by taking a look at the compile listing and the offending line number.
Another good location to start digging around for information is the CEEDUMP. This will give you a dump of all the working storage and other variables that were in play when the abend happened. This is a good place to stop and search since it can give you pretty much everything, if not all, to debug the issue.
If this were a DB2 related error, one of the first steps to look for is the SQLCA statement in the CEEDUMP. As you may remember, the SQLCA (SQL Communication Area) contains a set of variables that are populated after each and every SQL call. The SQLCODE will give you the result of the SQL statement that was executed. A zero return code indicates that the SQL was executed successfully, a negative return code indicates there was an error, and a positive return code indicates that the SQL returned a warning. A +100 indicates that there are no more records to be selected or fetched. For a detailed list of the different SQLCODEs, please click this link, which will take you to the IBM DB2 website.
Another good place to look for detailed information, especially if you have a resource unavailable condition (-904), a deadlock situation (-911), or a timeout situation (-913), is the DB2 Master Log. You can see this on SDSF subsystem, and the naming convention of the DB2 Master Log is
If your job was down with a deadlock (-911 SQLCODE), you would see a message/messages similar to the one below. In the example below, the job (or in this example the IMS started task) IMSXMX01 ended abnormally because the plan QREMLPCU was deadlocked with another plan QREMLPC1 causing the job/started task IMSXMX01 to go down. The job IMSXMX01 would have received a -911 SQLCODE causing it to abnormally end. If you restart the abended job when the job using plan QREMLPC1 is complete, it should go to completion without any issues, unless of course, someone else started to use it, and the cycle continues.
DSNT375I -DB2T PLAN=PLAN0001 WITH 824
CORRELATION-ID=0064PLAN0001
CONNECTION-ID=IMSX
LUW-ID=CDN.DXXXX511.C1D49A76FB35=76163
THREAD-INFO=CSQQTRMN:*:*:*
IS DEADLOCKED WITH PLAN=PLAN0002 WITH
CORRELATION-ID=0008PLAN0002
CONNECTION-ID=IMSX
LUW-ID=CDN.DXXXX511.C1D49A76F1D7=76162
THREAD-INFO=CSQQTRMN:*:*:*
ON MEMBER DB2T
DSNT501I -DB2T DSNILMCL RESOURCE UNAVAILABLE 825
CORRELATION-ID=0064PLAN0002
CONNECTION-ID=IMSX
LUW-ID=CDN.DXXXX511.C1D49A76FB35=0
REASON 00C90088
TYPE 00000302
NAME Database.SpaceNam .X'700338'
DSN3201I -DB2T ABNORMAL EOT IN PROGRESS FOR 834
USER=CSQQTRMN CONNECTION-ID=IMSQ CORRELATION-ID=0064PLAN0002
JOBNAME=IMSXMX01 ASID=01A9 TCB=007B9490
If your job was down with a timeout (usually a -913 SQLCODE), you would see a message/messages similar to the one below. This error should disappear once you restart your abended job when the plan is released.
DSNT376I -DB2T PLAN=DSNRRSAF WITH 896
CORRELATION-ID=CORRELATION0
CONNECTION-ID=RRSAF
LUW-ID=CDN.DXXXX511.C1D49662F1CF=74270
THREAD-INFO=AB00002:*:*:*
IS TIMED OUT. ONE HOLDER OF THE RESOURCE IS PLAN=RRSAF WITH
CORRELATION-ID=CORRELATION0
CONNECTION-ID=RRSAF
LUW-ID=CDN.DHXXX511.C1D4961EF89E=71580
THREAD-INFO=AB00001:*:*:*
ON MEMBER DDT2
DSNT501I -DB2T DSNILMCL RESOURCE UNAVAILABLE 897
CORRELATION-ID=CORRELATION0
CONNECTION-ID=RRSAF
LUW-ID=CDN.DXXXX511.C1D49662F1CF=286685
REASON 00C9008E
TYPE 00000302
NAME DSNDB06 .SYSDBASE.X'014A7E'
DSNT501I -DB2T DSNIDBET RESOURCE UNAVAILABLE 557
CORRELATION-ID=IMSX
CONNECTION-ID=DB2CALL
LUW-ID=CDN.DHIPX511.C1D4BC05AA17=0
REASON 00C90097
TYPE 00000200
NAME Database.SpaceNam
There are multiple reasons for a -904. For each Reason Code, there are different steps to be followed. Usually, a Quick Reference (QW) on the Reason Code, or a Google search on the Reason Code will give you pointers on the detailed steps to fix the error situation.
The system log is another place to get a chronological sequence of information leading up to your abend. You can browse the system log by using the keyword “LOG” on the SDSF command line.
I hope the above set of examples have given some insight into debugging DB2 problems. In my next blog, I will cover basic IMS debugging techniques.
If you want additional information, or would like to see specific topics to be covered, please contact me via email at pkganapathi@yahoo.com.
1 comment:
Highly informative and well explained! I am happy that i stumbled upon this article. I got to share this with my teammates.
Post a Comment