Date and Time in Dataset Names vs GDGs (Generation Data Groups)

Recently We were talking about the difficulties of finding backup times for IBM mainframe restore jobs.

The date and time of creation in GDG dataset names can be found in the tape management database. Another option would be to use date and time system symbols in dataset names instead of GDGs.

Since the default date and time system symbols show UTC (GMT), it is convenient to use local date and time displays such as &LYYMMDD and &LHHMMSS.

GDG backups can be run more than once in the same backup window. If the date is used alone, not used with the clock, it is not suitable for repeated operation. In this case, it is necessary to check the datasets before each run and UNCATALOG if there is one with the same name. Cartridges will be scratched during tape management expiration processing.

In our sample, we specified the backup dataset name in the SET JCL statement and used it in the next DD name. For system symbols to function in jobstreams, the JES2 JOBCLASS definition must have SYSSYM=ALLOW. We set one of the initiators to S class then performed our tests.

Datasets with a retention period (RETPD) of 2 days will be displayed in panel 3.4 until expiration processing is performed by tape management system.

We also prepared and tried a job that lists information about the backup without RESTORE.

This job can be used on both GDG and date/time specified dataset names.

Using PARM=’TYPRUN=NORUN’ parameter in EXEC statement, list of datasets together with date and time the backup was run is displayed and actual restore never performed. Only ADR031I and ADR040I messages are displayed.

Posted in General, IBM zEnterprise Servers | Tagged , , , , , , , , | Leave a comment

IBM Mainframes High Availability Concepts and Sysplex


Parallel SysplexHigh availability can be achieved in IBM mainframe environment utilizing parallel sysplex. Parallel sysplex is clustering technology which enables total capacity of multiple processors to be applied against common workloads as if they are part of a single computer (See IBM Z Connectivity Handbook).

Although parallel sysplex implies additional hardware costs, adds another level of complexity to software environment and does not deliver additional end user facilities; more and more enterprises go for it. A well managed, properly configured parallel Continuous Operationssysplex can decrease planned outages 90%, unplanned outages 10% and provide near continuous availability (See Share San Jose 2017 Parallel Sysplex: Achieving the Promise of High Availability Session 20697).

Continuous operations and high availability are componenets of continuous availability. High availability masks unplanned outages from end High Availabilityusers, continuous operations mask planned outages from end users. Continuous operations are achieved through non-disruptive hardware and software configuration change together with coexistence. High availability is achieved through fault tolerance, automated failure detection, recovery and reconfiguration. As a result continuous availability masks both types of outages from end users.

Serial vs Parallel ComponentsInstallations with highly redundant sysplex configurations may fail to meet availability objectives. To maintain availability objectives, it is necessary to investigate all components in Continue reading

Posted in IBM zEnterprise Servers, IT Service Management | Tagged , , , , , , | Leave a comment

Global Resource Serialization for zOS – To Serialize or Not To Serialize


This Article summarizes concepts of GRS in IBM zOS mainframe systems explaining how data sharing is performed in a multi-system environment. It also illustrates design specifications of related products. Finally operational aspects for different topologies are described.


shared_dasdAs data sharing is a vital issue in IBM system z mainframe environment, there is a separate component called GRS (Global Resource Serialization) for this purpose (1). In today’s multitasking and multiprocessing mainframe environments, users, transactions, tasks, programs, processes, jobs whatever units of work are, compete for accessing resources. A topology to coordinate the accesses is required, otherwise integrity exposure may occur.

Before going deeper in these topologies, let me define “resource” first. Data sets (files for IBM system z mainframes), records of data sets, data base rows, data base fields, in-storage table entries, any object subject to update in multiuser, multiprocessing environment.

Please note that access scope is also important. Some tasks just get the information in the resource, some tasks update the resource. First type of access is called “shared access”, second type of access is called “exclusive access”. (2)


You can make foreign exchange rates analogy at this point. Suppose that foreign exchange rates in a core banking application are kept in records of a data set. Most of the transactions “reads” the rate and proceeds. When an “update” is required, all read-only accesses should be stopped (delayed, postponed), update transaction is executed, read-only transactions should be allowed to “read” exchange rates again. Any design flaw will make foreign exchange transactions wait forever, time-out and/ or collapse or even worse, allow foreign exchange transactions from cheaper or more expensive rates.

This is the easy part if you are in a single-system environment. Suppose there are more than one images of operating systems. There are transactions running in each systems accessing same resources. If no further precautions are implemented, this topology is called shared DASD (Direct Access Storage Device) environment. Disks are shared. When a system accesses a data set on a disk volume for update, whole disk volume is “reserved”. No other data set can be accesses from other systems. When update completed, disk volume is “released” and can be used by other systems (3).

grs_ring2Until 1980s this implementation was okay and suitable for multi-system environment. As clustering activity soared among IBM mainframes to increase availability, this became insufficient. IBM mainframes were becoming members of clustered structures. This single-system image structures required data to be shared extensively. All members were connected with one-another using high-speed CTCA (Channel To Channel Adapter) links. IBM called this “ring” topology. When a resource to be accessed by one of systems, it is sending type and name of the resource to all other systems, all systems are putting this type/name couples in their own queue. This was called “ENQUEUing”. When a transaction or task finished with the resource, it was again sending type and name again to all other systems, they were removing resource from their own queue. That was called “DEQUEUing” (4).

As you observe there are two deficiencies in this design. First resource name and type should be travelled to all systems before it is accessed by any system and resource related information should have been stored separately in all systems. These were time and storage consuming issues.

Type of resource is named as QNAME (Queue name or Major name), name of resource is named as RNAME (Resource or Minor name). The message carrying both information is called RSA (Ring System Authority) message. Lists of resources defined are named RNLs (Resource Name Lists).

grs_star2IBM then took this cluster thing seriously and called it base “sysplex” inspiring from SYStems compLEX phrase. After this, IBM introduced devices named coupling facilities to store and transmit data much faster, integrated into sysplexes allowing members sharing data and named clusters “parallel sysplex”. Now it was possible for each member system to query resource information in CF (Coupling Facility) structures and send if not there. This topology was called “star” topology. Star topology did not have the deficiencies of ring topology. Sending data was fast, storing was not duplicated (5).

In star topology enqueues are faster when compared with ring topology. But ring topology is only possible choice for non-parallel-sysplex zOS systems and recently used channels which we call FICON CTCAs are not supported.

But all mainframe users did not implement parallel sysplexes. Data centers using shared DASD and/ or base sysplex without data sharing continued to use ring topology. In this eco-system MIM (Multi Image Manager) Data Sharing for zOS product of CA (Computer Associates) continued to be used. Customers liked DASDONLY ease of implementation of the product. It was not relying systems connected via CTCA. There was a shared data set accessible by all systems in the MIMPLEX (MIM comPLEX). Each system was accessing the shared data set in the fraction of a second, adding/ deleting its own resources (6).

Let me get back to concepts of GRS which is under-the-hood engine for data sharing for zOS. Even if a third party data sharing product is being used, GRS is initialized at IPL (Initial Program Load) time and active at all times. GRS has its own storage management component and one large storage block is allocated during initialization.


zos_operationsAfter starting GRS complex successfully, no operator intervention is required. If a contention or serialization problem occurs, either GRS or some other zOS component will detect it and notify operator. Since the systems in star complex must match the systems in parallel sysplex, operating is even simpler and more straightforward.

Operators may display the status of systems in the GRS complex, change resource names and types dynamically, remove member systems if they restart them and notify other members from restart.


(1) IBM zOS MVS Planning: Global Resource Serialization
(2) IBM zOS Basic Skills, zOS concepts, Serializing the use of resources
(3) SHARE, The Basics of GRS: An overview of GRS ENQ processing
(4) Middle East Technical University, CENG 497 Introduction to Mainframe Architectures and Computing
(5) IBM Introduction to the New Mainframe, z/OS Basics, pp 92 – 97
__IBM Redbooks, ABCs of z/OS System Programming: Volume 5, Chapter 4
(6) CA MIM Resource Sharing Overview

Posted in IBM zEnterprise Servers | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

Dynamic Workload Management in IBM System z Mainframe Environment


Workload manager (WLM) employed in IBM System z environment is one of most sophisticated performance monitoring and control tool in IT industry. In this article, first WLM concepts and facilities has been summarized. After going over evolution from machine resource level specifications to customer service level specifications, some advanced characteristics are explained in more detail.


One of the strengths of the System z mainframe environment is the ability to run multiple workloads at the same time. This is performed by dynamic WLM, which is implemented in the WLM component of the zOS operating system.

The idea of zOS WLM is to make a service level agreement (SLA) between service users (IT customers) and the service provider (zOS operating system). The installation classifies the work running on the zOS operating system in distinct service classes and defines goals for them that express the expectation of how the work should perform. WLM uses these goal definitions to manage the work across all systems of a SYSPLEX (SYStems comPLEX, which is zOS clustering facility) environment.

In earlier versions of mainframe operating systems, performance parameters were not reflecting user expectations. They were implemented separately in operating system and subsystems. They used to have their own separate terminologies for similar concepts and their own separate control procedures. Those uncoordinated policies and procedures were not defined in terms of customer expectations. All parameters and controls were defined in terms of system resources. This type of workload management was called compatibility mode. Components of compatibility mode were Installation Performance Specifications (IPS) and Installation Classification Specifications (ICS) PARMLIB members.

2000 onwards workload management focus shifted from tuning at system resources level to defining performance expectations. This brought goal mode workload management with fewer and simpler tuning parameters reflecting customer expectations. Goal mode WLM does not use low-level system tuning parameters any more. Performance goals defined directly from customer SLA.

In order to understand how WLM works, we can look at the traffic in cities. When we drive a car from one location to another location, the time is determined by the constant speed we are allowed to go, the amount of traffic on the streets, and the traffic regulations at crossings and intersections. Based on this, Time is the basic measure to go from a start to a destination point. We can easily understand that the amount of traffic is the determining factor of how fast we can go between these two points. While maximum driving speed is constant, the number of cars on the street determines how long we have to wait at intersections and crossings before we can pass through them. As a result, we can identify the crossings, traffic lights, and intersections as the points where we encounter delays on our way through the city.

A traffic management system can use these considerations to manage the running traffic. For example, dedicated lanes for public buses allow them to pass the waiting traffic during rush hours and therefore to travel faster than passenger cars. This is a common model of prioritizing a certain type of vehicle against others in the traffic systems. So far, we see that it is possible to use the time as a measure between two points for driving in a city, we identified the points where contention might occur, and we found a simple but efficient method of prioritizing and managing the traffic. Now we will go over WLM concepts. They are similar and analogous to traffic management systems.

Goal Mode WLM Concepts and Facilities

Managing mainframe workloads in a zOS system or across multiple systems in a Parallel SYSPLEX requires the end user to formulate their expectations as goals in different categories and associate goals with each category. This is known as workload classification. Workloads are classified into distinct service classes that define the performance goals to meet the business objectives and end-user expectations.

Classification is based on application-specific or middleware-specific qualifications that allow WLM to identify different work requests in the system. for example, defining a service class for batch programs or for all requests accessing the DB2 stored procedures. During runtime, the subsystems of zOS operating system, middleware components and applications inform WLM when new work arrives, through programming interfaces that all operating system and major middleware components exploit. The interfaces allow a work manager to pass the classification attributes that the component supports to WLM. WLM associates the new work request with the service class based on the user definitions and starts to manage the work requests toward the affiliated goal definition.

A performance goal belongs to a certain type with an importance level between 1 and 5. That number tells WLM how important it is that the work in this service class meets its goal, if it does not already. WLM makes every attempt to ensure that work with a level 1 service class is meeting its goal before moving to work in importance level 2 and so forth down to level 5.

Goal type expresses how the end user wants to see the work perform in a service class. Three types of goals exist:

  1. Response Time: This can be expressed either as “Average Response Time” (for example, all work requests in the service class should complete, on average, in one second) or as “Response Time with a Percentile” (for example, 90 percent of all work requests in a service class should complete within 0.8 seconds). Using a response time goal assumes that the applications or middleware tell WLM when a new transaction starts. This is the case when the component supports the WLM Enclave services, and its possible for CICS and IMS and subsystems like Job Entry Subsystem (JES), TSO (Time Sharing Option) and UNIX System Services. Using a response time goal also requires that a certain number of work requests constantly arrive in the system.

  2. Execution Velocity: Response Time is not appropriate for all types of work, for example address spaces that do not exploit WLM services or long-running transactions with infinite ends. To manage those workloads, a formula measures a transactions velocity as a number between 1 and 99, quantifying how much time a transaction spends waiting for system resources. Higher velocity means fewer delays have been encountered.

  3. Discretionary: Assign this goal to work that can run whenever the system has extra resources. Discretionary work is not associated with an importance level, so it accesses resources only when the requirements for all work with an importance level can be satisfied.

WLM Periods

A service class can also be subdivided into multiple periods. As the work consumes more resources, it may “age” from one period to another, at which point it can be assigned a new goal and importance level. The resource consumption is expressed in Service Units, a zOS definition for work consuming resources. For example, a service class for batch jobs can be broken into three periods. The first period has a “high” execution velocity goal of 50 and an importance level of 3 for the first 2,500 Service Units. Work that takes longer would go into the second period with a defined execution velocity of 20 and an importance level of 5 for the next 10,000 Service Units. Finally, long-running batch jobs age into the third period, which is associated with a discretionary goal.

We can now classify the work and define its performance goals. The subsystems, middleware components and applications use WLM services to inform WLM when new work arrives. They also summarize work requests into uniquely identifiable entities that can be monitored and managed.

WLM constantly collects performance data by service class, compares the results with the goal definitions and changes the access to the resources for the work entities contained in the service classes based on goal achievement and demand. Data collection occurs every 250 milliseconds with goal and resource adjustment executing every 10 seconds.

WLM calculates the performance index (PI) to determine whether a service class is meeting its goals. A response time goal is the quotient of the actual achieved response time divided by the goal value, and an execution velocity goal is the defined value divided by the measured value in the system. If the PI is less than one, the goal is overachieved, and if the value is greater than one the service class misses its goal. If a service class does not achieve its goal, WLM attempts to give that service class more of the specific resource it needs. As several service classes may be competing for the same resource, however, WLM must perform a constant balancing act, making trade-offs based on the business importance and goals of the different work types.

Whenever WLM decides to make a change, the current goal adjustment is completed and the system is monitored again for the next 10 seconds while WLM assesses whether additional changes are required. If no change was possible, WLM may look for another service class to help or attempt to help the selected service class for another resource.

The service class and goal definitions are part of the service definition for the entire Parallel SYSPLEX. The service definition can be subdivided into multiple service policies that allow you to dynamically change the performance goals of the service classes and have the same service definition with adapted goals in effect for certain times. For example, you may have different goals for your batch service classes during the night shift when primarily batch work runs in the system than you do during the day when the focus is online work.

Advanced WLM Characteristics

Some advanced WLM characteristics to mention are Intelligent Resource Director (IRD), dynamic channel path ID (CHPID) management, some small product enhancements (SPE) and dynamically managed batch initiators.

IRD is an important manageability enhancement. It comprised of Parallel SYSPLEX, PR/SM and WLM technologies. IRD processes work in a clustered environment in a new way. Rather than distributing work to the systems, the resources are moved to where the work runs. Systems, IPL’ed (Initial Program Load) in a logical partition (LPAR) on the same central processor complex (CPC), that belong to the same Parallel SYSPLEX and are running zOS, form an LPAR cluster. Within such a cluster, the CPU weight (which specifies how much CPU capacity is guaranteed to an LPAR if demanded) can be shifted from one LPAR to another while the overall LPAR cluster weight remains constant.

When all LPAR’s are busy, their current weights are all enforced. WLM can initiate a weight adjustment in favor of one system to help the most important work not meeting its goals. Along with CPU weight management, IRD can also manage the number of logical CPUs for a system. The LPAR overhead can be high when the logical-to-physical CPU ratio in a CPC exceeds 2.0. To avoid this, WLM manages the number of logical CPUs to be close to the physical number of CPUs that can be utilized based on the partitions current weight.

Dynamic CHPID management is based on the same idea, namely managing a set of channel paths (also called managed channels) as a pool of floating channels and assigning those channel paths dynamically to DASD control units based on the importance of the work doing the I/O and the channel path delay measured for this work. Dynamic CHPID management not only allows the system to react on changing I/O patterns based on business importance, but also helps reduce the impact of the 256 channel paths limit: fixed channel paths need only be defined for availability considerations, while additional channels can be added or removed dynamically to meet the business objectives and increase the overall I/O performance based on the workload demand.

IRD also allows control of the I/O priority from its start point within zOS operating system, through the channel subsystem, until its processed within the storage controller itself. Within zOS operating system, an I/O request is queued in front of a particular device control block. The I/O priority determines the position of a new I/O request within that queue. When the request is processed, it flows through the channel subsystem where the I/O priority queuing allows an installation to define the partitions I/O priority relative to other partitions in the CPC. Within the storage controller, its again the priority of the I/O request that determines how fast the data is accessed from disk if the request cannot be satisfied from the cache.

As an SPE, the scope of an LPAR cluster extends and allows it to include non-zOS members, particularly system images running Linux for System z. CPU weight can be shifted from zOS images to non-zOS images and vice versa. Non-zOS images are defined as work assigned to a service class in the WLM service definition and managed toward a velocity goal.

WLM introduced dynamically managed batch initiator address spaces. Instead of tuning the initiator address spaces manually, WLM-managed batch initiators let the system determine the optimal number of initiator address spaces.

Each batch job is associated with a job class that is either in JES or WLM mode. Accordingly, there are two types of initiators: JES-managed initiators selecting work from job classes in JES mode and WLM-managed initiators selecting work from job classes in WLM mode. Operators have full control over the JES-managed initiators but no control over WLM-managed initiators. When a job becomes ready for execution, an idle initiator associated with that job class selects the job. If no initiator exists or all initiators for that class are busy, the job must wait until an initiator becomes available.

That wait, or queue, time is factored into the actual goal achievement of the work as part of the overall response time or as an execution delay in case of velocity goals. If queue delay becomes a bottleneck and goals are no longer reached, WLM determines whether it can help the batch work by starting additional initiators. WLM calculates the maximum number of initiators that can be started to improve goal fulfillment without the expense of higher importance work. It also selects a system to start new initiators-systems with available capacity first, then systems with enough displaceable capacity. If more initiators are available than needed to process the batch work, WLM stops the initiators or assigns them to other job classes in WLM mode.

Reporting enhancements are the most extensive enhancement since the introduction of goal mode in MVS 5.1 is the implementation of report class periods. Before zOS, report classes appeared as simple containers for any kind of transaction-from those managed toward different velocities or different response time goals to those originating in different subsystems. This made general reporting possible, but not in specific cases where reporting makes sense for only a homogeneous set of transactions: response time distribution and subsystem work manager delays. Many installations add service classes to help solve the reporting deficiencies of report classes, but service classes should only be used for management purposes. We recommend using only 25 active service class periods at any time in a system.

To solve this dilemma, WLM implemented report class periods where the actual period number is dynamically derived from the service class period in which a transaction is currently running. Even then, there is still the possibility of mixing different kinds of transactions. However, WLM can track the transactions attributed to a particular report class, informing performance monitors such as the Resource Measurement Facility (RMF) about this when they request WLM reporting statistics. A homogeneous report class has all transactions attributed to it associated with the same service class. A heterogeneous report class has at least two transactions attributed to it that are associated with different service classes. A performance monitor can determine the homogeneity of the report class within the reporting interval and support the additional reporting capabilities.

Usually, response time distributions can only be generated for report classes when it reports on transactions managed by one service class with a response time goal. With a little trick, however, WLM can maintain the response time distribution for CICS and IMS workloads even though the CICS and IMS regions are managed towards execution velocity goals: CICS/IMS transactions are classified to a report class and a service class. The service classes response time goal is ignored for management; WLM will use it to maintain the response time distribution for the report class. RMF obtains the data and presents it throughout its reports.

One of operational facilities is that WLM has the capability to reset Enclaves to a different service class. Operators can reset address spaces through an operator command or another console interface, such as the System Display and Search Facility (SDSF). The RESET command allows the operator to specify a different service class, quiesce an address space (and swap it out if swappable) or resume it in the original service class (i.e., reclassify it according to the rules in the current service policy). In this way, address spaces can be slowed down or accelerated depending on the operators intention.

Until zOS V1.3, nothing could be done with Enclaves once they were created. That release provides a similar reset capability as address spaces. But rather than giving operators a new command to enter a somehow circumstantial Enclave token (Enclaves have no name comparable to an address spaces job name), WLM provides an API for authorized programs to reset an Enclave. SDSF uses that API to allow operators to reset the service class of an independent Enclave, quiesce the Enclave or resume it. That support is implemented on ENC-screen of SDSF and requires SPE PQ50025.

WLM enhanced its monitoring programming interfaces to allow middleware like WebSphere to collect and report on subsystem states. This, together with enhancing the capability of WebSphere to classify work on zOS, is to improve and ease the deployment of major middleware applications on zOS.

The installation and customization of WLM has been enhanced by integrating this function into zOS Managed System Infrastructure Setup. This integration allows users to more easily adjust the size of the WLM couple data set.


Before the introduction of zOS WLM, Operating systems required customers to translate their data processing goals from high-level objectives about what work needs to be done into the extremely technical terms that the system can understand. This translation requires high skill-level staff, and can be protracted, error-prone, and eventually in conflict with the original business goals. Multi-system, SYSPLEX, parallel processing, and data sharing environments add to the complexity.

In this article we summarized how zOS WLM provides a solution for managing workload distribution, workload balancing, and distributing resources to competing workloads like CICS, IMS/ESA, JES, APPC, TSO/E, zOS UNIX System Services, DDF, DB2, SOM, LSFM, and Internet Connection Server.

We also summarized how zOS WLM component dynamically allocates or redistributes server resources such as CPU, I/O and memory across a set of workloads based on user-defined goals and their resource demand within a zOS image.


  1. z/OS V2.1 MVS Planning: Workload Management, SC34-2662

  2. IBM Redbooks, System Programmer’s Guide to: Workload Manager, SG24-6472

  3. IBM Systems Magazine, Managing Workloads in z/OS, January 2004

Posted in IBM zEnterprise Servers, IT Service Management | Tagged , , , , , , , , , , , , , , , , , , , , | Leave a comment

Business Continuity Concepts at a Glance

Business ContinuityExpectations from today’s business environment are high availability, continuous operation and quick disaster recovery. For these reasons, enterprises and partners try to increase their ability to respond to risks. Although it is more difficult, they also try to utilize opportunities.

There are some items an enterprise should decide on. First of them is acceptable level of data loss. How much data loss is acceptable? A few seconds? A few minutes? Half an hour? Data entered before disaster and after last consistent backup will be lost. This period is called Recovery Point Objective (RPO). This data should be entered manually after recovery. This period should not be longer than some fraction of seconds for financial enterprises.

Another period in disaster recovery timeline an enterprise will decide is how long it will take to start business after a disaster. This period is called Recovery Time Objective (RTO).

If RTO is in minutes, it will cost a fortune. If RTO is some hours, cost will be less. Cost will be less and less if RTO is set longer.

Disaster Recovery TimelineThere are some periods of time related with disaster recovery. When an unexpected event or incident is occurred management of enterprise should decide whether this incident is a disaster or not. Since a capital goods investment is needed to recover disaster, approval of management is required. To decide whether an incident is disaster or not, management would need information related with damage. So a damage assessment is required and it takes some time. It also takes some time to start damage assessment after incident occurs. Finally it takes some time for disaster assessment after damage assessment. Sum of these time intervals is called Maximum Tolerable Period of Downtime (MTPOD) or Maximum Tolerable Outage (MTO). Beyond this point, enterprise is out of business.

To define these periods logically, an enterprise should evaluate what it will lose in case of outage and impacts to the business:

  • Lost revenue, loss of cash flow, and loss of profits
  • Loss of clients (lifetime value of each) and market share
  • Fines, penalties, and liability claims for failure to meet regulatory compliance
  • Lost ability to respond to subsequent marketplace opportunities
  • Cost of re-creation and recovery of lost data
  • Salaries paid to staff unable to undertake billable work
  • Salaries paid to staff to recover work backlog and maintain deadlines
  • Employee idleness, labor cost, and overtime compensation
  • Lost market share, loss of share value, and loss of brand image; etc.

Disaster Recovery PlanningComponents causing outage in a disaster are servers, storage devices, network components, software and infrastructural components like power. Recovery can be decomposed into hardware recovery, data integrity and data consistency.

Hardware recovery is making hardware, operating systems and network connections available. All storage devices and network components should be up, operating systems should be running. Data integrity can be considered as part of hardware recovery or it can be considered as a separate item.

This point is a bit confusing. Most of IT staff tend to claim that “Everything is up and running” but this is not the case. Data is not recovered from users’ standpoint. Application and data consistency should be attained. Applications should be recovered from most recent version backups. Data base resources should be restored from last image copy backups, log changes should be applied. A database restart should be performed. It is hoped that long and unpredictable database recovery will not take place. Just a few incomplete logical unit of works (luws) may be rolled back. After these activities, transaction integrity will be okay.

Disaster RecoveryRecovery Consistency Objective (RCO) is another concept related with business continuity. It focuses on data consistency achieved after disaster recovery. 100% RCO denotes all transaction entities are consistent after disaster recovery, any target below this amount means that the enterprise tolerates some data inconsistencies.

Single most critical factor affecting successful disaster recovery is avoiding labor intensive and unpredictable duration application and data base recovery. They should be fast, repeatable and consistent.

Rolling disasters should be another consideration in successful business continuity. Disasters may spread out over time. Partial damages may not be recognized on time, as time passes, they cause problems but it may be too late. Databases for example restarted but long recovery takes place due to inconsistencies and further loss of data recurs.

Some enterprises have heterogeneous platforms. If this is the case, there would be duplicate work for all platforms. Additionally assessments, hardware recovery and transaction recovery times are different for all platforms. This situation could be very difficult to coordinate and very complex to manage.

BC_plan_multi_platformAfter disaster recovery planning is completed, it is implemented. Implementation involves setting policies, material acquisition, staffing and testing. Periodical tests should include swinging from primary site to secondary site and vice versa. After completing implementation, periodical maintenance takes place and frequent testing should accompany maintenance for successful disaster recovery.


Posted in IBM zEnterprise Servers, IT Service Management, Risk Management | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

IT Security Audits for IBM zOS Systems from Top Secret Perspective

Security by Obscurity?An information technology security audit is a manual or systematic measurable technical assessment of an information system. Manual assessments include interviewing staff, performing security vulnerability scans, reviewing application and operating system access controls, and analyzing physical access to the systems. Automated assessments or CAAT’s (Computer Assisted Auditing Tools and Techniques), include system generated audit reports or using software to monitor and report changes to files and settings on an IT system.

IBM System z data centers and zOS operating systems are very secure environments. IT people or security people who are not familiar with System z platform may say that it is a kind of “Security though Obscurity”. Mainframe servers as largest servers available, support most of the open system protocols and maintain highest levels of IT security standards available additional to highest service levels of availability. For example IBM LPARs (Logical Partitions) providing that organizations can run many different applications containing confidential data on one System z box, has been awarded EAL5. Common Criteria EAL5 (Evaluation Assurance Level) is highest security certification awarded to a commercially available system.

Mainframe SecurityzOS operating system of System z platform has a security server called RACF (Resource Access Control Facility). The other security server option is Top Secret program product of CA (Computer Associates). Both of them can be used to protect system resources. As Top Secret is not widely used and known as RACF, we will explain Top Secret related listing and reporting characteristics.

APF (Authorized Programs Facility) authorized programs are one of important categories of resources. Those programs are cataloged in system defined APF load libraries with AC (Authorization Facility) of one (AC=1). Authorized programs can switch from problem state to supervisor state and perform authorized functions.

Another category of security related resources are IBM supplied system utilities. They are authorized programs and they can bypass security checks. IEHPROGM (Data management related), IEHINITT (Initialize cartridge media), IEFBR14 (dummy program) and SPZAP (Super zap) are important programs to consider usage for auditing.

PPT (Program Properties Table) is a system initialization table to define trusted programs. Trusted programs bypass security checking.

JES (Job Entry Subsystem) internal readers, PROCLIBs (Procedure Libraries) and job related libraries are important sources of JCL (Job Control Language) statements submitted to system to associate programs and data sources.

Top secret decomposes resources into facilities like STC (started tasks), TSO (time sharing users), BATCH (background workload), CICS (Customer Information Control System online subsystems), MQM (Websphere message queueing management), APPC (advanced program to program communication through SNA), OMVS (open MVS which is a built-in UNIX), TCP, FTP, MISCx (miscellaneous).

Those resources are accessed by users. Users are people like operators, applications developers, systems support staff, third party systems external staff. Their authorities to access any resources may be read-only, update or control authorities.

Top Secret defines users as ACIDs (Accessor IDs). Other than user ACIDs, there are functional and organizational ACIDs like profile, group, control, department, division and zone ACIDs. Control ACIDs define authorized users. MSCA (Master security control ACID) and SCA (Central security control ACID) type users are system wide authorized users and resemble to system special users of RACF. LSCA (Limit central control ACID), ZCA (Zone), VCA (Divisional) and DCA (Dept) users are like group special users whose authorities are limited to the management of resources and members in a group, zone, division or department.

Top Secret has audit utilities like TSSTRACK, TSSUTIL, TSSAUDIT and TSSCHART. It is also possible to perform audit using listings of Top Secret reporting facilities. User listings show parameters related with a user, some statistical usage information, together with authorized facilities with authorization level. Group related information can also be integrated in user listing. It is also possible to create facility listings through WHOHAS and WHOOWNS commands. These listings show authorization levels of users to access to the specified facility.

Performing security audits on zOS shops using Top Secret as security product is as straightforward as performing security audits on RACF using zOS shops.

Posted in IBM zEnterprise Servers, Risk Management | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

BMC Control-M and Control-O Products on zOS

Job scheduling and console automation are very important operational characteristics of high-end enterprise data centers. Control-M and Control-O are nice pieces of BMC software to achieve high service levels. I will tell how to use these tools in customer installations. I believe it will be useful for consultant colleagues.

Control-M is job scheduling solution for lots of platforms including zOS. You can start interactive facilities using TSO IOAISPF, even if you do not know where ISPF/PDF option is inserted. In the IOA primary option menu you see Control-M and Control-O selections. IOA is Integrated Operations Architecture covering both products.

 ---------------------       IOA PRIMARY OPTION MENU       ------------------(1)
 OPTION ===>                                               USER        SYSUSER
                                                           DATE        28.05.13

   2   JOB SCHEDULE DEF  CTM Job Scheduling Definition
   3   ACTIVE ENV.       CTM Active Environment Display
   4   COND/RES          IOA Conditions/Resources Display
   5   LOG               IOA Log Display
   6   UTILITIES         IOA On-Line Utilities
   7   MANUAL COND       IOA Manual Conditions Display
   8   CALENDAR DEF      IOA Calendar Definition
  IV   VARIABLE DATABASE IOA Variable Database Definition Facility
   C   CMEM DEFINITION   CTM Event Manager Rule Definition
  OR   RULE DEFINITION   CTO Rule Definition
  OM   MSG STATISTICS    CTO Message Statistics Display
  OS   RULE STATUS       CTO Rule Status Display
  OL   AUTOMATION LOG    CTO Automation Log Display
  OA   AUTOMATION OPTS   CTO Automation Options Menu
  OC   COSMOS STATUS     CTO COSMOS Status Screens

 COMMANDS:  X - EXIT, HELP, INFO  OR CHOOSE A MENU OPTION               10.12.53

Using Control-M you can start a job stream some time every day, periodically one after the other, last Saturday of a month, 9 to 5 etc etc. First you select “Schedule Definition” facility. A schedule library will be displayed.



    TABLE   ===>                     (Blank for table selection list)
    JOB     ===>                     (Blank for job selection list)

    TYPE OF TABLE         ===>       ( J  Job - default
                                       G   Group - for new tables only)


 USE THE COMMAND SHPF TO SEE PFK ASSIGNMENT                             10.14.03

You enter to accept it. All tables in that schedule library will be displayed.

 LIST OF TABLES IN IOA6304.CTM.SYSA.SCHEDULE                    -------------(2)
 COMMAND ===>                                                    SCROLL===> CRSR
 OPT  NAME -------- VV.MM  CREATED         CHANGED     SIZE  INIT   MOD   ID
      IDGS1         01.01 2010/04/15 2010/04/15 14:49    30    21     0 SYSUSER
      ITABLE1       01.06 2013/05/24 2013/05/27 08:23     9     9     0 SYSUSER
      IVP0          01.00 2009/09/11 2009/09/11 16:11    35    35     0 USR1006
      IVP1          01.00 2009/09/11 2009/09/11 16:11    71    71     0 USR1006
      IVP2          01.00 2009/09/11 2009/09/11 16:11    65    65     0 USR1006
      IVP3          01.00 2009/09/11 2009/09/11 16:11    85    85     0 USR1006
      IVP4          01.00 2009/09/11 2009/09/11 16:11    98    98     0 USR1006
      IVP5          01.00 2009/09/11 2009/09/11 16:11    62    62     0 USR1006
      IVP6          01.00 2009/09/11 2009/09/11 16:11   194   194     0 USR1006
      MAINDAY       01.00 2009/03/17 2009/03/17 12:00    53    53     0 IOA6304
      MAINDAY0      01.02 2009/03/17 2010/02/18 16:33    53    53     0 USR1006
      SYSYPRJ       01.09 2011/04/07 2012/04/10 09:45    44    11     0 SYSUSER
      SYSYPRJ0      02.98 1999/02/04 2010/04/07 15:03    58    10     0 SYSUSER
  ====== >>>>>>>>>>>>>>>>>>>    NO MORE TABLES IN LIBRARY   <<<<<<<<<<<<<<<< ===


Select a table typing “S” in front of it. To create a table type “S” and a non-existent table name on command line. First browse through some existing members in tables and then you create your first scheduled job.

After defining a job, type “O” in front of it to order (activate). The job will be started when the time comes.

You can see job status in “Active Environment Display” option. Status for each job is either “Wait Schedule” or “Ended OK” most of the time.

Note that next day it will not execute automatically. If you would like to execute it every day, you have to add the table in CTMDAY procedure DAJOB concatenation. CTMDAY procedure will be executed some time after midnight and all jobs in DAJOB DD definition will be rescheduled. Once you add a table into CTMDAY procedure, you can add new jobs in that table. All jobs in that table will be rescheduled every day.

If you would like to remove the job before execution, first “hold” it and then “delete” it in “Active Environment” option.

 Filter:           ------- CONTROL-M  Active   Environment ------ UP     D  - (3)
 COMMAND ===>                                                    SCROLL ==> CRSR
 O Name     Owner    Odate  Jobname  JobID   Typ ----------- Status ------------
 ==================        Top  of  Jobs  List       =================
   DAILYPRD PRODMNGR 280513                  JOB Wait Schedule
   DAILYSYS SYSTEM   280513                  JOB Wait Schedule
   IOALDNRS PRODMNGR 280513                  JOB Wait Schedule
   IOACLCND PRODMNGR 280513                  JOB Wait Schedule
   MQPTDEF  USR1022  280513 MQPTDEF /03587   JOB Ended "OK"
   BKPWKY   USR1002  280513 BKPWKY  /03556   JOB Ended "OK"
   IRX0080  SYSUSER  280513 SYSUSERA/03623   JOB Ended "OK"
   IRX0080  SYSUSER  280513 SYSUSERA/03625   JOB Ended "OK"
   IRX0090  SYSUSER  280513 SYSUSERA/03626   JOB Ended "OK"
 ==================        Bottom of Jobs List       =================

 Commands: OPt DIsplay Show HIstory RBal REFresh Auto Jobstat SHPF Note Table
           OPt command toggles between Commands and Options display     10.27.01

If you would like to execute a job multiple times a day, define its TASKTYPE as “CYClic” and specify “FROM TIME”, “UNTIL TIME” and “INTERVAL”.

 JOB: MYDEMO2  LIB IOA6304.CTM.LP13.SCHEDULE                     TABLE: MANUAL1
 COMMAND ===>                                                    SCROLL===> CRSR
 +---------------------------------- BROWSE -----------------------------------+
 | MEMNAME MYDEMO2     MEMLIB   PRDZ.LP13.OPERLIB                              |
 | OWNER   USR1015     TASKTYPE CYC                                            |
 | APPL                                GROUP                                   |
 | DESC    SAATLI VE TEKRARLI IS BASLATMA                                      |
 | OVERLIB                                                   STAT CAL          |
 | SCHENV                         SYSTEM ID                  NJE NODE          |
 | SET VAR                                                                     |
 | CTB STEP AT         NAME            TYPE                                    |
 | DOCMEM  MYDEMO2     DOCLIB   IOA6304.CTM.LP13.DOC                           |
 | =========================================================================== |
 | DAYS    ALL                                                   DCAL          |
 |                                                                    AND/OR   |
 | WDAYS                                                         WCAL          |
 | MONTHS  1- Y 2- Y 3- Y 4- Y 5- Y 6- Y 7- Y 8- Y 9- Y 10- Y 11- Y 12- Y      |
 | DATES                                                                       |
 | CONFCAL          SHIFT       RETRO N MAXWAIT 00  D-CAT                      |
 | MINIMUM          PDS                                                        |
 | DEFINITION ACTIVE FROM          UNTIL                                       |
 | =========================================================================== |
 | IN                                                                          |
 | CONTROL                                                                     |
 | RESOURCE                                                                    |
 | FROM TIME    0900 +     DAYS    UNTIL TIME 0935 +     DAYS                  |
 | DUE OUT TIME      +     DAYS    PRIORITY     SAC    CONFIRM                 |
 | TIME ZONE:                                                                  |
 | =========================================================================== |
 | OUT                                                                         |
 | SYSOUT OP   (C,D,F,N,R)                                              FROM   |
 | MAXRERUN      RERUNMEM                           INTERVAL 00030 M FROM END  |
 | STEP RANGE         FR (PGM.PROC)          .          TO          .          |
 | ON PGMST          PROCST          CODES                               A/O   |
 |   DO                                                                        |
 | ON SYSOUT                                          FROM     TO        A/O   |
 |   DO                                                                        |
 | SHOUT WHEN          TIME       +     DAYS      TO                  URGN     |
 |   MS                                                                        |
 | =========================================================================== |
 | APPL TYPE                                  APPL VER                         |
 | APPL FORM                                  CM   VER                         |
 | INSTREAM JCL: N                                                             |
 |                                                                             |
 ======= >>>>>>>>>>>>>>>>>>> END OF SCHEDULING PARAMETERS <<<<<<<<<<<<<<<< =====

  COMMANDS:  DOC, PLAN, JOBSTAT                                         09.00.39

Another nice tool for operations staff is Control-O. It is very useful for console automation. You can create rules to issue complicated set of commands like starting traces, issuing dump commands just creating user commands. You can issue commands periodically. Automate IPL (Initial Program Load) procedures and system and network restarts.

To create rules, select Control-O “Rule Definition” option from IOA primary option menu. All tables in rule library are displayed.

 TABLES OF LIBRARY IOA6304.CTO.LP14.RULES                       ------------(OR)
 COMMAND ===>                                                    SCROLL===> CRSR
 OPT  NAME -------- VV.MM  CREATED         CHANGED     SIZE  INIT   MOD   ID
      $COSMOSU      01.00 2009/03/17 2009/03/17 12:00  2383  2383     0 IOA6304
      $HASP         01.00 2009/03/17 2009/03/17 12:00    50    50     0 IOA6304
      COMMANDS      01.00 2009/03/17 2009/03/17 12:00    26    26     0 IOA6304
      COSMOS        01.00 2009/03/17 2009/03/17 12:00  3739  3739     0 IOA6304
      CTDMRULE      01.00 2009/03/17 2009/03/17 12:00     8     8     0 IOA6304
      CTMAPIF       01.00 2009/03/17 2009/03/17 12:00    21    21     0 IOA6304
      CTMMRULE      01.00 2009/03/17 2009/03/17 12:00     8     8     0 IOA6304
      CTOGATEI      01.00 2009/03/17 2009/03/17 12:00   295   295     0 IOA6304
      CTOMRULE      01.00 2009/03/17 2009/03/17 12:00     8     8     0 IOA6304
      CTOSCMD       01.00 2009/03/17 2009/03/17 12:00    33    33     0 IOA6304
      DAILY         02.05 1997/04/30 2012/05/16 09:07   117    18     0 SYSUSER
      DAILY0        01.00 2009/03/17 2009/03/17 12:00     9     9     0 IOA6304
      DELQUEUE      01.26 2001/02/26 2011/01/11 09:39   373   241     0 SYSUSER
      DEVICE        01.00 2009/03/17 2009/03/17 12:00    29    29     0 IOA6304
      EVENTS        01.00 2009/03/17 2009/03/17 12:00    24    24     0 IOA6304
      IEA           01.01 2009/03/17 2011/08/10 14:10    47    38     0 SYSUSER
      IEC           01.00 2009/03/17 2009/03/17 12:00     8     8     0 IOA6304
      INFOMAN       01.00 2009/03/17 2009/03/17 12:00    49    49     0 IOA6304
      IOAVARS       01.00 2009/03/17 2009/03/17 12:00    64    64     0 IOA6304
      IOS           01.00 2009/03/17 2009/03/17 12:00    12    12     0 IOA6304
      LOGBATS       01.01 2010/09/29 2010/09/29 16:22    10    10     0 SYSUSER
      LOGTEST       01.02 2010/09/28 2012/01/20 15:41    11    10     0 SYSUSER
      NETMAN        01.00 2009/03/17 2009/03/17 12:00    77    77     0 IOA6304
      REPJHIST      01.00 2009/03/17 2009/03/17 12:00    42    42     0 IOA6304
 OPTIONS: S SELECT  O ORDER  F FORCE  B BROWSE  D DELETE                15.49.43

First browse existing rules and than create your own rule.

In the example, we will create a user console command DISPCONS. First DISPLAY C command is issued. After 5 seconds of wait (WAIT 0005) DISPLAY T command is issued.

RL: DISPCONS   LIB IOA6304.CTO.LP14.RULES                       TABLE: MYDEMO1
 COMMAND ===>                                                    SCROLL===> CRSR
 | ON COMMAND  = DISPCONS                                                      |
 |    JNAME          JTYPE      SMFID        SYSTEM          USERID            |
 |    ROUTE          DESC       CONSOLEID    CONSOLE                           |
 |    APPEARED       TIMES IN      MINUTES                        And/Or/Not   |
 | OWNER SYSUSER  GROUP                         MODE PROD    RUNTSEC           |
 | THRESHOLD                                                                   |
 | DESCRIPTION DISPLAY CONSOLE INFORMATION                                     |
 | DESCRIPTION                                                                 |
 | =========================================================================== |
 |    SYSTEM                                                                   |
 | DO COMMAND  = DISPLAY C                                                     |
 |    WAIT              CONSOLEID    CONSOLE          SYSTEM                   |
 |    WAITMODE   N                                                             |
 | DO COMMAND  = DISPLAY T                                                     |
 |    WAIT       0005   CONSOLEID    CONSOLE          SYSTEM                   |
 |    WAITMODE   N                                                             |
 | DO                                                                          |
 | =========================================================================== |
 | DAYS                                                          DCAL          |
 |                                                                    AND/OR   |
 | WDAYS   ALL                                                   WCAL          |
 | MONTHS  1- Y 2- Y 3- Y 4- Y 5- Y 6- Y 7- Y 8- Y 9- Y 10- Y 11- Y 12- Y      |
 | DATES                                                                       |
 | CONFCAL          SHIFT                                                      |
 | ENVIRONMENT SMFID      SYSTEM                                               |
 | =========================================================================== |
 | IN                                                                          |
 ======= >>>>>>>>>>>>>>> END OF RULE DEFINITION PARAMETERS <<<<<<<<<<<<<<< =====

Another example would be to issue DISPLAY A,L command periodically (INTERVAL 060)

 RL: DISP-ALL   LIB IOA6304.CTO.LP14.RULES                       TABLE: MYDEMO1
 COMMAND ===>                                                    SCROLL===> CRSR
 | ON EVENT    = DISP-ALL                                                      |
 | OWNER SYSUSER  GROUP                         MODE PROD    RUNTSEC           |
 | THRESHOLD                                                                   |
 | DESCRIPTION                                                                 |
 | =========================================================================== |
 | DO COMMAND  = DISPLAY A,L                                                   |
 |    WAIT              CONSOLEID    CONSOLE          SYSTEM                   |
 |    WAITMODE   N                                                             |
 | DO                                                                          |
 | =========================================================================== |
 | DAYS    ALL                                                   DCAL          |
 |                                                                    AND/OR   |
 | WDAYS                                                         WCAL          |
 | MONTHS  1- Y 2- Y 3- Y 4- Y 5- Y 6- Y 7- Y 8- Y 9- Y 10- Y 11- Y 12- Y      |
 | DATES                                                                       |
 | CONFCAL          SHIFT                                                      |
 | ENVIRONMENT SMFID      SYSTEM                                               |
 | =========================================================================== |
 | IN                                                                          |
 ======= >>>>>>>>>>>>>>> END OF RULE DEFINITION PARAMETERS <<<<<<<<<<<<<<< =====

“Order” your rule to activate. They will be seen “Rule Status” panel.

 ---------------------------- CONTROL-O RULE STATUS ---------------------(OS)
 COMMAND ===>                                                    SCROLL===> CRSR
 RULES:     87  MSG     42  CMD      3  EVN     31  ACTIVATED     34  SMFID LP14
 O RULE       TYP ---- DESCRIPTION ---- ACTIVATED --------- STATUS -------- UP
   IEA404A     M  MVS - SEVERE WTO BUF         0  ACTIVE
   IEA405A     M  MVS - SEVERE WTO BUF         0  ACTIVE
   IEF453I DU  S  DUMPXY JCL ERROR             0  ACTIVE
 ======= >>>>>>>>>>>>>>>>>>> BOTTOM OF ACTIVE RULES <<<<<<<<<<<<<<<<<<<< =======


To be able to use your rules in incoming days, either add your table in DARULLST DD statement member of Control-O startup or issue an “order” command similar to below, after new day process is performed in early minutes of next day:


After testing a couple of operations scenarios, you will be proficient to define your own automation scenarios.

Posted in IBM zEnterprise Servers, IT Service Management | Tagged , , , , , , , , , , | Leave a comment

zSecure Installation and Operation

zSecure is a suite of security products to improve management of System z security environment. zSecure Admin and zSecure Visual components are the ones I installed and reconfigured recently.

zSecure Admin lets security administrator perform more productive and zSecure Visual allows to perform these in MS Windows workstations.

Installation is pretty straightforward. There are SMP/E datasets, target libraries, distribution libraries and configuration datasets. After APF (Authorized Program Facility), TSO (Time Sharing Option) command authorization and PARMLIB enablement it is ready to use.

To start using execute CKR command list in SCKRSAMP library. It would be useful to add it as an option in ISPF/PDF (Interactive System Productivity Facility/ Program Development Facility) system administration panels.

One of very useful facilities of zSecure is it has a set of profiles to allow regular users to perform security administration. You can authorize a regular user just to create users and reset passwords. This is very useful for service desk applications. There is no need to give high grade system-special or group-special authorities to service operation staff.

Another useful facility is Collect function. It allows to collect and keep status of all data sets (Freeze data set) together with RACF (Resource Access Control Facility) database periodically. Daily, weekly or monthly information lets auditors to observe access information in the past practically.

zSecure Visual is not a long running task like other zOS address spaces. It is started by a started task but, works as a set of OMVS (Open MVS) USS (Unix System Services) processes. After some time “Accepting Logons” message is issued and Visual is operational. zSecure Visual behaves like other zOS subsystems in parallel sysplex (Systems Complex) environment. To access Visual, it is necessary to access the TCP/IP environment of the sysplex member Visual is started.

zSecure Visual server is generally started when system started through automation and up all the time. Default port Visual server uses is 8000. Client user should be identified by server before usage. Client user logs on zSecure Admin first using a TN3270 terminal. A token is created by server on 3270 terminal. This token is pasted in client logon panel and saved in client. After this process client can use Visual server.

Authorities issued in System z environment tend to grow in time. After some time some of authorities are never used and become garbage. But it is very difficult to differentiate accessed and not accessed authorizations. traditional SMF (System Management Facility) records data is huge and very difficult to manage for this reason. zSecure supplies Access Monitor facility to address this problem. Access monitor is executed in all members of sysplex and all accesses are captured in datasets. Those datasets are consolidated daily, weekly, monthly and even yearly. So they do not take much space. Least used authorizations are authorizations related with yearly applications. After two years of monitoring, it would be safe to remove never accessed authorizations and resources related with them.

Final facility I will tell is CARLA language embedded in the product. Most of the functions are implemented using this language and it is very easy to customize zSecure using CARLA language.

Posted in IBM zEnterprise Servers, IT Service Management | Tagged , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

SMP/E HOLDERROR Processing in IBM System z Software Management

Recent discussions with some colleagues showed me that this is an area which requires some clarification.

SMP/E (System Management Program/Extended) is the tool for performing maintenance on z/OS systems and System z components like DB/2, CICS, WAS etc. Program products are organized as FMIDs (Function Modifier IDentifier). Modifications are organized as PTFs (Program Temporary Fix). PTFs are corrective code updates for defective programs and organized as PUT (Program Update Tape). They are not delivered in cartridges anymore but name is not changed anyway. PTFs are primarily identified like PUT1302 which means belonging to year 2013 second month PUT tape.

Although PTFs are prepared to fix some erroneous code, some of them are erroneous themselves. They are called PEs (PTF in Error). PEs are marked in accompanied documentation called HOLDDATA. New superseeding PTFs are prepared and included in more recent PUTs.

As time passes some PTFs are observed as correct and included in a more reliable categorization RSU (Recommended Service Upgrade). RSU is also identified as year and month format like RSU1304 meaning RSU PTFs belonging to fourth month of 2013.

As an example, suppose that UK12344 and UK12355 are PTFs in PUT1303. In PUT1304, UK12344 is marked as PE and superseded by a new PTF UK12366. Again suppose that PTFs UK12355 and UK12366 are added to RSU1305. If you APPLY RSU PTFs instead of PUTs, it is less likely that you apply PEs.

Let me get back to HOLDDATA. I mentioned that HOLDDATA keeps information related with ERROR PTFs. HOLDDATA also keeps information related with other requirements of application of PTFs. For example a system RESTART may be required after application of a PTF, some command, parameter, message or documentation CHANGE may be introduced.

Before APPLYing PTFs, we always run an APPLY CHECK and mention BYPASS(HOLDSYSTEM, HOLDERROR)

With these parameters we mention “I noted SYSTEM HOLDS, go ahead and pretend apply PTFs with system holds” and “I noted ERROR PTFs, go ahead and pretend apply PEs”. Since this is a CHECK run, we observe any difficulties available in application of PTFs. Otherwise we receive GIM30206E error messages and return code 8 which may hide some other errors.

When it comes to real APPLY never use BYPASS(HOLDERROR) as mentioned in above documentation unless you have a good reason to apply a PTF in error. The reason may be PE would be related with a component you do not use – like cryptography for example – and PE may be prerequisite for another PTF you have to apply.

Majority of IBM System z customers do not apply PUT PTFs. They apply RSU PTFs, so that they would like to minimize the risk of applying error PTFs and introduce defects in their systems.

This may not be a frequently confronted problem as PEs are not included in RSUs most of the time. Do not forget that mentioning BYPASS(HOLDSYSTEM) means that BYPASS HOLDs and APPLY PTFs with system HOLDs and mentioning BYPASS(HOLDERROR) means that BYPASS ERRORS and APPLY PEs.

Posted in IBM zEnterprise Servers | Tagged , , , , , , , , , , , , , , , , , , | Leave a comment

Increasing Number of CHPIDs (Channel Path Identifiers) for LCUs (Logical Control Units)

The channel subsystem (CSS) and the IBM z/OS operating system need to know what hardware resources are available in the computer system and how these resources are connected. This information is called hardware configuration. Hardware Configuration Definition (HCD) provides an interactive interface that allows you to define the hardware configuration for both a processor’s channel subsystems and the operating system running on the processor (1).

A channel is the piece of hardware with logic in the CPC (Central Processing Complex) and to which you connect a cable in order to communicate with an outboard device. FICON channels are used in recent configurations. Channel path identifiers (CHPIDs) are the 2-byte identifiers for channels. You use a CHPID to identify a channel to the hardware and software in the HCD. Although the two terms are often used interchangeably, we refer to attaching a control unit to a channel and using the CHPID in a z/OS CONFIG command to identify the channel you want to bring online or offline (2).

256 devices are supported for CU (Control units). LCU definitions are used to circumvent this limitation. Multiple CUs are defined for each block of 256 devices. Most zOS shops use a set of 8 channels for all LCUs. As long as channel utilization is below 90% percent, there is no problem and these definitions can be used forever. If channel utilization closes maximum steadily, then some channel definitions should be added in hardware configuration.

Following is the view of control unit definition panel in Hardware Configuration Definition:


As we cannot define more than 8 channels per LCU, we define channels in a circular way. Suppose we have channels A1, A2, A3, A4, A5, A6, A7 and A8. Further suppose that existing LCU definitions are as follows:


We will add 4 more channels like B1, B2, B3 and B4. We will add CHPIDs in a circular fashion to have channels distributed evenly. New definition would be as follows:


Since number of LCUs may not be the multiple of channels, usage of some channels would be one less then others. In our case use counts of channels A5, A6, A7, A8, B1, B2, B3 and B4 will be one less than others.

After performing modifications in a new work file, create a new production I/O definition file. You can write I/O configuration to one of IOCDS locations in the processor complex.

The problem may start when you try to activate I/O configuration dynamically. In a complex I/O configuration environment, you could receive system abend 878 with reason code 10:


You increase TSO address space region to a point where you may not get more virtual storage and you may not activate I/O configuration dynamically:




Even if you use ACTIVATE console command to test or activate configuration dynamically, you may not do so due to below messages:


But these messages reminds you channel path and device associations not used any more. You have to VARY following paths OFFLINE using console commands similar to VARY PATH(C100-C1FF,A5),OFFLINE command:


After varying unused paths, you test and activate new configuration successfully. Next thing is to keep an eye on RMF (Resource Management Facility) or BMC Mainview CMF Monitor reports (whichever you use) to observe a better-balanced I/O distribution.



(1) z/OS Hardware Configuration Definition User’s Guide

(2) I/O Configuration Using z/OS HCD and HCM

Posted in IBM zEnterprise Servers | Tagged , , , , , , , , , , , , , , , , | Leave a comment