Memories of COBOL
This post is sort of a therapy session where I dump all of the weird things that I have seen when working with several huge projects in COBOL 85.
It might be useful if you have a morbid curiosity about this language, or if you (for some reason) are deeply involved with the language and want to see how it looks from an outsider’s perspective.
As of now, I’m a developer, mostly in the web, mostly in the backend/data engineering side of things. While I spend a lot of time in Python, I might have to explain things in terms of other languages, depending on the context. I also expect at least some experience with programming and won’t go into basics of how modern computers and languages work.
*record scratch* Yep, that’s me
Imagine a young student, who got offered their first SWE job in the third year of university. Accepting a first job is a no-brainer, and most of the conditions can be ignored, even though the impact your first job has can be lasting.
So what was the job?
We were working with old COBOL codebases. Like, “last modification of this file was done in 1992” old. Like, “the first version of this file was written in 1979” old. This automatically means a couple of things:
- No VCS, since this concept did not exist at the time. Versioning of file works by commenting the changelog of your program at the top of the file.
- No/minimal technical specification. People really treated programming and planning differently back then.
- No real way to run the code locally in any capacity*.
- Most of the people that wrote the code in the first place are 100% not available, since they likely retired 10+ years ago.
I will explore implications of working with an ancient codebase, in a language that resists being written in, using technologies that can’t really be obtained by normal people. Even though in some sense this is a nightmare, I got a set of skills that have proven to be useful for my whole career.
Bear in mind that I might be forgetful about some details of the language - I’m lucky to not have practiced any of it for about a decade at this point. The general gist of my complaints are true, but details about syntax can have mistakes. I also refuse to compile and check any of my code examples, but will appreciate a feedback to fix mistakes, if any.
Anatomy of a fall COBOL
While this is not a full COBOL tutorial, it would help to briefly describe how a generic COBOL program looks like. Here’s how one would write
* this is a comment
identification division.
program-id. hello-world.
environment division.
data division.
01 USERS OCCURS 5 TIMES.
02 NAME PIC X(20).
02 ZIPCODE PIC S9(5).
02 COUNTRY PIC XXX.
02 ZIPCODE-TEXT REDEFINES ZIPCODE PICTURE 9(5)V9(5).
01 USER-IDX PIC 9 VALUE 1.
procedure division.
001-main-s section.
001-main.
display 'hello world'.
perform 005-display-user varying user-idx from 1 to 5.
001-main-exit.
stop run.
005-display-s section.
005-display-user.
display 'name: ' NAME(USER-IDX)
display 'country: ' COUNTRY(USER-IDX)
005-exit.
exit.
end.
Now, let’s dissect this part by part:
Basics
A COBOL program is made of divisions. There can only be one instance of each, and most of them are optional (except for identification). The most used ones are:
identification- defines program metadata, author, version etc.data- defines all the variables that can be used. In 85, you don’t get the privilege of allocating memory dynamically - everything is pre-allocated based on what’s in the data division.environment- mostly used to define files accessedprocedure- the actual program code that gets executed.
In summary, you get all of the perks of a language so old that it was one of the first to have concept of structures:
- everything in a program is global
- everything is mutable. You do named constants by using a
control divisionwith text replacements, which work similar to C preprocessor. - no concept of functions, only “paragraphs” and “sections”. In the sample above,
001-mainis an example of a paragraph. The main usage of a paragraph is as a label forGOTOorPERFORMstatement, which is something likeGOTO, for, while loop in a trench coat. - you get all of the wonders of C-style pointer arithmetic, more on that later.
We’ve never left the punch card land
Some of you might have already noticed, but the example program had a suspicious indentation to the left of it throughout the whole example. This is because first 7 characters in each line are reserved. The reason for this is extremely simple: in a typical punch card the first 6 characters are reserved for statement number, and 7th character is an optional marker: if it is empty, the line will be processed by the compiler, otherwise, it is treated as a comment. I assume, this was kept for long after the punch cards were not used anymore for backwards compatibility.
In the “modern” code, those symbols were usually left unused, sometimes used for marking specific lines, as additional comment space.
yeah we’ve got structs, no big deal
While not the first language to introduce the concept of “let’s pack a group of variables into one blob that contains those values”, COBOL certainly has some …unique features. Let’s start with basics: every variable has a number called “level”. If the level is larger than level of previous declaration, that field belongs to the previous one, making previous declaration a struct. For example:
01 USER.
02 USER-NAME PIC X(10).
02 USER-SURNAME PIC X(10).
This defines a memory area 20 bytes long, where first 10 bytes are taken by USER-NAME, and second 20 bytes are taken by USER-SURNAME.
The level number should be a positive integer (obviously), with value from 1 to 49(what?). I’m guessing someone decided that 49 levels is enough for everyone.
Additionally, there are special level values: 66, 77, and 88:
66 level
This is the only one that allows to RENAME a variable(s) within a struct.
This means that essentially you get a pointer to a field with a different area
that belongs to it:
01 USER.
02 USER-NAME PIC X(10).
02 USER-SURNAME PIC X(10).
02 USER-CREDIT-CARD PIC 9(12).
66 USER-FULL-NAME RENAMES USER-NAME THRU USER-SURNAME
66 USER-BLOB RENAMES USER
This means that USER-FULL-NAME points to a 20 byte-range into USER
struct, while USER-BLOB points to a whole USER structure. Yes,
including the number - which in this case will be a 12-byte-long character
sequence, but more on that later.
77 level
For most purposes, this is the same as 01 level, but it forbids creating underlying fields, so this forces having just one top-level variable.
88 level
This is a curious one. Similarly to 66, this is a fake variable,
and it sort of reminds me of enum in a more modern language. 88 levels
always have boolean type. They work as a get* method, that looks at the
variable one level above, and returns true if it matches the predefined constant:
01 USER-CARD-TYPE PIC(3)
88 USER-DEBIT-CARD VALUE 'DEB'
88 USER-CREDIT-CARD VALUE 'CRE'
If in the program you try to get value from USER-DEBIT-CARD, it will
return TRUE only if the USER-CARD-TYPE has a specific string. As far as
language feature goes, this is actually useful, and interesting: I can
easily imagine having something similar in a modern program, just implemented
without all of the substructuring nonsense. Good job, COBOL!
insert a picture pun here
Another thing I’ve omitted up until now are pictures, usually written as PIC.
Picture is a byte layout configuration of any individual variable, they can
only be defined on leaf nodes, but not on 66/88 levels. Pictures define
both the underlying memory layout, as well as string representation of a
variable if it’s being printed in a report, for example. Here are some highlights
of what can be done with them:
- XXX, X(3) - alphanumeric fields with 3 length
- 99V99 - numeric field with 2 digits before and after fixed point. Value is stored as EBCDIC (of course it wouldn’t be ASCII) representation of a digit, each taking 1 byte, with 1 byte for representing a comma separator in the middle. Importantly, there’s no loss of precision here as it happens for IEEE-754 floats, and you can freely store up to ~10 digits before and after the point.
- 99.99 - same as above, but the
.does not actually take any space in memory, the variable now is 4 bytes long. You can also use,for same purpose. - 99 COMP-3, or PACKED-DECIMAL - numeric value that is stored as
binary-encoded decimal integer
(see wiki. In short, this
means that this value is a 1 byte with layout:
AAAABBBB, where first 4 bits are first digit A, and last 4 bits are digit B. Yes, endianness is a factor here, and it’s usually different from what we have in current systems. - COMP-1/2 under the hood are single and double-precision floating point, as
- There are also COMP-4/5/6/7/8 because why wouldn’t here be. To my memory, they didn’t do anything, except were possibly aliases for other computational types.
Scan be used to save a sign for the variable, but also sign can be specified by a separate clause, and you can specify if it should be saved at the start or the end of byte layout.
well that does not seem that bad right?
The overarching theme of COBOL’s problems is that every feature of the language has to be implemented as its’ special little case in a special statement with 30% of the grammar for that feature being optional. As for what this means for structure definitions, here’s a short list of things you can do right there, at the declaration:
- Repeat any of the structures on any level, making arrays
- At the same time, you can also add another integer variable to specify it will be used as an index for this array.
- Define structs that specify layout of one record in a file. Reading from a file fills that structure by default.
- Declare varying-length array, the length of which depends on a different variable. No, you still don’t allocate memory: you specify a range of values for length, and you allocate the maximum.
- Specify that empty values in a string have to be filled by leading or trailing spaces or zeroes
- “Redefine” a variable: do a
reinterpret_castof a different variable with any picture you want - Define a pointer variable that can interpret underlying memory as specific layout
- say that if a numeric value is zero,
displayshould print it as spaces - similar to how it works in Pascal, you can say that an array is indexed not from
1 to
len, but from an arbitrary integer to another arbitrary integer.
The grammar for just a single level declaration takes more than one page. Grammar for the rest of declarations, including file record types (of which there are at least 3) takes another 5 pages.
Well, why?
A lot of bloat in the grammar is taken by the idea that the language should be written by human, so we should allow language to “flow” like English, to allow onboarding of non-technical people. I’ll give a couple of examples of reserved keywords in the language (some of these are equivalent, some are absolutely not):
THROUGH,THRUEND-OF-PAGE,EOPALPHABET,ALPHABETICCOMP,COMPUTATIONALPIC,PICTUREINPUT-OUTPUT,I-O,INPUT,OUTPUTLIMIT,LIMITSOVERFLOW,OVERFLOWSEQUAL,EQUALSRECORD,RECORDING,RECORDSREPORT,REPORTING,REPORTSSPACE,SPACES,BLANK,LOW-VALUE,LOW-VALUES,VALUE,VALUESEND-IF,END-ADDand about 20 other variations ofENDclause for every statement
and 400+ more!
where are the statements, Lebowski?
It’s not possible to cover the vast landscape of COBOL statements in one blogpost. I’ll just mention that due to non-existence of functions and procedures as we know them, most of things we now expect to work as a part of standard library API, works as a grammar construct and is baked into compiler. This includes:
- reading files. This includes statements such as
OPEN,READ,WRITE,REWIND,CLOSE, - sorting file records
- merging multiple files into one
- searching for an item in array
- splitting string into parts by separator
- gather parts of a string/array into a new string
- counting lengths of substrings, after split by previous statement
DIVIDE,ADD,MULTIPLY,SUBTRACTare separate statements, all with their own special cases in grammar - for modular arithmetic, error handling and all that. For cases where we need to combine more than one operation type, we haveCOMPUTEstatement, with its’ own set of grammar and semantics- member-by-member assigning of struct values to other structs. The semantics
of cases where you have two structs with different layouts are mysterious and
have a ton of edge cases: you can automatically assign a struct to array of
strings, array of structs to a single struct, individual members of nested
structs can be assigned by name, or by copying memory blobs per-byte. You do that
by using
MOVEandSETcommands, which are obviously completely different commands and have absolutely unique semantics. - huge subset of commands can accept paragraph IDs to work like a callback hook.
- a separate set of commands to generate reports - which are somewhat similar to modern HTML template language like Jinja. You can define headers there, print spreadsheets, and do a bunch of other things.
In total, the COBOL-85 standard that describes grammar with a shallow description of semantics, spans for over 800 pages. For more details on what is the logic of individual clauses, how exactly the edge-cases are handled, you need a different ANSII document, that is another 1000+ pages long. I’ll go over some of the features that stand out to me
fall-through behaviors
Let’s look again at the example from the beginning:
001-main-s section.
001-main.
display 'hello world'.
perform 005-display-user varying user-idx from 1 to 5.
001-main-exit.
stop run.
The important detail is that the flow of the program naturally goes to the first section, first paragraph, and then falls through into the next paragraph.
This makes paragraphs similar to labels in assembly. However, unlike assembly,
you can do PERFORM <paragraph-or-section>, which automatically returns you to
your control point after EXIT statement is reached. Or until the paragraph
ends. Or you can specify a list of paragraphs in sequence and execute them
instead like PERFORM 005-do-a THRU 005-do-b. All of these have slightly
different semantics about where exactly the control flow will return
if it reaches EXIT statement in the middle of a loop, like if you
do PERFORM <something> 5 TIMES. And yes, there is GOTO, which also behaves
differently with paragraphs, EXIT statements and all that.
naming conventions
When dealing with programs written for mainframes (yes, just like in the movies),
you often have to deal with quirks like “oh, our system has a limit on file
name length of 7 symbols”. Now imagine that you have to have 3 million lines
of code spread across about 1000+ programs. This forces these programs to have
descriptive names like NW101PW, FW5000X and so on. Most of these programs
do not have any documentation for what they do. Each such program
can call a different program, name can be set up dynamically (and fetched from
an external file, for example).
standard implementation
I’ve mentioned that the code you deal with is running on a mainframe. This implies two things:
- It uses hardware with completely different architecture from what you are using
- Most likely that mainframe uses a proprietary operating system, for example z/OS by IBM
- COBOL code was written for a proprietary implementation of COBOL that can’t be obtained. And if it can, it still wouldn’t be able to produce binaries for your precious Arch.
- the main way to test things out is to send the file to the mainframe and run tests there remotely. That might not be available, since we are likely dealing with sensitive data and the system is air-gapped, so the other option is to either send code to a person who’s physically there and has access to machines, or just deal with it and move on
- While there are open source alternatives like gnuCobol available, they are incomplete
- In fact, to my knowledge, all of the production implementations of COBOL do not implement the ANSII standard to 100% (similar to how it works with C++, I think): Every vendor adds their own cool little extensions to work with their favourite proprietary tools such as DB/2 RDBMS by IBM, CICS, and so on.
- This means that without having a compiler to test out your code, you are forced to read from 2 to 5 different tech specifications, each at least 600 pages long, which sometimes contradict each other. Rarely, but it happens.
indexed files
Indexed files is a semi-proprietary file format that is described by an ANSII standard, but there are no details as to how it can/should be implemented. The standard only says what you can technically do through an interface, which consists of several special COBOL statements. Namely:
- Each file contains records of same size, and have to contain a unique primary key.
- You can quickly search for a record in a file by its’ key
- You can sequentially read through all of the records. This will access them in the order of ascending key
- You can read only a subset of records by searching with constraints
- You can do partial matching on keys: if you search for a record
100000, it will scroll past all records with keys<100000and then end up on the first one with key larger than or equal to100000 - You can have a secondary key that can also be used for searching, but it can allow duplicates in some cases.
- Files are modifiable: you can add, overwrite and delete data in it.
why do we even care about byte layout
To put it simply: I haven’t seen a language before or since, that cares so much about byte layout. There are several reasons for it:
- It’s just what people cared about back in ‘85, when memory was expensive
- Most of COBOL programs are generating reports. When you are bound to the output format of a punchcard (of course this is also some sort of legacy restriction), you have to care about your numbers fitting into the report, all the names fitting the specific schema and so on.
- Since the language gives you limited tools, pointer arithmetic and equivalents are abundant, if not overused. This is usually done to access the same data with a different layout, and doing reinterpret casts makes you care about every bit of data that you have.
cics
CICS is this weird ultimate GUI toolkit that allows people to make interactive software in COBOL. The most relevant example of this is likely some sort of old ATM machine, or maybe you have seen some old system for ticket ordering, where they’ve used something like it. I don’t have actual screenshots on my hands, but a very close replication can be found here in README images.
why do we even care about COBOL?
back when I was working at that place, it was estimated that COBOL was powering more than half of banks and government sectors around the world. Someone has to care, because these are the systems that count your taxes, send you bills, and do all of the things that nowadays we tend to use Java for and laugh that of as the most boring enterprise language that exists. Trust me, Java is not that bad. At the very least,
the silver lining
Working with that codebase taught me a couple of things:
- I’ve learned to read the documentation - not the tutorials in Google, which were wrong or incomplete 60% of the time, but actual language specification. Turns out, it’s not that hard, and is rather useful
- This gave me a good perspective on what an unmaintainable codebase looks like