u/Known_Vanilla_9071

▲ 4 r/cobol

CS student tried to build a COBOL lexical analyzer — would appreciate a sanity check from someone who actually knows the language

Hi r/cobol,

Student here from Pakistan studying Theory of Programming Languages.

Just wrapped up my final project — a lexical analyzer for a language

I'm calling PyCOBOL, which is basically COBOL's structural syntax

mixed with Python's control flow keywords.

I know that sounds weird but the idea was to design a hybrid language

and build a lexer for it from scratch as a compiler design exercise.

What the lexer currently handles on the COBOL side:

- IDENTIFICATION, DATA, PROCEDURE, ENVIRONMENT DIVISIONS

- WORKING-STORAGE, FILE, LINKAGE, INPUT-OUTPUT SECTIONS

- PIC clauses with basic format validation

- Keywords like DISPLAY, MOVE, COMPUTE, PERFORM, STOP RUN

- COBOL-style identifiers with hyphens (MY-VARIABLE)

- Level numbers 01-05

My professor evaluated it and said it was good but told me to get

feedback from an actual COBOL developer — which as a student with

no industry connections is... not easy lol.

I already know the obvious gaps:

- No column position enforcement (columns 1-6, 7, 8-72)

- No COPY statements or REDEFINES

- Very limited subset of the full COBOL standard

- No parser after this — just phase 1

What I'm genuinely curious about from someone experienced:

Does our tokenization approach make sense for COBOL's structure?

Is there something fundamentally wrong about how we modeled

COBOL tokens that would matter in a real implementation?

Happy to share the code in the comments if anyone's interested.

Thanks 🙏

reddit.com
u/Known_Vanilla_9071 — 8 days ago

Built a COBOL lexical analyzer as a CS student — would love 2 mins of feedback from someone who actually knows COBOL

Hey r/mainframe,

CS student here. Just finished a Theory of Programming Languages project

where I built a lexical analyzer for a hybrid language called PyCOBOL —

it combines COBOL's structure (DIVISIONS, SECTIONS, PIC clauses, COBOL

keywords) with Python's control flow syntax.

My professor was impressed but said "go get a review from a real COBOL

developer" — which honestly felt impossible since I'm a student in

Pakistan with zero industry connections lol.

The lexer recognizes:

- All 4 COBOL DIVISIONS and major SECTIONS

- PIC clauses with format validation

- COBOL keywords (DISPLAY, MOVE, COMPUTE, STOP RUN etc.)

- Python keywords simultaneously (hybrid design)

- Lexical errors (unclosed strings, invalid PIC chars, unknown characters)

- Builds a symbol table with scope tracking

It's definitely a prototype and not anywhere near real COBOL standards

— I know we're missing column rules, COPY statements, REDEFINES and a

lot more. But the question for someone experienced is basically:

"Does this make sense as a lexical approach? What's the most wrong thing

about how we modeled COBOL tokens?"

Even one sentence from someone who's actually touched a mainframe would

genuinely help. Happy to share the GitHub link or a quick demo video.

Thanks for reading 🙏

reddit.com
u/Known_Vanilla_9071 — 8 days ago