CS student tried to build a COBOL lexical analyzer — would appreciate a sanity check from someone who actually knows the language
Hi r/cobol,
Student here from Pakistan studying Theory of Programming Languages.
Just wrapped up my final project — a lexical analyzer for a language
I'm calling PyCOBOL, which is basically COBOL's structural syntax
mixed with Python's control flow keywords.
I know that sounds weird but the idea was to design a hybrid language
and build a lexer for it from scratch as a compiler design exercise.
What the lexer currently handles on the COBOL side:
- IDENTIFICATION, DATA, PROCEDURE, ENVIRONMENT DIVISIONS
- WORKING-STORAGE, FILE, LINKAGE, INPUT-OUTPUT SECTIONS
- PIC clauses with basic format validation
- Keywords like DISPLAY, MOVE, COMPUTE, PERFORM, STOP RUN
- COBOL-style identifiers with hyphens (MY-VARIABLE)
- Level numbers 01-05
My professor evaluated it and said it was good but told me to get
feedback from an actual COBOL developer — which as a student with
no industry connections is... not easy lol.
I already know the obvious gaps:
- No column position enforcement (columns 1-6, 7, 8-72)
- No COPY statements or REDEFINES
- Very limited subset of the full COBOL standard
- No parser after this — just phase 1
What I'm genuinely curious about from someone experienced:
Does our tokenization approach make sense for COBOL's structure?
Is there something fundamentally wrong about how we modeled
COBOL tokens that would matter in a real implementation?
Happy to share the code in the comments if anyone's interested.
Thanks 🙏