So, how complicated do you think htmled might be? This simple 1000 liner program does very simple content extraction and transformation (see htmled, the handbook editor blog and publisher and htmled revisited – Handbook entries to blog posts automated). It doesn't leave the HTML format, no concurrency, no database and no distribution. So, if you're a Python developer you might simply look at the code.
But, as an ArgoUML developer and user, prior to start doing TDD on htmled, I made some simple diagrams which I kept up-to-date for documentation purposes. The following UML class diagram shows the basic data structure, with correspoding classes existing in htmled.py. The
HbFile represents a Handbook file, using
HbFileParser to parse the file into
HbDailyEntries which by thenselves are composed of
HbSubjectEntries. I normally load
htmled in the Python shell, create
HbFile(s) and then use a
PostExtractor to extract
Posts from the handbook files.
Picture 4 – htmled main classes diagram.
The complexity lies a bit in the
PostExtractor and very much in the
HbFileParser and its associated classes. Specifically, for parsing a handbook file I used the
HTMLParser module contained in the Python standard library and the State design pattern I read about in Robert C. Martin's Agile Software Development book. For this you may check the UML state diagram I draw when trying to model the Finite State Machine for handbook file parsing – check the
HbFileParsing class in htmled.
Picture 5 – htmled Handbook File Parsing Finite State Machine.
The use of the state pattern and
HTMLParser for parsing the handbook files is definitelly better than a "reinvent the wheel" approach, but, I think it isn't the most appropriate way of solving this problem. I think that I should have tried to use ANTLR for parsing, creating an AST and then extracting the required info from there, e.g.,
DailyEntries. If I revisit this project again due to ambitions related to imroving its functionalities I might very well do just this as a warm-up!