So, how complicated do you think htmled might be? This simple 1000 liner program does very simple content extraction and transformation (see htmled, the handbook editor blog and publisher and htmled revisited – Handbook entries to blog posts automated). It doesn't leave the HTML format, no concurrency, no database and no distribution. So, if you're a Python developer you might simply look at the code.
But, as an ArgoUML developer and user, prior to start doing TDD on htmled, I made some simple diagrams which I kept up-to-date for documentation purposes. The following UML class diagram shows the basic data structure, with correspoding classes existing in htmled.py. The HbFile
represents a Handbook file, using HbFileParser
to parse the file into HbDailyEntries
which by thenselves are composed of HbSubjectEntries
. I normally load htmled
in the Python shell, create HbFile
(s) and then use a PostExtractor
to extract Posts
from the handbook files.
Picture 4 – htmled main classes diagram.
The complexity lies a bit in the PostExtractor
and very much in the HbFileParser
and its associated classes. Specifically, for parsing a handbook file I used the HTMLParser
module contained in the Python standard library and the State design pattern I read about in Robert C. Martin's Agile Software Development book. For this you may check the UML state diagram I draw when trying to model the Finite State Machine for handbook file parsing – check the HbFileParsing
class in htmled.
Picture 5 – htmled Handbook File Parsing Finite State Machine.
The use of the state pattern and HTMLParser
for parsing the handbook files is definitelly better than a "reinvent the wheel" approach, but, I think it isn't the most appropriate way of solving this problem. I think that I should have tried to use ANTLR for parsing, creating an AST and then extracting the required info from there, e.g., DailyEntries
. If I revisit this project again due to ambitions related to imroving its functionalities I might very well do just this as a warm-up!