Port Watchers № 4
Happy New Year! Below, two projects of my own, and one by someone else. This issue is a bit more on the technical side, not tediously so for this audience (I hope).
Be Tidy
I started a new project this past month: an HTML5 writer for Racket. If you have a nested list representation of some HTML content (what we call an X-expression), this library will print it out nicely indented and wrapped at a given line length. It’s still in early stages, but it is working, and once I have figured out some design specifics (see the link, input welcome), I will publish it as a package on the Racket package server.
For accidental historical reasons, most of what I know about computers and programming is self-taught. And much of what I taught myself was related to my interest in the World Wide Web. The web is a great field for an autodidact, partly because web developers enjoy sharing about their craft, but also because you can right-click any web page in your browser and click View Source, and see how its authors did whatever they did. It’s the nature of the medium.
I learned a lot of things by looking at the HTML and CSS behind sites I liked and trying to replicate their results. If you follow some of the same creators I do, you’ve probably seen their influence on my projects, including this newsletter. (I could add “and if you are the View-Source type of person”, but you are subscribed to this newsletter, after all.) So it’s important to me to make my own web projects “welcoming” to those who peek under the hood: easy for eyeballs to walk around in, and with handles and levers for curious hands to grab or pull.
The simplest expression of this kind of consideration is not publishing sites where all of the HTML is mashed onto a single line a million columns long. Before now I’ve satisfied this pedantic preference using HTML Tidy, an elder utility that dates from the 1990s. But the “problem” of how to pretty-print HTML accurately is an interesting one (see below), so I wanted to see if I could solve it on my own.
A technical aside about what the HTML5 writer does. Some people call this “pretty-printing”, because you’re adding extra steps to make the output pretty for humans, but without changing how it would be treated by your computer. That last part is often the tricky part, especially with HTML. Consider this snippet:
<p><span>one two</span><i>three</i></p>
If you want to wrap this snippet at 20 columns, the straightforward “word processor” approach of printing “words” of non-whitespace characters and breaking between those words would yield this (rule added at the top for column-count clarity):
----|----1----|----2----|----3----|
<p><span>one two
</span><i>three</i>
</p>
…but the break after two is wrong, because it introduces whitespace between
two and three where none exists in the input.
So doing this correctly requires more than just looping through the elements: to print any given element correctly, you need some visibility into the contents of adjacent elements. The solution I’ve landed on is accurate, and visually satisfactory for nearly all the content you are likely to throw at it. But it is not rigorous: it will occasionally let a line run longer than it needs to. The example above? My writer simply prints it all on one line:
----|----1----|----2----|----3----|
<p><span>one two</span><i>three</i></p>
because by the time it gets to the closing </span> tag (which would
put it over the 20-column limit used here) it never finds another opportunity for a line
break. Normally, though, you’ll be wrapping content at something like 80 or 100 columns, not
20, so this doesn’t crop up often, and when it does it will not be very noticeable.
Nonetheless, I will be putting in a bit more work to try and get the printer to produce the most-correct result for this example:
----|----1----|----2----|----3----|
<p><span>one
two</span><i>three</i></p>
(If this still looks wrong to you, please think on it for a few minutes before you @ me 😉)
I could not find any blog posts or papers about sound approaches to this problem. Maybe they are out there, but search engines are not showing them to me (they all think I want to know about line-wrapping HTML content, not HTML source code). Or perhaps the few people who think about keeping their source looking good have been content to let HTML Tidy handle it, as I have been until now. If you know of any, please send them to me!
View-Source…on this newsletter
One more thing, in the spirit of view-source: I have (as promised) published the source
code to raco-news, the program I wrote as my front-end for publishing this
newsletter. As written, you will need your own installation of Sendy to use it (full instructions are in the README),
but you could take its HTML and plain-text output and use them in another newsletter-sending
scheme of your design or choosing.
A curiosity
Soupault calls itself a static site generator, but is also (or more of?) a static site manipulator.
Soupault is not like other static site generators — it works on the HTML element tree level. Most SSGs treat HTML as an opaque format that can be generated with templates but cannot be read or manipulated.
Soupault treats HTML as a first-class format and that enables many use cases and features that are impossible for other SSGs.
When I first read through Soupault’s documentation, it was a real head-scratcher, but I think I get it now. Say you have a blog: you could write your individual articles in Markdown or AsciiDoc or whatever. You could then use Soupault to manage the processing of those HTML files (running them through pandoc or whatever converter you use), manipulate individual elements within that HTML output, and build your list of posts.
It is a pretty unique approach. If I had run into this in 2002 (when I was writing HTML files by hand) or 2015 (when I was experimenting pretty hard trying to glue stuff together with Pandoc and shell scripts), I might have made much use of Soupault. Instead, I was introduced by Pollen to a better paradigm: authoring in a programming environment, where the source texts themselves are programs that can expose an arbitrary data model directly useable by other code. But I am still glad to know about Soupault, if only for its having opened my eyes to a different approach to web publishing mechanics. I thought I knew them all already!
That’s all for now! More soon, but not sooner than a month from now. If you have anything to say or share, just reply to this email, or find me on Mastodon: @joeld@tilde.zone.
— Joel