Store as HTML, Edit as LML

Page created: 2025-06-18
html warden icon

Back to HTML WARDen.

After the disappointing failure of my low-effort WYSIWYG editor (see Death to WYSIWYG!), I got my chance to try out the other idea I’d had.

Normally, a wiki will have its own mini-language which gets converted to HTML for viewing. For example, Wiki Creole (wikipedia.org).

Let’s use the term LML, short for Lightweight Markup Language (wikipedia.org), as the name for these wiki mini-languages.

In broad strokes, this is what the process normally looks like:

The idea I’ve been brewing for a while is to do this instead:

No "rendering" step needed, page is already HTML!

In this model, the LML form of the page only temporarily exists as the editing format.

If you’re feeling somewhat philosophical, you could say that this method uses an LML as an interface for editing HTML.

From HTML it comes, to HTML it returns

In slightly more detail, the lifecycle of a page would look like this:

  • HTML loaded from storage directly into page.

  • HTML converted once to LML for editing.

  • As LML edited, the HTML view is updated immediately to display the changes.

  • Saving changes is simply writing the HTML preview back to storage!

drawing depicts editing with arrows and boxes as described above

In the drawing above, the split-pane editor has the HTML page "preview" on the left side and the LML code editor on the right side.

The so-called "preview" is initially the HTML page exactly as loaded from disk. It gets updated when you make changes to the LML. When you save your changes, the "preview" is saved back to disk.

Okay, simple enough. But isn’t this just shifting complexity around, not eliminating it? Maybe. That’s a fair question.

In practice, this does seem to have some real benefits. Read on!

Trying it out

Incredibly, I was able to get the basics working in just one evening. The key to success: Extreme simplicity!

The two-pane editor remains identical in appearance, but I scrapped the two-way WYSIWYG and HTML code sync and replaced it with a more traditional one-way sync from the LML code view to the "rendered" HTML view.

As I often do with DSLs, I have restricted the HTML WARDen LML to a line-based format, like the roff (wikipedia.org) formats of yore. This means that there is no tricky inline formatting (like '*' for bold). EXTREME SIMPLICITY!

Putting the Lightweight in Lightweight Markup Language

The HTML WARDen LML supports a tiny number of elements. Just what I need for this project and not a single thing extra:

  • A page title (the first line of the document).

  • Paragraphs of text separated by blank lines.

  • Internal and external links.

  • Two heading levels.

Here’s a syntax cheatsheet:

= Heading
== Sub-Heading
internal_page_link[]
page_link[Page Link With Display Text]
http://example.com[External Link]

The page editing button interface also needed an update, but mostly I removed formatting features and replaced them with a Help button that displays the above cheatsheet.

Arbitrary HTML still okay

Wait, but isn’t that LML too limited? What if I need to create or include something a little more complex in a page - can I still edit it?

Absolutely. The final syntax feature of the LML is the ability to have arbitrary HTML that doesn’t have a direct equivalent still exist, but be left alone. Any HTML not understood by HTML WARDen’s editor is automatically enclosed by a protective bubble of start and end <html></html> tags and left verbatim. You can edit the raw HTML if you want, but the editor itself won’t mess with it.

You can see an example of that in this screenshot of the new editor interface:

screenshot of the htmlwarden split-screen editing interface

As you can see on the left side, there’s a green box. And on the right side, you can see that it’s just a chunk of raw HTML with inline style like so:

<html>
<div style="background:#bfb;padding:10px;">...<div>
</html>

This "escape hatch" may feel like cheating, but it’s not.

Line-based parsers are so easy

The LML parser is 71 lines of, honestly, pretty trashy code.

Here’s the bulk of it:

var rx_h2=/^# (.*)$/;
var rx_h3=/^## (.*)$/;
var rx_a=/^([^[]+)\[(.*)\]/;
var rx_html=/^\<html\>$/;
var rx_end_html=/^\<\/html\>$/;
var rx_blank=/^\s*$/;

...

var lines = code_area.value.split(/\r\n|\r|\n/);
var para_open = false;

// Title always first
var title = lines[0];
if(title.length < 1){ title = "Untitled"; }
var html = "<h1>" + title + "</h1>\n";

for(var i=1; i<lines.length; i++){
    line = lines[i];
    if(m = rx_h2.exec(line)){ close_p(); html += '<h2>' + m[1] + '</h2>\n'; continue; }
    if(m = rx_h3.exec(line)){ close_p(); html += '<h3>' + m[1] + '</h3>\n'; continue; }

    ...

    // Blank line ends current p, if we're in one
    if(rx_blank.test(line)){
        close_p();
        continue;
    }

    // Else, we've got paragraph content
    open_p();
    html += line + "\n";
}

html_view.innerHTML = html;

The hardest part, as is the case with these things, is just keeping track of whether or not we’re currently in a paragraph.

So that’s the LML to HTML conversion, but…​

Wait, but one does not simply parse HTML!

There’s still the initial one-time conversion of HTML to LML to populate the code editor pane. Trying to parse even a tiny subset of HTML is going to be fraught with danger, right?

So…​the other fun thing about working with HTML in a browser is…​you have the most powerful piece of HTML parsing software ever created right at your fingertips.

How do I parse the HTML? I don’t! I’m accessing the DOM of the page from JavaScript. It’s just 43 lines of, again, trashy code. Here’s most of it:

// Start by writing title as the first line
var title_tag = html_view.querySelector('h1');
var txt = title_tag ? title_tag.textContent : 'Untitled';
if(txt.length < 1){ txt = 'Untitled'; }
txt += "\n\n";

// old skool 'for' loop required for element collection
for(var i=0; i<html_view.children.length; i++){
    e = html_view.children[i];
    switch(e.nodeName){
    case 'H1': continue; // We've already taken care of the title
    case 'H2': p = '# ' + e.textContent + "\n\n"; break;
    case 'H3': p = '## ' + e.textContent + "\n\n"; break;

    ...

    case 'P':
        p='';
        for(j=0; j<e.childNodes.length; j++){
            if(e2.nodeType === Node.TEXT_NODE){ ...
            if(e2.nodeType === Node.ELEMENT_NODE){
                if(e2.nodeName === 'A'){
                    ...
                }
            }
        }
        p += "\n";
        break;
    default: p = '<html>\n' + e.outerHTML + '\n</html>\n\n';
    }
    txt += p;
}

code_area.value = txt;
}

The DOM API ain’t pretty, but compared to parsing HTML myself, it’s very, very nice.

It works!

It’s not as "easy" as just using the browser-supplied WYSIWYG editor and doing everything as HTML with no conversion at all. But it’s way nicer to use and very effective. I actually enjoy editing pages like this. I don’t feel like I’m fighting it.

Even though I’ve introduced a new LML and a conversion process to the situation, it was surprisingly painless to change direction.

Now I feel like I’m getting the benefits of traditional wiki quick authoring and the benefits of HTML storage.

Benefit: No conversion needed to view

The HTML seen in the "preview" pane of the editor is the page. It’s the exact source that is saved when you’re done editing.

Zero processing is needed to view the wiki, I’m still just including the content of the file with PHP’s include statement.

Benefit: HTML is the One True Final storage format for HTML

Incredibly, making this huge change to page editing had zero impact on the rest of the wiki. The only file affected by that commit was edit.php, and only the editor portion of that.

Using HTML as the storage format means not having to convert between storage formats ever again

(This is one reason I’ve been excited about maybe someday moving to this method for ratfactor.com. If I switched from AsciiDoc to, say, Markdown or some other LML, I’d have to convert all of my stored pages. Yuck! But if I convert to HTML…​I’m already done! I already have the exported static site.)

Benefit: Static pages for a potential static site

As I mentioned in Wiki Weekend Part 3, the pages are stored as incomplete HTML documents missing a head and foot.

Browsers, tolerant beasts that they are, will happily render the partial pages despite the missing tags. Which means that, in some sense, the stored page fragments are a static website.

I’ve been mulling over several different ways I might handle a bigger and more complex static HTML site with real, complete pages. I’m not ready for that just yet and there are a lot of ways I would want to improve the LML experience for that. But it’s clearly possible!

So, HTML WARDen is also proof of concept for a future wiki-style editor for a complete, proper static website where documents sit on disk as HTML ready to be served as-is and ready to be edited as LML.

What’s next?

Next is image handling. There will be one more bit of syntax in the LML to represent images. I’ll have file uploads with visual progress. I’ll also need to generate thumbnails and make a mini gallery with a search feature to find previously uploaded images.

Stay tuned.