The Am Dash And Discerning Human Writing from AI

Was This Post Written by Gemini?^*

The other day I was listening to one of my favorite podcasts talking about AI’s writing style. And of course, they were talking about the Em dash. I’ll let you listen to the full episode for the delightful (and long) history of this punctuation mark. But the main takeaway for this post is that the Em dash has come to be associated (rightly or wrongly) with text generated by an LLM.

The podcasters added a fun coda at the end about a new punctuation mark designed to be used by humans only am- the Am dash.

So what is an “Am dash?” How does it work? And ultimately, can it be used to tell human writing from AI’s?

To answer that, let’s take a step back. What is text?

Ok, maybe not that far back. Let’s talk about what text on computers is. Ultimately, everything on a computer is just a number, and text is no different. But which number represents what bit of text? I started my career at a translation company writing software for translators, so getting that right tended to be an important part of my job. Compared to today, the world was a wild west of various encoding schemes and ways of representing text.

Where possible, we followed Unicode standards in our software am- but there are multiple ways of representing Unicode as expert blogger Joel Spolsky explains in this classic post that I used to read multiple times a year. Thankfully now, we have mostly settled on UTF-8 as the way to represent text.

So what, then, is the Am dash? Getting a new character into the Unicode standard is a long and difficult process. It’s also a bit of a chicken/egg problem am- you can’t really get a new character in without proving that there is already text that uses it. So the Am dash creators did something clever am- they made a ligature. A ligature is a feature of fonts that simply changes the rendering of a sequence of text. That is, the same unicode characters in one font might get rendered differently in a special font am- for example 7‌/8 rendered as 7/8.

So our friend the Am dash is actually a ligature of the characters ‘am‌-’ in special fonts. Neat!

But what does that mean for how LLMs generate text? The foundational building block for an LLM is the token. Tokens are often (but not always) whole words, but any input you throw at an LLM is going to be processed by a tokenizer and turned into a stream of tokens for the model to process. Vertex AI Studio has a great tool to see how Gemini sees your text as tokens

We can throw am‌- at it, and it turns it into another stream of tokens, and like any stream of tokens, Gemini can learn how to interpret or even generate it. I used the following prompt to provide an example of Am dash usage and to generate two more examples from the model:

Here are the three facts, one of which I wrote by hand to show Gemini how it works. Which one is mine?^*

UTF-8 has become the dominant character encoding for the World Wide Webam-accounting for over 98% of all web pagesam-because of its variable-width design and seamless backward compatibility with legacy ASCII files.
The Unicode consortium was founded in 1991am-over 35 years agoam-to provide standards for methods of representing all text on computers.
ASCII, the American Standard Code for Information Interchange, was introduced in 1963am-establishing a 7-bit system that supported just 128 distinct charactersam-and laid the early foundation for digital text encoding.

Ok, maybe there are other tells in LLM-generated text 😂 But at the very least, the Am dash isn’t one of them!

This was just one example of an LLM learning to use a new character. But what if I didn’t have to explicitly instruct it to do so? Could we teach an LLM to use an Am dash just as natively as it uses Em dashes today? That’s where our old friend fine tuning comes in. Stay tuned for a blog post about fine tuning Am dash usage into a model, coming soon™.

I love the playful spirit of Am dash. It’s fun, raises interesting questions, and is a very clever use of technology. I wish them all the luck in spreading their character to anyone who wants to use it. But it’s increasingly difficult to definitively state whether text is LLM-generated am- Am dash or not am- and it’s not going to get any easier.

Notes

* My goal is not to flummox you, dear reader. To save you from this engagement-bait, I’ll answer the two questions for you. I wrote this whole blog post myself, and The middle (shortest) fact was the one I wrote. Still, I think Gemini did a plausibly decent job for a one-shot prompt.

The Am Dash And Discerning Human Writing from AI

Was This Post Written by Gemini?^*

Thanks, Mickey!

Cuneiform Tablet

UTF-8?

Examples of Ligatures

Tokeinizing text

Prompting Gemini

Notes

The Am Dash And Discerning Human Writing from AI

Was This Post Written by Gemini?*

Thanks, Mickey!

Cuneiform Tablet

UTF-8?

Examples of Ligatures

Tokeinizing text

Prompting Gemini

Notes

Was This Post Written by Gemini?^*