Welcome To The Brave New World Of An Inauthentic Web

Ian Kallen
8 min readJan 7, 2023
Brave New World (cc attribution)

A generation ago (well, almost 20 years go), I developed a driving passion around the potential of connective tissue for the web to usher in an age of elevated human discourse. The thinking at the time was that given an engaged population of citizen publishers, conversing on their blogs by posting and linking to each other, a meritocratic system of reputation built on citations would elevate new insights and deeper understandings of the human condition. Eventually, my bubble was burst by the reality — read on.

When I first heard about Technorati, I was skeptical of what seemed to be driven merely by vanity (“who is linking to me? how popular are the things I’m posting?”) but in his recruiting pitch, Dave Sifry infected me with the bigger vision and the passion was born. I joined Technorati and grew engaged in what became known as “Web 2.0” (where “1.0” was the read-only web that ordinary people contributed very little to).

As blogs developed more traction, SEO abusers realized that the basic mechanics of linking from diverse sources, which for Technorati drove reputational stratification in the blogosphere, was like fentanyl for Google’s PageRank (back than, it was probably “crack” since fentanyl wasn’t yet a thing). If an emerging system’s attractiveness for abuse was a measure of its success, the blogosphere was appearing very successful. However, I grew a new passion around protecting and incentivizing quality in the blogosphere’s discourse. I hacked together a number of measures to identify and thwart web spam on the blogosphere. These were all predicated on a common premise:

In order to be effective, a spam publisher will exhibit behavior that is outlying from the norms of authentic publisher’s; the metrics around an abuser’s content cannot behave normally and still meet their goals of duping people into paying attention.

With that as the foundational thought, I worked with some of my colleagues there to percolate abusive behavior by developing ranked aggregations of both ends of the citation (the publisher and the link destination), tags and other measures. The most prolific publishers (number of posts, number of links, the common FQDN’s of those links, etc) and link targets were, more often than not, spam. We also started feeding identified web spam into a Bayesian system to classify content based on the data and metadata of the inauthentic content; it was nominally effective for email spam, why not web spam? Or so the thinking went. We raised the ire of blogger groups who were (mostly) publishing authentic content but explicitly linking to each other, banding together for the sole purpose of raising their blog’s profile. Around the same time (2005/2006), we had a few “web spam summits” that included folks from Google, Yahoo, Ask Jeeves, Wordpress and other stakeholders of the day to look for where our common interests are maintaining quality on the web. While little came of that as far as collaboration, it was an interesting opportunity to share our learnings. Over the subsequent years, new platforms, competitive positioning and industry consolidation kiboshed any prospects for collaboration; Google’s real-time indexing eclipsed what we had developed at Technorati (their blog search eventually just folded into news) and a lot of conversations migrated to Twitter.

Web spam has remained as a fact of life, there’s a lot of garbage on the web but Google has been, not great, but pretty good about down-ranking the worst of it. In my opinion, most inauthentic content needs more than down-ranking, it needs to go completely dark. Hem and haw about free speech if you like but I really don’t care to fire cognitive cycles trying to weed out work that is not just derivative but straight up pastiche from the real deal in my explorations of the web. Twitter evolved into being the connective tissue for many specialized communities and journalists. In addition to conversing people, Twitter became a major marketing and customer relationship venue for brands. Politicians and aspiring politicians discovered Twitter’s power to directly connect with audiences (constituents, potential ones and non-constituents alike). Twitter eventually proved that, instead of bringing forth a meritocratic arena of ideas, a venue where anyone can say anything, the discourse would be swamped by spammers, trolls, bullies and demagogues. While highly imperfect, the trust and safety functions at Twitter helped keep the most corrosive forces off the platform.

And this state of affairs for Web 2.0 has stood for about 15 years or so, depending on where you lay your markers. But a few things have happened in the last few months that is upending the status quo:

  1. Elon Musk seemingly decided that kiboshing trust and safety on the alter “free speech” was worth making a $44B bet, bought the company, made himself the central player on the platform and has systematically reinstated some of the most egregious abusers of it.
  2. In response, many people have migrated to Mastodon (very Twitter-like in many respects but decentralized) and some other venues. Some (myself included) maintain a presence on Twitter but, in keeping a DNFTT (do not feed the trolls) stance, refrain from engaging. However in doing so, a lot of communities are at risk of disappearing or, at least, becoming so fragmented that they are obscure social backwaters.
  3. ChatGPT has emerged a publicly available tool generate text that can, in many cases, pass a Turing test. Artificial intelligence, deep learning, large language models have advanced greatly recently. We saw instances where text-prompt generated images with DALL-E raised questions about what art is, with claims that Google’s LaMDA had gained self-awareness and other such stories.

Twitter’s descent into a tire fire didn’t start with Musk’s acquisition. Just as Roosevelt (Franklin, not Teddy) initiated a sea change by having fireside chats with American via radio, Donald Trump exploited Twitter to parlay celebrity and unfiltered access into a movement that (along with some not insignificant social media manipulation help from the Russians) propelled him into the White House. As the United States’ official troll-in-chief, Twitter became a venue for driving hyper-polarization in the country’s political divides. Musk just poured fuel on the fire by taking the guardrails (imperfect as they were) almost completely off the venue. As the trolls increasingly have free run of the asylum, Twitter’s descent into irrelevance and toxicity has accelerated; the platform seems destined now to continue devolving into being as commercially valuable as AOL and socially valuable as 4Chan. In other words, totally worthless. I like your transportation and energy investments Elon, but Twitter… ¯\_(ツ)_/¯ $44B well spent, way to go. 🙄

While Mastodon has been around for a while, it still has some of that “new social network smell” where the discourse is (generally) more civil and it’s not polluted with the problems that swamped Twitter. However, much like the blogosphere benefited from connective tissue, the fediverse leaves much to be desired. Some of the benefits of a centralized service like Twitter include visibility into a broader zeitgeist (trending terms and hash tags) as well as search for content, people and topics. I have mused aloud about the need for a Technorati TNG to emerge fill those gaps. Having worked on the systems powering Technorati and Klout (and retrospecting on their business model collapses), I can attest to the big spend required to provide the infrastructure for that connective tissue. To say nothing of how it spawned an industry built around inauthenticity (“influencer marketing” has arguably brought some of the worst instincts in humans). The challenge is to keep those costs low enough so as to not require tilting the functionality towards excess commercial exploitation and incentivize the production of bullshit. I think by now we can see where mining people’s data to manipulate their buying decisions gets us; I haven’t disavowed Facebook yet but I get close sometimes. If connective tissue for the fediverse can’t be done in a non-exploitative way (and without incentivizing platform abuse), its value is questionable. For the time being, you’ll find me there but there will remain things I grieve about what has become of Twitter.

And now this lands me back on ChatGPT. If abuser’s content is passing a Turing test, appearing to be authored authentically by a human, one can only imagine ways in which outlying behavior may grow increasingly difficult to identify. The web is at a severe risk of being further diluted with inauthentic content as spammers raise link farms with human-plausible content. The risk is sufficiently high that Google declared it to be a “code red” condition. The risk is that authentic content is going to be caught in the cross fire of an AI arms race on the web where abusive publishers and indexers such as Google wage war against and raises defense of content relevance.

Of course (as far as we know), there are no sentient AI’s as of yet. In the ChatGPT case, it just predicts what words should be produced based on analysis of a huge corpus of text. But just for giggles, I asked it to “explain how large language models and generative AI can make the web spam and SEO abuse problem worse” and thankfully, ChatGPT didn’t show any signs of self-awareness but it would have been funny if it closed with a hippocratic promise to do no harm. Here’s its response:

It is possible that large language models and generative AI could potentially be used to generate spam content or manipulate search engine results in an attempt to improve a website’s ranking. These techniques are known as spamdexing and black hat SEO, and they are generally seen as harmful and deceptive practices that can undermine the integrity of the web.

However, it is also worth noting that these types of language models and AI systems can also be used to help detect and prevent spam and SEO abuse. For example, they could potentially be used to analyze website content and identify patterns or characteristics that are commonly associated with spam or manipulative practices. Additionally, they could be used to generate algorithms or rules that could help filter out or flag suspicious content or activities.

Overall, the impact of large language models and generative AI on web spam and SEO abuse will depend on how they are used and the measures that are put in place to prevent their abuse.

The Russians reportedly spent millions paying people to pump inauthentic content and social interactions on to the web during the 2016 presidential election; now we can foresee the cost of boosting provocative, divisive and incendiary content will drop dramatically. For that tidbit and other much more thought provoking ideas, I recommend giving Ezra Klein’s recent conversation with Gary Marcus a listen.

I’ll posit that the emerging inauthentic web won’t be confined to textual content. Just over the horizon, you’ll be hearing “music” generated by an AI (“play a song about peanut butter cups that grow on trees on an island inhabited by Godzilla, in the style of Mariah Carey”). Plan on seeing some egregious deep fakes; videos of known figures seeming doing/saying incriminating things with a synthesized voice uttering disgusting absurdities generated by an AI (“show me Hunter Biden talking how much he enjoyed about raping children in Ukraine”). The AI genie is out of the bottle; one might imagine what that bodes for our cultural maladies and political discourse in the months and years ahead.

--

--

Ian Kallen

Whiskey swillin', card marking pirate and foul mouthed beyond hope. I tweet on my behalf. Usually when I'm closing browser tabs.