Blog Archives
AI Ouroboros, Reddit Edition
Last year, if you recall, there was a mod-led protest at Reddit over some ham-fisted changes from the admins. Specifically, the admins implemented significant costs/throttles on API calls such that no 3rd-party Reddit app would have been capable of surviving. Even back then it was known that the admins were snuffing out competition ahead of an eventual Reddit IPO.
Well, that time is nigh. If you want a piece of an 18-year old social media company that has never posted a profit – $18m revenue, -$90m net losses last year – you can (eventually) purchase $RDDT.
But that’s not the interesting thing. What’s interesting is that Google just purchased a license to harvest AI training material from Reddit, to the tune of $60 million/year. And who is Reddit’s 3rd-largest shareholder currently? Sam Altman, of OpenAI (aka ChatGPT) fame. It’s not immediately clear whether OpenAI has or even needs a similar license, but Altman owns twice as many shares as the current CEO of Reddit so it probably doesn’t matter. In any case, that’s two of the largest AI feeding off Reddit.
In many ways, leveraging Reddit was inevitable. It’s been an open secret for years that Google search results have been in decline, even before Google started plastering advertisements six layers deep. Who knew that when you allowed people to get certified in Search Engine Optimization, that eventually search results would turn to shit? Yeah, basically everyone. One of the few ways around that though was to seed your search with +Reddit, which returned Reddit posts on the topic at hand. Were these intrinsically better results? Actually… yes. A site with weaponized SEO wins when they get your click. But even though there are bots and karma whores and reposts and all manner of other nonsense on Reddit, fundamentally posts must receive upvotes to rise to the top, which is an added layer of complexity that SEO itself does not help. Real human input from people who otherwise have no monetary incentive to contribute is much more likely to float to the top and be noticed.
Of course, anyone who actually spends any amount of time on Reddit will understand the downsides of using it for AI training purposes. One of the most upvoted comments on the Reddit post about this:
starstarstar42 3237 points 1 day ago*
Good luck with that, because vinyl siding eats winter squid and obsequious ladyhawk construction twice; first on truck conditioners and then with presidential urology.
Edit: I people my found have
That’s all a bit of cheeky fun, which will undoubtedly be filtered away by the training program. Probably.
What may not be filtered away as easily are the many hundreds/thousands of posts made by bot accounts that already repost the same comment from other people in the same thread. I’m not sure how or why it works, but the reposted content sometimes becomes higher rated than the original; perhaps there is some algorithm to detect a trending comment, which then gets copied and boosted with upvotes from other bot accounts? In any case, karma farming in this automated way allows the account to be later sold to others who need such (disposable) accounts to post in more specialized sub-Reddits that otherwise require certain limits to post anything (e.g. account has to be 6+ months old and/or have 200+ karma, etc). Posts from these “mature” accounts as less obviously from bots.
While that may not seem like a big deal at first, the endgame is the same as with SEO: gaming the system. The current bots try to hijack human posts to farm karma. The future bots will be posting human-like responses generated by AI to farm karma. Hell, the reinforcement mechanism is already there, e.g. upvotes! Meanwhile, Google and OpenAI will be consuming Reddit content which itself will consist of more and more of their own AI output. The mythological Ouroboros was supposed to represent a cycle of death and rebirth, but the AI version is more akin to a dog eating its own shit.
I suppose sometime in the future its possible for the tech-bro handlers or perhaps the AI itself to recognize (via reinforcement) that they need to roll back one iteration due to consuming too much self-content. Perhaps long-buried AOL chatroom logs and similar backups would become the new low-background steel, worth its weight in gold Bitcoin.
Then again, it may soon be an open question of how much non-AI content even exists on the internet anymore, by volume. This article mentions experts expect 90% of the internet to be “synthetically generated” by 2026. As in, like, 2 years from now. Or maybe it’s already happened, aka Dead Internet.
[Fake Edit] So… I wrote almost exactly this same post a year ago. I guess the update is: it’s happening.
Dead Internet
There are two ways to destroy something: make it unusable, or reduce its utility to zero. The latter may be happening with the internet.
Let’s back up. I was browsing a Reddit Ask Me Anything (AMA) thread by a researcher who worked on creating “AI invisibility cloak” sweaters. The goal was to design “adversarial patterns” that essentially tricked AI-based cameras from no longer recognizing that a person was, in fact, a person. During the AMA though, they were asked what they thought about language-model AI like GPT-3. The reply was:
I have a few major concerns about large language models.
– Language models could be used to flood the web with social media content to promote fake news. For example, they could be used to generate millions of unique twitter or reddit responses from sockpuppet accounts to promote a conspiracy theory or manipulate an election. In this respect, I think language models are far more dangerous than image-based deep fakes.
This struck me as interesting, as I would have assumed deep-faked celebrity endorsements – or even straight-up criminal framing – would have been a bigger issue for society. But… I think they are right.
There is a conspiracy theory floating around for a number of years called “The Dead Internet Theory.” This Atlantic article explains in more detail, but the premise is that the internet “died” in 2016-2017 and almost all content since then has been generated by AI and propagated by bots. That is clearly absurd… mostly. First, I feel like articles written by AI today are pretty recognizable as being “off,” let alone what the quality would have been five years ago.
Second, in a moment of supreme irony, we’re already pretty inundated with vacuous articles written by human beings trying to trick algorithms, to the detriment of human readers. It’s called “Search Engine Optimization” and it’s everywhere. Ever wonder why cooking recipes on the internet have paragraphs of banal family history before giving you the steps? SEO. Are you annoyed when a piece of video game news that could have been summed up with two sentences takes three paragraphs to get to the point? SEO. Things have gotten so bad though that you pretty much have to engage in SEO defensively these days, lest you get buried on Page 27 of the search results.
And all of this is (presumably) before AI has gotten involved belting out 10,000 articles a second.
A lot has already been said about polarization in US politics and misinformation in general, but I do feel like the dilution of utility of the internet has played a part in that. People have their own confirmation biases, yes, but it also true that when there is so much nonsense everywhere, that you retreat to the familiar. Can you trust this news outlet? Can you trust this expert citing that study? After a while, it simply becomes too much to research and you end up choosing 1-2 sources that you thereafter defer to. Bam. Polarization. Well, that and certain topics – such as whether you should force a 10-year old girl to give birth – afford no ready compromises.
In any case, I do see there being a potential nightmare scenario of a Cyberpunk-esque warring AI duel between ones engaging in auto-SEO and others desperately trying to filter out the millions of posts/articles/tweets crafted to capture the attention of whatever human observers are left braving the madness between the pockets of “trusted” information. I would like to imagine saner heads would prevail before unleashing such AI, but… well… *gestures at everything in general.*
Blarghest
Aug 12
Posted by Azuriel
The last time I officially joined Blaugust was back in 2015. Back then, the conclusion I came to was that it wasn’t really worth the effort: posting every single day for a month did not meaningfully increase page views. I’m not trying to chase page views per se, but you can’t become a fan of something if you don’t know about it. Discoverability is a real issue, especially if you don’t want to juice SEO metrics in suspect ways. So, on a lark, I decided to rejoin Blaugust nine years later (e.g. this year) to at least throw my hat back in the ring and try to expand my (and others’) horizons.
What I’m finding is not particularly encouraging.
More specifically, I was looking at the list of participants. I’m not going to name names, but more than a few of the dozen I’ve browsed thus far appear to be almost nakedly commercial blogs (e.g. affiliate-linked), AI-based news aggregate sites, and similar nonsense. I’m not trying to be the blogging gatekeeper here, but is there no vetting process to keep out the spam? I suppose that may be a bit much when 100+ people/bots sign up, but it also seems deeply counter-productive to the mission statement of:
Ahem. The calls are coming from inside the house, my friends.
[Fake Edit] In fairness, after getting through all 76 of the original list, the number of spam blogs did not increase much. Perhaps a non-standard ordering mechanism would have left a better first impression.
Anyway, we’ll have to see how this Blaugust plays out. I have added 10-20 new blogs to my Feedly roll, and am interested to see where they go from here. Their initial stuff was good enough for my curiosity. The real trick though, is who is still posting in September.
Posted in Commentary
3 Comments
Tags: AI, Blaugust, Blogroll, Gatekeeping, Search Engine Optimization