I started working on my theme for Micro.blog in the hopes that it was accessible and could be parsed using schema and microformats. It was spurred by the feeling that I wanted my posts to be shared to as many people as possible. I’m not saying that I’m a literary genius or have something deep or profound to say, but I do have some really cool pictures of my dog as well.
Now years have gone by and I’ve refined and enhanced what I made initially. Post can show previews in bluesky, mastodon, and discord without me constantly worrying about formats. This technical project has been completed (for the most part).
And now, I’m faced with another technical itch.
Corporations are now crawling the World Wide Web in order to get data that they could use to train LLMs. That’s not to say they value the writings of a 40+ year old man from California whose average post size is around 45 words per post more than the thousands of artists and writers available, but having perfectly structured data1 would be nice. In the grand scheme of things, I’m not significant and not singular as far as data points go. I regularly see people who walk, talk, and dress like. I listen to at least two podcasts that not only have the same views that I have but talk in similar vocal range and cadence2.
But what should I do?
I’ve seen articles about editing the robots.txt
file that your website serves up to tell the various bots not to use the site for data. This feels like a whack-a-mole solution, and I came across a message that it only stops the companies that have standards. At this point in development, there feels to be a “better to ask for forgiveness, than permission” strategy in place for these corporations to pursue the collection of data. Unless you are rich and/or famous, the information is taken with a middle finger pointed at the EULA.
The summaries of my articles are already distributed/federated to services that I have no control over. This was understood when I set the different systems up and I’m still ok with that to a degree.
But, seriously, what am I going to do?
At the moment, I’m not including any changes to robots.txt
, I feel that some of my writing might be important enough to include in the data set for our new robot overlords protectors.
I talk about mental health.
I talk about media.
I talk about inconsequential items and sometimes important things.
And, yes, I do post about my dog as well.
All in the hope that -maybe- someone out there gets it.