Log in

No account? Create an account
delirium happy

Just keep on trying till you run out of cake

Previous Entry Share Next Entry
Google blogsearch
delirium pissed off
Apparently, google now has a blogsearch feature. And this is all nice and good, except for the fact that I trust google about as far as I could throw them. If they were carrying all of their stored data. Written out in full. On stone tablets. Which isn't very far. So, I went to look at their FAQ page to see if they listed any way to get out of their index. And lo and behold, there are! Three of them, in fact. If you don't know better, this looks fairly promising, even if most people would be rather blinded by the science. Fortunately for me, as the former admin of LJ's syndication support category, I'm not most people, and I do know better. Handy that. As it happens, all three of the options they supply suck. One of them is akin to saying "if you take your website off the web, then we'll stop indexing it" and the other two just wouldn't work at all. Unimpressed, google, unimpressed.

So I sent them an email:

To whom it may concern,

I was recently made aware of your blog search service, and wish to have my blog removed from your index. Your help page (http://www.google.com/help/about_blogsearch.html) states that I can do so in one of three ways: not publishing a data feed of my blog in the first place, using a robots.txt file, or using noindex and nofollow meta tags. I find each of these solutions unsatisfactory, for the following reasons:

Removing my RSS and atom feeds entirely would also prevent them from being used for other purposes, such as desktop RSS aggregators. Since RSS and atom are both primarily designed for syndication and aggregation rather than for archival, I find it unreasonable that I should have to block the primary uses to also block your secondary uses.

Adding a robots.txt file is simply not possible in my case. Due to the nature of blogging, many people have blogs that are hosted by large sites, with all blogs being served from the same domain. In these cases, it is not possible to produce a robots.txt file listing the blogs of all users who do not wish to be archived due to the sheer size that would be required. (For instance, the blog host that I use, LiveJournal, has over eight million accounts, all of them accessible at http://www.livejournal.com/users/username.)

Finally, you provide as a third option the use of meta tags. This is also unsatisfactory, since the syndication formats involved do not include any space for such meta tags within their specifications. While atom and later versions of RSS are designed to be extensible, I am not aware of any namespace which allows for the addition of these meta elements that are standard in HTML files. If you are aware of such, I would be happy to hear of it.

With these points in mind, can you please tell me what steps I should take to remove my blog from your listing?

Many Thanks,
Rachel Walmsley

Thus far, all I've got back from them is an automated reply:

Thank you for your note about Blog Search (beta). Due to the large volume
of email requests we receive, we may not be able to personally respond to
your email. We hope you'll visit our FAQ at
http://www.google.com/help/about_blogsearch.html for our most recently
updated information.

Thank you for using Blog Search.

The Google Team

I'm not holding my breath for any sort of helpful response.

  • 1
More disturbing are the Support requests that indicate Google has somehow indexed friends-only entries. I'm kind of WTFing on those.

I rather suspect that in most if not all of those cases, the person would have made the entry oublic initially, then changed it to friends only, but google managed to sneak in while it was still public, and have stubbornly refused to let go since.

Why does it bother you that Google is indexing LJ, specifically your LJ? I'm inclined to think it's a good thing; it's always annoyed me how hard it is to find stuff on LJ with Google, and working without Google feels to me like working blindfolded and with both hands tied behind my back. It's not about whether I "trust" Google; I can't think of a malicious use of my LJ entries that would really bother me.

With LJ's friends locking system in place, which I do trust within the limits of the second law of thermodynamics, I don't see why this is a problem. Can you explain further why you are so keen to opt out of Google?

Well, I'd be lying if I tried to claim that the fact that they don't provide any way to opt out wasn't a big factor. I don't like anything that tries to use my data for its own purposes without giving me any say over it. I'll accept that things like search engines wouldn't work if they were opt in, but I do think that they need to be easy enough to opt out of.

Secondly, it makes it pretty much impossible to write anything that isn't friends locked without having it show up in search engines. Now, for most of the things I write, I'm either happy for them to be searchable, or I want them to be friends locked, but I like to retain the option, for things like this.

If you delete an entry, or make it non-public, then it's still there in their archive. With a regular web-crawl, a deleted webpage will eventually recrawled, the deletion noted, and it'll be gone. (From anywhere publicly accesible, at least, although apparently google doesn't delete any of their old data; and there's also their cache feature which I equally dislike. But in theory, once it's recrawled, it's gone.)

On the other hand, RSS is designed so that old entries will drop off the feed. So there's no reason for them to recrawl, and no indication that it will ever be deleted. In fact, their FAQ explicitly says "However, if you previously published a site feed that was included, the old posts will remain in the index, even though new ones are not added." This doesn't fill me with confidence that their respecting my right to control my copyrighted material.

And of course, the fact that I generally don't trust google very far anyway also doesn't help.

Essentially, in the vast majority of cases, I don't care about my public entries being archived, but I do like to still have an option in those few cases where that isn't the case. Google seems to be trying to take this option away from me.

I like to not use my full real name on LJ. If parents / employers / etc want to stalk me I'd rather it took them effort. But a long time ago, I stupidly wanted to rant about a set of email correspondance, that I just copied and pasted into my blog. Which were signed with my full name, because that's what I do in important emails.

Now the first hit for my full name on google blogs is my LJ. I don't want this to be the case.

I admit, this is my own stupid fault. But it was one slip over a year ago, and as far as I can tell (I've edited the post now so it only says my first name instead of my surname) google blogs will *never* recache the page, and there is *no way* I can decouple my LJ from my real name now. I will always remain the first hit. Which is a Right Royal Pain.

Then again, I'm sure if I fixed the problem, googling for my first name, and enough of the people / interests / places I have would show up my blog anyway. Maybe I should go completely friends only, but I like being able to talk to random passers by who stumble across my LJ.

Gah. Damn google blogs, for reminding us that if we put stuff on the internet it's on the internet

Why robots.txt doesn't work? What about "Block Robots/Spiders from indexing your journal - If you check this option, robots will be told to go away. Not all robots respect the rules."
I believe it works with http://blogs.yandex.ru - blogsearch that appeared earlier this year.

robots.txt would have to go in the root directory of the webserver, that is, at http://www.livejournal.com/robots.txt. This file would have to list every single user who wanted to block robots and spiders, which would be a huge list, and above the size limit for robots.txt. There are individual per-user robot.txt files for paid accounts at http://username.livejournal.com/robots.txt, but they won't work for URLs of the form: http://www.livejournal.com/users/username.

So what that option does instead is put meta tags on the individual html pages to tell search engines to go away. But if you look at an RSS feed (eg http://www.livejournal.com/users/rho/data/rss then you'll see that that doesn't include the meta tags. Which is quite right, because there's nothing in the RSS specifications to allow for the inclusion of meta tags.

So google is saying "hey, look! no robots.txt telling us to go away, and no meta tags in the file! this means we can archive!" But the problem is that there's no way to opt out of it.

I have that option set, and at least one of my jounrla entries is in the search results.

I'm still reading support_interim and have thus learned that there's a "synlevel" option that can be set via the console; set synlevel 'level' where level is title, summary, or full. If I understand this correctly, this enables you to sharply limit how much of your public entries is syndicated via RSS.

Yes, Google ought to be more responsive, but this is something. Now to suggest that LJ include this in the FAQs.

Yes, but this similarly falls under the first clause of her email: "Affecting the primary use (aggregation) to account for the secondary/archival use is not cool."

Yes. I remembered that this option existed, but couldn't remember what it was. It's kind of a pain, because it blocks legitimate use as well though. I've not yet decided whether I'm planning on using it or not. Probably depends on whether anyone actually does read my journal via its RSS/atom feeds. It's definitely Not Cool though, as Chris pointed out.

I have a vague recollection that a decission was made not to include information about this in the FAQ, but I'm nt even remotely confident on my memory on that one. You'd have to ask someone who's actually currently involved in lj_userdoc if you want a more concrete answer.

I asked in support_interim and was told that whoever decides these things had decided not to publicize it because they hadn't wanted to add it in the first place.

I would guess the policy decion was made somewhere around the same time as http://www.livejournal.com/community/lj_dev/681766.html was announced - amusingly enough, Rahaeli, who fought so hard on the "If it's syndicated, it's no different than HTML" is now Indicating that Google Is Slimy for doing this.

Yawn. "People are going to copy your content" thrown back at you, and you wonder why they were upset in the first place...

The ideal way to solve this would be the same as the way you tell Google Groups not to index your usenet posts. All you'd do is make a public post containing the line
X-No-Archive: Yes
and the feed would be unindexed from then on.

my own letter to Google

Oddi wrth: Thomas Thurman
I: suggestions@google.com
Pwnc: X-No-Archive equivalent for Google Blog Search
Dyddiad: Wed, 14 Sep 2005 21:36:41 -0400

Since before its acquisition by Google, Google Groups has supported a
feature whereby the string "X-No-Archive: yes" as the first line of a
news post causes its contents not to appear in the permanent index.

I'm writing to suggest that you implement a similar feature in the new
Google Blog Search. In the FAQ, you suggest certain methods of hiding a
public RSS feed from the spider, but these are problematic for some
bloggers, particuarly users of very large sites like LiveJournal. I
suggest a rule where a blogger who does not want their RSS feed indexed
should post a string such as "X-No-Archive: yes" in an entry, causing
any RSS feed which contains that post not to be indexed.


  • 1