The last two days, I’ve talked about Internet Ethics. I don’t know that there’s much more that I can say on that side of the subject. However, today’s technology does offer one more avenue to address, this time from the other side of the equation. That avenue is prevention.
In some cases, the use of the robots.txt file will keep your site safe. Robots are pieces of automated software that are used to index and scrape sites.
If you operate a web site, you will often see entries in your log from Googlebot or Slurp (robots from Google and Inktomi, respectively). If you run a blog, you may also see robots from other search engines. Seems we are getting more robots on the web, not less.
Keep in mind, however, that abiding by the robots.txt file is purely voluntary, so for poorly or maliciously written software, it might not do the job. In that case, it might be time to do something on your side.
As one example, King Features Syndicate seems to do some checking on the source of the requester (perhaps by checking the referer value) to make sure you are allowed access to the data. If you view one of their strips on a valid licensee’s page, you’ll see the image for that strip. If you try and access the strip directly, as would be the case through an RSS feed, you get a generic “blank” image, and no comic.
In my time on the web, I’ve also seen other solutions, for instance where sites have been known to create cgi applications that return the URL so that this screenscraping method won’t work. Instead of having a typical URL to an image, a program processes certain input and returns a somewhat randomized string to the image. Very difficult to reproduce this programmatically. At the very least, the scraping script would need to change. It might not be possible to change it correctly if some sort of encrypting pattern is used on the URL.
Other producers of material have added watermarks to their images.
Perhaps issuing strips in alternate formats, such as Shockwave Flash, would be a solution. This would allow for the embedding of additional data in the feed, such as URLs or advertisement. Flash, at least at the moment, doesn’t seem to be supported through RSS, so that could solve the problem in an entirely different way. Of course, it might also prevent some users from viewing the content directly from the web site, as they would need the Flash plugin (which is free, but nonetheless, some people won’t have it).
Just because someone can scrape a page doesn’t necessarily mean that they should. But when alternatives exist that would prevent people from doing just that, perhaps the content producers should implement some or all of these ideas. Just in case.