Opened 5 years ago

Closed 5 years ago

#11958 closed Patch - Feature (Won't Fix)

PATCH: Robots.txt for mythweb

Reported by: skd5aner <skd5aner@…> Owned by: Rob Smith
Priority: minor Milestone: unknown
Component: Plugin - MythWeb Version: 0.27-fixes
Severity: medium Keywords:
Cc: Ticket locked: no

Description

Over the years, several people accidentally leave mythweb exposed to the internet. Even though strongly discouraged for a multitude of reasons, one of the biggest issues that occurs in this situation is that various search engines can cause havoc by crawling the site and hitting links that modify the system. There are dozens of reports of bots like googlebot causing modifications to all recording rules, etc.

At the very least, including a robots.txt file will keep the majority of valid bots from causing havoc when crawling the site. It will not prevent bots that ignore robots.txt from causing issues. This is not a magic bullet, but it's certainly a simple way to add a small layer of protection that could otherwise cause someone to have a real bad day.

The attached patch simply creates a robots.txt file at the root of the mythweb directory, and disallows all user-agents from crawling the site. This patch can apply to master and also to 0.27 and 0.26.

Attachments (1)

0001-Add-Robots.txt-to-prevent-some-search-engines-from-c.patch (589 bytes) - added by skd5aner <skd5aner@…> 5 years ago.

Download all attachments as: .zip

Change History (3)

Changed 5 years ago by skd5aner <skd5aner@…>

comment:1 Changed 5 years ago by sphery

FWIW, I feel adding a robots.txt to MythWeb will add a false sense of security for users, and allow them to continue to ignorantly run an open MythWeb site (at great risk to all their recordings, settings, ...). And, since a robots.txt is completely useless on a properly-configured MythWeb install (which includes authentication, which will prevent crawlers from accessing the site, anyway), including a robots.txt may add confusion about whether MythWeb is "safe" to run without authentication on the WWW.

The robots.txt will only work if MythWeb is installed as the root application on the web server (or virtual server), and only if the client actually chooses to adhere to the advice in the robots.txt file. Users that know the limitations of robots.txt are free to add them to their servers (but should also realize that robots.txt is unnecessary because they need to add authentication to MythWeb), but including one by default may make users who don't understand those limitations feel protected from danger (when I'll argue it actually does the opposite).

I have not once seen a single report of MythWeb's lockdown feature ( http://www.gossamer-threads.com/lists/mythtv/commits/338813#338813 ) being triggered since it was added in June 2008 (and am pretty certain that a panicking user whose MythWeb stopped working would not find discussion of the feature in the README without requesting help on the list), so I am relatively confident in saying that it seems unlikely that any MythWeb user's site has been crawled since then (especially since distros are doing such a good job of properly securing MythWeb installs and since directed queries on a couple of the major search engines provide no results showing open MythWeb sites (for all practical purposes--I found one that seems to be an old one as the site no longer exists)).

I'd argue that the active lockdown feature is a better solution in that it will a) trigger when a MythWeb site is crawled and b) lock MythWeb down until the user manually unlocks it (ideally after fixing his site to include proper authentication, as described in the README's answer to "how do I unlock the install?" - http://code.mythtv.org/cgit/mythweb/tree/README#n155 ). Therefore, a robots.txt that excludes web crawlers from accessing the site will only prevent the user from being (forcibly) informed that he needs to fix his insecure configuration to include authentication.

comment:2 Changed 5 years ago by Rob Smith

Resolution: Won't Fix
Status: newclosed

The current default template has metatags to prevent robots indexing the content

<meta name="robots" content="noindex, nofollow">

This is the same as having a robots.txt

Note: See TracTickets for help on using tickets.