Google の Sitemap Generator に異議あり

2013/10/16
2013/11/19
2018/06/28 fix old link

I oppose Google's site map generator (sitemap_gen.py) because that exposes all files under document root unless explicitly excluded. An alternative site map generator is here that follows links and (by default) never exposes unlinked files to outer world.

退職に伴い、Web サーバーを移動しました。それに伴い、リンクの修正が発生しました。また、元の記事の sitemap.xml.gz は全て sitemap.xml に書き換えました。理由は、現在では圧縮をサポートしていないらしいから。実際、圧縮するメリットは無い。

Sitemap Generator

2013/10/16

僕のサーバーには時折クローラーが押し掛ける。Google だけではなく、様々なクローラーが押し掛ける。クローラーが押し掛けると、大量のアクセスが発生する。こちらにとってはありがたいのか迷惑なのか? クローラーにデータを拾って貰わないと、検索に支障が発生するので致し方ないが、僕のサーバのように非力なサーバにとっては、大量のアクセスは荷が重い。解決の方法は、ページの一覧に更新日を書いて、クローラーに読んで貰い、効率よく新しいページだけにアクセスして貰う事である。これは、クローラーにとっても、こちらにとっても良い解決策である。この方法は Sitemap Protocol として 2005 年頃から行われているらしい。
ページの一覧の作り方は次のページに解説されている。また Google が作成したツールが存在する。

サイトマップの XML 形式
http://www.sitemaps.org/ja/protocol.html
Webmaster Tools
http://sitemap-generators.googlecode.com/svn/trunk/docs/en/sitemap-generator.html
google-sitemap_gen
http://sourceforge.net/projects/goog-sitemapgen/files/sitemapgen/

ファイルが丸見え

2013/10/16

さて僕の仕事が一段落したので、僕のサーバーに Google の Sitemap Generator (sitemap_gen.py) をインストールし、生成された一覧を眺めた。そして、エーと驚いた。
僕は、リンクされているページだけが一覧に集められていると思っていたのだ。僕がそのようなツールを作れば、当然そのように設計する。所が違うのだ。 sitemap_gen.py はリンクを辿らないで、ドキュメントルートから下にある全てのファイルを調べて、その一覧を作る。(その方が簡単なのだ。) これは手抜きである。その副作用が大きい。大きすぎる!
僕のサーバーのブラウザからアクセス可能な空間には、リンクされた正規のファイルだけではなく、書きかけの記事や、半ば私的な、あまりアクセスされたくもないファイルが存在する。例えば誰かに渡したいファイル(写真や文書)が置かれる場合がある。これらが丸見えになり、当然ながら、意図しない第3者に渡る可能性を秘めている。従って、Google の Sitemap Generator は採用不可である。
しかたないねー。自分で書きましょう。(書きました)
現在、試運転中。その内に公開しますね。(公開しました)

Sitemap Generator のユーザーはよくよく注意しましょう。ドキュメントルートから下にあるファイルは全て公開されている事を認識していて下さい。決して Web のサーバーを通じて私的なファイルを誰かに渡してはいけません。

sitemapgen

2013/10/19

ここに紹介する sitemapgen はファイルと最後の更新日の一覧だけを sitemap.xml に登録するための python スクリプトである。これは
http://p9.nyx.link/netlib/sitemapgen/
に置かれている。(sitemapgen-1.3.tgz)
僕のサーバーは Plan9 なので、パッケージに含まれている使い方の例は Plan9 用に書かれているが、unix でも問題なく実行できるはずである。
ただし unix の場合には最初の行を、

#!/bin/python

から

#!/usr/bin/env python

に変更しなくてはならないだろう。(システムに依存するので注意)

特に指定しない限り sitemapgen はリンクを辿ってブラウザから見えるファイルだけを公開する。HTML ファイルはもちろん、画像ファイルなどがその中に含まれる。
指定されたディレクトリ以下の全てのファイルを sitemap.xml に登録する事も可能になっている。この場合、複数のディレクトリを指定できる。しかし、その必要性が発生するのは非常に特殊な場合(CGIでユーザから隠されているファイルを検索エンジンに見せたい場合)だけである。Google sitemap generator は、この特殊なニーズを、document root 以下の全てのファイルに適用しているのである。

注: Googlebot は画像ファイルにもアクセスする。Google の画像検索に必要なのだろう。

現在のところ sitemapgen が調べている HTML ファイルの中のタグは以下の通りである。

a	area	img		embed 	iframe	object

詳しくはパッケージの中の MAN_SITEMAPGEN を見てください。(英文で申し訳ない)

脆弱性の修正?

2013/10/20

次のページを見付けた。

　今回の脆弱性は、自分のサイトに存在するもののGoogleに登録にしたくないページの情報まで、Googleの巡回プログラムが登録してしまい、結果的に第三者がそのサイトのすべてのページの存在を知ることができてしまうというもの。Googleでは、こうした問題を防ぐためにSitemaps登録時にGoogleが指定する固有のファイル名のHTMLファイルをサイトオーナーにアップロードするよう要請し、そのファイルの存在を確認して初めてGoogle Sitemapsによる巡回が開始されるようになっている。
http://internet.watch.impress.co.jp/cda/news/2005/11/21/9922.html

注: この記事に出てくる「アップロード」は(Google Webmaster に登録済みと仮定して)
https://www.google.com/webmasters/tools/home?hl=ja
から
→ 「プロパティを追加」
で行える。

古いページであるが、問題点がよく分かる。
この記事では「固有のファイル名のHTMLファイル」となっているが、「固有のファイル名のXMLファイル」の誤りであろう。

Google はずいぶんと変な解決策をとったものだ。ページの一覧が書かれているファイルの名前 sitemap.xml (他の名前でも構わない) を Google に知らせてくれないと、Google はこれをとりに行かないと。実際、僕の観察によると Google のクローラーは、ファイルの名前を知らされるまで sitemap.xml にアクセスしない。

このファイルの名前はどの検索エンジンでも参照できるように配慮する必要がある。そのためには robots.txt に

Sitemap: http://ar.nyx.link/sitemap.xml

のように書き込む。sitemap.xml は Google だけのために存在するのではないのである。Google はアクセス要請があるまではアクセスしないと言うのは何の解決策にもなっていない。

Manual: sitemepgen

2013/10/24
2013/11/19 更新

SITEMAPGEN(8)

NAME
	sitemapgen (ver.1.4)
	mksitemap

SYNOPSIS
	sitemapgen sitemap [-t type] sitemap.conf
	mksitemap


DESCRIPTION
	Sitemapgen is a tool that generates sitemap of a hosts.
	The argument "sitemap.conf" is the path to sitemap.conf that
	should be located on the document root of the host.
	The option for sitemangen is:
	-t type
	where type is one of "xml" or "list" which denotes the output format.

	By default, the sitemapgen follows links in HTML files and never exposes unlinked
	files to outer world.

	In some special cases, you might want to expose files under specified directories.
	Directories that are controled by a CGI program are such an example.
	Sitemapgen (ver.1.3 or above) provides a variable <expose> for the purpose.
	Note: files that begin with "." are not exposed.

	The below is an example sitemap.conf of one of my hosts:
		urlbase="http://ar.aichi-u.ac.jp"
		docroot="/usr/arisawa/http/doc"
		default="index.html"	# if url ends with "/".
		chkdir=False	# examine if the url is directory.
		root="index.html"	# analysis starts from docroot+"/"+root
		exclude=None	# default None
		#exclude=(r"^(semi/index\.html|semi/chat/index\.html)$")
		expose=None		# default None
		#expose=("netlib",)	# you can expose files under those directories.

	These are python codes which are executed in sitemapgen.

	Note that if "foo/index.html" is excluded. Then the file:
		docroot+"/"+ "foo/index.html"
	is not evaluated. Thus all links in the file are not listed in the sitemap.
	The link-chains are truncated by "foo/index.html".

	Sitemapgen listups files and the last modified date. An example of a file
	is as follows:
	 <url>
	  <loc>http://ar.aichi-u.ac.jp/photo/1980/terminal.jpg</loc>
	  <lastmod>2005-03-13T13:33:33Z</lastmod>
	 </url>
	This is extracted from a img tag such as
		<img src="terminal.jpg" width=320>
	in a HTML file.

	There are various varations of the URL expressions. For example,
		src="../1980/terminal.jpg"
		src="/photo/1980/terminal.jpg"
		src="//ar.aichi-u.ac.jp/photo/1980/terminal.jpg"
		src="http://ar.aichi-u.ac.jp/photo/1980/terminal.jpg"
	All of them are valid URLs and supported by sitemapgen.
	And in addition, HTML5 allows the following syntaxes for attribute and the value.
		attr="value"
		attr='value'
		attr=value		# quoteless expression under certain conditions.
	They are also supported by sitemapgen.

	Currently sitemapgen supports the following tags:
		a	area	img	embed	iframe	object
	If you want more tags, change the relevant lines of the python code.
	I think the modification is very easy if you have some knowledge of HTML.

	You need a description in robots.txt. The following is my example:
		Sitemap: http://ar.aichi-u.ac.jp/sitemap.xml
		User-agent: *
		Disallow: /semi/

	Mksitemap is an example script that is used in my server.
	Note that my server is very special. You need your own script.


	WARNING:
	Google's sitemap tool (sitemap_gen.py) does not follow links
	and exposes all files under document root to outer world.
	Please check output of sitemapgen if it is really what you want.

CHANGE LOG
	ver.1.3 to ver.1.4
	Changed the format of "exclude". Now regular expression.

	ver.1.2 to ver.1.3
	Added "expose", which is list of directries.

	ver.1.1 to ver.1.2
	Added "chkdir". The value is True or False.
	If False, url in href must end with "/" for dirctory.
	The value True removes this restriction at the cost of speed in generating sitemap.

	ver.1.0 to ver.1.1
	Added "exclude", which is a list of files.

効果の程は?

2013/11/02
2013/11/12

次は、僕の Web サーバの過去10万行のログ(http://ar.aichi-u.ac.jp と http://plan9.aichi-u.ac.jp)の中で、sitemap.xml にアクセスしたロボットの一覧である。リクエストと結果がログに含まれているので、アクセス数はその半分と見積もってよい。一覧はリクエストのみを含む。日付をみると、ほぼ 1 週間の記録だという事が分かる。

Oct 26 06:04:59 google
Oct 26 06:17:18 yandex
Oct 26 07:42:14 yandex
Oct 26 09:37:01 dynamic
Oct 26 18:54:50 google
Oct 26 21:33:36 google
Oct 27 04:41:13 google
Oct 27 17:37:03 google
Oct 28 01:30:17 google
Oct 28 06:01:51 dynamic
Oct 28 06:50:20 google
Oct 28 08:07:07 msn
Oct 28 18:58:49 google
Oct 29 23:23:19 yandex
Oct 30 01:30:57 dynamic
Oct 30 01:31:04 dynamic
Oct 30 05:53:00 yandex
Oct 30 19:17:51 msn
Oct 30 21:36:27 yandex
Oct 31 04:33:31 dynamic
Oct 31 04:33:39 dynamic
Oct 31 05:04:37 msn
Nov 2 00:34:44 google
Nov 2 03:44:02 yandex
Nov 2 05:34:36 yandex
Nov 2 06:52:56 dynamic
Nov 2 06:53:02 dynamic

この一覧には載っていないが、過去に sitemap.xml にアクセスしたロボットに goo や mesh の他に、マイナーなエンジン(others)が存在する。
dynamic と書かれたものは、明らかに動的 IP アドレスからのものである。ちゃんとした検索エンジンが動的 IP アドレスを使っているとも考えにくいので、彼らの目的は、サーバーに含まれる「おいしい情報」だと考えられる。
さらに、この一覧には無いが、DNS に登録されていないロボット(unknown)が sitemap.xml にアクセスしている。彼らも動的 IP アドレスのロボットと同様に怪しげなロボットである。

さて、問題は sitemap.xml によって、ロボットからの無駄なアクセスが減ったかである。微妙だねー。
ログを見ている限り、Google は活用しているように見える。しかし... 他のロボットは...

msn と書いたのは msnbot (Microsoft のロボット)である。先の一覧を見ると Google のアクティビティが一番高いように見えるのだが、実際のアクティビティは mnsbot が群を抜いて高い。様々な IP の msnbot が来るのであるが、ちぐはぐで、全然チームプレーができていない。Microsoft は資金豊かで、ロボットを合理的に動かす理由が無いのかも知れないが、しかしこちらが迷惑だ。

Google は、一度に大量のファイルにアクセスしないなど、サーバの負荷に関して配慮している^*。他のロボットも Google に見習ってもらいたい。

注*: https://support.google.com/webmasters/answer/182072
また、このページには書かれていないが、googlebot は、前回の検索から変更があった場合にだけページを受け取る。これにはリクエストヘッダに HTTP1.1 の "If-Modified-Since xxxx" を活用しているからである。(他の検索エンジンでは、この機能は意外と活用されていない)

SEO 対策?

2013/11/02

Google の sitemap generator は SEO 対策として有効?
関係ない!

僕のサーバーには検索のトップページにヒットする記事が多い。しかし SEO 対策は全く行っていない。記事のオリジナリティの高さこそがヒット率を上げる最大の要因である。オリジナリティの無い記事は、所詮、ネットの中のゴミであり、良質の検索エンジンは、それらを最下位のランクに持って行こうとするのは当然である。