Scripts for crawling the website.
This lands a script that acts as a simple web crawler.
It'll crawl http://www.chromium.org and recursively fetch
all of the content on the site, rewriting any URLs to point
to the local copies of the pages.
The script uses a number of hard-coded values:
- the list of URLs to use as starting puts (in paths_to_crawl.txt)
- a list of URLs to skip (because they don't exist)
- alternate ways of writing equivalent URLs (e.g. /foo and
/a/chromium.org/dev/foo).
and so it is not a general-purpose crawler, but it would not be
too hard to generalize it a bit more to become one.
It also supports a `--prefix` option to rewrite all of the URLs
to that they start at an offset from the root directory of the
web server, e.g., instead of fetching /Home from ./home, it'll fetch
it from $prefix/home. This is useful if you want to embed the
crawled content as a part of the larger site.
In the context of the chromium.org migration, this crawler and the
prefix option will allow us to nest the original content inside
the new content, in order to be able to easily compare the two.
Bug: 1267643
Change-Id: I9b917f4a1dba795bcda8286611e1d631d7d27518
Reviewed-on: https://chromium-review.googlesource.com/c/website/+/3266881
Commit-Queue: Dirk Pranke <dpranke@google.com>
Auto-Submit: Dirk Pranke <dpranke@google.com>
Reviewed-by: Struan Shrimpton <sshrimp@google.com>
diff --git a/scripts/common.py b/scripts/common.py
index f5c8e3d..a911f15 100644
--- a/scripts/common.py
+++ b/scripts/common.py
@@ -83,7 +83,7 @@
return page + '/index' + ext
if os.path.exists(top + page):
return page
- if os.path.exists(top + page + ext):
+ if ext and os.path.exists(top + page + ext):
return page + ext
return page