diff options
author | Yury Melnichek <melnichek@gmail.com> | 2012-09-17 14:13:25 +0400 |
---|---|---|
committer | Alex Zolotarev <alex@maps.me> | 2015-09-23 01:43:34 +0300 |
commit | f8b8a13a870024f5a20e0efdfa9c12239fa1cad9 (patch) | |
tree | 26c8bd4dbc380e5aadb40a9e5069297c42498e54 /crawler | |
parent | f8d90e92ce791650dc89944fca009fc36d9e3a90 (diff) |
[crawler] Download full wikitravel images, not thumbnails.
Diffstat (limited to 'crawler')
-rwxr-xr-x | crawler/normalize-image-urls.sh | 4 | ||||
-rwxr-xr-x | crawler/wikitravel-crawler.sh | 4 |
2 files changed, 7 insertions, 1 deletions
diff --git a/crawler/normalize-image-urls.sh b/crawler/normalize-image-urls.sh new file mode 100755 index 0000000000..ee045b1df6 --- /dev/null +++ b/crawler/normalize-image-urls.sh @@ -0,0 +1,4 @@ +#!/bin/bash +set -e -u -x + +cat $1 | sed 's:/thumb\(/.*\)/[0-9][0-9]*px-.*$:\1:' | sort -u > $2 diff --git a/crawler/wikitravel-crawler.sh b/crawler/wikitravel-crawler.sh index 58fd1a2f3f..dee0e843a1 100755 --- a/crawler/wikitravel-crawler.sh +++ b/crawler/wikitravel-crawler.sh @@ -28,6 +28,8 @@ cat wikitravel-pages.json | python $MY_PATH/wikitravel-optimize-articles.py $MY_PATH/extract-image-urls.sh wikitravel-images.urls -wget --wait=1 --no-clobber -i wikitravel-images.urls +$MY_PATH/normalize-image-urls.sh wikitravel-images.urls wikitravel-images-normalized.url + +wget --wait=1 --random-wait --no-clobber -i wikitravel-images-normalized.urls # TODO: Run publisher. |