diff options
author | Yury Melnichek <melnichek@gmail.com> | 2012-05-28 16:31:07 +0400 |
---|---|---|
committer | Alex Zolotarev <alex@maps.me> | 2015-09-23 01:40:14 +0300 |
commit | 157cbfe78c67b6188b21cfd28cf414292c10d37c (patch) | |
tree | 9541a9d66f7401f19a3c0f0fdf723297e6e0198e /crawler | |
parent | 37c71bed8dd1a961aaea2ae8a3b57483aaeac76d (diff) |
[guide] Extract images from html files.
Diffstat (limited to 'crawler')
-rwxr-xr-x | crawler/extract-image-urls.sh | 8 | ||||
-rwxr-xr-x | crawler/wikitravel-crawler.sh | 2 |
2 files changed, 10 insertions, 0 deletions
diff --git a/crawler/extract-image-urls.sh b/crawler/extract-image-urls.sh new file mode 100755 index 0000000000..e63d10085d --- /dev/null +++ b/crawler/extract-image-urls.sh @@ -0,0 +1,8 @@ +#!/bin/bash +set -e -u -x + +grep --ignore-case --only-matching --no-filename '<img[^/]*src=\"[^">]*"' *.opt \ + | sed 's/<img.*src="//g' \ + | sed 's/"$//g' \ + | sort -u \ + > $1 diff --git a/crawler/wikitravel-crawler.sh b/crawler/wikitravel-crawler.sh index d6e8406bd3..7f3eb802fe 100755 --- a/crawler/wikitravel-crawler.sh +++ b/crawler/wikitravel-crawler.sh @@ -26,4 +26,6 @@ cat wikitravel-pages.json | python $MY_PATH/wikitravel-process-articles.py cat wikitravel-pages.json | python $MY_PATH/wikitravel-optimize-articles.py +$MY_PATH/extract-image-urls.sh wikitravel-images.urls + # TODO: Run publisher. |