[二星]程序开发，不限语言，抓取网页中的图片

oneleaf · #1

1 任务内容：使用任何语言或脚本，给出一个网页地址，抓取该网页的所有图片，并保存在本地。

2 任务的难度：二星

3 任务的目的：学习 Linux 下编程

4 任务所涉及的软件：任何语言

5 任务将大致消耗的时间： 1~6小时

tenzu · #2

wget里做过...

qkbeyond · #3

man wget 写了：Very Advanced Usage

* If you wish Wget to keep a mirror of a page (or ftp subdirectories), use ‘--mirror’ (‘-m’), which is the shorthand for ‘-r -l inf -N’. You can put Wget in the crontab file asking it to recheck a site each Sunday:

crontab
0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog

* In addition to the above, you want the links to be converted for local viewing. But, after having read this manual, you know that link conversion doesn't play well with timestamping, so you also want Wget to back up the original html files before the conversion. Wget invocation would look like this:

wget --mirror --convert-links --backup-converted \
http://www.gnu.org/ -o /home/me/weeklog

* But you've also noticed that local viewing doesn't work all that well when html files are saved under extensions other than ‘.html’, perhaps because they were served as index.cgi. So you'd like Wget to rename all the files served with content-type ‘text/html’ or ‘application/xhtml+xml’ to name.html.

wget --mirror --convert-links --backup-converted \
--html-extension -o /home/me/weeklog \
http://www.gnu.org/

Or, with less typing:

wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog

红色的算么 ${L_SMILIES_ROLLING_EYES}$

oneleaf · #4

不算，只要单页面涉及的图片，不需要镜像整个站点

速腾1994 · #5

xiooli · #6

代码：全选

#!/bin/bash

# Name:     pic_catch.sh
# Author:   xiooli <xioooli[at]yahoo.com.cn>
# Site:     http://joolix.com
# Licence:  GPLv3
# Version:  100130

url="$1"
url_cwd="$(dirname "$url")"
url_fd="$(dirname "$url_cwd")"
html="$(wget "$url" -q -O-)"
[ -d ./pics ] || mkdir ./pics

echo "Downloading..."
echo "$html"|grep -o "img src=\"[^\"]*\"" \
|awk -F"\"" '{print $2}'| while read line; do
	full_url="$(sed "s|\.\./|$url_fd/|" <<< "$line")"
	full_url="$(sed "s|\./|$url_cwd/|" <<< "$full_url")"
	[ "${full_url/:\/\/}" = "$full_url" ] && full_url="$url_cwd/$full_url"
	wget "$full_url" -O ./pics/"$(basename $full_url)"
done && echo "Downloading pictures finished."

tusooa · #7

ls的那个不大通用。

代码：全选

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sgmllib
import sys
import pycurl
import StringIO
import os

def download(fileName):
    buf = StringIO.StringIO()
    curl = pycurl.Curl()
    curl.setopt(pycurl.URL, fileName)
    curl.setopt(pycurl.WRITEFUNCTION, buf.write)
    curl.setopt(pycurl.FOLLOWLOCATION, 1)
    curl.setopt(pycurl.MAXREDIRS, 5)
    curl.perform()
    return buf.getvalue()

def getTopDirname(name):
    oldName = name
    dirname = os.path.dirname(oldName)
    while dirname != 'http:':
        oldName = dirname
        dirname = os.path.dirname(oldName)
    return oldName

def getFullName(fileName, dirname, net):
    if fileName.startswith('/') and net:
        fullName = os.path.join(getTopDirname(dirname), fileName[1:])
    else:
        fullName = os.path.join(dirname, fileName)
    return fullName

def getDownloadDir():
    dwdir = os.getenv('DOWNLOAD_DIR')
    if not dwdir:
        dwdir = os.path.join(os.getenv('HOME'), '个人/下载/网页图片')
    return dwdir

def createDownloadDir():
    dwdir = getDownloadDir()
    if not os.path.exists(dwdir):
        os.makedirs(dwdir)

def downloadPicture(fileName, dirname):
    fullName = getFullName(fileName, dirname, True)
    buf = StringIO.StringIO()
    curl = pycurl.Curl()
    curl.setopt(pycurl.URL, fullName)
    curl.setopt(pycurl.WRITEFUNCTION, buf.write)
    curl.setopt(pycurl.FOLLOWLOCATION, 1)
    curl.setopt(pycurl.MAXREDIRS, 5)
    curl.perform()
    f = open(os.path.join(getDownloadDir(), os.path.basename(fileName)), 'w')
    f.write(buf.getvalue())
    f.close()

class imgSrcLister(sgmllib.SGMLParser):
    def reset(self):
        sgmllib.SGMLParser.reset(self)
        self.picture_uris = []
    
    def start_img(self, attrs):
        for key, value in attrs:
            if key == 'src':
                if not value in self.picture_uris:
                    self.picture_uris.append(value)

if __name__ == '__main__':
    if len(sys.argv) > 1:
        fileName = sys.argv[1]
    else:
        fileName = raw_input('Input the file name: ')
    if fileName.startswith('http://'):
        c = download(fileName)
    else:
        f = open(fileName, 'r')
        c = f.read()
        f.close()
    
    lister = imgSrcLister()
    lister.feed(c)
    if fileName.startswith('http://'):
        net = True
        dirname = os.path.dirname(fileName)
        if dirname == 'http:':
            dirname = fileName
    else:
        net = False
        dirname = os.path.dirname(fileName)
    
    if '-p' in sys.argv or '--print' in sys.argv or not net:
        for item in lister.picture_uris:
            print getFullName(item, dirname, net)
    else:
        createDownloadDir()
        for item in lister.picture_uris:
            downloadPicture(item, dirname)

效果：

代码：全选

[tlcr: 0] [01/02/2010 17:24:14] [tusooa@tusooa-laptop] [~]
>> cat /tmp/test.htm
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/1999/xhtml">
<html>
        <head>
                <title>test</title>
        </head>
        <body>
                <p>
                        <img alt='test pic' src='/usr/share/icons/default.kde4/16x16/mimetypes/application-pdf.png' />
                </p>
        </body>
</html>
[tlcr: 0] [01/02/2010 17:24:41] [tusooa@tusooa-laptop] [~]
>> get_html_pictures /tmp/test.htm
/usr/share/icons/default.kde4/16x16/mimetypes/application-pdf.png
[tlcr: 0] [01/02/2010 17:25:10] [tusooa@tusooa-laptop] [~]
>> cat /tmp/test.htm
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/1999/xhtml">
<html>
        <head>
                <title>test</title>
        </head>
        <body>
                <p>
                        <img alt='test pic' src='usr/share/icons/default.kde4/16x16/mimetypes/application-pdf.png' />
                </p>
        </body>
</html>
[tlcr: 0] [01/02/2010 17:25:12] [tusooa@tusooa-laptop] [~]
>> get_html_pictures /tmp/test.htm
/tmp/usr/share/icons/default.kde4/16x16/mimetypes/application-pdf.png
[tlcr: 0] [01/02/2010 17:26:35] [tusooa@tusooa-laptop] [~]
>> get_html_pictures http://forum.ubuntu.org.cn/ -p
http://forum.ubuntu.org.cn/./styles/UbuntuCN/theme/images/icon_mini_login.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/theme/images/icon_mini_register.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/theme/images/icon_mini_faq.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/imageset/forum_read.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/imageset/icon_topic_latest.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/imageset/forum_read_subforum.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/theme/images/whosonline.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/imageset/forum_unread.gif
http://forum.ubuntu.org.cn/./styles/UbuntuCN/imageset/forum_read_locked.gif

没有--help。
第一个参数永远是路径。
第二个(可选)参数可以是-p和--print。表示只输出不下载。如果是本地文件也是只输出不下载。
环境变量:DOWNLOAD_DIR 设置下载路径。默认为$HOME/个人/下载/网页图片

避免重复，加上判断。

roylez · #8

纯属无聊了

代码：全选

#!/usr/bin/env ruby
# coding: utf-8
#Author: Roy L Zuo (roylzuo at gmail dot com)
require 'open-uri'
require 'hpricot'

def parse_img_list(url)
    u = URI.parse(url)
    p = Hpricot.parse(open(url).read)
    img = (p/"img").collect {|v| u.merge(v.attributes['src']).to_s}.uniq.sort
end

def buster_img_list(*list)
    list.collect {|i| 
        Thread.new { open(i.split("/").last,'wb') {|f| f.puts open(i).read } } 
    }.each {|t| t.join}
end

if __FILE__==$0
    require 'optparse'
    options = {}
    parser = OptionParser.new { |opts|
        opts.banner = "Usage: #{$0} [-l] URL"
        options[:list_only] = false
        opts.on('-l','--list-only','显示图片列表') { options[:list_only] = true }
    }
    parser.parse!
    exit unless ARGV[0]
    l = parse_img_list ARGV[0]
    options[:list_only] ? l.each{|u| puts u } : (buster_img_list *l )
end

代码：全选

roylez@Lancelot> ruby imgbuster.rb -h
Usage: imgbuster.rb [-l] URL
    -l, --list-only                  显示图片列表
roylez@Lancelot> ruby imgbuster.rb -l http://forum.ubuntu.org.cn
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_read.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_read_locked.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_read_subforum.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/forum_unread.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/imageset/icon_topic_latest.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/icon_mini_faq.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/icon_mini_login.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/icon_mini_register.gif
http://forum.ubuntu.org.cn/styles/UbuntuCN/theme/images/whosonline.gif

Stupid kid · #9

献上我的perl版本，新手呐喊下，高手指点下，呵呵……

update: 有个问题，似乎是在后台运行的，最后结束了也看不到命令行提示符，应该跟system调用有关

简单的写了几句，只能下载单个网页，并且最后图片保存到当前目录

代码：全选

cat getImage.pl

代码：全选

#!/usr/bin/perl -w

use strict;

my $url = $ARGV[0];
my $url_base = $1 if $url =~ /((?:http:\/\/)?[^\/]*\/)/;
chomp($url);
chomp($url_base);

system '/usr/bin/wget', '-O/tmp/index.html', '-q', $url || die "Cannot get the $url page: $!";

open(FILE, '<', "/tmp/index.html") || die "Cannot open index.html file: $!";

while (<FILE>) {
   if (/src=\"([^? ]+\.\w{3,})\"/i) {
	
	my $img = $1;
	$img = $url_base . $img unless $img =~ /http/;
	
	system '/usr/bin/wget', '-c',  '-T1', '-t1', $img || die "Cannot get the $img image: $!";
	#print "$img\n";
   }
}

close(FILE);

范例：

代码：全选

perl getImage.pl http://tianxiamm.com/viewthread.php?tid=57326&extra=page%3D1

天下MM，好多美女

awper361 · #10

各位大牛，我想学通用编程，就是想C++的socket那样而不是只能linux下抓网页的程序，有什么好的教程或书籍推荐吗

tenzu · #11

LS的头像。。。

jyf1987 · #12

既然有py版了我就不献丑了

xep007 · #13

怎么python的脚本最长呢？

oneleaf · #14

ok，给个简单的py

代码：全选

import urllib,re
u = 'http://forum.ubuntu.org.cn/'
html = urllib.urlopen(u).read()
li=re.findall('img src="*.*?"', html, re.S)
for item in li:
    print item
    item = item.replace('img src="','').replace('"','')
    urllib.urlretrieve(('' if item.find('http://')==0 else u)+'/'+item,item.split('/')[-1])

xep007 · #15

pocoyo 写了：

oneleaf 写了：ok，给个简单的py

代码：全选

import urllib,re
u = 'http://forum.ubuntu.org.cn/'
html = urllib.urlopen(u).read()
li=re.findall('img src="*.*?"', html, re.S)
for item in li:
    print item
    item = item.replace('img src="','').replace('"','')
    urllib.urlretrieve(('' if item.find('http://')==0 else u)+'/'+item,item.split('/')[-1])

谢老大

谢谢老大.但是网址如果是像以下样式的则会出错.
http://image.baidu.com/i?ct=201326592&c ... %C4%B8&s=0

[二星]程序开发，不限语言，抓取网页中的图片

[二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片

Re: [二星]程序开发，不限语言，抓取网页中的图片