修改了一个爬虫htcap

发布时间:June 24, 2017 // 分类:开发笔记,代码学习,python // No Comments

修改htcap的数据库为mysql

好懒..拖延症.....本来早两个月以前就应该完成了的东西.在基友的催促下匆忙的修改了一下.修改了一个爬虫htcap,这个爬虫强大之处就是基于PhantomJS 实现的.数据库本来这个是基于sqlite3的.本屌感觉这个爬虫其实功能还是不错的.于是就修改了数据库为mysql.还稍微增加了一些功能

大概新增的功能如下

0.禁止dns刷新缓存 done

1.修改htcap的数据库为mysql done

2.增加常见统计代码和分享网站的过滤功能 done

3.增加常见静态后缀的识别 done

4.获取url在原有的robots基础上增加目录爆破和搜索引擎采集.识别一些不能访问的目录 done

5.砍掉sqlmap和Arachni扫描功能. done

6.增加页面信息识别功能.

7.增加重复去重和相似度去重功能

列举说明下

0x00.禁止dns刷新缓存

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import socket  
import requests
_dnscache={}  
def _setDNSCache():  
    """ 
    Makes a cached version of socket._getaddrinfo to avoid subsequent DNS requests. 
    """  

    def _getaddrinfo(*args, **kwargs):  
        global _dnscache  
        if args in _dnscache:  
            print str(args)+" in cache"  
            return _dnscache[args]  

        else:  
            print str(args)+" not in cache"    
            _dnscache[args] = socket._getaddrinfo(*args, **kwargs)  
            return _dnscache[args]  

    if not hasattr(socket, '_getaddrinfo'):  
        socket._getaddrinfo = socket.getaddrinfo  
        socket.getaddrinfo = _getaddrinfo  



def test(url):  
    _setDNSCache()  
    import time
    start = time.time()
    requests.get(url)
    end = time.time()
    print "first \t"+str(end-start)
    requests.get(url)
    sect = time.time()
    print "second \t"+str(sect - end)
    requests.get(url)
    print "third \t"+str(time.time() - sect)


if __name__ == '__main__':  
    url = "http://testphp.vulnweb.com"
    test(url)  

执行的结果还是比较满意

('0cx.cc', 80, 0, 1) not in cache
first   1.11360812187
('0cx.cc', 80, 0, 1) in cache
second  0.933846950531
('0cx.cc', 80, 0, 1) in cache
third   0.733961105347

但是又不是每一次都是ok的

('github.com', 443, 0, 1) not in cache
first   1.77086091042
('github.com', 443, 0, 1) in cache
second  2.0764131546
('github.com', 443, 0, 1) in cache
third   3.75542092323

不过这个方案虽好,但也有缺陷,罗列如下:
1.相当于只对socket.getaddrinfo打了一个patch,但socket.gethostbyname,socket.gethostbyname_ex还是走之前的策略
2.只对本程序有效,而修改/etc/hosts将对所有程序有效,包括ping

0x01.修改htcap的数据库为mysql

vi core/lib/DB_config.py
    #数据库信息
    'host' : 'localhost',
    'user' : 'root',
    'port' : '3306',
    'password' : 'mysqlroot',
    'db' : 'w3a_scan',

0x02.增加常见统计代码和分享网站的过滤功能

    def urlfilter(self):
        IGNORE_DOMAIN = [   #强制过滤域名.常见的bat和一些统计分享类的网站
            'taobao.com','51.la','weibo.com','qq.com','t.cn','baidu.com','gravatar.com','cnzz.com','51yes.com',
            'google-analytics.com','tanx.com','360.cn','yy.com','163.com','263.com','eqxiu.com','','gnu.org','github.com',
            'facebook.com','twitter.com','google.com'
        ]

....
and so on

测试站testphp.vulnweb.com

效果如下

1.没有做采集以及没有去重处理

. . initialized, crawl started with 10 threads
[=================================]   66 of 66 pages processed in 8 minutes
Crawl finished, 66 pages analyzed in 8 minutes


2.没有做采集和但是去重处理

3.做采集以及去重处理

. initialized, crawl started with 10 threads
[=================================]   108 of 108 pages processed in 43 minutes
Crawl finished, 108 pages analyzed in 43 minutes

ps:
采集前后的区别 66 108
去重前后的区别 85 65

唯一的缺点就是PhantomJS这玩意太特么占内存了..

https://github.com/0xa-saline/htcap_mysql

PS:因为PhantomJS作者不再维护PhantomJS项目了..估计这个也不会继续更新了

另外,不承担解决该爬虫导致的bug的问题

爬虫结合sqlmapi判断注入

发布时间:September 5, 2015 // 分类:开发笔记,代码学习,python,生活琐事 // No Comments

最近在弄点蛋疼的东西.爬虫,扫描。扫描交给sqlmapapi来进行.现在的资料不是很多,但是还是可以找到一些

《使用sqlmapapi.py批量化扫描实践》http://drops.wooyun.org/tips/6653 

看看他所封装的sqlmapapi的类

#!/usr/bin/python
#-*-coding:utf-8-*-
import requests
import time
import json


class AutoSqli(object):

    """
    使用sqlmapapi的方法进行与sqlmapapi建立的server进行交互

    By Manning
    """

    def __init__(self, server='', target='',data = '',referer = '',cookie = ''):
        super(AutoSqli, self).__init__()
        self.server = server
        if self.server[-1] != '/':
            self.server = self.server + '/'
        self.target = target
        self.taskid = ''
        self.engineid = ''
        self.status = ''
        self.data = data
        self.referer = referer
        self.cookie = cookie
        self.start_time = time.time()

    #新建扫描任务    
    def task_new(self):
        self.taskid = json.loads(
            requests.get(self.server + 'task/new').text)['taskid']
        print 'Created new task: ' + self.taskid
        #得到taskid,根据这个taskid来进行其他的
        if len(self.taskid) > 0:
            return True
        return False

    #删除扫描任务
    def task_delete(self):
        if json.loads(requests.get(self.server + 'task/' + self.taskid + '/delete').text)['success']:
            print '[%s] Deleted task' % (self.taskid)
            return True
        return False

    #扫描任务开始
    def scan_start(self):
        headers = {'Content-Type': 'application/json'}
        #需要扫描的地址
        payload = {'url': self.target}
        url = self.server + 'scan/' + self.taskid + '/start'
        #http://127.0.0.1:8557/scan/xxxxxxxxxx/start
        t = json.loads(
            requests.post(url, data=json.dumps(payload), headers=headers).text)
        self.engineid = t['engineid']
        if len(str(self.engineid)) > 0 and t['success']:
            print 'Started scan'
            return True
        return False

    #扫描任务的状态
    def scan_status(self):
        self.status = json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/status').text)['status']
        if self.status == 'running':
            return 'running'
        elif self.status == 'terminated':
            return 'terminated'
        else:
            return 'error'

    #扫描任务的细节
    def scan_data(self):
        self.data = json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/data').text)['data']
        if len(self.data) == 0:
            print 'not injection:\t'
        else:
            print 'injection:\t' + self.target

    #扫描的设置,主要的是参数的设置
    def option_set(self):
        headers = {'Content-Type': 'application/json'}
        option = {"options": {
                    "smart": True,
                    ...
                    }
                 }
        url = self.server + 'option/' + self.taskid + '/set'
        t = json.loads(
            requests.post(url, data=json.dumps(option), headers=headers).text)
        print t

    #停止扫描任务
    def scan_stop(self):
        json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/stop').text)['success']

    #杀死扫描任务进程
    def scan_kill(self):
        json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/kill').text)['success']

    def run(self):
        if not self.task_new():
            return False
        self.option_set()
        if not self.scan_start():
            return False
        while True:
            if self.scan_status() == 'running':
                time.sleep(10)
            elif self.scan_status() == 'terminated':
                break
            else:
                break
            print time.time() - self.start_time
            if time.time() - self.start_time > 3000:
                error = True
                self.scan_stop()
                self.scan_kill()
                break
        self.scan_data()
        self.task_delete()
        print time.time() - self.start_time

if __name__ == '__main__':
    t = AutoSqli('http://127.0.0.1:8774', 'http://192.168.3.171/1.php?id=1')
    t.run()

它的工作过程是

    get请求创建任务, 获取到任务id
    get请求特定的任务id设置参数
    post请求特定的任务id开始扫描指定url
    get请求特定的任务id获取状态
    get请求特定的任务id获取测试结果
    get请求特定的任务id删除任务

进入到lib/utils/api.py的server类,可以发现通过向server提交数据进行与服务的交互。 一共分为3种类型。

    Users' methods 用户方法
    Admin function 管理函数
    sqlmap core interact functions 核心交互函数
可以提交数据的种类如下。

用户方法

    @get("/task/new")
    @get("/task//delete")
管理函数

    @get("/admin//list")
    @get("/admin//flush")
核心交互函数

    @get("/option//list")
    @post("/option//get")
    @post("/option//set")
    @post("/scan//start")
    @get("/scan//stop")
    @get("/scan//kill")
    @get("/scan//status")
    @get("/scan//data")
    @get("/scan//log//")
    @get("/scan//log")
    @get("/download///")
最后对于是否是有注入漏洞, 代码里面是这么判断的, 如果返回的字典中, data里面有值, 那么就有注入

然后从https://github.com/smarttang/w3a_Scan_Console/blob/master/module/sprider_module.py里面得到爬虫模块.稍微整合一下

#!/usr/bin/python
# vim: set fileencoding=utf-8:

import sys
import urllib2
import re
from BeautifulSoup import BeautifulSoup

import autosql

class SpriderUrl:
    # 初始化
    def __init__(self,url):
        self.url=url
        #self.con=Db_Connector('sprider.ini')

#获得目标url的第一次url清单
    def get_self(self):
        urls=[]
        try:
            body_text=urllib2.urlopen(self.url).read()
        except:
            print "[*] Web Get Error:checking the Url"
        soup=BeautifulSoup(body_text)
        links=soup.findAll('a')
        for link in links:
            # 获得了目标的url但还需要处理
            _url=link.get('href')
             # 接着对其进行判断处理
             # 先判断它是否是无意义字符开头以及是否为None值
             # 判断URL后缀,不是列表的不抓取
            if re.match('^(javascript|:;|#)',_url) or _url is None or re.match('.(jpg|png|bmp|mp3|wma|wmv|gz|zip|rar|iso|pdf|txt|db)$',_url):
                continue
            # 然后判断它是不是http|https开头,对于这些开头的都要判断是否是本站点, 不做超出站点的爬虫
            if re.match('^(http|https)',_url):
                if not re.match('^'+self.url,_url):
                    continue
                else:
                    urls.append(_url)
            else:
                urls.append(self.url+_url)
        rst=list(set(urls))
        for rurl in rst:
            try:
                self.sprider_self_all(rurl)
                # 进行递归,但是缺点太明显了,会对全部的页面进行重复递归。然后递交进入autosql
                # AutoSqli('http://127.0.0.1:8775', rurl).run
            except:
                print "spider error"

    def sprider_self_all(self,domain):
        urls=[]
        try:
            body_text=urllib2.urlopen(domain).read()
        except:
            print "[*] Web Get Error:checking the Url"
            sys.exit(0)
        soup=BeautifulSoup(body_text)
        links=soup.findAll('a')
        for link in links:
            # 获得了目标的url但还需要处理
            _url=link.get('href')
             # 接着对其进行判断处理
             # 先判断它是否是无意义字符开头以及是否为None值
             # 判断URL后缀,不是列表的不抓取
            try:
                if re.match('^(javascript|:;|#)',str(_url)) or str(_url) is None or re.match('.(jpg|png|bmp|mp3|wma|wmv|gz|zip|rar|iso|pdf|txt|db)$',str(_url)):
                    continue
            except TypeError:
                print "[*] Type is Error! :"+str(_url)
                continue
            # 然后判断它是不是http|https开头,对于这些开头的都要判断是否是本站点, 不做超出站点的爬虫
            if re.match('^(http|https)',_url):
                if not re.match('^'+self.url,_url):
                    continue
                else:
                    urls.append(_url)
            else:
                urls.append(self.url+_url)
        res=list(set(urls))
        for rurl in res:
            try:
                print rurl
                #AutoSqli('http://127.0.0.1:8775', rurl).run
            except:
                print "spider error"

spi="http://0day5.com/"
t=SpriderUrl(spi)
# # 第一次捕获
t.get_self()

最好的办法还是存进数据库里面,然后检查是否重复。

        for rurl in res:
            if self.con.find_item("select * from url_sprider where url='"+rurl+"' and domain='"+self.url+"'"):
                continue
            else:
                try:
                    self.con.insert_item("insert into url_sprider(url,tag,domain)values('"+rurl+"',0,'"+self.url+"')")
                except:
                    print "[*] insert into is Error!"

 

 

最近依旧在整理这爬虫的资料:

1.针对很多爬虫有明显特征的办法就是指定相对应的User-Agent

2.针对部分WAF则可以采取来路来进行Bypass。如果是百度的呢

改进了下

    USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    ]

    REFERERS = [
    "https://www.baidu.com",
    "http://www.baidu.com",
    "https://www.google.com.hk",
    "http://www.so.com",
    "http://www.sogou.com",
    "http://www.soso.com",
    "http://www.bing.com",
    ]

    default_cookies = {}
    #随机User-Agent.
    default_headers = {
        'User-Agent': random.choice(USER_AGENTS),
        'Accept': 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Cache-Control': 'max-age=0',
        'referer': random.choice(REFERERS),
        'Accept-Charset': 'GBK,utf-8;q=0.7,*;q=0.3',
    }

然后依旧是过滤的问题,就是相似度检查~然后有效的爬虫,这样得到的结果可以更加精准一些。这个算法主要是依靠对URL的拆解与对拆解对象的HASH,这个算法适用类似的需求常见。这个算法是将一个URL拆解为三个维度,第一个维度是netloc,第二个维度是path的各项长度,第三个维度是query对象的参数排序后列表。通过一个数据结构对以上三个维度组合,构建一个可hash的对象。[来自:http://drops.wooyun.org/tips/5462]

#!/usr/bin/env python
# coding:utf-8
import time
import os
import urlparse
import hashlib
import sys
sys.path.append("..")

from config.config import *
reload(sys) 
sys.setdefaultencoding("utf-8") 

SIMILAR_SET = set()
REPEAT_SET = set()

'''
2015.3.30
分清楚爬虫 什么是聚焦 什么是过滤
聚焦:  如果keyword在url则返回True 否则返回False

过滤:  如果keyword在url则返回False 否则返回True
'''


def format(url):
    '''
    策略是构建一个三元组
    第一项为url的netloc
    第二项为path中每项的拆分长度
    第三项为query的每个参数名称(参数按照字母顺序排序,避免由于顺序不同而导致的重复问题)
    '''
    if urlparse.urlparse(url)[2] == '':
        url = url+'/'

    url_structure = urlparse.urlparse(url)
    netloc = url_structure[1]
    path = url_structure[2]
    query = url_structure[4]
    
    temp = (netloc,tuple([len(i) for i in path.split('/')]),tuple(sorted([i.split('=')[0] for i in query.split('&')])))
    #print temp
    return temp


def check_netloc_is_ip(netloc):
    '''
    如果url的netloc为ip形式
    return True
    否则
    return False
    '''
    flag =0
    t = netloc.split('.')
    for i in t:
        try:
            int(i)
            flag += 1
        except Exception, e:
            break
    if flag == 4:
        return True
    
    return False

def url_domain_control(url,keyword):
    '''
    URL域名控制  聚焦

    True url符合域名判断
    False url不符合域名判断

    1,keyword可以是list或者str
    2,如果url的netloc为ip形式,return True

    '''
    t = format(url)
    if check_netloc_is_ip(t[0]):
        return True

    elif str(type(keyword)) == "<type 'list'>":
        for i in keyword:
            if i.lower() in t[0].lower():
                return True

    elif str(type(keyword)) == "<type 'str'>":
        if keyword.lower() in t[0].lower():
            return True
    return False

def url_domain_control_ignore(url,keyword):
    '''
    URL域名控制  过滤

    True 忽略关键字不在url中
    False 忽略关键字在url中

    例如:
    忽略blog,如果域名的netloc内有blog,则返回false
    '''
    t = format(url)
    for i in keyword:
        if i in t[0].lower():
            return False
    return True

def url_similar_control(url):
    '''
    URL相似性控制
    
    True url未重复
    False url重复
    '''
    t = format(url)
    if t not in SIMILAR_SET:
        SIMILAR_SET.add(t)
        return True
    return False


def url_format_control(url):
    '''
    URL格式控制  过滤

    True url符合格式判断
    False url不符合格式判断
    '''

    if '}' not in url and '404' not in url and url[0].lower() == 'h' and '/////' not in url and len(format(url)[1]) < 6:
        if len(format(url)[2]) > 0:
            for i in format(url)[2]:
                if len(i) > 20:
                    return False
        if 'viewthread' in url or 'forumdisplay' in url:
            return False
        return True
    return False

def url_custom_control(url):
    '''
    URL自定义关键字控制  过滤
    True 
    False
    '''
    for i in CUSTOM_KEY:
        if i in url:
            return False
    return True

def url_custom_focus_control(url,focuskey):
    '''
    URL自定义关键字控制  聚焦
    True 符合聚焦策略
    False
    '''
    if len(focuskey) == 0:
        return True
    for i in focuskey:
        if i in url:
            return True
    return False

def url_repeat_control(url):
    '''
    URL重复控制

    True url未重复
    False url重复
    '''
    if url not in REPEAT_SET:
        REPEAT_SET.add(url)
        return True
    return False

def url_filter_similarity(url,keyword,ignore_keyword,focuskey):
    if url_format_control(url) and url_similar_control(url) \
                and url_domain_control(url,keyword) and url_domain_control_ignore(url,IGNORE_KEY_WORD) \
                    and url_custom_control(url) and url_custom_focus_control(url,focuskey):
        return True
    else:
        return False
def url_filter_no_similarity(url,keyword,ignore_keyword,focuskey):
    if url_format_control(url) and url_repeat_control(url) \
                and url_domain_control(url,keyword) and url_domain_control_ignore(url,IGNORE_KEY_WORD) \
                    and url_custom_control(url) and url_custom_focus_control(url,focuskey):
        return True
    else:
        return False


if __name__ == "__main__": 
    print url_format_control("http://www.gznu.edu.cn")

 

python之bugscan爬虫

发布时间:April 26, 2015 // 分类:代码学习,linux,python,windows // No Comments

最近在bugscan上提交各种插件,发现了与很多大神之间的差距。但是自己想弄一些回来慢慢的学习。就萌生了想把全部的explotit全部拖回来看看。

首先我们来抓包看看

POST /rest HTTP/1.1
Host: www.bugscan.net
Connection: keep-alive
Content-Length: 45
Origin: https://www.bugscan.net
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Content-Type: application/json;charset=UTF-8
Referer: https://www.bugscan.net/
Accept-Encoding: gzip, deflate
Accept: application/json, text/plain, */*
Accept-Language: zh-CN,zh;q=0.8,es;q=0.6,fr;q=0.4,vi;q=0.2
Cookie: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 
{"method":"GetPluginDetail","params":[579]}

可以看出来是通过json进行提交的

然后我们用这个数据包利用burp来重放一次

明显是可以看到成功提交的。然后各种精简到后来发现到这样也是可以提交的

POST /rest HTTP/1.1
Host: www.bugscan.net
Content-Type: application/json;charset=UTF-8
Referer: https://www.bugscan.net/
Cookie: xxxxxxxxxxxxxxxxxxxxxx
Content-Length: 45
 
{"method":"GetPluginDetail","params":[579]}

那么现在可以明确的肯定,我们需要提交的东西包括host url referer content-type cookie 以及 post提交的数据

首先因为要使用json,我们就加载json模块

我们使用urllib2这个组件来抓取网页。

urllib2是Python的一个获取URLs(Uniform Resource Locators)的组件。

它以urlopen函数的形式提供了一个非常简单的接口

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Author: rookie
 
import json   
import urllib2
 
url = ' #接收参数的地址 
headers = { 'Referer':' #模拟来路 
            'Content-Type': 'application/json;charset=UTF-8', #提交的方式
            'cookie':'xxxxxxxxxxxxxxxxx'  #自己的cookie
        }
data = {"method":"GetPluginDetail","params":[579]}
postdata = json.dumps(data)
 
req = urllib2.Request(url, postdata, headers)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page

可以看到现在成功的获取到了这个内容.

然后逐个匹配,内容,文件名字等等

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Author: rookie
 
import json   
import urllib2
 
url = ' #接收参数的地址  
headers = { 'Referer':'https://www.bugscan.net/',#模拟来路
            'Content-Type': 'application/json;charset=UTF-8',#提交的方式
            'cookie':'xxxxxxxxxxxxxxxxxxxxxxxxxxx'#自己的cookie
        }
data = {"method":"GetPluginDetail","params":[579]}
postdata = json.dumps(data)
 
req = urllib2.Request(url, postdata, headers)
response = urllib2.urlopen(req)
html_page = response.read()
json_html = json.loads(html_page) #dedao得到原始的数据
source = json_html['result']['plugin']['source'] #获取到我们需要的源码内容
name = json_html['result']['plugin']['fname'] #获取到我们读取的文件名字
f = open('f:/bugscan/'+name,'w')
f.write(source)
f.close()

然后就是我们想要做的了..

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Author: rookie
 
import json   
import urllib2
 
url = 'https://www.bugscan.net/rest'
headers = { 'Referer':'https://www.bugscan.net/',
            'Content-Type': 'application/json;charset=UTF-8',
            'cookie':'xxxxxxxxxxxxxxxxxxxxxxxxxxx'
        }
for i in range(1, 579+1): #因为到当前为止,只有579篇
    datas = {"method":"GetPluginDetail","params":[i]}
    postdata = json.dumps(datas)
    req = urllib2.Request(url, postdata, headers)
    response = urllib2.urlopen(req)
    html_page = response.read()
    json_html = json.loads(html_page) #dedao得到原始的数据
    source = json_html['result']['plugin']['source'] #获取到我们需要的源码内容
    name = json_html['result']['plugin']['fname'] #获取到我们读取的文件名字
    f = open('f:/bugscan/'+i+name,'w')#写入文件内
    f.write(source)
    f.close()

下一个修改的地方,因为用户Zero的是不能查看的,所以考虑获取用户的地方直接给他不写入

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Author: rookie
 
import json   
import urllib2
 
url = 'https://www.bugscan.net/rest'
headers = { 'Referer':'https://www.bugscan.net/',
            'Content-Type': 'application/json;charset=UTF-8',
            'cookie':'xxxxxxxxxxxxxxxxxxxxxx'
        }
for i in range(1, 579+1):
    datas = {"method":"GetPluginDetail","params":[i]}
    postdata = json.dumps(datas)
    req = urllib2.Request(url, postdata, headers)
    response = urllib2.urlopen(req)
    html_page = response.read()
    if html_page.find('未登录或者是加密插件') == -1: #未登录或者是加密插件的基本不可看
        json_html = json.loads(html_page) #dedao得到原始的数据
        source = json_html['result']['plugin']['source'] #获取到我们需要的源码内容
        name = json_html['result']['plugin']['fname'] #获取到我们读取的文件名字
        f = open('f:/bugscan/'+i+name,'w')
        f.write(source)
        f.close()
 

2015.8.15更新

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Author: rookie
  
import json   
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
  
url = 'https://www.bugscan.net/rest'
#自定义的http头
headers = { 'Referer':'https://www.bugscan.net/',
            'Content-Type': 'application/json;charset=UTF-8',
            'cookie':'here is cookie which is yours'
        }

for i in range(39, 1341+1):
    #自带的提交方式
    data = {"method":"GetPluginDetail","params":[i]}
    #把提交得到的数据进行json解码
    postdata = json.dumps(data)
    req = urllib2.Request(url, postdata, headers)
    response = urllib2.urlopen(req)
    html_page = response.read()
    json_html = json.loads(html_page)
    #如果匹配到sql:no rows in result set
    #表示这个是不可读取的
    if html_page.find("sql: no rows in result set") == -1: 
        #如果匹配到了source:的式样。则进行匹配
        if html_page.find('"source":') != -1: 
            #print i
            #抓取source的内容
            source = json_html['result']['plugin']['source']
            #抓取fname的内容,也就是提交的文件名
            name = json_html['result']['plugin']['fname']
            #考虑到有重复的出现,这里匹配了i
            f = open('bugscan/'+bytes(i)+'_'+name,'w')
            f.write(source)
            f.close()

 

2016/1/16 更新。因为之前的跳转全部给弄到了q.bugscan.net。所以单独对q.bugscan.net的进行爬行

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re

#自定义的http头
qian = re.compile(r'<pre><code>([\s\S]*?)</code></pre>')
headers = { 'Referer':'http://q.bugscan.net/',
            'UserAgent': 'Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19',
            'cookie':'cookie',}
url = 'http://q.bugscan.net/t/1945'
response = requests.get(url,headers=headers)
resu = qian.findall(response.content.replace('&#34;', '"'))
if (len(resu)>0) and ("def assign(service, arg):" in resu):
    content = resu[0].replace("&#39;", "'")
        print content

然后按照这个弄了一个多线程的

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Author: rookie
import requests,re
import time
import thread

def scan(i):
    print i
    qian = re.compile(r'<pre><code>([\s\S]*?)</code></pre>')
    headers = { 'Referer':'http://q.bugscan.net/',
            'UserAgent': 'Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19',
            'cookie':'CNZZDATA1256927980=200621908-1450077676-http%253A%252F%252Fq.bugscan.net%252F%7C1450106517; gsid=a8971c8ce53bc573809080eb34e7b03d88ed0dcfb733f1c48aceeeb1b6c8c574; _xsrf=2|1f20d955|cd1c41b6601fd0225ea5c8641af3028c|1451976836',}
    url = 'http://q.bugscan.net/t/'+str(i)
    response = requests.get(url,headers=headers)
    resu = qian.findall(response.content.replace('&#34;', '"'))
    if (len(resu)>0) and ("def assign(service, arg):" in resu):
        content = resu[0].replace("&#39;", "'")
        content = content.replace('&gt;','>')
        content = content.replace('&lt;','<')
        content = content.replace('&amp;','&')
        f = open('bugscan/'+bytes(i)+".py",'w')
        f.write(content)
        f.close()

def find_id():
    for i in range(1,2187+1):
        thread.start_new_thread(scan, (i,))
        time.sleep(3)

if __name__ == "__main__":
    find_id()