爬取 搜狗

重要参数SUID、SNUID、SUV

SUID

SUID具体的含义可以自行百度,这里只讲述它生成的过程。当我们访问sogou搜索首页的时候,Set-Cookie中便会生成一个SUID参数的内容,除非重启浏览器,不然短时间内SUID并不会改变。SUID的值应该是sogou服务端随便分配的,只有当重新开启一个session时它的值才会更新。

SNUID

SNUID是sogou反爬虫的重点,sogou是对同一个SNUID访问次数做了限制,而超过限制后,会跳转到验证码页面,只有输入验证码重新验证以后,SNUID才会更新,访问才能继续进行。那么SNUID是如何生成的呢?经过测试,应该是由javascript生成的,当然前提是要有SUID,SUID是生成SNUID的基础。

SUV

SUV参数内容是由JavaScript生成的,测试并没有发现其对于反爬虫有何影响,故本文不做详细介绍。

sct

访问次数

ld

每次请求ld都会变化,但即使不对也能获取到内容

被屏蔽现象

同样,要解决反爬虫问题,我们先来看看触发反爬虫的现象。当同一个SNUID访问次数sct多了以后,继续访问sogou会跳转到一个验证码页面。
URL地址以及解码后的url地址:
http://www.sogou.com/antispider/?from=%2fweb%3Fquery%3d152512wqe%26ie%3dutf8%26_ast%3d1488957312%26_asf%3dnull%26w%3d01029901%26p%3d40040100%26dp%3d1%26cid%3d%26cid%3d%26sut%3d578%26sst0%3d1488957299160%26lkt%3d3%2C1488957298718%2C1488957298893
http://www.sogou.com/antispider/?from=/web?query=152512wqe&ie=utf8&_ast=1488957312&_asf=null&w=01029901&p=40040100&dp=1&cid=&cid=&sut=578&sst0=1488957299160&lkt=3,1488957298718,1488957298893

获取SUID

在爬取https://www.sogou.com/时,可以在响应的Headers里找见SUID,如下:


-----请求开始-----https://www.sogou.com/
响应状态:HTTP/1.1 200 OK
Server    nginx
Date    Fri, 10 Nov 2017 08:33:45 GMT
Content-Type    text/html; charset=UTF-8
Transfer-Encoding    chunked
Connection    keep-alive
Vary    Accept-Encoding
Set-Cookie    ABTEST=0|1510302825|v17; expires=Sun, 10-Dec-17 08:33:45 GMT; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    IPLOC=CN1100; expires=Sat, 10-Nov-18 08:33:45 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    SUID=62869B271810990A000000005A056469; expires=Thu, 05-Nov-37 08:33:45 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Cache-Control    max-age=0
Content-Language    zh-CN
Set-Cookie    black_passportid=1; domain=.sogou.com; path=/; expires=Thu, 01-Dec-1994 16:00:00 GMT
 Expires    Fri, 10 Nov 2017 08:33:45 GMT
-----请求结束-----


-----请求开始-----https://www.sogou.com/
响应状态:HTTP/1.1 200 OK
Server    nginx
Date    Fri, 10 Nov 2017 08:36:43 GMT
Content-Type    text/html; charset=UTF-8
Transfer-Encoding    chunked
Connection    keep-alive
Vary    Accept-Encoding
Set-Cookie    ABTEST=4|1510303003|v17; expires=Sun, 10-Dec-17 08:36:43 GMT; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    IPLOC=CN1100; expires=Sat, 10-Nov-18 08:36:43 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    SUID=62869B271810990A000000005A05651B; expires=Thu, 05-Nov-37 08:36:43 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Cache-Control    max-age=0
Content-Language    zh-CN
Set-Cookie    black_passportid=1; domain=.sogou.com; path=/; expires=Thu, 01-Dec-1994 16:00:00 GMT
Expires    Fri, 10 Nov 2017 08:36:43 GMT
-----请求结束-----


-----请求开始-----https://www.sogou.com/
响应状态:HTTP/1.1 200 OK
Server    nginx
Date    Mon, 13 Nov 2017 02:30:45 GMT
Content-Type    text/html; charset=UTF-8
Transfer-Encoding    chunked
Connection    keep-alive
Vary    Accept-Encoding
Set-Cookie    ABTEST=0|1510540245|v17; expires=Wed, 13-Dec-17 02:30:45 GMT; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    IPLOC=CN1100; expires=Tue, 13-Nov-18 02:30:45 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    SUID=62869B271810990A000000005A0903D5; expires=Sun, 08-Nov-37 02:30:45 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Cache-Control    max-age=0
Content-Language    zh-CN
Set-Cookie    black_passportid=1; domain=.sogou.com; path=/; expires=Thu, 01-Dec-1994 16:00:00 GMT
Expires    Mon, 13 Nov 2017 02:30:45 GMT
-----请求结束-----

获取SNUID

搜狗重新生成SNUID的HTML源码

在源码里面可以看到重新生成SNUID的步骤

<!DOCTYPE HTML>
<html>
<head>
    <meta charset="utf-8">
    <link rel="shortcut icon" href="//www.sogou.com/images/logo2014/new/favicon.ico" type="image/x-icon">
    <title>搜狗搜索</title>
    <link rel="stylesheet" href="static/css/anti.min.css?v=1"/>
    <script src="//dlweb.sogoucdn.com/common/lib/jquery/jquery-1.11.0.min.js"></script>
    <script src="static/js/antispider.min.js?v=2"></script>
    <script>
        var domain = getDomain();
        window.imgCode = -1;

        (function() {
            function checkSNUID() {
                var cookieArr = document.cookie.split('; '),
                    count = 0;

                for(var i = 0, len = cookieArr.length; i < len; i++) {
                    if (cookieArr[i].indexOf('SNUID=') > -1) {
                        count++;
                    }
                }

                return count > 1;
            }

            if(checkSNUID()) {
                var date = new Date(), expires;
                date.setTime(date.getTime() -100000);

                expires = date.toGMTString();

                document.cookie = 'SNUID=1;path=/;expires=' + expires;
                document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.www.sogou.com';
                document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.weixin.sogou.com';
                document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.sogou.com';
                document.cookie = 'SNUID=1;path=/;expires=' + expires + ';domain=.snapshot.sogoucdn.com';

                sendLog('delSNUID');
            }

            if(getCookie('seccodeRight') === 'success') {
                sendLog('verifyLoop');

                setCookie('seccodeRight', 1, getUTCString(-1), location.hostname, '/');
            }

            if(getCookie('refresh')) {
                sendLog('refresh');
            }
        })();

        function setImgCode(code) {
            try {
                var t = new Date().getTime() - imgRequestTime.getTime();
                sendLog('imgCost',"cost="+t);
            } catch (e) {
            }
            window.imgCode = code;
        }
        sendLog('index');

        function changeImg2() {
        	if(window.event) {
        		window.event.returnValue=false
        	}
        }
    </script>
</head>
<body>
<div class="header">
    <div class="logo"><a href="/"><img width="180" height="60" src="//www.sogou.com/images/logo2014/error180x60.png"></a></div>
    <div class="other"><span class="s1">您的访问出错了</span><span class="s2"><a href="/">返回首页&gt;&gt;</a></span></div>
</div>
<div class="content-box">
    <p class="ip-time-p">IP:xxx.xxx.xxx.xxx<br>访问时间:2017.11.10 16:16:01</p>
    <p class="p2">用户您好,您的访问过于频繁,为确认本次访问为正常用户行为,需要您协助验证。</p>
    <p class="p3"><label for="seccodeInput">验证码:</label></p>
    <form name="authform" method="POST" id="seccodeForm" action="/">
        <p class="p4">
            <input type=text name="c" value="" placeholder="请输入验证码" id="seccodeInput">
            <input type="hidden" name="tc" id="tc" value="">
            <input type="hidden" name="r" id="from" value="%2Fweb%3Fquery%3D152512wqe" >
            <input type="hidden" name="m" value="0" >            <span class="s1">
                <script>imgRequestTime=new Date();</script>
                <a onclick="changeImg2();" href="javascript:void(0)">
                    <img id="seccodeImage" onload="setImgCode(1)" onerror="setImgCode(0)" src="util/seccode.php?tc=1510301761" width="100" height="40" alt="请输入图中的验证码" title="请输入图中的验证码">
                </a>
            </span>
            <a href="javascript:void(0);" id="change-img" onclick="changeImg2();" style="padding-left:50px;">换一张</a>
            <span class="s2" id="error-tips" style="display: none;"></span>
        </p>
    </form>
    <p class="p5">
        <a href="javascript:void(0);" id="submit">提交</a>
        <span>提交后没解决问题?欢迎<a href="http://fankui.help.sogou.com/index.php/web/web/index?type=10&anti_time=1510301761&domain=www.sogou.com" target="_blank">反馈</a>。</span>
    </p>
</div>
<div id="ft"><a href="http://fuwu.sogou.com/" target="_blank">企业推广</a><a href="http://corp.sogou.com/" target="_blank">关于搜狗</a><a href="/docs/terms.htm?v=1" target="_blank">免责声明</a><a href="http://fankui.help.sogou.com/index.php/web/web/index?type=10&anti_time=1510301761&domain=www.sogou.com" target="_blank">意见反馈</a><br>&nbsp;&copy;&nbsp;2017<span id="footer-year"></span>&nbsp;Sogou Inc.&nbsp;-&nbsp;<a href="http://www.miibeian.gov.cn" target="_blank" class="g">京ICP证050897号</a>&nbsp;-&nbsp;京公网安备1100<span class="ba">00000025号</span></div>
<script src="static/js/index.min.js?v=0.1.4"></script>
</body>
</html><!--zly-->

通过访问一个搜狗url,获取响应header里面的SNUID

-----请求开始-----https://www.sogou.com/sogou?ori=%E6%B5%8B%E8%AF%95+&site=news.qq.com&query=%E6%B5%8B%E8%AF%95%E5%B7%A5%E5%85%B7&pid=sogou-wsse-b58ac8403eb9cf17-0004&idx=f&page=2&duppid=1&ie=utf8
Invalid cookie header: "set-cookie: ld=qyllllllll2zjStRlllllVoJVZGlllllJimpYkllll9lllllRylll5@@@@@@@@@@; path=/; expires=Wed, 13 Dec 2017 03:52:35 GMT; domain=.sogou.com". Invalid 'expires' attribute: Wed, 13 Dec 2017 03:52:35 GMT
响应状态:HTTP/1.1 200 OK
Server    nginx
Date    Mon, 13 Nov 2017 03:52:35 GMT
Content-Type    text/html; charset=utf-8
Transfer-Encoding    chunked
Connection    keep-alive
Vary    Accept-Encoding
Set-Cookie    ABTEST=5|1510545155|v17; expires=Wed, 13-Dec-17 03:52:35 GMT; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    SNUID=0DF6EB4870752DA55C005265702D6FF2; expires=Tue, 13-Nov-18 03:52:35 GMT; domain=.sogou.com; path=/
Set-Cookie    IPLOC=CN1100; expires=Tue, 13-Nov-18 03:52:35 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
Set-Cookie    SUID=62869B27541C940A000000005A091703; expires=Sun, 08-Nov-37 03:52:35 GMT; domain=.sogou.com; path=/
P3P    CP="CURa ADMa DEVa PSAo PSDo OUR BUS UNI PUR INT DEM STA PRE COM NAV OTC NOI DSP COR"
set-cookie    ld=qyllllllll2zjStRlllllVoJVZGlllllJimpYkllll9lllllRylll5@@@@@@@@@@; path=/; expires=Wed, 13 Dec 2017 03:52:35 GMT; domain=.sogou.com
Cache-Control    max-age=0
x_ad_pagesize    adpagesize=1059
Set-Cookie    black_passportid=1; domain=.sogou.com; path=/; expires=Thu, 01-Dec-1994 16:00:00 GMT
Expires    Mon, 13 Nov 2017 03:52:35 GMT
-----请求结束-----

注意: Set-Cookie 和 set-cookie 大小写不一样

//通过访问一个搜狗url,获取响应header里面的SNUID
// 
#! -*- coding:utf-8 -*-
'''
获取SNUID的值
'''
import requests
import json
import time
import random

'''
方法(一)通过phantomjs访问sogou搜索结果页面,获取SNUID的值
'''
def phantomjs_getsnuid():
    from selenium import webdriver
    d=webdriver.PhantomJS('D:\python27\Scripts\phantomjs.exe',service_args=['--load-images=no','--disk-cache=yes'])
    try:
        d.get("https://www.sogou.com/web?query=")
        Snuid=d.get_cookies()[5]["value"]
    except:
        Snuid=""
    d.quit()
    return Snuid
	
'''
方法(二)通过访问特定url,获取body里面的id
'''
def Method_one():
    url="http://www.sogou.com/antispider/detect.php?sn=E9DA81B7290B940A0000000058BFAB0&wdqz22=12&4c3kbr=12&ymqk4p=37&qhw71j=42&mfo5i5=7&3rqpqk=14&6p4tvk=27&eiac26=29&iozwml=44&urfya2=38&1bkeul=41&jugazb=31&qihm0q=8&lplrbr=10&wo65sp=11&2pev4x=23&4eyk88=16&q27tij=27&65l75p=40&fb3gwq=27&azt9t4=45&yeyqjo=47&kpyzva=31&haeihs=7&lw0u7o=33&tu49bk=42&f9c5r5=12&gooklm=11&_=1488956271683"
    headers={"Cookie":
    "ABTEST=0|1488956269|v17;\
    IPLOC=CN3301;\
    SUID=E9DA81B7290B940A0000000058BFAB6D;\
    PHPSESSID=rfrcqafv5v74hbgpt98ah20vf3;\
    SUIR=1488956269"
    }
    try:
        f=requests.get(url,headers=headers).content
        f=json.loads(f)
        Snuid=f["id"]
    except:
        Snuid=""
    return Snuid
	
'''
方法(三)访问特定url,获取header里面的内容
'''
def Method_two():
    url="https://www.sogou.com/web?query=333&_asf=www.sogou.com&_ast=1488955851&w=01019900&p=40040100&ie=utf8&from=index-nologin"
    headers={"Cookie":
    "ABTEST=0|1488956269|v17;\
    IPLOC=CN3301;\
    SUID=E9DA81B7290B940A0000000058BFAB6D;\
    PHPSESSID=rfrcqafv5v74hbgpt98ah20vf3;\
    SUIR=1488956269"
    }
    f=requests.head(url,headers=headers).headers
    print f
'''
方法(四)通过访问需要输入验证码解封的页面,可以获取SNUID
'''
def Method_three():
    '''
    http://www.sogou.com/antispider/util/seccode.php?tc=1488958062 验证码地址
    '''
    '''
    http://www.sogou.com/antispider/?from=%2fweb%3Fquery%3d152512wqe%26ie%3dutf8%26_ast%3d1488957312%26_asf%3dnull%26w%3d01029901%26p%3d40040100%26dp%3d1%26cid%3d%26cid%3d%26sut%3d578%26sst0%3d1488957299160%26lkt%3d3%2C1488957298718%2C1488957298893
    访问这个url,然后填写验证码,发送以后就是以下的包内容,可以获取SNUID。
    '''
    import socket
    import re
    res=r"id\"\: \"([^\"]*)\""
    s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
    s.connect(('www.sogou.com',80))
    s.send('''
POST http://www.sogou.com/antispider/thank.php HTTP/1.1
Host: www.sogou.com
Content-Length: 223
X-Requested-With: XMLHttpRequest
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: CXID=65B8AE6BEE1CE37D4C63855D92AF339C; SUV=006B71D7B781DAE95800816584135075; IPLOC=CN3301; pgv_pvi=3190912000; GOTO=Af12315; ABTEST=8|1488945458|v17; PHPSESSID=f78qomvob1fq1robqkduu7v7p3; SUIR=D0E3BB8E393F794B2B1B02733A162729; SNUID=B182D8EF595C126A7D67E4E359B12C38; sct=2; sst0=958; ld=AXrrGZllll2Ysfa1lllllVA@rLolllllHc4zfyllllYllllljllll5@@@@@@@@@@; browerV=3; osV=1; LSTMV=673%2C447; LCLKINT=6022; ad=6FwTnyllll2g@popQlSGTVA@7VCYx98tLueNukllll9llllljpJ62s@@@@@@@@@@; SUID=EADA81B7516C860A57B28911000DA424; successCount=1|Wed, 08 Mar 2017 07:51:18 GMT; seccodeErrorCount=1|Wed, 08 Mar 2017 07:51:45 GMT
c=6exp2e&r=%252Fweb%253Fquery%253Djs%2B%25E6%25A0%25BC%25E5%25BC%258F%25E5%258C%2596%2526ie%253Dutf8%2526_ast%253D1488957312%2526_asf%253Dnull%2526w%253D01029901%2526p%253D40040100%2526dp%253D1%2526cid%253D%2526cid%253D&v=5
    ''')
    buf=s.recv(1024)
    p=re.compile(res)
    L=p.findall(buf)
    if len(L)>0:
        Snuid=L[0]
    else:
        Snuid=""
    return Snuid
def getsnuid(q):
    while 1:
        if q.qsize()<10:
            Snuid=random.choice([Method_one(),Method_three(),phantomjs_getsnuid()])
            if Snuid!="":
                q.put(Snuid)
                print Snuid
                time.sleep(0.5)
if __name__=="__main__":
    import Queue
    q=Queue.Queue()
    getsnuid(q)

SUV

通过JS生成,可以直接用cookies里的,一般不会改变

参考
[1] https://thief.one/2017/03/19/爬取搜索引擎之搜狗/
[2] https://blog.gaoqixhb.com/p/56e92e1e7b71cea107c700ba 记搜狗微信号搜索反爬虫