i trying unshort lot of urls have in urlset. following code works of time. times takes long time finish. example have 2950 in urlset. stderr tells me 2900 done, geturlmapping not finish.
def geturlmapping(urlset): # url mapping urlmapping = {} #rs = (grequests.get(u) u in urlset) rs = (grequests.head(u) u in urlset) res = grequests.imap(rs, size = 100) counter = 0 x in res: counter += 1 if counter % 50 == 0: sys.stderr.write('doing %d url_mapping length %d \n' %(counter, len(urlmapping))) urlmapping[ getoriginalurl(x) ] = getgoalurl(x) return urlmapping def getgoalurl(resp): url='' try: url = resp.url except: url = 'null' return url def getoriginalurl(resp): url='' try: url = resp.history[0].url except indexerror: url = resp.url except: url = 'null' return url
probably won't has passed long time still..
i having issues requests, similar ones having. me problem requests took ages download pages, using other software (browsers, curl, wget, python's urllib) worked fine...
afer lot of time wasted, noticed server sending invalid headers, example, in 1 of "slow" pages, after content-type: text/html began send header in form header-name : header-value (notice space before colon). somehow breaks python's email.header functionality used parse http headers requests transfer-encoding: chunked header wasn't being parsed.
long story short: manually setting chunked property true of response objects before asking content solved issue. example:
response = requests.get('http://my-slow-url') print(response.text) took ages but
response = requests.get('http://my-slow-url') response.raw.chunked = true print(response.text) worked great!
Comments
Post a Comment