i have simple code:
#usr/bin/python bs4 import beautifulsoup import requests import tldextract def scrap(url): main_domain = tldextract.extract(url) r = requests.get(url) data = r.text soup = beautifulsoup(data) list = [] href in soup.find_all('a'): link_domain = tldextract.extract(href.get('href')) print link_domain print getting error :
traceback (most recent call last): file "cloud.py", line 20, in <module> scrap("--- url here -- ") file "cloud.py", line 14, in scrap link_domain = tldextract.extract(href.get('href')) file "/usr/lib/python2.6/site-packages/tldextract/tldextract.py", line 196, in extract return tld_extractor(url) file "/usr/lib/python2.6/site-packages/tldextract/tldextract.py", line 127, in __call__ netloc = scheme_re.sub("", url) \ typeerror: expected string or buffer how can fix it.
some of a tags not have href attribute, .get('href') returns none.
use:
link_domain = tldextract.extract(href.get('href', '')) to return empty string in case, or test attribute first:
href = href.get('href') if not href: continue link_domain = tldextract.extract(href)
Comments
Post a Comment