Hello. There are two scripts in python 3. The first script works without problems (the goal is to collect the necessary links from the page):

import requests from bs4 import BeautifulSoup as bs import re from itertools import groupby r = requests.get('http://www.mmtitalia.it/directory_edile/rivenditori_macchine/ammann/index.htm') soup = bs(r.text, 'lxml') print(soup) link = soup.find('div', class_='elenco').find_all('a', href = re.compile('azienda')) links = [i.get('href') for i in link] new_links = [el for el, _ in groupby(links)] for i in new_links: print('http://www.mmtitalia.it/directory_edile/rivenditori_macchine/ammann/' + i) 

The second script is the supplemented first script, its goal is to read the data from the input file (file_1), then go to the specified addresses and collect the necessary information (that is, more links). BUT, an error occurs (question title). Question: what's the problem? For some reason, the code for the page that the soup variable receives is different in different scripts, but the variable r gets the same address. The second (problem) script:

 import requests from bs4 import BeautifulSoup as bs import re from itertools import groupby file_1 = 'links.txt' file_2 = 'links2.txt' myfile_1 = open(file_1, mode = 'r', encoding = 'ascii') myfile_2 = open(file_2, mode = 'w', encoding = 'ascii') for link in myfile_1: r = requests.get(link) soup = bs(r.text, 'lxml') url = soup.find('div', class_='elenco').find_all('a', href = re.compile('azienda')) print(url) urls = [i.get('href') for i in url] new_urls = [el for el, _ in groupby(urls)] for i in new_urls: myfile_2.write('http://www.mmtitalia.it/directory_edile/rivenditori_macchine/ammann/' + i) 

The first three lines in links.txt: http://www.mmtitalia.it/directory_edile/rivenditori_macchine/ammann/index.htm http://www.mmtitalia.it/directory_edile/rivenditori_macchine/astra/index.htm http: www.mmtitalia.it/directory_edile/rivenditori_macchine/atlas/index.htm

    1 answer 1

    He asked, he answered. First, you need to add to r = requests.get (link. Strip () ), which allows you to remove the line feed ('\ n') after each line. Secondly add exceptions if the links are 'broken'. Thirdly, to make the actual string concatenation for the received links when writing to the file. Work code:

     import requests from bs4 import BeautifulSoup as bs import re from itertools import groupby file_1 = 'links.txt' file_2 = 'links2.txt' myfile_1 = open(file_1, mode = 'r', encoding = 'ascii') myfile_2 = open(file_2, mode = 'w', encoding = 'ascii') for link in myfile_1: try: print(link) r = requests.get(link.strip()) soup = bs(r.text, 'lxml') url = soup.find('div', class_='elenco').find_all('a', href = re.compile('azienda')) #print(url) urls = [i.get('href') for i in url] new_urls = [el for el, _ in groupby(urls)] for i in new_urls: url_base = 'http://www.mmtitalia.it/directory_edile/rivenditori_macchine/' uls_current = link.split('/')[5] myfile_2.write(url_base + uls_current + '/' + i + '\n') except AttributeError: continue 

    I would be grateful for comments on improving the code, because he only recently started learning python.