Initially, the task is this:
- Reconciliation of links by domain (*) from TXT # 2 with TXT # 1
- Saving unique links to a separate TXT №3
- Adding unique links to TXT # 1
When reconciling by domains - only the domain of the site is checked without: http: // and without https: //, also during the reconciliation of the domain, everything that is after the domain zone (.com, .ru, etc.) is not taken into account
The program works stably well and prints the lines if the input data is standardized and represents this:
Input data:
1.txt http://site.com/ https://sit1e.com/vwdsvdfw/vwdefei/userpanel?cid=2 https://site.com/index.php?id=1 http://sit2e.com/ http://site.com/vwifow/fwviiwf? 2.txt http://site.com/ https://sit1e213.com/vwdsvdfw/vwdefei/userpanel?cid=2 https://site.com/index.php?id=1 http://sit222e.com/ http://site.com/vwifow/fwviiwf?
Program Code:
import urllib.parse def get_domains(filename): with open(filename) as f: return ['.'.join(urllib.parse.urlparse(line.strip()).netloc.split('.')[:-1]) for line in f] dom1 = get_domains(r'1.txt') dom2 = get_domains(r'2.txt') doms = set(dom1) ^ set(dom2) import re def search_domains(filename, doms): with open(filename) as f: text = f.read() pat = r'(https?://[^./\r\n]*?\b(?:{})\b[^\r\n]*)'.format('|'.join(doms)) return re.findall(pat, text) outin13 = search_domains(r'2.txt', doms) with open ('1.txt', 'a') as result: for i in outin13: result.write('\n' + i) with open ('3.txt', 'w') as result: for i in outin13: result.write(i + '\n') input('Нажмите Enter, чтобы завершить программу')
The result of the program:
1.txt http://site.com/ https://sit1e.com/vwdsvdfw/vwdefei/userpanel?cid=2 https://site.com/index.php?id=1 http://sit2e.com/ http://site.com/vwifow/fwviiwf? https://sit1e213.com/vwdsvdfw/vwdefei/userpanel?cid=2 http://sit222e.com/ 3.txt https://sit1e213.com/vwdsvdfw/vwdefei/userpanel?cid=2 http://sit222e.com/
When they decided to check the program on real lines, it gave an error:
TypeError: can only concatenate str (not "tuple") to str
And I began to put in documents in this way:
with open ('1.txt', 'a') as file: print(sep='\n', *outin13, file=file) with open ('3.txt', 'w') as file: print(*outin13, file=file, sep='\n')
Sample tuple output:
3.txt ('http://site254.com/product-detail.php?pid=86 ', '') ('http://www.site345.com/seatrip07/place.php?tid=2 ', '') ('http://site1234.hr/moto-grip/index.php?m=135 ', '')
How to make so that lines, but not tuples were deduced? PS I tried to find such lines that would break the program and force it to display tuples, but the idea was not successful and it was not possible to find any particular type of lines.