Initially, the task is this:

  1. Reconciliation of links by domain (*) from TXT # 2 with TXT # 1
  2. Saving unique links to a separate TXT №3
  3. Adding unique links to TXT # 1

When reconciling by domains - only the domain of the site is checked without: http: // and without https: //, also during the reconciliation of the domain, everything that is after the domain zone (.com, .ru, etc.) is not taken into account

The program works stably well and prints the lines if the input data is standardized and represents this:

Input data:

1.txt http://site.com/ https://sit1e.com/vwdsvdfw/vwdefei/userpanel?cid=2 https://site.com/index.php?id=1 http://sit2e.com/ http://site.com/vwifow/fwviiwf? 2.txt http://site.com/ https://sit1e213.com/vwdsvdfw/vwdefei/userpanel?cid=2 https://site.com/index.php?id=1 http://sit222e.com/ http://site.com/vwifow/fwviiwf? 

Program Code:

 import urllib.parse def get_domains(filename): with open(filename) as f: return ['.'.join(urllib.parse.urlparse(line.strip()).netloc.split('.')[:-1]) for line in f] dom1 = get_domains(r'1.txt') dom2 = get_domains(r'2.txt') doms = set(dom1) ^ set(dom2) import re def search_domains(filename, doms): with open(filename) as f: text = f.read() pat = r'(https?://[^./\r\n]*?\b(?:{})\b[^\r\n]*)'.format('|'.join(doms)) return re.findall(pat, text) outin13 = search_domains(r'2.txt', doms) with open ('1.txt', 'a') as result: for i in outin13: result.write('\n' + i) with open ('3.txt', 'w') as result: for i in outin13: result.write(i + '\n') input('Нажмите Enter, чтобы завершить программу') 

The result of the program:

 1.txt http://site.com/ https://sit1e.com/vwdsvdfw/vwdefei/userpanel?cid=2 https://site.com/index.php?id=1 http://sit2e.com/ http://site.com/vwifow/fwviiwf? https://sit1e213.com/vwdsvdfw/vwdefei/userpanel?cid=2 http://sit222e.com/ 3.txt https://sit1e213.com/vwdsvdfw/vwdefei/userpanel?cid=2 http://sit222e.com/ 

When they decided to check the program on real lines, it gave an error:

 TypeError: can only concatenate str (not "tuple") to str 

And I began to put in documents in this way:

 with open ('1.txt', 'a') as file: print(sep='\n', *outin13, file=file) with open ('3.txt', 'w') as file: print(*outin13, file=file, sep='\n') 

Sample tuple output:

 3.txt ('http://site254.com/product-detail.php?pid=86 ', '') ('http://www.site345.com/seatrip07/place.php?tid=2 ', '') ('http://site1234.hr/moto-grip/index.php?m=135 ', '') 

How to make so that lines, but not tuples were deduced? PS I tried to find such lines that would break the program and force it to display tuples, but the idea was not successful and it was not possible to find any particular type of lines.

  • can you give in the question the data that will reproduce the problem with the tuples? - MaxU 8:39 pm
  • Which line is wrong? Or better just give the whole text of the error entirely. - Xander
  • Traceback (most recent call last): File "C: / Web / Python/v10.py", line 25, in <module> result.write ('\ n' + i) TypeError: can only concatenate str (not "tuple ") to str - Ian

1 answer 1

In real lines, the comma after the address is sometimes sometimes found. For example:

http://site254.com/product-detail.php?pid=86 ,

This creates a tuple.

Change the strings to fix

 for i in outin13: result.write('\n' + i) 

on:

 for i in outin13: if type(i) is tuple: result.write('\n' + i[0]) else: result.write('\n' + i) 
  • Used your code - it works as it should, thanks! - Ian
  • @Ian. Once the code is working, check the box for the answer. A person will receive incentive points, and the rest will see that assistance has already been provided. - Sergey Nudnov
  • I searched the Internet for an answer to my question and it turned out that the re.findall method returns a tuple if grouping brackets are used in a regular expression, so you need to get the zero element of the tuple through the cycle. Displays a tuple of this type ('That coincided with a regular expression', 'that which did not coincide') The second part was always empty, which indicates the correctness of the method, you just need to output it in a special way. - Ian
  • @lan Thanks, I will know too. - Ivan Kramarchuk 8:39 pm