I am engaged in preprocessing more than 10 million links in Jupyter notebook on python, I would like to know the fastest way to check the correctness of the link. From those that managed to try:

import urllib.request def is_valid(url, qualifying=None): qualifying = min_attributes if qualifying is None else qualifying token = urllib.parse.urlparse(url) return all([getattr(token, qualifying_attr) for qualifying_attr in qualifying]) 

Parses the link in parts, works quickly, but gives such things:

  is_valid('http://http://апревлупупц') True 

  def is_valid(url): try: urllib.request.urlopen(url) return True except Exception: return False 

Opens each link, works fine, but plows very slowly.

PS Django in Jupyter does not work, and its libraries, respectively, also

  • Why do you think that http://http://апревлупупц is incorrect? I see a link to the site with the http address and the optional port number omitted — the site http://http/ very well exist on the local network, and the link you specify can work. - andreymal
  • Quote from the WHATWG specification : “A URL-port string must be zero or more ASCII digits.” The word “zero” hints that the absence of a port after a colon is normal. Browsers also consider such a link to be correct, I checked - andmalmal
  • Hmm, well, I meant the link is considered correct if it can be opened, especially when it comes to what product the customer has been viewing, and without a correctly opening link it is impossible to find out - Marija
  • Then the slow urllib.request.urlopen is the only possible option - andreymal
  • Thanks, I will know .. - Marija

0