At work, I needed a page parser. I know that there are already many ready-made solutions, like the same Grab-a, but I wanted to make my crutch for practice. I wrote logging on the site and getting the page, however, it works a bit strange.

Code:

import pycurl from StringIO import StringIO c = pycurl.Curl() url = 'https://site.ru/index.php' url1 = 'https://site.ru/index.php?_m=tickets&_a=manage&departmentid=17&ticketstatusid=1' c.setopt(pycurl.URL, url) c.setopt(pycurl.POSTFIELDS, 'username=user&password=pass&_ca=login') c.setopt(pycurl.COOKIEJAR, "/tmp/cookie.txt") c.setopt(pycurl.COOKIEFILE, "/tmp/cookie.txt") def __list(url) : c.setopt(pycurl.URL, url) c.setopt(pycurl.COOKIEJAR, "/tmp/cookie.txt") c.setopt(pycurl.COOKIEFILE, "/tmp/cookie.txt") c.bodyio = StringIO() c.setopt(pycurl.WRITEFUNCTION, c.bodyio.write) c.get_body = c.bodyio.getvalue c.perform() return c.get_body() print __list(url1) 

As a result, should receive a code ticketing. However, a redirect occurs in the browser after logging. And the code, in the form in which it is above, gives the page redirect. However, when commenting on the part responsible for logging and creating cookies, using the ready-made cookie gives the page you need.

Tell me, please, why is this happening and how can we get rid of it? Itself, honestly, I see python for the first time.

  • Something did not understand the question, why does a redirect happen? Because the site developers wanted it. How to get rid? Ask the developers to remake the site so that it works without redirects :) Or process redirects as it should be in the code. I have never worked with pycurl (why is it needed if there are urllib and requests?), But fluent googling says c.setopt(c.FOLLOWLOCATION, True) - andreymal

1 answer 1

Redirecting to the [current] page is a common behavior when submitting web forms - this is the so-called Post / Redirect / Get model to avoid resubmitting the web form when it is possible to refresh the page or go to the tab.

Behavior when the method changes when redirecting from POST to GET was common in browsers, but did not meet the standard before rfc 7231, which legalized this behavior . pycurl follows browser behavior if the CURL_REDIR_POST_* option does not override behavior (only makes sense when CURLOPT_FOLLOWLOCATION ).

To access the site (to receive cookies), and then, having cookies, request the page you need:

 import requests with requests.Session() as s: # get cookies s.post(login_url, data=dict(username='user', password='pass', _ca='login')) # use cookies html = s.get(ticket_url).content 

You can explicitly transfer cookies or save / load them from a file .