Hello! Can you tell me how to assign a User-Agent
to the HTTP
header? If there is an example, many thanks. I just need to, when my little robot visited the site, he was there somehow "glowing". Maybe there is an article with code examples?
Here is the code for my little crawler:
UPD
package crawler; import java.io.BufferedReader; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.InputStreamReader; import java.net.MalformedURLException; import java.net.URL; import java.util.HashSet; import java.util.Scanner; import java.util.Set; import java.util.regex.Matcher; import java.util.regex.Pattern; public class crawler { /** * @author ivan * @version 2.2 * @param args */ private static final String CRAWLER_SOURSE = "C:\\apache-tomcat-6.0.36\\webapps\\wasks\\WEB-INF\\indexer\\urllist.txt"; private static final String CRAWLER_WRITE = "C:\\apache-tomcat-6.0.36\\webapps\\wasks\\WEB-INF\\indexer\\urllist.txt"; private static final Pattern p = Pattern.compile("http://(.+?.ru)/"); private static final Set<String> urls = new HashSet<String>(); public static void main(String[] args) { BufferedReader reader = null; Scanner scanner = null; URL url = null; FileWriter wr = null; String sRead = null; File file = new File(CRAWLER_SOURSE); try { scanner = new Scanner(file); while (scanner.hasNext()) { urls.add(scanner.nextLine()); } scanner.close(); for (String s : urls) { StringBuffer buffer = new StringBuffer(); try { try { url = new URL(s); } catch (MalformedURLException e) { e.printStackTrace(); } reader = new BufferedReader(new InputStreamReader( url.openStream())); } catch (IOException e) { e.printStackTrace(); continue; } while (true) { sRead = reader.readLine(); if (sRead == null) { break; } buffer.append(sRead); } Matcher m = p.matcher(buffer.toString()); while (m.find()) { url = new URL(m.group()); String group = new String(m.group()); urls.add(group); System.out.println(urls); } } wr = new FileWriter(CRAWLER_WRITE); for (String s : urls) { wr.write(s + "\n"); } } catch (IOException e) { e.printStackTrace(); } finally { if (wr != null) { try { wr.flush(); wr.close(); } catch (IOException e) { e.printStackTrace(); } } } } }
As you can see it saves pages. It would be nice if the webmaster knew that I had walked around his site. Plus, I want to give the name of the crawler, like that of the big uncles.