How to download all the pictures from the site with #?

Question

Hello! Tell me, how can I download all the pictures from the site? I understand how to make it so that you can download one picture. I found the code on this site and everything is clear.

class Program { static void Main(string[] args) { WebClient client = new WebClient(); Uri uri = new Uri(""); client.DownloadFileAsync(uri, "picture.jpg" ); Console.WriteLine("Картинка скачана"); Console.Read(); } }

And how to make that you could download all the pictures, for example, from the site instagramm. I think that you need to create an array and put all the pictures in it and enclose the code above either in foreach, or something like

 for (int i = 1; i <= pictures; i++) { WebClient client = new WebClient(); Uri uri = new Uri(" "); client.DownloadFileAsync(uri, "picture.jpg" ); Console.WriteLine("Картинка скачана"); Console.Read(); }

But how to determine all the pictures on the site?

@dreenline: Wow, how do you suggest distinguishing a photo from just a design element?
@VladD Actually, that is the question :) How to determine all the photos on the site and in general, is it possible.

ߊߚߤߘ 10.3k 6 26 58 · Accepted Answer · 2016-11-27T06:50:22

The question does not indicate any specific site, so I will answer and suggest a technique that is suitable for any site.

The task uses a solution only for images with the src attribute of the img tag, but a solution on this basis is also possible for images in background-image . It is more complicated, but also possible. For Javascript calls, jQuery is used for simplicity, assuming that it also exists on the target site. But it can also be pure JavaScript or another library provided that this library is used on the site.

Use the CefSharp library for such tasks.

What it is?

This is a managed shell over CEF ( Chromium Embedded Framework ). That is, you get the power of Chromium, which is controlled programmatically.

Why choose CEF / CefSharp?

You should not bother with parsing pages (and this is a difficult and ungrateful task, which I highly recommend not to do).
You can work with an already loaded page (after running scripts).
It is possible to execute arbitrary javascript with the latest features.
It makes it possible to invoke AJAX with the help of JavaScript, and then with success (success), to pull events in the C # code with the result of AJAX.

CeSharp Varieties

CefSharp.WinForms
CefSharp.Wpf
CefSharp.OffScreen

The first two are used if you need to give users a Browser control. Conceptually similar to WebBrowser in Windows Forms, which is a wrapper for managing IE, not Chromium, as in our case.

Therefore, we will use the CefSharp.OffScreen (offscreen) version.

Code writing

Suppose we have a console application, but it already depends on you.

Install the CefSharp.OffScreen Nuget package of the 51st version:

 Install-Package CefSharp.OffScreen -Version 51.0.0

The fact is that C # all arrays map to List<object> , the result of JavaScript is wrapped in object , which already contains List<object> , string , bool , int depending on the result. In order to make the results strongly typed, create a small ConvertHelper:

 public static class ConvertHelper { public static T[] GetArrayFromObjectList<T>(object obj) { return ((IEnumerable<object>)obj) .Cast<T>() .ToArray(); } public static List<T> GetListFromObjectList<T>(object obj) { return ((IEnumerable<object>)obj) .Cast<T>() .ToList(); } public static T ToTypedVariable<T>(object obj) { if (obj == null) { dynamic dynamicResult = null; return dynamicResult; } Type type = typeof(T); if (type.IsArray) { dynamic dynamicResult = typeof(ConvertHelper).GetMethod(nameof(GetArrayFromObjectList)) .MakeGenericMethod(type.GetElementType()) .Invoke(null, new[] { obj }); return dynamicResult; } if (type.IsGenericType && type.GetGenericTypeDefinition() == typeof(List<>)) { dynamic dynamicResult = typeof(ConvertHelper).GetMethod(nameof(GetListFromObjectList)) .MakeGenericMethod(type.GetGenericArguments().Single()) .Invoke(null, new[] { obj }); return dynamicResult; } return (T)obj; } }

Create a CefSharpWrapper class:

 public sealed class CefSharpWrapper { private ChromiumWebBrowser _browser; public void InitializeBrowser() { CefSettings settings = new CefSettings(); // Disable GPU in WPF and Offscreen until GPU issues has been resolved settings.CefCommandLineArgs.Add("disable-gpu", "1"); //Perform dependency check to make sure all relevant resources are in our output directory. Cef.Initialize(settings, shutdownOnProcessExit: true, performDependencyCheck: true); _browser = new ChromiumWebBrowser(); // wait till browser initialised AutoResetEvent waitHandle = new AutoResetEvent(false); EventHandler onBrowserInitialized = null; onBrowserInitialized = (sender, e) => { _browser.BrowserInitialized -= onBrowserInitialized; waitHandle.Set(); }; _browser.BrowserInitialized += onBrowserInitialized; waitHandle.WaitOne(); } public void ShutdownBrowser() { // Clean up Chromium objects. You need to call this in your application otherwise // you will get a crash when closing. Cef.Shutdown(); } public Task<T> GetResultAfterPageLoad<T>(string pageUrl, Func<Task<T>> onLoadCallback) { TaskCompletionSource<T> tcs = new TaskCompletionSource<T>(); EventHandler<LoadingStateChangedEventArgs> onPageLoaded = null; T t = default(T); // An event that is fired when the first page is finished loading. // This returns to us from another thread. onPageLoaded = async (sender, e) => { // Check to see if loading is complete - this event is called twice, one when loading starts // second time when it's finished // (rather than an iframe within the main frame). if (!e.IsLoading) { // Remove the load event handler, because we only want one snapshot of the initial page. _browser.LoadingStateChanged -= onPageLoaded; t = await onLoadCallback(); tcs.SetResult(t); } }; _browser.LoadingStateChanged += onPageLoaded; _browser.Load(pageUrl); return tcs.Task; } public async Task<T> EvaluateJavascript<T>(string script) { JavascriptResponse javascriptResponse = await _browser.EvaluateScriptAsync(script); if (javascriptResponse.Success) { object scriptResult = javascriptResponse.Result; return ConvertHelper.ToTypedVariable<T>(scriptResult); } throw new ScriptException(javascriptResponse.Message); } }

Next we call our CefSharpWrapper class from the Main method.

 public class Program { private static void Main() { MainAsync().Wait(); } private static async Task MainAsync() { CefSharpWrapper wrapper = new CefSharpWrapper(); wrapper.InitializeBrowser(); string[] imageUrls = await wrapper.GetResultAfterPageLoad("https://yandex.ru", async () => await wrapper.EvaluateJavascript<string[]>("$('img').map((index, element) => $(element).prop('src')).toArray()")); string imageFolder = "C://Test"; if (!Directory.Exists(imageFolder)) { Directory.CreateDirectory(imageFolder); } WebClient client = new WebClient(); for (int i = 0; i < imageUrls.Length; i++) { string imageUrl = imageUrls[i]; byte[] fileBytes = await client.DownloadDataTaskAsync(imageUrl); // Можете написать алгоритм позволяющий подбирать расширения string imagePath = Path.Combine(imageFolder, i + ".jpg"); File.WriteAllBytes(imagePath, fileBytes); } wrapper.ShutdownBrowser(); } }

Answer 2 · 2016-11-26T14:10:28

download the page of the target site;
Parsh (all snag in this, he-he) this page in search of all links to the necessary pictures, save all links to the list;
You cycle through this list and download all the necessary pictures.
PS Some sites have a so-called API: in response to a request to the site, they give xml or json format information that is much easier to work with than parsing the site page.

Vladislav Khapin Vladislav Khapin 1,174 3 eleven · Answer 3 · 2016-11-27T14:12:20

I will also write my own solution to the problem using AngleSharp for parsing using it HTML / CSS and building a DOM tree in C #.

The library includes both the HTML / CSS parser (BrowsingContext, HtmlParser, CssParser) and the Jint JavaScript JavaScript plugin from the AngleSharp.Scripting.JavaScript package, which allows you to connect to events from C # and call the JavaScript code from C # ( although I did not manage to call the jQuery code on it :), apparently, there are limitations).

In this case, the need for authorization is not taken into account, i.e. from the same instagram it will not be possible to deflate everything with the help of my example, but from the main pikabu or imgur - quite

If you get a list of URLs from the background-image and img.src properties, the code will look like this:

 using System; using System.Collections.Generic; using System.Linq; using System.Text.RegularExpressions; using System.Threading.Tasks; using AngleSharp; using AngleSharp.Dom.Html; class WebImageElementParser { private readonly IBrowsingContext _context; public WebImageElementParser() { //AngleSharp BrowsingContext _context = BrowsingContext.New(new Configuration().WithDefaultLoader().WithCss().WithJavaScript()); } public async Task<IEnumerable<string>> GetImageUrlsAsync(string siteUrl) { var documentResult = await _context.OpenAsync(siteUrl); //ищем у всех элементов DOM-дерева свойство background-image и берем его значение var backgroundImagesUrls = documentResult.QuerySelectorAll("*") .Where(htmlElement => !String.IsNullOrWhiteSpace(htmlElement.Style.BackgroundImage)) .Select(htmlElement => htmlElement.Style.BackgroundImage) //url(\"http://s8.pikabu.ru/video/2016/11/27/7/1480245149289358252.jpg\") .Select(styleValue => Regex.Match(styleValue, "\\\"(?<url>.*)\\\"")) //http://s8.pikabu.ru/video/2016/11/27/7/1480245149289358252.jpg .Where(regex => regex.Success) .Select(regex => regex.Groups["url"].Value); var imgElementUrls = documentResult.Body.QuerySelectorAll("img").Cast<IHtmlImageElement>().Select(img => img.Source); return imgElementUrls.Union(backgroundImagesUrls).ToList(); } }

I also wrote for example a variant of how to download all the received images by the found URLs (if there is a file extension in it, the URL), save them, grouping them by the host name:

 using System; using System.IO; using System.Net.Http; using System.Text.RegularExpressions; using System.Threading.Tasks; namespace Test { class WebImageDownloader : IDisposable { private readonly HttpClient _httpClient = new HttpClient() { Timeout = TimeSpan.FromSeconds(30) }; public async Task SaveImageAsync(string imageUrl) { try { Uri imageUri; //нам же ведь надо достучаться до ресурса? :) if (Uri.TryCreate(imageUrl, UriKind.Absolute, out imageUri)) { using (HttpResponseMessage imageResponse = await _httpClient.GetAsync(imageUri)) { // HTTP result != 200 OK -> HttpRequestException imageResponse.EnsureSuccessStatusCode(); using (Stream imageStream = await imageResponse.Content.ReadAsStreamAsync()) { await SaveImageAsync(imageUri, imageStream); } } } else { //обрабатываем неверные url Console.ForegroundColor = ConsoleColor.Cyan; Console.WriteLine($"Not an absolute URI: {imageUrl}"); Console.ForegroundColor = ConsoleColor.Gray; } } catch (HttpRequestException e) { //обрабатываем ошибки запросов Console.ForegroundColor = ConsoleColor.Cyan; Console.WriteLine($"{e.Message} : {imageUrl}"); Console.ForegroundColor = ConsoleColor.Gray; } } private async Task SaveImageAsync(Uri imageUri, Stream imageStream) { /*берем Path сегмент (https://en.wikipedia.org/wiki/Uniform_Resource_Identifier) и пытаемся вычленить из него последнюю часть ресурса(до слеша) и его расширение https://i.stack.imgur.com/pnAAg.jpg?s=32&g=1 -> /pnAAg.jpg (path) -> pnAAg.jpg (regexp) */ Match fileExtensionMatch = Regex.Match(imageUri.AbsolutePath, @"(?!/)[\w\d]+\.\w+", RegexOptions.RightToLeft); if (fileExtensionMatch.Success) { //создаем дерикторию для данного хоста картинки, чтобы хоть как-то их сгруппировать string imageDirectory = Path.Combine(Environment.CurrentDirectory, $"Images_{imageUri.Host.Replace('.', '_')}"); if (!Directory.Exists(imageDirectory)) Directory.CreateDirectory(imageDirectory); string fileName = fileExtensionMatch.Value; string fullPathForFile = Path.Combine(imageDirectory, fileName); using (FileStream newFile = File.Create(fullPathForFile)) { await imageStream.CopyToAsync(newFile); Console.WriteLine($"{imageUri.AbsoluteUri} ----> {fullPathForFile}"); } } else { //обрабатываем отсутствие расширения у файла Console.ForegroundColor = ConsoleColor.Cyan; Console.WriteLine($"No match for file name and extension in URL {imageUri.AbsoluteUri}"); Console.ForegroundColor = ConsoleColor.Gray; } } public void Dispose() { _httpClient.Dispose(); } } }

Main method of console application:

 using System.Threading.Tasks; namespace Test { class Program { static void Main(string[] args) { LoadImages().Wait(); } static async Task LoadImages() { var imageLoader = new WebImageElementParser(); var urls = await imageLoader.GetImageUrlsAsync("https://imgur.com/"); using (var loader = new WebImageDownloader()) { foreach (var url in urls) { await loader.SaveImageAsync(url); } } } } }

How to download all the pictures from the site with #?

3 answers 3

What it is?

Why choose CEF / CefSharp?

CeSharp Varieties

Code writing

More articles: