Tell me how best to optimize the algorithm? (Work with a csv file)

Question

I want to write a code that would work with a large bd database. I set myself a task to write a simple code that would read the database in chunks (my computer is not very strong and I cannot open it) and pull out some data (for example, email addresses) . I used the org.apache.commons:commons-csv:1.2 library org.apache.commons:commons-csv:1.2 . The database provides large blocks with data (for example, letters in which there are addresses). I got something like this:

 public class MainClass { static public final Pattern PATTERN_MAIL = Pattern.compile("([a-z0-9_-]+\\.)*[a-z0-9_-]+@[a-z0-9_-]+(\\.[a-z0-9_-]+)*\\.[az]{2,6}"); public static void main(String[] args) throws Exception { Matcher getMail; String path = "cpath"; Reader in = new FileReader(path); File file = new File("data.text"); FileWriter fileWriter = new FileWriter(file.getAbsoluteFile()); BufferedWriter bufWriter = new BufferedWriter(fileWriter); int k=1; Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in); List<CSVRecord> tableStr = new ArrayList<>(); for (CSVRecord record : records) { tableStr.add(record); if (tableStr.size() == 5) { System.out.println("list"+k); for (int i=0; i<tableStr.size(); i++) { getMail=PATTERN_MAIL.matcher(tableStr.get(i).toString()); while(getMail.find()) { bufWriter.write(tableStr.get(i).toString().substring(getMail.start(), getMail.end()) +"\n"); } } tableStr.clear(); System.out.println("tableStr deleted"); k++; } } bufWriter.close(); System.out.println("buf was closed");

The code works, according to the regular expression, from the information can get addresses. But the problem is that when a large text block is processed with information, it takes a long time. Experienced programmers, please tell me how you can increase the speed of this code? I agree that the code is clumsy, but I'm new to it, I'm learning. Your advice will not be superfluous.

javax.mail.internet.InternetAddress If you need to check for minimal adequacy without maniahs.
the option is not bad, but can it be possible to apply this class, if I assume, there will be a hundred letters?
that is, 100 blocks with information containing texts of letters, subjects of letters and addresses of senders-recipients
if there is a column in which there is a letter text in which there is one or several emails, then they also need to be found?
understand, object CSVRecord record contains an array of data strings.
there may be several lines with fields of the tables, as well as lines containing, for example, a letter.
and in this line for example 5 letters in which the text is in several paragraphs, the subject of letters and addresses.
and when my code gets to such blocks with information, it runs through them regularly, but it takes a decent amount of time

Sergey Mitrofanov Sergey Mitrofanov 1,875 6 17 · Answer 1 · 2016-04-25T14:32:53

According to the description of the task and clarification it turns out that CSV is not needed at all. There is some text file in which there are emails that need to be found and saved. We can talk about csv, provided that we are looking for (or rather, checking) that the email is contained in a specific column. In our case, the csv in some terms is exactly 5 columns and only then we start looking for emails as a substring in all columns . As a result, we still look for all the line.

Just reading and just looking.

 // rfc2822 + поставил в группу для Matcher static public final Pattern PATTERN_MAIL = Pattern.compile("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"); // --- File inFile = new File("in.txt"); BufferedReader br = new BufferedReader(inFile); while (true) { String line = reader.readLine(); if (line == null) break; Marcher m = PATTERN_MAIL.matcher(line); while (m.find()) { System.out.println(m.group()); } }

No, the code will work with large csv databases ... here is a man who did a similar thing on Sharpe retifrav.imtqy.com/blog/2014/08/13/insert-from-huge-csv-into-db that's what I want for java write a code that would take a large csv file and pull addresses from it. And to work with databases of any type, both with structured csv and with database dumps.

Tell me how best to optimize the algorithm? (Work with a csv file)

1 answer 1

More articles: