How to parse a line with maximum speed?

Question

There is a collection of strings List<String> data = new ArrayList<>(); it contains lines of this kind:

 book="book-1" operation="SELL" price="100.50" volume="81" orderId="1"

I need to get all the values in quotes, I do it like this:

 private String getOrderParam(String line, String param) { String[] orderArr = line.split(param); return orderArr[1].split("\"")[0]; }

That is, I first cut the line in half so that the desired value would be under the first index, and then I cut it again with quotes from the right of the part of the line I need. But all this is very slow. I fill the array String[] result = new String[5]; from the first parameter from the string to the last, each time calling this method.

It is necessary that the method return an array of strings, and at the same time do it as quickly as possible up to 5 seconds, the data collection is large - 3 million objects. I use the data to create an Order object with the fields: book , operation , price , volume , orderId I would be grateful for ideas, or code how to achieve high performance in solving the problem.

It would be nice to explain what is “fast” for you and what is “slow” for you, and give an estimate of the number of lines, since the question is about performance.
Is it processing streaming data or is there an offline rowset as a file or collection?
If there are a lot of requests like "return the X parameter in the Y line", first you need to parse the string into the object, with the fields book , operation , etc., and think about what data structure they should be folded.
Data approximately 3 million lines, quickly - average time 5 seconds

Accepted Answer · 2017-03-09T07:01:21

I think it is worth giving an example of a finite state machine (based on the answer Nofate)

We assume that pairs are separated by a single space, inside the values there are no quotes and so on. Those. We consider the simplest option without surprises.

  String src = "book=\"book-1\" operation=\"SELL\" price=\"100.50\" volume=\"81\" orderId=\"1\""; src += " "; char[] charArr = src.toCharArray(); boolean isKey = true; String key = ""; String value = ""; for (char c : charArr) { switch (c) { case '"': break; case '=': isKey = false; break; case ' ': isKey = true; System.out.println(key + ":" + value); //Выводим пару key = ""; value = ""; break; default: if (isKey) key += c; else value += c; } }

At the request of workers, ousted version. Execution speed on an office PC for 3,000,000 lines +/- 3 seconds:

 public class Main { public static void main(String[] args) { Map<String, String> out = new HashMap<>(); String[] input = new String[3000000]; for (int i = 0; i < 3000000; i++) { input[i] = "book=\"book-" + Math.random() + "\" operation=\"SELL\" price=\"" + Math.random() + "\" volume=\"" + Math.random() + "\" orderId=\"" + Math.random() + "\""; } long startTime = System.currentTimeMillis(); for (String src : input) { char[] buffer = new char[30]; char[] charArr = src.toCharArray(); String key = ""; int i = 0; for (char c : charArr) { if (c == '"') continue; if (c == '=') { key = String.valueOf(buffer).trim(); buffer = new char[30]; i = 0; continue; } if (c == ' ') { out.put(key, (new String(buffer)).trim()); new String(buffer); buffer = new char[30]; i = 0; continue; } buffer[i++] = c; } out.put(key, (new String(buffer)).trim()); } long stopTime = System.currentTimeMillis(); long elapsedTime = stopTime - startTime; System.out.println(elapsedTime); System.out.println(out.size()); } }

So, string concatenation is the easiest way to "kill" performance.
@Regent is a very simplified example, for a general understanding of how it works.
This, in my opinion, is enough for thinking about optimization.
I did not take measurements, but there is a suspicion that even with short lines there will be problems with such a number of them.

Answer 2 · 2017-03-08T23:16:56

To parse a line, it is enough to go through it from left to right with a state machine.

We start with the state "Property".
We collect readable characters in the temporary buffer.
By the sign = , if the state is now "Property", we take the property name from the buffer to a variable, go to the "Value" state, clear the buffer.
In the state of "Value" collect readable characters in the buffer.
On the closing quotation mark, we take the value into the variable, clear the buffer, save the property-value pair in the object field or HashMap, go to the "Property" state.
If the input is unexpected for the current state, we throw an exception.

Ps. If inside the values may be quotes, we think of screening.

On the other hand, while nothing is known about the requirements for speed and the nature of the data, I would just do this:

 Map<String, String> bookInfo = new HashMap(); String src = "book=\"book-1\" operation=\"SELL\" price=\"100.50\" volume=\"81\" orderId=\"1\""; for (String chunk : src.split("\" \"")) { String[] pair = chunk.split("="); bookInfo.put(pair[0], pair[1]); }

It is known that there are a lot of data in 3 million lines in the collection.
Clarified the question, I apologize for not accurate wording.

Answer 3 · 2017-03-10T15:37:38

One Turn Solution

 package me.etki.perfground; import java.util.LinkedList; import java.util.List; /** * @author Etki {@literal <etki@etki.name>} */ public class StringParser { enum State { EXTERIOR { @Override public State next(char c) { return c == ' ' ? EXTERIOR : KEY; } @Override public boolean expects(char c) { return c != '"' && c != '='; } }, KEY { @Override public State next(char c) { return c == '=' ? EQUALITY_SIGN : KEY; } @Override public boolean expects(char c) { return c != '"'; } }, EQUALITY_SIGN { @Override public State next(char c) { return OPENING_QUOTE; } @Override public boolean expects(char c) { return c == '"'; } }, OPENING_QUOTE { @Override public State next(char c) { return c == '"' ? CLOSING_QUOTE : VALUE; } }, VALUE { @Override public State next(char c) { return c == '"' ? CLOSING_QUOTE : VALUE; } }, CLOSING_QUOTE { @Override public State next(char c) { return c == ' ' ? EXTERIOR : KEY; } @Override public boolean expects(char c) { return c != '=' && c != '"'; } }; public State next(char c) { throw new UnsupportedOperationException("Somebody has forgotten to implement something"); } public boolean expects(char c) { return true; } } public static class Tuple { private final CharSequence key; private final CharSequence value; public Tuple(CharSequence key, CharSequence value) { this.key = key; this.value = value; } public CharSequence getKey() { return key; } public CharSequence getValue() { return value; } } public static class Region implements CharSequence { private final CharSequence parent; private final int start; private final int end; public Region(CharSequence parent, int start, int end) { this.parent = parent; this.start = start; this.end = end; } @Override public int length() { return end - start; } @Override public char charAt(int index) { return parent.charAt(start + index); } @Override public CharSequence subSequence(int start, int end) { if (end > length()) { throw new IndexOutOfBoundsException(); } return parent.subSequence(this.start + start, this.start + end); } @Override public String toString() { return parent.subSequence(start, end).toString(); } } public static List<Tuple> parse(CharSequence input) { State state = State.EXTERIOR; List<Tuple> result = new LinkedList<>(); int mark = -1; CharSequence key = null; for (int i = 0; i < input.length(); i++) { char c = input.charAt(i); if (!state.expects(c)) { String s = "Provided string contains char " + c + " right after " + state + " state, and that's " + "prohibited"; throw new IllegalArgumentException(s); } State next = state.next(c); if (next.equals(state)) { continue; } if (next.equals(State.KEY) || next.equals(State.VALUE)) { mark = i; } if (state.equals(State.KEY)) { key = new Region(input, mark, i); } else if (state.equals(State.VALUE)) { result.add(new Tuple(key, new Region(input, mark, i))); } state = next; } return result; } }

Which does not work safely

 // dummy - пустой метод, который берет аргумент и передает его в blackhole # Run complete. Total time: 00:01:09 Benchmark (input) Mode Samples Score Score error Units mepStringParserBenchmark.dummy book="book-1" operation="SELL" price="100.50" volume="81" orderId="1" thrpt 1 316041178.489 NaN ops/s mepStringParserBenchmark.parse book="book-1" operation="SELL" price="100.50" volume="81" orderId="1" thrpt 1 1084975.810 NaN ops/s mepStringParserBenchmark.warmThenParse book="book-1" operation="SELL" price="100.50" volume="81" orderId="1" thrpt 1 1259713.331 NaN ops/s

If I find the time to find out the reason (why my performance is at the same level as in other examples without creating new lines and a single pass), then I will update the answer.

update

It seems that the distribution of methods according to different enum is not much in the liking of the optimizer (and it can be understood - it cannot predict which method will be called). Just a little upsurge:

 enum State { ... public static State next(char c, State current) { switch (current) { case EXTERIOR: return c == ' ' ? EXTERIOR : KEY; case KEY: return c == '=' ? EQUALITY_SIGN : KEY; case EQUALITY_SIGN: return OPENING_QUOTE; case OPENING_QUOTE: return c == '"' ? CLOSING_QUOTE : VALUE; case VALUE: return c == '"' ? CLOSING_QUOTE : VALUE; case CLOSING_QUOTE: return c == ' ' ? EXTERIOR : KEY; default: throw new IllegalStateException(); } } public static boolean expects(char c, State current) { switch (current) { case EXTERIOR: return c != '"' && c != '='; case KEY: return c != '"'; case EQUALITY_SIGN: return c == '"'; case OPENING_QUOTE: return true; case VALUE: return true; case CLOSING_QUOTE: return c != '=' && c != '"'; default: throw new IllegalStateException(); } } }

almost doubles the speed of work:

 Benchmark (input) Mode Samples Score Score error Units mepStringParserBenchmark.dummy book="book-1" operation="SELL" price="100.50" volume="81" orderId="1" thrpt 1 319058428.945 NaN ops/s mepStringParserBenchmark.parse book="book-1" operation="SELL" price="100.50" volume="81" orderId="1" thrpt 1 1176753.244 NaN ops/s mepStringParserBenchmark.warmThenAltParse book="book-1" operation="SELL" price="100.50" volume="81" orderId="1" thrpt 1 2057734.054 NaN ops/s mepStringParserBenchmark.warmThenParse book="book-1" operation="SELL" price="100.50" volume="81" orderId="1" thrpt 1 1112356.422 NaN ops/s

Answer 4 · 2017-03-10T14:04:33

This option is also possible, the speed is not much different.

  Pattern p = Pattern.compile("\"(.*?)\""); for (String src : input) { Matcher m = p.matcher(src); int i = 1; while (m.find()) { switch (i) { case 1: out.put("book", m.group()); case 2: out.put("operation", m.group()); case 3: out.put("price", m.group()); case 4: out.put("volume", m.group()); case 5: out.put("orderId", m.group()); } i++; } }

ps: crept error int i = 0; but you need = 1 (or case to do from scratch)

How to parse a line with maximum speed?

4 answers 4

More articles: