String split optimization

Question

There is a piece of code that loads lines from a file into a collection with unique values.

var WORDS = new Set() let file = fs.readFileSync('file.txt') // 1 500 000+ строк // Прошло ~10 мс let text = iconv.decode(file, 'windows-1251') // Прошло ~100 мс let list = text.split('\n') // Прошло ~500 мс let i = 0 while (list[i] != null) { // Быстрее, чем "WORDS = new Set(list)" WORDS.add(list[i++]) } // Прошло ~1100 мс

The thickest parts of the code are splitting into cells and iteration. Is it possible to optimize this?

UPD:
Everything is done in order to quickly look for the value in the collection

 WORDS.has('string') // true или false

So, if there are other ways to store and search for unique values, I’m in favor

Accepted Answer · 2017-01-20T20:07:59

Well, if no joke, then you have two calls to the same element of the array. So it will be exactly faster -

 let length = list.length; for(var i = 0; i < length; i++){ Worlds.add(list[i]); }

And about just checking for existence, it is difficult to say without measurements. Set works with all types, which should theoretically make its work slower than a regular object, which is the base type for everything in js and uses a string as a key. I'm about -

 let hash = {}; hash[list[i]] = true; console.log(hash['string']);

adding to Set -

 const CharFactory = { count: 0, getChar(){ return 'some text' + this.count++; }, reset(){ this.count = 0; } }; const ITERATION = 1000000; const set = new Set(); console.time('add in Set'); for(let i = 0; i < ITERATION; i++){ set.add(CharFactory.getChar()); } console.timeEnd('add in Set');

adding to Object -

 const CharFactory = { count: 0, getChar(){ return 'some text' + this.count++; }, reset(){ this.count = 0; } }; const ITERATION = 1000000; const hash = {}; console.time('add in Object'); for(let i = 0; i < ITERATION; i++){ hash[CharFactory.getChar()] = true; } console.timeEnd('add in Object');

check in Set -

 const CharFactory = { count: 0, getChar(){ return 'some text' + this.count++; }, reset(){ this.count = 0; } }; const ITERATION = 1000000; const set = new Set(); for(let i = 0; i < ITERATION; i++){ set.add(CharFactory.getChar()); } CharFactory.reset(); console.time('has in Set'); for(let i = 0; i < ITERATION; i++){ let isCharExistValid = set.has(CharFactory.getChar()); } console.timeEnd('has in Set');

check in Object -

 const CharFactory = { count: 0, getChar(){ return 'some text' + this.count++; }, reset(){ this.count = 0; } }; const ITERATION = 1000000; const hash = {}; for(let i = 0; i < ITERATION; i++){ hash[CharFactory.getChar()] = true; } CharFactory.reset(); console.time('has in Object'); for(let i = 0; i < ITERATION; i++){ let isCharExistValid = hash[CharFactory.getChar()]; } console.timeEnd('has in Object');

Although it is necessary to test it in the node in which you work, I still could not resist and wrote tests for the browser.

The code has really accelerated, but unfortunately only for 50 ms.
Plus, I always thought that length is a getter and when I call it, there will be an extra count of the number of cells that will take time.
As for the objects, at first the code was written this way, but paradoxically, Set works one and a half times faster, so I switched to it
As for the length, you are right and not only that it is a getter, which should be recalculated when the array is changed, and not during the conversion, but I’m already used to it.
I tested your examples in the node - indeed, the objects work faster.
But with my code, for some reason, other circumstances rextester.com/GKVS16394
Well, plus you have no assignments in the example on the creation of an object - the keys are not created but just empty cells are being

Victor Khovanskiy Victor Khovanskiy 2.196 one five 28 · Answer 2 · 2017-01-20T20:49:57

For performance, it is worth doing only one pass per line.

 let WORDS = {}; let file = fs.readFileSync('file.txt') // 1 500 000+ строк let text = iconv.decode(file, 'windows-1251'); let len = text.length; let offset = 0; for (let i = 0; i < len; ++i) { if ((text[i] == '\n' || text[i] == '\r') && offset != i) { WORDS[text.substring(offset, i)] = true; offset = i + 1; } } if (offset != len - 1) { WORDS[text.substring(offset, len)] = true; }

I also thought about it and tried it, but in the end it was twice as slow.
Apparently native split () works faster than checking for "\ n".
And you can also join the discussion about the speed of objects and Set-collections in the comments of the previous answer ( rextester.com/GKVS16394 )

String split optimization

2 answers 2

More articles: