How is the text transfer implemented in all the readers? How can I dynamically create a WebView and display text on it with transfer to the next page when the current one is downloaded? How is this implemented in Foliant or in CoolReader? Please help me figure it out.

UPD:
Look, a slightly modified question. I have here such a sliding of screens http://habrahabr.ru/post/131889/ how can I cut a line so that I can transfer a new unread text to each new page? How can I cut the text so that when the screen ends, the next one will be transferred to the next screen.

UPD2:
Let's rephrase the question. How can I cut the text and calculate how much it will fit on the screen? Even so .. Help, please.

  • Well, I need to read the html file in which everything is compiled, headers, bold, and so on - dajver
  • make your browser? - KoVadim
  • No, your browser is very cool .. you need to think of how to transfer text when the screen ends, and then everything will be cool ... - dajver
  • read how did any fb2 reader and think. In my answer, I gave a link to one of them. Others can be found on the net. - KoVadim

3 answers 3

The task you are talking about is quite complicated, so let's divide it into parts.

Part 1. Arrangement of hyphenation.

For starters, there are dictionaries in which words are given with all possible hyphenation. Compiling such dictionaries is, of course, a laborious task. In addition, dictionaries usually still store tables of affixes and the rules for attaching them to roots: storing 6 cases for each word is burdensome.

Then, there is the classical algorithm of F. Liang for hyphenation, which was used by the mega-father Donald Knuth in his system TeX. This algorithm is based on patterns that are assigned a particular weight, and is designed in such a way that the dictionary of exceptions is minimal (for example, the exception table for the English language contains a total of 14 elements). You can find tables for all languages ​​along with exception tables in any TeX distribution (for example, in many Linux distributions).

The same algorithm, with minor modifications, is used in many popular open spell checking systems. For example, hunspell , which is used in OS X, OpenOffice, Firefox, Chrome, Opera, Eclipse, and a bunch of other programs, uses the Liang algorithm .

There are still proprietary, closed transfer algorithms, I can not say anything about them.

How exactly the breaks are arranged in each specific reader, only experts can say. However, the list of programs that use TeX hyphenation is impressive, isn't it?

Part 2. Breaking the text into lines.

Here again there is a simple, artless approach: knowing the possible points of hyphenation, we “stuff” pieces into the current line as long as possible. Then go to the next line, and so on to the end. According to my impressions, Microsoft Word still works that way.

The algorithm used in the TeX system is much more elegant: the purpose of the algorithm is to maximize the total "quality" of all lines. For each of the options for splitting into lines, the algorithm can estimate how much the string is stretched or compressed. Deviation from the “ideal” case reduces the quality of the string. Other factors, such as hyphenation, also reduce line quality. The algorithm selects a partition in quadratic time, using the option of dynamic programming. The quality of the resulting splitting is noticeably better than in the naive algorithm, and increases with increasing paragraph length; however, the algorithm itself is not very fast.

Again, what kind of algorithm does this or that reader use, only experts can say.

Part 3. What to do?

First, talk to your superiors, find out what they want. Take out the pagination in a separate module and to start with a half-day with coffee and bread rolls, sketch out a "greedy" line break without hyphenation. (Note the special case where one word in the middle is longer than the entire line.)

Secondly, for transfers, check if hunspell is compatible with your license, and screw it up for hyphenation. This is short-lived, and immediately gives a partition that is no different from OpenOffice — not bad already. Work here for a couple of days.

Thirdly, try either to independently implement the TeXh algorithm for splitting into lines, or find a ready implementation (even under an incompatible license). Test speed, it can be critical for your program. If the speed is right, look for an implementation that is compatible with your program in the sense of license purity (or do it yourself, there is information in Google, albeit in English). If not, maybe this quality of the layout at the price of speed is not what you need?

  • Thank you so much for the answer, the most detailed of all that is)) - dajver
  • one
    @dajver: please! This topic was once very interesting to me. - VladD
  • one
    Well, now your interest has passed to me (: - dajver

Come on, the table with the dictionary is too big for such a task.

It seems to me there should be a relatively simple algorithm. A little google leads to such an option written in Pascal, but any prog average levels of aggressiveness will quickly put it on Java

  • it will be difficult not knowing pascal = (it was 6 years ago ... And I need a transfer to the next activation. - dajver

Very simple - there is a file with a dictionary in which all the words are written down, how they can be transferred. Somewhere so

чи-тал-ках реа-ли-зо-ван пе-ре-нос текс-та 

Using a webview for this is a waste of resources.

The file for hyphenation is easily googled by the word 'Russian_EnUS_hyphen_ (Alan) .pdb'. How it is read and used, you can peep in the coolreader code - hyphman.h and hyphman.cpp