Java source code consists of a collection of spaces, identifiers, literals, comments, operators, delimiters, and keywords.

What happens in the compiler with each of the selected concepts? Is something sifted out or somehow modified?

  • Book of the Dragon read? - VladD
  • @VladD: No, I didn’t read it, thanks, I’ll look for sure - TimurVI

1 answer 1

The usual practice when writing a compiler is to divide it into parts. Traditionally, the first part is lexical analysis, dividing the source text into lexemes. This means that the code is read as a sequence of characters, and is represented as a sequence of tokens .

The token consists of the token type and value (packed in one class).

In this case, usually spaces (not part of character / string literals) are discarded, identifiers are turned into an “Identifier” type token with a value equal to the string with the identifier name. Literals also turn into tokens. Comments stage of lexical analysis usually do not pass and are simply discarded. Separators, like parentheses and punctuation marks, form each own type of token. Well, for keywords, too, they are usually distinguished by a separate type of token.

Example:

Source text

public class Example { // пример public static void main(String[] args) { System.out.println(/* этот текст будет напечатан*/"hello world"); } } 

produces the following sequence of lexical tokens:

 [public-keyword] [class-keyword] [ident "Example"] [separator-left-brace] [public-keyword] [static-keyword] [void-keyword] [ident "main"] [separator-left-paren] [ident "String"] [separator-left-brack] [separator-right-brack] [ident "args"] [separator-right-paren] [separator-left-brace] [ident "System"] [separator-dot] [ident "out"] [separator-dot] [ident "println"] [separator-left-paren] [string-literal "hello world"] [separator-right-paren] [separator-semicolon] [separator-right-brace] [separator-right-brace] 

Further compilation phases will break it down into definitions of classes, functions, and operations, check for matching names, tie names to objects, check for meaningfulness, optimize and compile into bytecode.

Lexical analysis is the easiest compilation phase.


Yes, it is theoretically possible (and sometimes necessary ) to build compilers in which lexical analysis is essentially combined with the subsequent compilation phases. In principle, nothing forces the authors of the compiler to single out a separate phase of lexical analysis, but it is still common practice.

  • `@VladD: very interesting answer, thank you - TimurVI
  • @TimurVI: Please! Glad if it proves useful. - VladD
  • @TimurVI: Here is a simple example of a lexical analyzer ( tokenizer class). - VladD
  • one
    @TimurVI, but here they offer right away Let's create a compiler! - avp