Chapter 4. Lexical Conventions

This chapter discusses lexical conventions for these topics:

This chapter uses the following notation to describe syntax:

Tokens

The assembler has these tokens:

  • Identifiers

  • Constants

  • Operators

The assembler lets you put blank characters and tab characters anywhere between tokens; however, it does not allow these characters within tokens (except for character constants). A blank or tab must separate adjacent identifiers or constants that are not otherwise separated.

Comments

The pound sign character (#) introduces a comment. Comments that start with a # extend through the end of the line on which they appear. You can also use C-language notation /*...*/ to delimit comments.

The assembler uses cpp (the C language preprocessor) to preprocess assembler code. Because cpp interprets a # symbol in the first column as pragmas (compiler directives), do not start a # comment in the first column.

Identifiers

An identifier consists of a case-sensitive sequence of alphanumeric characters, including these:

  • . (period)

  • _ (underscore)

  • $ (dollar sign)

The first character of an identifier cannot be numeric.

If an identifier is not defined to the assembler (only referenced), the assembler assumes that the identifier is an external symbol. The assembler treats the identifier like a .globl pseudo-operation (see Chapter 8, “Pseudo Op-Codes (Directives)”). If the identifier is defined to the assembler and the identifier has not been specified as global, the assembler assumes that the identifier is a local symbol.

Constants

The assembler has these constants:

  • Scalar constants

  • Floating-point constants

  • String constants

Scalar Constants

The assembler interprets all scalar constants as twos-complement numbers. In 32-bit mode, a scalar constant is 32 bits. 64 bits is the size of a scalar constant in 64-bit mode. Scalar constants can be any of the alphanumeric characters 0123456789abcdefABCDEF. You can use an all or LL suffix to identify a 64-bit constant.

Scalar constants can be one of the following:

  • Decimal constants, which consist of a sequence of decimal digits without a leading zero.

  • Hexadecimal constants, which consist of the characters 0x (or 0X) followed by a sequence of digits.

  • Octal constants, which consist of a leading zero followed by a sequence of digits in the range 0..7.

Floating-Point Constants

Floating-point constants can appear only in .float and .double pseudo-operations (directives) (see Chapter 8, “Pseudo Op-Codes (Directives)”), and in the floating-point Load Immediate instructions (see Chapter 6, “Coprocessor Instruction Set”). Floating-point constants have this format:

+d1[.d2][e|E+d3]

where:

  • d1 is written as a decimal integer and denotes the integral part of the floating-point value.

  • d2 is written as a decimal integer and denotes the fractional part of the floating-point value.

  • d3 is written as a decimal integer and denotes a power of 10.

  • The "+" symbol is optional.

For example:

21.73E-3

represents the number .02173.

Optionally, .float and .double directives may use hexadecimal floating-point constants instead of decimal ones. A hexadecimal floating-point constant consists of:

<+ or -> 0x <1 or 0 or nothing> . <hex digits> H 0x <hex digits>

The assembler places the first set of hex digits (excluding the 0 or 1 preceding the decimal point) in the mantissa field of the floating-point format without attempting to normalize it. It stores the second set of hex digits into the exponent field without biasing them. It checks that the exponent is appropriate if the mantissa appears to be denormalizing. Hexadecimal floating-point constants are useful for generating IEEE special symbols, and for writing hardware diagnostics.

For example, either of the following generates a single-precision "1.0":

.float 1.0e+0
.float 0x1.0h0x7f

String Constants

String constants begin and end with double quotation marks (").

The assembler observes C language backslash conventions. For octal notation, the backslash conventions require three characters when the next character can be confused with the octal number. For hexadecimal notation, the backslash conventions require two characters when the next character can be confused with the hexadecimal number (that is, use a 0 for the first character of a single character hex number).

The assembler follows the backslash conventions shown in Table 4-1.

Table 4-1. Backslash Conventions

Convention

Meaning

\a

Alert (0x07)

\b

Backspace (0x08)

\f

Form feed (0x0c)

\n

Newline (0x0a)

\r

Carriage return (0x0d)

\t

horizontal tab (0x09)

\v

Vertical feed (0x0b)

\\

Backslash (0x5c)

\"

Double quotation mark (0x22)

\'

Single quotation mark (0x27)

\000

Character whose octal value is 000

\Xnn

Character whose hexadecimal value is nn


Multiple Lines Per Physical Line

You can include multiple statements on the same line by separating the statements with semicolons. The assembler does not recognize semicolons as separators when they follow comment symbols (# or /*).

Section and Location Counters

Assembled code and data fall in one of the sections shown in Figure 4-1.

Figure 4-1. Section and Location Counters

Section and Location Counters

The assembler always generates the text section before other sections. Additions to the text section happen in four-byte units. Each section has an implicit location counter, which begins at zero and increments by one for each byte assembled in the section.

The bss section holds zero-initialized data. If a .lcomm pseudo-op defines a variable (see Chapter 8, “Pseudo Op-Codes (Directives)”), the assembler assigns that variable to the bss (block started by storage) section or to the sbss (short block started by storage) section depending on the variable's size. The default variable size for sbss is 8 or fewer bytes.

The command line option -G for each compiler (C, Pascal, Fortran 77, or the assembler), can increase the size of sbss to cover all but extremely large data items. The link editor issues an error message when the -G value gets too large. If a -G value is not specified to the compiler, 8 is the default. Items smaller than, or equal to, the specified size go in sbss. Items greater than the specified size go in bss.

Because you can address items much more quickly through $gp than through a more general method, put as many items as possible in sdata or sbss. The size of sdata and sbss combined must not exceed 64 KB.

Statements

Each statement consists of an optional label, an operation code, and the operand(s). The system allows these statements:

  • Null statements

  • Keyword statements

Label Definitions

A label definition consists of an identifier followed by a colon. Label definitions assign the current value and type of the location counter to the name. An error results when the name is already defined, the assigned value changes the label definition, or both conditions exist.

Label definitions always end with a colon. You can put a label definition on a line by itself.

A generated label is a single numeric value (1...255). To reference a generated label, put an f (forward) or a b (backward) immediately after the digit. The reference tells the assembler to look for the nearest generated label that corresponds to the number in the lexically forward or backward direction.

Null Statements

A null statement is an empty statement that the assembler ignores. Null statements can have label definitions. For example, this line has three null statements in it:

label: ; ;

Keyword Statements

A keyword statement begins with a predefined keyword. The syntax for the rest of the statement depends on the keyword. All instruction opcodes are keywords. All other keywords are assembler pseudo-operations (directives).

Expressions

An expression is a sequence of symbols that represent a value. Each expression and its result have data types. The assembler does arithmetic in twos-complemet integers (32 bits of precision in 32-bit mode; 64 bits of precision in 64-bit mode). Expressions follow precedence rules and consist of:

  • Operators

  • Identifiers

  • Constants

Also, you may use a single character string in place of an integer within an expression. Thus:

.byte "a" ; .word "a"+0x19

is equivalent to:

.byte 0x61 ; .word 0x7a

Precedence

Unless parentheses enforce precedence, the assembler evaluates all operators of the same precedence strictly from left to right. Because parentheses also designate index-registers, ambiguity can arise from parentheses in expressions. To resolve this ambiguity, put a unary + in front of parentheses in expressions.

The assembler has three precedence levels, which are listed here from lowest to highest precedence:

least binding, lowest precedence 

binary +,-

 

 

binary *,/,5,<<,>>,^,&, |

 

most binding, highest precedence 

unary --,+,~


Note: The assembler's precedence scheme differs from that of the C language.


Expression Operators

For expressions, you can rely on the precedence rules, or you can group expressions with parentheses. The assembler recognizes the operators listed in Table 4-2.

Table 4-2. Expression Operators

Operator

Meaning

+

Addition

-

Subtraction

*

Multiplication

/

Division

%

Remainder

<<

Shift Left

>>

Shift Right (sign NOT extended)

^

Bitwise Exclusive-OR

&

Bitwise AND

|

Bitwise OR

-

Minus (unary)

+

Identity (unary)

~

Complement


Data Types

The assembler manipulates several types of expressions. Each symbol you reference or define belongs to one of the categories shown in Table 4-3.

Table 4-3. Data Types

Type

Description

undefined

Any symbol that is referenced but not defined becomes global undefined, and this module will attempt to import it. The assembler uses 32-bit addressing to access these symbols. (Declaring such a symbol in a .globl pseudo-op merely makes its status clearer).

sundefined

A symbol defined by a .extern pseudo-op becomes global small undefined if its size is greater than zero but less than the number of bytes specified by the -G option on the command line (which defaults to 8). The linker places these symbols within a 64KB region pointed to by the $gp register, so that the assembler can use economical 16-bit addressing to access them.

absolute

A constant defined in an "=" expression.

text

The text section contains the program's instructions, which are not modifiable during execution. Any symbol defined while the .text pseudo-op is in effect belongs to the text section.

data

The data section contains memory that the linker can initialize to nonzero values before your program begins to execute. Any symbol defined while the .data pseudo-op is in effect belongs to the data section. The assembler uses 32-bit or 64-bit addressing to access these symbols (depending on whether you are in 32-bit or 64-bit mode).

sdata

This category is similar to data, except that defining a symbol while the .sdata ("small data") pseudo-op is in effect causes the linker to place it within a 64KB region pointed to by the $gp register, so that the assembler can use economical 16-bit addressing to access it.

rdata

Any symbol defined while the .rdata pseudo-op is in effect belongs to this category, which is similar to data, but may not be modified during execution.

bss and sbss

The bss and sbss sections consist of memory which the kernel loader initializes to zero before your program begins to execute. Any symbol defined in a .comm or .lcomm pseudo-op belongs to these sections (except that a .data, .sdata, or .rdata pseudo-op can override a .comm directive). If its size is less than the number of bytes specified by the -G option on the command line (which defaults to 8), it belongs to sbss ("small bss"), and the linker places it within a 64 KB region pointed to by the $gp register so that the assembler can use economical 16-bit addressing to access it. Otherwise, it belongs to bss and the assembler uses 32-bit or 64-bit addressing (depending on whether you are in 32-bit or 64-bit mode). Local symbols in bss or sbss defined by .lcomm are allocated memory by the assembler; global symbols are allocated memory by the link editor; and symbols defined by .comm are overlaid upon like-named symbols (in the fashion of Fortran COMMON blocks) by the link editor.

Symbols in the undefined and small undefined categories are always global (that is, they are visible to the link editor and can be shared with other modules of your program). Symbols in the absolute, text, data, sdata, rdata, bss, and sbss categories are local unless declared in a .globl pseudo-op.

Type Propagation in Expressions

When expression operators combine expression operands, the result's type depends on the types of the operands and on the operator. Expressions follow these type propagation rules:

  • If an operand is undefined, the result is undefined.

  • If both operands are absolute, the result is absolute.

  • If the operator is + and the first operand refers to a relocatable text-section, data-section, bss-section, or an undefined external, the result has the postulated type and the other operand must be absolute.

  • If the operator is - and the first operand refers to a relocatable text-section, data-section, or bss-section symbol, the second operand can be absolute (if it previously defined) and the result has the first operand's type; or the second operand can have the same type as the first operand and the result is absolute. If the first operand is external undefined, the second operand must be absolute.

  • The operators * , /, % , << , >> , ~, ^ , & , and | apply only to absolute symbols.

Relocations

With -n32 and -64 compiles, it is possible to specify a relocation explicitly in assembly. For example:

lui $24,%hi(.data)

This example emits a lui$24,0 instruction with a R_MIPS_H16 relocation that references the .data symbol.

The following table lists the available relocations:

AS-SYNTAX 

ELF Relocation

%hi 

R_MIPS_HI16

%lo 

R_MIPS_LO16

%gp_rel 

R_MIPS_GPREL

%half 

R_MIPS_16

%call6 

R_MIPS_CALL6

%call_hi 

R_MIPS_CALL_H16

%call_lo 

R_MIPS_CALL_LO16

%got 

R_MIPS_GOT

%got_disp 

R_MIPS_GOT_DISP

%got_hi 

R_MIPS_GOT_HI16

%got_lo 

R_MIPS_GOT_LO16

%got_page 

R_MIPS_GOT_PAGE

%got_ofst 

R_MIPS_GOT_OFST

%neg 

R_MIPS_SUB

%higher 

R_MIPS_HIGHER

%highest 

R_MIPS_HIGHEST