Issue 150108.1: Accelerated Access

Cary Coutant

Introduction
------------

This proposal is an attempt to unify DWARF issues 130410.1
("Accelerated Access") and 140817.1 ("Fast Name Lookup"), and add some
support for split DWARF files that is missing from both earlier
proposals.

In DWARF version 4, the specification describes three tables
("aranges", "pubnames", and "pubtypes") supporting two types of lookup
(by address, and by name). The "aranges" table has proven to be
reasonable and adequate support for lookup by address, and neither
issue proposes to change it. The "pubnames" and "pubtypes" tables have
well-documented deficiencies, noted in both of the aforementioned
issues, and it is these tables we propose to replace in their
entirety.

The primary goal for accelerated lookup by name is to provide an index
that helps a DWARF consumer (e.g., a debugger) find the debugging
information entry or entries associated with a given name, as used in
the source program. Additional goals are driven by assumed
characteristics of the application and the consumer:

  * Because a program may have many compilation units (CUs), we
    desire an index that not only supports per-CU lookup, but
    also an index that can easily be combined at link time (or by
    a post-link utility) into a per-load-module index.

  * So that the consumer can defer the processing of debugging
    information entries as much as possible, the index should
    provide, for each name, a minimal list of CUs that it will
    need to consult to find the information for that name.

  * Because the consumer may have additional context that could
    help narrow the search, each entry in the index should be
    able to provide some basic information about the debugging
    information represented by that index entry (e.g., whether
    the name is global or local; whether it's a variable,
    subroutine, type, namespace).

  * Because a program may have many load modules, we desire an
    index that can quickly answer whether or not a name can be
    found in any CU within that load module.

  * Furthermore, the consumer must be able to trust that the
    index is complete with respect to the names we expect to find
    in the index, and we must have well-defined expectations for
    what names will be indexed.

  * Because different producer/consumer combinations may have
    varying needs, the format should be flexible and extensible,
    allowing the producer to provide a customizable set of
    properties in each index entry.

Differences between this and the Earlier Proposals

The "Accelerated Access" (AA) proposal called for three separate name
tables -- type names, namespaces, and other names --while the "Fast
Name Lookup" (FNL) proposal unified these into a single table. This
proposal uses a single table, adding the DWARF tag to each index entry
so that a consumer can quickly decide whether a particular index entry
is of interest for a particular query.

Both earlier proposals described hash tables. In AA, the hash table
was laid out with fast lookup in mind, and in FNL, the hash table was
optional. In this proposal, the hash table is optional, but the layout
is closer to that from AA.

In both AA and FNL, the format of each index entry is specified in the
section header, and must be homogeneous throughout the section. This
proposal uses an abbreviations-style encoding to allow for
heterogeneous index entries (similar to the "atom" enumerations in AA,
but extended to be more like the DWARF abbreviations table).

In AA, the set of names that is required in the index was rigorously
specified, while in FNL, there were two options: all names, or just
public names. This proposal uses the rigorous specification from AA.

Both per-CU and per-module indexes are supported by AA, while FNL (as
written) does not support per-module indexes. This proposal supports
both types of indexes, allowing a linker or post-link utility to merge
or regenerate a combined index for the module (rather than simply
concatenate the per-CU indexes).

Type units in split DWARF objects are supported by AA, but only by
including a signature field in every index entry, and are not
supported by FNL at all. This proposal adds a list of type signatures
that may be referenced by an index entry to support the case where the
index is in the executable object file, but a symbol refers to a type
unit that is found only in a split DWARF object.

Proposed Changes to the Spec
----------------------------

Replace Section 6.1.1 in its entirety with the following:

6.1.1  Lookup by Name

For lookup by name, a name index is maintained in a separate
object file section named .debug_names. The index consists
primarily of two parts: a list of names, and a list of index
entries. A name, such as a subprogram name, type name, or
variable name, may have several defining declarations in the
debugging information. In this case, the entry for that name in
the list of names will refer to a sequence of index entries in
the second part of the table, each corresponding to one defining
declaration in the .debug_info section.

The name index may also contain an optional hash table for faster
lookup.

6.1.1.1  Contents of the Name Index

The name index must contain an entry for each DIE that defines a
named subprogram, label, variable, type, or namespace, subject to
the following rules:

  * All non-defining declarations (i.e., DIEs with a
    DW_AT_declaration attribute) are excluded.

  * DW_AT_namespace DIEs without a DW_AT_name attribute are
    included with the name "(anonymous namespace)".

  * All other DIEs without a DW_AT_name attribute are excluded.

  * DW_TAG_subprogram, DW_TAG_inlined_subroutine, and
    DW_TAG_label DIEs without an address attribute (DW_AT_low_pc,
    DW_AT_high_pc, DW_AT_ranges, or DW_AT_entry_pc) are excluded.

  * DW_TAG_variable DIEs with a DW_AT_location attribute that
    includes a DW_OP_addr or DW_OP_form_tls_address operator are
    included; otherwise, they are excluded.

  * If a subprogram or inlined subroutine is included, and has a
    DW_AT_linkage_name attribute, there will be an additional
    index entry for the linkage name.

The complete list of the DWARF tags to be indexed is as follows:

    DW_TAG_subprogram
    DW_TAG_inlined_subroutine
    DW_TAG_label
    DW_TAG_variable
    DW_TAG_array_type
    DW_TAG_class_type
    DW_TAG_enumeration_type
    DW_TAG_enumerator
    DW_TAG_pointer_type
    DW_TAG_reference_type
    DW_TAG_string_type
    DW_TAG_structure_type
    DW_TAG_subroutine_type
    DW_TAG_typedef
    DW_TAG_union_type
    DW_TAG_ptr_to_member_type
    DW_TAG_set_type
    DW_TAG_subrange_type
    DW_TAG_base_type
    DW_TAG_const_type
    DW_TAG_constant
    DW_TAG_file_type
    DW_TAG_namelist
    DW_TAG_packed_type
    DW_TAG_volatile_type
    DW_TAG_restrict_type
    DW_TAG_interface_type
    DW_TAG_unspecified_type
    DW_TAG_shared_type
    DW_TAG_namespace
    DW_TAG_coarray_type
    DW_TAG_dynamic_type
    DW_TAG_atomic_type

6.1.1.2  Structure of the Name Index

Logically, the name index can be viewed as a list of names, with a
list of index entries for each name. Each index entry corresponds to a
DIE that matches the criteria given in the previous section. For
example, if one compilation unit has a function named "fred" and
another has a struct named "fred", a lookup for "fred" will find the
list containing those two index entries.

The index section contains eight individual parts:

  * A header, describing the layout of the section.

  * A list of compile units (CUs) referenced by this index.

  * A list of local type units (TUs) referenced by this index
    that are present in this object file.

  * A list of foreign type units (TUs) referenced by this index
    that are not present in this object file (i.e., that have
    been placed in a split DWARF object).

  * An optional hash lookup table.

  * The name table.

  * An abbreviations table, similar to the one used by the
    .debug_info section.

  * The entry pool, containing a list of index entries for each
    name in the name list.

[[[
Pretty diagram here:
http://www.bayarea.net/~cary/dwarf/Accelerated%20Access%20Diagram.png
]]]

The formats of the header and the hash lookup table are described
below, in Section 6.1.1.4 (Data Representation).

The list of CUs and the list of local TUs are each an array of
offsets, each of which is the offset of a compile unit or a type unit
in the .debug_info section. For a per-CU index, there is a single CU
entry, and there may be a TU entry for each type unit generated in the
same translation unit as the single CU. For a per-module index, there
will be one CU entry for each compile unit in the module, and one TU
entry for each unique type unit in the module. Each list is indexed
starting at 0.

The list of foreign TUs is an array of 8-byte (DW_FORM_ref_sig8) type
signatures, representing types referenced by the index whose
definitions have been placed in a different object file (i.e., a split
DWARF object). This list may be empty. This list is indexed starting
with the size of the local TU list, so that the two lists of TUs are
logically combined into one list that can be indexed contiguously.

The name table is logically a table with a row for each unique name in
the index, and two columns. The first column contains a reference to
the name, as a string. The second column contains the offset within
the entry pool of the list of index entries for the name.

The abbreviations table describes the formats of the entries in the
entry pool. Like the DWARF abbreviations table in the .debug_abbrev
section, it defines one or more abbreviation codes. Each abbreviation
code provides a DWARF tag value followed by a list of pairs that
defines an attribute and form code used by entries with that
abbreviation code.

The entry pool contains all the index entries, grouped by name. The
second column of the name list points to the first index entry for the
name, and all the index entries for that name are placed one after the
other.

Each index entry begins with an unsigned LEB128 abbreviation code.
The  abbreviation list for that code provides the DWARF tag value for
the entry as well as the set of attributes provided by the entry and
their forms.

The standard attributes are:

  * Compilation Unit (CU), a reference to an entry in the list of
    CUs. In a per-CU index, index entries without this attribute
    implicitly refer to the single CU.

  * Type Unit (TU), a reference to an entry in the list of local
    or foreign TUs.

  * DIE offset within the CU or TU.

  * Parent DIE, a reference to the index entry for the parent.

  * Type hash, an 8-byte hash of the type declaration.

A producer may define additional vendor-specific attributes, and a
consumer will be able to ignore and skip over any attributes it is not
prepared to handle.

When an index entry refers to a foreign type unit, it may have
attributes for both CU and (foreign) TU. For such entries, the CU
attribute gives the consumer a reference to the CU that may be used to
locate a split DWARF object that contains the type unit.

The type hash attribute, not to be confused with the type signature
for a TU, may be provided for type entries whose declarations are not
in a type unit, for the convenience of link-time or post-link
utilities that wish to de-duplicate type declarations across
compilation units. The type hash should, however, be computed by the
same method as specified for type signatures.

The last entry for each name is followed by a zero byte that
terminates the list. There may be gaps between the lists.

6.1.1.3  Per-CU vs. Per-Module Indexes (Non-Normative)

In a per-CU index, the CU list may have only a single entry, and index
entries may omit the CU attribute. (Cross-module or link-time
optimization, however, may produce an object file with several compile
units in one object. A compiler in this case may produce a separate
index for each CU, or a combined index for all CUs. In the latter
case, index entries will require the CU attribute.) Most name table
entries may have only a single index entry for each, but sometimes a
name may be used in more than one context and will require multiple
index entries, each pointing to a different debugging information
entry.

When linking object files containing per-CU indexes, the linker may
choose to concatenate the indexes as ordinary sections, or it may
choose to combine the input indexes into a single per-module index.

A per-module index will contain a number of CUs, and each index entry
should contain a CU attribute or a TU attribute to identify which CU
or TU contains the debugging information entry being indexed. When a
given name is used in multiple CUs or TUs, it will typically have a
series of index entries pointing to each CU or TU where it is
declared. For example, an index entry for a C++ namespace will need to
list each occurrence, since each CU may contribute additional names to
the namespace, and the consumer will need to find them all. On the
other hand, some index entries do not need to list more than one
definition; for example, with the one-definition rule in C++,
duplicate entries for a function may be omitted, since the consumer
only needs to find one declaration. Likewise, a per-module index needs
to list only a single copy of a type declaration contained in a type
unit.

For the benefit of link-time or post-link utilities that consume
per-CU indexes and produce a per-module index, the per-CU index
entries provide the tag encoding for the original debugging
information entry, and may provide a type hash for certain types that
may benefit from de-duplication. For example, the standard declaration
of the typedef uint32_t is likely to occur in many CUs, but a
combined per-module index needs to retain only one; a user declaration
of a typedef mytype may refer to a different type at each
occurrence, and a combined per-module index should retain each unique
declaration of that type.

6.1.1.4  Data Representation

The name index is placed in a section named .debug_names, and
consists of the eight parts described below.

Section Header

The section header contains the following fields:

  * unit_length (initial length). A 4- or 12-byte field, as in
    other DWARF sections, giving the size of this contribution to
    the index section, not including the unit_length field.

  * version (uhalf). For this proposal, the value of this field
    is 5.

  * padding (uhalf).

  * comp_unit_count (4-byte unsigned integer). The number of CUs
    in the CU list.

  * local_type_unit_count (4-byte unsigned integer). The number
    of TUs in the first TU list.

  * foreign_type_unit_count (4-byte unsigned integer). The number
    of TUs in the second TU list.

  * bucket_count (4-byte unsigned integer). The number of hash
    buckets in the hash lookup table. If there is no hash lookup
    table, this field should contain 0.

  * name_count (4-byte unsigned integer). The number of unique
    names in the index.

  * abbrev_table_size (4-byte unsigned integer). The size in
    bytes of the abbreviations table.

List of CUs

The list of CUs immediately follows the header. Each entry in the list
is a section offset. In the DWARF-32 format, a section offset is 4
bytes, while in the DWARF-64 format, a section offset is 8 bytes.

The total number of entries in the list is given by comp_unit_count.
There must be at least one CU.

List of Local TUs

The list of local TUs immediately follows the list of CUs. Each entry
in the list is a section offset. In the DWARF-32 format, a section
offset is 4 bytes, while in the DWARF-64 format, a section offset is 8
bytes.

The total number of entries in the list is given by
local_type_unit_count. This list may be empty.

List of Foreign TUs

The list of foreign TUs immediately follows the list of local TUs.
Each entry in the list is an 8-byte type signature (as described by
DW_FORM_ref_sig8).

The number of entries in the list is given by foreign_type_unit_count.
This list may be empty.

Hash Lookup Table

The optional hash lookup table immediately follows the list of type signatures.

The hash lookup table is actually two separate arrays: an array of
buckets, followed immediately by an array of hashes. The number of
entries in the buckets array is given by bucket_count, and the number
of entries in the hashes array is given by name_count. Each array
contains 4-byte unsigned integers.

Symbols are entered into the hash table by first computing a hash
value from the symbol name. The hash is computed by the DJB
hash function (the same hash function used for the DT_GNU_HASH
table in ELF object files). Given a hash value for the symbol,
the symbol is entered into a bucket whose index is the hash value
modulo bucket_count. The buckets array is indexed starting at 0.

Each bucket contains the index of an entry in the hashes array. The
hashes array is indexed starting at 1, and an empty bucket is
represented by the value 0.

The hashes array contains a list of the full hash values for each
symbol. All symbols that fall into the same bucket must be grouped
together in the hashes array, and the bucket refers to the first
symbol in the group. When searching for a symbol, the search should
start at the index given by the bucket, and continue either until a
matching symbol is found or until a hash value from a different bucket
is found. If two different symbol names produce the same hash value,
that hash value will occur twice in the hashes array. Thus, if a
matching hash value is found, but the name does not match, the search
should continue visiting subsequent entries in the hashes table.

When a matching hash value is found in the hashes array, the index of
that entry in the hashes array is used to find the corresponding entry
in the name table.

Name Table

The name table immediately follows the hash lookup table. The name
table is laid out in column-major order (i.e., the first column,
followed by the second column). Each entry in the first column
contains the string table offset (DW_FORM_strp) of the name in the
.debug_str (or .debug_str.dwo) section. Each entry in the second
column contains the offset (as a section offset) within the entry pool
of the list of index entries for the name. Rows in the name table are
indexed starting at 1 (to match the hashes array).

The number of rows in the name table is given by name_count.

If there is a hash lookup table, the entries in the name table must be
grouped by bucket: all names that fall into the same hash bucket must
be grouped together. The row number of an entry in the name table must
match the row number of its corresponding entry in the hashes array.

If there is no hash lookup table, there is no ordering or grouping
requirement for the name table.

Abbreviations Table

The abbreviations table immediately follows the name table. Like the
abbreviations table for debugging information entries, this table
consists of a series of abbreviation declarations. Its size is given
by abbrev_table_size.

Each abbreviation declaration defines the tag and other attributes for
a particular form of index entry. Each declaration starts with an
unsigned LEB128 number representing the abbreviation code itself. It
is this code that appears at the beginning of an index entry. The
abbreviation code must not be 0.

The abbreviation code is followed by another unsigned LEB128 number
that encodes the tag of the debugging information entry corresponding
to the index entry.

Following the tag encoding is a series of attribute specifications.
Each attribute consists of two parts: an unsigned LEB128 number that
represents the index attribute, and another unsigned LEB128 number
that represents the attribute's form (as described in Section 7.5.4,
Attribute Encodings). The series of attribute specifications ends
with an entry containing 0 for the attribute and 0 for the form.

The index attributes are listed in the following table:

  Attribute Name     Value  Meaning                   Form/Class

  DW_IDX_compile_unit   1   Index of CU               const

  DW_IDX_type_unit      2   Index of TU (local or     const
   foreign)

  DW_IDX_die_offset     3   Offset of DIE within CU   reference
   or TU

  DW_IDX_parent         4   Index of name table       const
   entry for parent

  DW_IDX_type_hash      5   Hash of type declaration  DW_FORM_data8

  DW_IDX_lo_user   0x2000   Start of user-defined range

  DW_IDX_hi_user   0x3fff   End of user-defined range

The abbreviations table ends with an entry consisting of a single 0
byte for the abbreviation code. The size of the table given by
abbrev_table_size may include optional padding following the
terminating 0 byte.

Entry Pool

The entry pool immediately follows the abbreviations table. The second
column of each row of the name table points to an offset in the entry
pool, where a series of index entries for that name is located.

Each index entry in the series begins with an abbreviation code, and is
followed by the attributes described by the abbreviation declaration
for that code. The last index entry in the series is followed by a
terminating entry whose abbreviation code is 0.

Gaps are not allowed between entries in a series (i.e., the entries
for a single name must all be contiguous), but there may be gaps
between series (if, for example, a producer/consumer combination finds
it useful to maintain alignment).

The size of the entry pool is limited by the size of the contribution
to the index section, as defined by the unit_length header field.

--
2/23/2015 -- Accepted

Author:	Cary Coutant
Champion:	Cary Coutant
Date submitted:	2015-01-08
Date revised:
Date closed:
Type:	Enhancement
Status:	Accepted
DWARF Version:	5