Issue 011228.1: UTF-8 Clarification

Author: David Anderson
Champion: Dave Anderson
Date submitted: 2001-12-28
Date revised:
Date closed:
Type: Clarification
Status: Accepted with modifications
DWARF Version: 3
BACKGROUND
----------

Dave Anderson posted several questions on 28-Dec-2001 about the
UTF-8 attribute and how it interacts with sections other than
debug_info (in which it is contained). Brender responded with
some tentative answers and further suggested that at minimum
additional italics commentary seemed warranted in several places.


THE DIALOG SO FAR
-----------------

Dave Anderson wrote:
>I wonder if there is a problem with the DW_AT_UTF8 attribute.
>An introduction to UTF8 is on
>http://www.unicode.org/unicode/uni2book/u2.html
>in Chapter 2, General Structure, on page 12 of 26 in that pdf chapter.
>
>I am afflicted with ignorance on this in spite of reading
>the above info...
>
>One point: for all values in the range [0..127] unicode and
>ASCII are identical, which may make a difference...

By design, of course...


>Perhaps there is no problem, but I wonder.
>
>UTF8 contains solely 8bit characters, so ordinary
>string processing won't go haywire reading them, but
>on the other hand, I worry that in a couple cases below
>it might be *necessary* to know that
>a string is UTF8 but with no appropriate way to tell.
>
>I would be happy to learn my concerns are unfounded, so
>feel free to clear this up for me...
>
>3.1  DW_AT_UTF8 says that "all strings use UTF8" in that CU.
>
>This is clearly fine for
>    .debug_info
>    .debug_strp
>
> .debug_line
>    This has file names etc.
>    This is perhaps sensibly looked at as an 'extension' of
>    .debug_info.  Are these utf8? I guess so.
>        It's not read independent of .debug_info so this seems ok.

debug_line does have its own version number field. So, if it version 3
then I think it reasonable to expect that it depends on the use_UTF8
attribute of the compilation unit.

An italics comment to this effect seems appropriate someplace.


> .debug_macinfo
>    has strings.
>    This is perhaps sensibly looked at as an 'extension' of
>    .debug_info.  Are these utf8? I guess so.
>       It's not read independent of .debug_info so this seems ok.

debug_macinfo does not have it's own version number, so the argument
is not quite so direct as for .debug_line. Still, I would expect that
debug_macinfo is equally dependent on the unit use_UTF8 attribute.

An italics comment to this effect seems appropriate someplace.

    Aside: Of course, C (C99) has its own quite different source scheme
    for representing multibyte characters. But, since that scheme is
    defined in terms of the "C character set", which is a subset of
    7-bit ASCII (and its international equivalents), I think there is
    nothing that prevents its use in combination with UTF-8 (although
    one might hope that no producer would do so!).


>However, the following sections also have strings and
>things are  slightly less clear to me.
>
> .debug_frame can have an augmentation string
>    Is this utf8?

Hmmm, very interesting...

>    What if .debug_frame is used for exceptions, so exists
>    in stripped executables? .debug_frame may be
>    gone!  So then what?

Clearly the behavior of exception handling better not depend on whether
the executable is stripped or not!

If .debug_frame is gone, then the question is moot, yes?

>    Document/dwarf3 mistake or
>    quality-of-implementation issue or ???
>    .debug_frame is 'independent' of .debug_info.... No direct
>    connection to a CU in .debug_info.

Me thinks there is a real DWARF3 issue. Not sure how best to resolve it.


> .debug_pubnames has name strings
> .debug_pubtypes has name strings.
>    Till one actually has *found* the relevant
>    compilation unit then these strings cannot
>    be usefully printed?
>    Will searches work if some CUs are utf8 but not all?
>    Is this a quality of implementation issue or a
>    dwarf3 issue?
>    These *are* searched independent of .debug_info, though
>    the do reference .debug_info.
>    Document/dwarf3 mistake or quality-of-implementation issue or ???

Well, searches can work given a mixture of UTF8 and non-UTF-8; to do so
requires that the consumer look into the corresponding compilation unit
header before doing the search. Not hard, but definitely a new wrinkle.

Worth at least an italics explanation.


RECOMMENDED ACTION
------------------

The email Ron responds to has a typo.
I meant to say that since .debug_frame can be present
yet .debug_info absent, what is supposed to happen?
Which Ron has noted as a key issue.

So I'm floating the following proposal for consideration.
(the existing ISSUE has no proposal)


PROPOSAL (ref draft 8/9, dwarf3):


x means the text is to orient the proposal reader and
x document editor, the
x literal proposed text below has no leading >.

xThe proposal is that the .debug_frame augmentation always
xbe ISO-IEC 10646-1:1993, never UFT8.  
xGiven the string is a compiler flag, 
xthere is no need to handle user-level names.




x6.4.1

x4. Augmentation (string)

xAdd the following to bullet 4:

Augmentation string must use ISO/IEC 10646-1:1993, not UTF8.
The string is (thus) independent of DW_AT_use_UTF8.


x7.5.4  Attribute Encodings
xAt the end of the 'string' attribute discussion, add the following:

Because .debug_frame is useful independently of .debug_info the
augmentation string of .debug_frame must be ISO/IEC 10646-1:1993, not UTF8
(the augmentation format is thus independent of DW_AT_use_UTF8).

=============================================================
5/17/2005:  Accepted with modification:
  In the proposal, where it says ISO-10646, it should read
  ISO-646.  The latter is LATIN-1, the former is UCS-2/4.

  The Augmentation string will be UTF-8 encoded in 6.4.1. 
  No changed text in 7.5.4.