Patch to support extended characters in C/C++ identifiers

Message ID 20190812220121.GA9251@ldh.local
State New
Headers show
Series
  • Patch to support extended characters in C/C++ identifiers
Related show

Commit Message

Lewis Hyatt Aug. 12, 2019, 10:01 p.m.
Hello-

The attached patch for libcpp adds support for extended characters (e.g. UTF-8)
in identifiers. A preliminary version of the patch was posted on PR c/67224 as
Comment 26 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26) and
discussed with Joseph Myers. Here is an updated patch incorporating all
feedback received so far. I hope it is suitable now; please let me know if I
can do anything else to make it ready for you to apply. I am happy to work on
it further, whatever is needed. I can't easily test on anything other than
x86_64-linux though. I did bootstrap all languages and run all tests on that
platform, everything was good.

The (relatively short) changes to libcpp are included inline here. I attached
the test cases as a gzipped patch to avoid any problems with the encoding (the
test cases contain some invalid UTF-8 and also other encodings such as latin-1
as part of the testing).

Thanks for taking a look at it!

-Lewis

libcpp/ChangeLog

	PR c/67224
	* charset.c (_cpp_valid_utf8): New function to help lex UTF-8 tokens.
	* internal.h (_cpp_valid_utf8): Declare.
	* lex.c (forms_identifier_p): Use it to recognize UTF-8 identifiers.
	(_cpp_lex_direct): Handle UTF-8 in identifiers and CPP_OTHER tokens.
	Do all work in "default" case to avoid slowing down typical code paths.
	Also handle $ and UCN in the default case for consistency.

gcc/testsuite/ChangeLog

	PR c/67224
	* c-c++-common/cpp/ucnid-2011-1-utf8.c: New test.
	* g++.dg/cpp/ucnid-1-utf8.C: New test.
	* g++.dg/cpp/ucnid-2-utf8.C: New test.
	* g++.dg/cpp/ucnid-3-utf8.C: New test.
	* g++.dg/cpp/ucnid-4-utf8.C: New test.
	* g++.dg/other/ucnid-1-utf8.C: New test.
	* gcc.dg/cpp/ucnid-1-utf8.c: New test.
	* gcc.dg/cpp/ucnid-10-utf8.c: New test.
	* gcc.dg/cpp/ucnid-11-utf8.c: New test.
	* gcc.dg/cpp/ucnid-12-utf8.c: New test.
	* gcc.dg/cpp/ucnid-13-utf8.c: New test.
	* gcc.dg/cpp/ucnid-14-utf8.c: New test.
	* gcc.dg/cpp/ucnid-15-utf8.c: New test.
	* gcc.dg/cpp/ucnid-2-utf8.c: New test.
	* gcc.dg/cpp/ucnid-3-utf8.c: New test.
	* gcc.dg/cpp/ucnid-4-utf8.c: New test.
	* gcc.dg/cpp/ucnid-6-utf8.c: New test.
	* gcc.dg/cpp/ucnid-7-utf8.c: New test.
	* gcc.dg/cpp/ucnid-9-utf8.c: New test.
	* gcc.dg/ucnid-1-utf8.c: New test.
	* gcc.dg/ucnid-10-utf8.c: New test.
	* gcc.dg/ucnid-11-utf8.c: New test.
	* gcc.dg/ucnid-12-utf8.c: New test.
	* gcc.dg/ucnid-13-utf8.c: New test.
	* gcc.dg/ucnid-14-utf8.c: New test.
	* gcc.dg/ucnid-15-utf8.c: New test.
	* gcc.dg/ucnid-16-utf8.c: New test.
	* gcc.dg/ucnid-2-utf8.c: New test.
	* gcc.dg/ucnid-3-utf8.c: New test.
	* gcc.dg/ucnid-4-utf8.c: New test.
	* gcc.dg/ucnid-5-utf8.c: New test.
	* gcc.dg/ucnid-6-utf8.c: New test.
	* gcc.dg/ucnid-7-utf8.c: New test.
	* gcc.dg/ucnid-8-utf8.c: New test.
	* gcc.dg/ucnid-9-utf8.c: New test.

Comments

Jason Merrill Aug. 15, 2019, 3:48 a.m. | #1
On 8/12/19 6:01 PM, Lewis Hyatt wrote:
> Hello-

> 

> The attached patch for libcpp adds support for extended characters (e.g. UTF-8)

> in identifiers. A preliminary version of the patch was posted on PR c/67224 as

> Comment 26 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26) and

> discussed with Joseph Myers. Here is an updated patch incorporating all

> feedback received so far. I hope it is suitable now; please let me know if I

> can do anything else to make it ready for you to apply. I am happy to work on

> it further, whatever is needed. I can't easily test on anything other than

> x86_64-linux though. I did bootstrap all languages and run all tests on that

> platform, everything was good.

> 

> The (relatively short) changes to libcpp are included inline here. I attached

> the test cases as a gzipped patch to avoid any problems with the encoding (the

> test cases contain some invalid UTF-8 and also other encodings such as latin-1

> as part of the testing).

> 

> Thanks for taking a look at it!


Looks good to me.  Joseph?

Jason
Joseph Myers Aug. 15, 2019, 12:23 p.m. | #2
On Thu, 15 Aug 2019, Jason Merrill wrote:

> On 8/12/19 6:01 PM, Lewis Hyatt wrote:

> > Hello-

> > 

> > The attached patch for libcpp adds support for extended characters (e.g.

> > UTF-8)

> > in identifiers. A preliminary version of the patch was posted on PR c/67224

> > as

> > Comment 26 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26) and

> > discussed with Joseph Myers. Here is an updated patch incorporating all

> > feedback received so far. I hope it is suitable now; please let me know if I

> > can do anything else to make it ready for you to apply. I am happy to work

> > on

> > it further, whatever is needed. I can't easily test on anything other than

> > x86_64-linux though. I did bootstrap all languages and run all tests on that

> > platform, everything was good.

> > 

> > The (relatively short) changes to libcpp are included inline here. I

> > attached

> > the test cases as a gzipped patch to avoid any problems with the encoding

> > (the

> > test cases contain some invalid UTF-8 and also other encodings such as

> > latin-1

> > as part of the testing).

> > 

> > Thanks for taking a look at it!

> 

> Looks good to me.  Joseph?


I'm a month behind on gcc-patches at present.  It will take me a while to 
get to this for detailed review.

-- 
Joseph S. Myers
joseph@codesourcery.com
Joseph Myers Sept. 10, 2019, 11:47 p.m. | #3
On Mon, 12 Aug 2019, Lewis Hyatt wrote:

> Hello-

> 

> The attached patch for libcpp adds support for extended characters (e.g. UTF-8)

> in identifiers. A preliminary version of the patch was posted on PR c/67224 as

> Comment 26 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26) and

> discussed with Joseph Myers. Here is an updated patch incorporating all

> feedback received so far. I hope it is suitable now; please let me know if I

> can do anything else to make it ready for you to apply. I am happy to work on

> it further, whatever is needed. I can't easily test on anything other than

> x86_64-linux though. I did bootstrap all languages and run all tests on that

> platform, everything was good.

> 

> The (relatively short) changes to libcpp are included inline here. I attached

> the test cases as a gzipped patch to avoid any problems with the encoding (the

> test cases contain some invalid UTF-8 and also other encodings such as latin-1

> as part of the testing).

> 

> Thanks for taking a look at it!


Thanks, I think this is OK with a few updates to the documentation.  
Specifically:

cpp.texi says:

  In the 1999 C standard, identifiers may contain letters which are not
  part of the ``basic source character set'', at the implementation's
  discretion (such as accented Latin letters, Greek letters, or Chinese
  ideograms).  This may be done with an extended character set, or the
  @samp{\u} and @samp{\U} escape sequences.  GCC only accepts such
  characters in the @samp{\u} and @samp{\U} forms.

and it's no longer accurate to say that only the \u and \U forms are 
accepted.

cpp.texi, section "Implementation-defined behavior", discusses 
implementation-defined characters in identifiers.  It should say that GCC 
accepts exactly those multibyte characters that correspond to UCNs for 
characters permitted by the chosen version of the C or C++ standard.

cppopts.texi documents -fextended-identifiers as "Accept universal 
character names in identifiers.".  That needs to say the characters are 
also accepted directly in the identifiers.


I should also note that a few of the tests added by the test are testing 
things that are properties of the implementation that might arguably be 
bugs, rather than standard features, and so perhaps should at least have 
comments added saying they are testing those implementation properties.

gcc/testsuite/gcc.dg/cpp/ucnid-7-utf8.c, testing invalid UTF-8, is relying 
on GCC, in its default -finput-charset=utf-8 mode, not actually checking 
that the input is valid UTF-8.  It's clear that avoiding such a check 
makes sense in strings and comments, both as a matter of efficiency and 
because it's likely to do the right thing for a lot of user programs that 
use non-UTF-8 character sets in those places and just need the bytes in 
the strings to be passed through to the compiler output (rather than 
requiring users to specify -finput-charset and -fexec-charset for those 
programs).  Outside those contexts it's less obvious what's the best way 
to behave (this sort of test, where the stray non-UTF-8 bytes are in text 
that disappears as a result of macro expansion, is certainly a corner 
case).

gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C and 
gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C are testing double stringizing in 
C++, where strictly the results they expect show that GCC does not conform 
to the C++ standard requirement to convert all extended characters to UCNs 
(because C++ does not have the special C rule making it 
implementation-defined whether the \ of a UCN in a string literal is 
doubled when stringizing).

-- 
Joseph S. Myers
joseph@codesourcery.com
Lewis Hyatt Sept. 11, 2019, 2:31 p.m. | #4
On Tue, Sep 10, 2019 at 7:47 PM Joseph Myers <joseph@codesourcery.com> wrote:
>

> Thanks, I think this is OK with a few updates to the documentation.


Thanks for looking through this, I'm glad it will be acceptable. I
will make the documentation adjustments as you suggest.

Speaking of documentation, one other thing occurred to me. When I made
these changes, I tried to make them as minimally disruptive as
possible, so they are the smallest changes I could find to the current
overall architecture to make this work. As a result there are some
things that may be a little surprising. For instance, you can take a
UTF-8 encoded file and insert a backslash line continuation in the
middle of a multibyte sequence, and gcc will happily paste it back
together and then interpret the resulting UTF-8. I think it's
technically OK standardwise since the conversion from extended
characters to the source character set is implementation-defined, but
it's hardly a straightforward definition. It is sort of consistent
with the treatment of undefined behavior with UCN escapes though,
which gcc already permits to be pasted together over a line
continuation. Anyway, should this behavior be documented as well? I
doubt anyone would be happy with a full-blown solution that involves
doing the UTF-8 conversion at initial parse time, given how much of
the libcpp code is devoted to optimizing the performance of scanning
input files, so this is probably the way it's going to end up working
I presume.

> I should also note that a few of the tests added by the test are testing

> things that are properties of the implementation that might arguably be

> bugs, rather than standard features, and so perhaps should at least have

> comments added saying they are testing those implementation properties.

>

> gcc/testsuite/gcc.dg/cpp/ucnid-7-utf8.c, testing invalid UTF-8, is relying

> on GCC, in its default -finput-charset=utf-8 mode, not actually checking

> that the input is valid UTF-8.  It's clear that avoiding such a check

> makes sense in strings and comments, both as a matter of efficiency and

> because it's likely to do the right thing for a lot of user programs that

> use non-UTF-8 character sets in those places and just need the bytes in

> the strings to be passed through to the compiler output (rather than

> requiring users to specify -finput-charset and -fexec-charset for those

> programs).  Outside those contexts it's less obvious what's the best way

> to behave (this sort of test, where the stray non-UTF-8 bytes are in text

> that disappears as a result of macro expansion, is certainly a corner

> case).

>


My main reason for including this test was to demonstrate that
existing behavior is unchanged by the patch. If you think it makes
more sense, I could omit the test altogether, otherwise I will add a
comment like you suggested.

> gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C and

> gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C are testing double stringizing in

> C++, where strictly the results they expect show that GCC does not conform

> to the C++ standard requirement to convert all extended characters to UCNs

> (because C++ does not have the special C rule making it

> implementation-defined whether the \ of a UCN in a string literal is

> doubled when stringizing).


Thanks, I didn't mean to ignore this point when you made it on the PR
comments, I just wasn't sure what was the best way to handle it. Do
you find it preferable to just add a comment, or should I rather
change the test to look for the standard-confirming output, and make
it an XFAIL?

Finally, one general question, when I submit these last changes, is it
better to send them as a new patch relative to what I already sent, or
is it better to send the whole thing updated from scratch? Thanks
again.

-Lewis


-Lewis
Joseph Myers Sept. 12, 2019, 12:33 a.m. | #5
On Wed, 11 Sep 2019, Lewis Hyatt wrote:

> things that may be a little surprising. For instance, you can take a

> UTF-8 encoded file and insert a backslash line continuation in the

> middle of a multibyte sequence, and gcc will happily paste it back

> together and then interpret the resulting UTF-8. I think it's

> technically OK standardwise since the conversion from extended

> characters to the source character set is implementation-defined, but

> it's hardly a straightforward definition. It is sort of consistent

> with the treatment of undefined behavior with UCN escapes though,

> which gcc already permits to be pasted together over a line

> continuation. Anyway, should this behavior be documented as well? I


I don't think that peculiarity should be documented.  (Whereas accepting 
arbitrary bytes inside comments and strings by default is arguably 
actually a feature.)

> > gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C and

> > gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C are testing double stringizing in

> > C++, where strictly the results they expect show that GCC does not conform

> > to the C++ standard requirement to convert all extended characters to UCNs

> > (because C++ does not have the special C rule making it

> > implementation-defined whether the \ of a UCN in a string literal is

> > doubled when stringizing).

> 

> Thanks, I didn't mean to ignore this point when you made it on the PR

> comments, I just wasn't sure what was the best way to handle it. Do

> you find it preferable to just add a comment, or should I rather

> change the test to look for the standard-confirming output, and make

> it an XFAIL?


My inclination would be a comment, with reference to a bug filed for this 
issue in Bugzilla.

> Finally, one general question, when I submit these last changes, is it

> better to send them as a new patch relative to what I already sent, or

> is it better to send the whole thing updated from scratch? Thanks

> again.


A complete patch that can be applied to trunk is best.

-- 
Joseph S. Myers
joseph@codesourcery.com
Lewis Hyatt Sept. 12, 2019, 8:30 p.m. | #6
On Tue, Sep 10, 2019 at 11:47:22PM +0000, Joseph Myers wrote:
> On Mon, 12 Aug 2019, Lewis Hyatt wrote:

> 

> > Hello-

> > 

> > The attached patch for libcpp adds support for extended characters (e.g. UTF-8)

> > in identifiers. A preliminary version of the patch was posted on PR c/67224 as

> > Comment 26 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26) and

> > discussed with Joseph Myers. Here is an updated patch incorporating all

> > feedback received so far. I hope it is suitable now; please let me know if I

> > can do anything else to make it ready for you to apply. I am happy to work on

> > it further, whatever is needed. I can't easily test on anything other than

> > x86_64-linux though. I did bootstrap all languages and run all tests on that

> > platform, everything was good.

> > 

> > The (relatively short) changes to libcpp are included inline here. I attached

> > the test cases as a gzipped patch to avoid any problems with the encoding (the

> > test cases contain some invalid UTF-8 and also other encodings such as latin-1

> > as part of the testing).

> > 

> > Thanks for taking a look at it!

> 

> Thanks, I think this is OK with a few updates to the documentation.


Attached is a single patch relative to current trunk that incorporates all of
your feedback. I gzipped it like last time just in case the invalid UTF-8 in
the tests presents a problem. The code changes are the same as before other
than comments.

The documentation is now updated... there were a couple other places that also
seemed reasonable to me to update, hope it sounds OK.

I also created the PR about UCN conversion (PR 91755) and added a reference in
the comments for those tests.

Bootstrap was done on Linux x86-64, testing results:

before patch:
36 XPASS
72 FAIL
1452 XFAIL
9624 UNSUPPORTED
359529 PASS

after patch:
36 XPASS
72 FAIL
1452 XFAIL
9624 UNSUPPORTED
359633 PASS

Thank you.

-Lewis
libcpp/ChangeLog
2019-09-12  Lewis Hyatt  <lhyatt@gmail.com>

	PR c/67224
	* charset.c (_cpp_valid_utf8): New function to help lex UTF-8 tokens.
	* internal.h (_cpp_valid_utf8): Declare.
	* lex.c (forms_identifier_p): Use it to recognize UTF-8 identifiers.
	(_cpp_lex_direct): Handle UTF-8 in identifiers and CPP_OTHER tokens.
	Do all work in "default" case to avoid slowing down typical code paths.
	Also handle $ and UCN in the default case for consistency.

gcc/Changelog
2019-09-12  Lewis Hyatt  <lhyatt@gmail.com>

	PR c/67224
	* doc/cpp.texi: Document support for extended characters in
	identifiers.
	* doc/cppopts.texi: Likewise.

gcc/testsuite/ChangeLog
2019-09-12  Lewis Hyatt  <lhyatt@gmail.com>

	PR c/67224
	* c-c++-common/cpp/ucnid-2011-1-utf8.c: New test.
	* g++.dg/cpp/ucnid-1-utf8.C: New test.
	* g++.dg/cpp/ucnid-2-utf8.C: New test.
	* g++.dg/cpp/ucnid-3-utf8.C: New test.
	* g++.dg/cpp/ucnid-4-utf8.C: New test.
	* g++.dg/other/ucnid-1-utf8.C: New test.
	* gcc.dg/cpp/ucnid-1-utf8.c: New test.
	* gcc.dg/cpp/ucnid-10-utf8.c: New test.
	* gcc.dg/cpp/ucnid-11-utf8.c: New test.
	* gcc.dg/cpp/ucnid-12-utf8.c: New test.
	* gcc.dg/cpp/ucnid-13-utf8.c: New test.
	* gcc.dg/cpp/ucnid-14-utf8.c: New test.
	* gcc.dg/cpp/ucnid-15-utf8.c: New test.
	* gcc.dg/cpp/ucnid-2-utf8.c: New test.
	* gcc.dg/cpp/ucnid-3-utf8.c: New test.
	* gcc.dg/cpp/ucnid-4-utf8.c: New test.
	* gcc.dg/cpp/ucnid-6-utf8.c: New test.
	* gcc.dg/cpp/ucnid-7-utf8.c: New test.
	* gcc.dg/cpp/ucnid-9-utf8.c: New test.
	* gcc.dg/ucnid-1-utf8.c: New test.
	* gcc.dg/ucnid-10-utf8.c: New test.
	* gcc.dg/ucnid-11-utf8.c: New test.
	* gcc.dg/ucnid-12-utf8.c: New test.
	* gcc.dg/ucnid-13-utf8.c: New test.
	* gcc.dg/ucnid-14-utf8.c: New test.
	* gcc.dg/ucnid-15-utf8.c: New test.
	* gcc.dg/ucnid-16-utf8.c: New test.
	* gcc.dg/ucnid-2-utf8.c: New test.
	* gcc.dg/ucnid-3-utf8.c: New test.
	* gcc.dg/ucnid-4-utf8.c: New test.
	* gcc.dg/ucnid-5-utf8.c: New test.
	* gcc.dg/ucnid-6-utf8.c: New test.
	* gcc.dg/ucnid-7-utf8.c: New test.
	* gcc.dg/ucnid-8-utf8.c: New test.
	* gcc.dg/ucnid-9-utf8.c: New test.
Joseph Myers Sept. 19, 2019, 7:57 p.m. | #7
On Thu, 12 Sep 2019, Lewis Hyatt wrote:

> Attached is a single patch relative to current trunk that incorporates all of

> your feedback. I gzipped it like last time just in case the invalid UTF-8 in

> the tests presents a problem. The code changes are the same as before other

> than comments.

> 

> The documentation is now updated... there were a couple other places that also

> seemed reasonable to me to update, hope it sounds OK.


Thanks, I've now committed this patch.

-- 
Joseph S. Myers
joseph@codesourcery.com
Lewis Hyatt Sept. 19, 2019, 8:08 p.m. | #8
On Thu, Sep 19, 2019 at 3:57 PM Joseph Myers <joseph@codesourcery.com> wrote:
>

> On Thu, 12 Sep 2019, Lewis Hyatt wrote:

>

> > Attached is a single patch relative to current trunk that incorporates all of

> > your feedback. I gzipped it like last time just in case the invalid UTF-8 in

> > the tests presents a problem. The code changes are the same as before other

> > than comments.

> >

> > The documentation is now updated... there were a couple other places that also

> > seemed reasonable to me to update, hope it sounds OK.

>

> Thanks, I've now committed this patch.

>


Thank you very much.

-Lewis

Patch

diff --git a/libcpp/charset.c b/libcpp/charset.c
index 8a0e5cbb29b..4f1bee96cee 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -1198,6 +1198,84 @@  convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
   return from;
 }
 
+/*  Performs a similar task as _cpp_valid_ucn, but parses UTF-8-encoded
+    extended characters rather than UCNs.  If the return value is TRUE, then a
+    character was successfully decoded and stored in *CP; *PSTR has been
+    updated to point one past the valid UTF-8 sequence.  Diagnostics may have
+    been emitted if the character parsed is not allowed in the current context.
+    If the return value is FALSE, then *PSTR has not been modified and *CP may
+    equal 0, to indicate that *PSTR does not form a valid UTF-8 sequence, or it
+    may, when processing an identifier in C mode, equal a codepoint that was
+    validly encoded but is not allowed to appear in an identifier.  In either
+    case, no diagnostic is emitted, and the return value of FALSE should cause
+    a new token to be formed.
+
+    Unlike _cpp_valid_ucn, this will never be called when lexing a string; only
+    a potential identifier, or a CPP_OTHER token.  NST is unused in the latter
+    case.
+
+    As in _cpp_valid_ucn, IDENTIFIER_POS is 0 when not in an identifier, 1 for
+    the start of an identifier, or 2 otherwise.  */
+
+extern bool
+_cpp_valid_utf8 (cpp_reader *pfile,
+		 const uchar **pstr,
+		 const uchar *limit,
+		 int identifier_pos,
+		 struct normalize_state *nst,
+		 cppchar_t *cp)
+{
+  const uchar *base = *pstr;
+  size_t inbytesleft = limit - base;
+  if (one_utf8_to_cppchar (pstr, &inbytesleft, cp))
+    {
+      /* No diagnostic here as this byte will rather become a
+	 new token.  */
+      *cp = 0;
+      return false;
+    }
+
+  if (identifier_pos)
+    {
+      switch (ucn_valid_in_identifier (pfile, *cp, nst))
+	{
+
+	case 0:
+	  /* In C++, this is an error for invalid character in an identifier
+	     because logically, the UTF-8 was converted to a UCN during
+	     translation phase 1 (even though we don't physically do it that
+	     way). In C, this byte rather becomes grammatically a separate
+	     token.  */
+
+	  if (CPP_OPTION (pfile, cplusplus))
+	    cpp_error (pfile, CPP_DL_ERROR,
+		       "extended character %.*s is not valid in an identifier",
+		       (int) (*pstr - base), base);
+	  else
+	    {
+	      *pstr = base;
+	      return false;
+	    }
+
+	  break;
+
+	case 2:
+	  if (identifier_pos == 1)
+	    {
+	      /* This is treated the same way in C++ or C99 -- lexed as an
+		 identifier which is then invalid because an identifier is
+		 not allowed to start with this character.  */
+	      cpp_error (pfile, CPP_DL_ERROR,
+	  "extended character %.*s is not valid at the start of an identifier",
+			 (int) (*pstr - base), base);
+	    }
+	  break;
+	}
+    }
+
+  return true;
+}
+
 /* Subroutine of convert_hex and convert_oct.  N is the representation
    in the execution character set of a numeric escape; write it into the
    string buffer TBUF and update the end-of-string pointer therein.  WIDE
@@ -1956,8 +2034,9 @@  cpp_interpret_charconst (cpp_reader *pfile, const cpp_token *token,
 }
 
 /* Convert an identifier denoted by ID and LEN, which might contain
-   UCN escapes, to the source character set, either UTF-8 or
-   UTF-EBCDIC.  Assumes that the identifier is actually a valid identifier.  */
+   UCN escapes or UTF-8 multibyte chars, to the source character set,
+   either UTF-8 or UTF-EBCDIC.  Assumes that the identifier is actually
+   a valid identifier.  */
 cpp_hashnode *
 _cpp_interpret_identifier (cpp_reader *pfile, const uchar *id, size_t len)
 {
diff --git a/libcpp/internal.h b/libcpp/internal.h
index 45167a9500e..d2158426b1f 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -777,6 +777,14 @@  extern bool _cpp_valid_ucn (cpp_reader *, const unsigned char **,
 			    cppchar_t *,
 			    source_range *char_range,
 			    cpp_string_location_reader *loc_reader);
+
+extern bool _cpp_valid_utf8 (cpp_reader *pfile,
+			     const uchar **pstr,
+			     const uchar *limit,
+			     int identifier_pos,
+			     struct normalize_state *nst,
+			     cppchar_t *cp);
+
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,
diff --git a/libcpp/lex.c b/libcpp/lex.c
index 16ded6e9b05..15b10cb3f01 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -1313,7 +1313,9 @@  warn_about_normalization (cpp_reader *pfile,
     }
 }
 
-/* Returns TRUE if the sequence starting at buffer->cur is invalid in
+static const cppchar_t utf8_signifier = 0xC0;
+
+/* Returns TRUE if the sequence starting at buffer->cur is valid in
    an identifier.  FIRST is TRUE if this starts an identifier.  */
 static bool
 forms_identifier_p (cpp_reader *pfile, int first,
@@ -1336,17 +1338,25 @@  forms_identifier_p (cpp_reader *pfile, int first,
       return true;
     }
 
-  /* Is this a syntactically valid UCN?  */
-  if (CPP_OPTION (pfile, extended_identifiers)
-      && *buffer->cur == '\\'
-      && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U'))
+  /* Is this a syntactically valid UCN or a valid UTF-8 char?  */
+  if (CPP_OPTION (pfile, extended_identifiers))
     {
       cppchar_t s;
-      buffer->cur += 2;
-      if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
-			  state, &s, NULL, NULL))
-	return true;
-      buffer->cur -= 2;
+      if (*buffer->cur >= utf8_signifier)
+	{
+	  if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
+			       state, &s))
+	    return true;
+	}
+      else if (*buffer->cur == '\\'
+	       && (buffer->cur[1] == 'u' || buffer->cur[1] == 'U'))
+	{
+	  buffer->cur += 2;
+	  if (_cpp_valid_ucn (pfile, &buffer->cur, buffer->rlimit, 1 + !first,
+			      state, &s, NULL, NULL))
+	    return true;
+	  buffer->cur -= 2;
+	}
     }
 
   return false;
@@ -1464,7 +1474,8 @@  lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
   pfile->buffer->cur = cur;
   if (starts_ucn || forms_identifier_p (pfile, false, nst))
     {
-      /* Slower version for identifiers containing UCNs (or $).  */
+      /* Slower version for identifiers containing UCNs
+	 or extended chars (including $).  */
       do {
 	while (ISIDNUM (*pfile->buffer->cur))
 	  {
@@ -3117,12 +3128,12 @@  _cpp_lex_direct (cpp_reader *pfile)
       /* @ is a punctuator in Objective-C.  */
     case '@': result->type = CPP_ATSIGN; break;
 
-    case '$':
-    case '\\':
+    default:
       {
 	const uchar *base = --buffer->cur;
-	struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 
+	/* Check for an extended identifier ($ or UCN or UTF-8).  */
+	struct normalize_state nst = INITIAL_NORMALIZE_STATE;
 	if (forms_identifier_p (pfile, true, &nst))
 	  {
 	    result->type = CPP_NAME;
@@ -3131,13 +3142,21 @@  _cpp_lex_direct (cpp_reader *pfile)
 	    warn_about_normalization (pfile, result, &nst);
 	    break;
 	  }
+
+	/* Otherwise this will form a CPP_OTHER token.  Parse valid UTF-8 as a
+	   single token.  */
 	buffer->cur++;
+	if (c >= utf8_signifier)
+	  {
+	    const uchar *pstr = base;
+	    cppchar_t s;
+	    if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s))
+	      buffer->cur = pstr;
+	  }
+	create_literal (pfile, result, base, buffer->cur - base, CPP_OTHER);
+	break;
       }
-      /* FALLTHRU */
 
-    default:
-      create_literal (pfile, result, buffer->cur - 1, 1, CPP_OTHER);
-      break;
     }
 
   /* Potentially convert the location of the token to a range.  */