[v4,11/14,BZ,#14095] update collation data from Unicode / ISO 14651

Message ID s9defl7aird.fsf@taka.site
State New
Headers show
Series
  • Untitled series #1852
Related show

Commit Message

Mike FABIAN Feb. 26, 2018, 3:11 p.m.

Comments

Carlos O'Donell Feb. 26, 2018, 6:27 p.m. | #1
On 02/26/2018 07:11 AM, Mike FABIAN wrote:
> From 19460537f923c9b1ba7668de3b7ac7fa75ce687b Mon Sep 17 00:00:00 2001

> From: Mike FABIAN <mfabian@redhat.com>

> Date: Tue, 23 Jan 2018 17:29:36 +0100

> Subject: [PATCH 11/14] Fix test cases tst-fnmatch and tst-regexloc for the new

>  iso14651_t1_common file.

> MIME-Version: 1.0

> Content-Type: text/plain; charset=UTF-8

> Content-Transfer-Encoding: 8bit

> 


Excellent comments, this version looks good and keeps the testing we were doing
already.

OK to commit.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>



> See:

> 

> http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html

> 

>> A range expression represents the set of collating elements that fall

>> between two elements in the current collation sequence,

>> inclusively. It is expressed as the starting point and the ending

>> point separated by a hyphen (-).

>>

>> Range expressions must not be used in portable applications because

>> their behaviour is dependent on the collating sequence. Ranges will be

>> treated according to the current collating sequence, and include such

>> characters that fall within the range based on that collating

>> sequence, regardless of character values. This, however, means that

>> the interpretation will differ depending on collating sequence. If,

>> for instance, one collating sequence defines ä as a variant of a,

>> while another defines it as a letter following z, then the expression

>> [ä-z] is valid in the first language and invalid in the second.

> Therefore, using [a-z] does not make much sense except in the C/POSIX locale.

> The new iso14651_t1_common lists upper case and  lower case Latin characters

> in a different order than the old one which causes surprising results

> for example in the de_DE locale: [a-z] now includes A because A comes

> after a in iso14651_t1_common but does not include Z because that comes

> after z in iso14651_t1_common.

> 

> 	* posix/tst-fnmatch.input: Use range expressions only in C locale.

> 	* posix/tst-regexloc.c: Do not use a range expression for

>         de_DE.ISO-8859-1 locale.

> ---

>  posix/tst-fnmatch.input | 58 +++++++++++++++++++++++++++++++++++--------------

>  posix/tst-regexloc.c    |  4 ++--

>  2 files changed, 44 insertions(+), 18 deletions(-)

> 

> diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input

> index 88b3f739a5..589fb2a940 100644

> --- a/posix/tst-fnmatch.input

> +++ b/posix/tst-fnmatch.input

> @@ -418,21 +418,47 @@ C		"-"			"[Z-\\]]"	       NOMATCH

>  # Following are tests outside the scope of IEEE 2003.2 since they are using

>  # locales other than the C locale.  The main focus of the tests is on the

>  # handling of ranges and the recognition of character (vs bytes).

> +#

> +# See:

> +#

> +# http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html

> +#

> +# > A range expression represents the set of collating elements that fall

> +# > between two elements in the current collation sequence,

> +# > inclusively. It is expressed as the starting point and the ending

> +# > point separated by a hyphen (-).

> +# >

> +# > Range expressions must not be used in portable applications because

> +# > their behaviour is dependent on the collating sequence. Ranges will be

> +# > treated according to the current collating sequence, and include such

> +# > characters that fall within the range based on that collating

> +# > sequence, regardless of character values. This, however, means that

> +# > the interpretation will differ depending on collating sequence. If,

> +# > for instance, one collating sequence defines ä as a variant of a,

> +# > while another defines it as a letter following z, then the expression

> +# > [ä-z] is valid in the first language and invalid in the second.

> +#

> +# Therefore, using [a-z] does not make much sense except in the C/POSIX locale.

> +# The new iso14651_t1_common lists upper case and lower case Latin characters

> +# in a different order than the old one which causes surprising results

> +# for example in the de_DE locale: [a-z] now includes A because A comes

> +# after a in iso14651_t1_common but does not include Z because that comes

> +# after z in iso14651_t1_common.


OK. Great comment!

>  de_DE.ISO-8859-1 "a"			"[a-z]"		       0

>  de_DE.ISO-8859-1 "z"			"[a-z]"		       0

>  de_DE.ISO-8859-1 "ä"			"[a-z]"		       0

>  de_DE.ISO-8859-1 "ö"			"[a-z]"		       0

>  de_DE.ISO-8859-1 "ü"			"[a-z]"		       0

> -de_DE.ISO-8859-1 "A"			"[a-z]"		       NOMATCH

> +de_DE.ISO-8859-1 "A"			"[a-z]"		       0 # surprising but correct!


OK.

>  de_DE.ISO-8859-1 "Z"			"[a-z]"		       NOMATCH

> -de_DE.ISO-8859-1 "Ä"			"[a-z]"		       NOMATCH

> -de_DE.ISO-8859-1 "Ö"			"[a-z]"		       NOMATCH

> -de_DE.ISO-8859-1 "Ü"			"[a-z]"		       NOMATCH

> +de_DE.ISO-8859-1 "Ä"			"[a-z]"		       0 # surprising but correct!

> +de_DE.ISO-8859-1 "Ö"			"[a-z]"		       0 # surprising but correct!

> +de_DE.ISO-8859-1 "Ü"			"[a-z]"		       0 # surprising but correct!


OK.

>  de_DE.ISO-8859-1 "a"			"[A-Z]"		       NOMATCH

> -de_DE.ISO-8859-1 "z"			"[A-Z]"		       NOMATCH

> -de_DE.ISO-8859-1 "ä"			"[A-Z]"		       NOMATCH

> -de_DE.ISO-8859-1 "ö"			"[A-Z]"		       NOMATCH

> -de_DE.ISO-8859-1 "ü"			"[A-Z]"		       NOMATCH

> +de_DE.ISO-8859-1 "z"			"[A-Z]"		       0 # surprising but correct!

> +de_DE.ISO-8859-1 "ä"			"[A-Z]"		       0 # surprising but correct!

> +de_DE.ISO-8859-1 "ö"			"[A-Z]"		       0 # surprising but correct!

> +de_DE.ISO-8859-1 "ü"			"[A-Z]"		       0 # surprising but correct!


OK.

>  de_DE.ISO-8859-1 "A"			"[A-Z]"		       0

>  de_DE.ISO-8859-1 "Z"			"[A-Z]"		       0

>  de_DE.ISO-8859-1 "Ä"			"[A-Z]"		       0

> @@ -515,16 +541,16 @@ de_DE.UTF-8	 "z"			"[a-z]"		       0

>  de_DE.UTF-8	 "ä"			"[a-z]"		       0

>  de_DE.UTF-8	 "ö"			"[a-z]"		       0

>  de_DE.UTF-8	 "ü"			"[a-z]"		       0

> -de_DE.UTF-8	 "A"			"[a-z]"		       NOMATCH

> +de_DE.UTF-8	 "A"			"[a-z]"		       0 # surprising but correct!


OK.

>  de_DE.UTF-8	 "Z"			"[a-z]"		       NOMATCH

> -de_DE.UTF-8	 "Ä"			"[a-z]"		       NOMATCH

> -de_DE.UTF-8	 "Ö"			"[a-z]"		       NOMATCH

> -de_DE.UTF-8	 "Ü"			"[a-z]"		       NOMATCH

> +de_DE.UTF-8	 "Ä"			"[a-z]"	       0 # surprising but correct!

> +de_DE.UTF-8	 "Ö"			"[a-z]"	       0 # surprising but correct!

> +de_DE.UTF-8	 "Ü"			"[a-z]"	       0 # surprising but correct!


OK.

>  de_DE.UTF-8	 "a"			"[A-Z]"		       NOMATCH

> -de_DE.UTF-8	 "z"			"[A-Z]"		       NOMATCH

> -de_DE.UTF-8	 "ä"			"[A-Z]"		       NOMATCH

> -de_DE.UTF-8	 "ö"			"[A-Z]"		       NOMATCH

> -de_DE.UTF-8	 "ü"			"[A-Z]"		       NOMATCH

> +de_DE.UTF-8	 "z"			"[A-Z]"		       0 # surprising but correct!

> +de_DE.UTF-8	 "ä"			"[A-Z]"	       0 # surprising but correct!

> +de_DE.UTF-8	 "ö"			"[A-Z]"	       0 # surprising but correct!

> +de_DE.UTF-8	 "ü"			"[A-Z]"	       0 # surprising but correct!


OK.

>  de_DE.UTF-8	 "A"			"[A-Z]"		       0

>  de_DE.UTF-8	 "Z"			"[A-Z]"		       0

>  de_DE.UTF-8	 "Ä"			"[A-Z]"		       0

> diff --git a/posix/tst-regexloc.c b/posix/tst-regexloc.c

> index 60235b4d3b..7fbc496d0c 100644

> --- a/posix/tst-regexloc.c

> +++ b/posix/tst-regexloc.c

> @@ -29,8 +29,8 @@ do_test (void)

>  

>    if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)

>      puts ("cannot set locale");

> -  else if (regcomp (&re, "[a-f]*", 0) != REG_NOERROR)

> -    puts ("cannot compile expression \"[a-f]*\"");

> +  else if (regcomp (&re, "[abcdef]*", 0) != REG_NOERROR)

> +    puts ("cannot compile expression \"[abcdef]*\"");


OK.

>    else if (regexec (&re, "abcdefCDEF", 1, mat, 0) == REG_NOMATCH)

>      puts ("no match");

>    else

> -- 2.14.3



-- 
Cheers,
Carlos.

Patch

From 19460537f923c9b1ba7668de3b7ac7fa75ce687b Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Tue, 23 Jan 2018 17:29:36 +0100
Subject: [PATCH 11/14] Fix test cases tst-fnmatch and tst-regexloc for the new
 iso14651_t1_common file.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

See:

http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html

> A range expression represents the set of collating elements that fall
> between two elements in the current collation sequence,
> inclusively. It is expressed as the starting point and the ending
> point separated by a hyphen (-).
>
> Range expressions must not be used in portable applications because
> their behaviour is dependent on the collating sequence. Ranges will be
> treated according to the current collating sequence, and include such
> characters that fall within the range based on that collating
> sequence, regardless of character values. This, however, means that
> the interpretation will differ depending on collating sequence. If,
> for instance, one collating sequence defines ä as a variant of a,
> while another defines it as a letter following z, then the expression
> [ä-z] is valid in the first language and invalid in the second.

Therefore, using [a-z] does not make much sense except in the C/POSIX locale.
The new iso14651_t1_common lists upper case and  lower case Latin characters
in a different order than the old one which causes surprising results
for example in the de_DE locale: [a-z] now includes A because A comes
after a in iso14651_t1_common but does not include Z because that comes
after z in iso14651_t1_common.

	* posix/tst-fnmatch.input: Use range expressions only in C locale.
	* posix/tst-regexloc.c: Do not use a range expression for
        de_DE.ISO-8859-1 locale.
---
 posix/tst-fnmatch.input | 58 +++++++++++++++++++++++++++++++++++--------------
 posix/tst-regexloc.c    |  4 ++--
 2 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
index 88b3f739a5..589fb2a940 100644
--- a/posix/tst-fnmatch.input
+++ b/posix/tst-fnmatch.input
@@ -418,21 +418,47 @@  C		"-"			"[Z-\\]]"	       NOMATCH
 # Following are tests outside the scope of IEEE 2003.2 since they are using
 # locales other than the C locale.  The main focus of the tests is on the
 # handling of ranges and the recognition of character (vs bytes).
+#
+# See:
+#
+# http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html
+#
+# > A range expression represents the set of collating elements that fall
+# > between two elements in the current collation sequence,
+# > inclusively. It is expressed as the starting point and the ending
+# > point separated by a hyphen (-).
+# >
+# > Range expressions must not be used in portable applications because
+# > their behaviour is dependent on the collating sequence. Ranges will be
+# > treated according to the current collating sequence, and include such
+# > characters that fall within the range based on that collating
+# > sequence, regardless of character values. This, however, means that
+# > the interpretation will differ depending on collating sequence. If,
+# > for instance, one collating sequence defines ä as a variant of a,
+# > while another defines it as a letter following z, then the expression
+# > [ä-z] is valid in the first language and invalid in the second.
+#
+# Therefore, using [a-z] does not make much sense except in the C/POSIX locale.
+# The new iso14651_t1_common lists upper case and lower case Latin characters
+# in a different order than the old one which causes surprising results
+# for example in the de_DE locale: [a-z] now includes A because A comes
+# after a in iso14651_t1_common but does not include Z because that comes
+# after z in iso14651_t1_common.
 de_DE.ISO-8859-1 "a"			"[a-z]"		       0
 de_DE.ISO-8859-1 "z"			"[a-z]"		       0
 de_DE.ISO-8859-1 "ä"			"[a-z]"		       0
 de_DE.ISO-8859-1 "ö"			"[a-z]"		       0
 de_DE.ISO-8859-1 "ü"			"[a-z]"		       0
-de_DE.ISO-8859-1 "A"			"[a-z]"		       NOMATCH
+de_DE.ISO-8859-1 "A"			"[a-z]"		       0 # surprising but correct!
 de_DE.ISO-8859-1 "Z"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "Ä"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "Ö"			"[a-z]"		       NOMATCH
-de_DE.ISO-8859-1 "Ü"			"[a-z]"		       NOMATCH
+de_DE.ISO-8859-1 "Ä"			"[a-z]"		       0 # surprising but correct!
+de_DE.ISO-8859-1 "Ö"			"[a-z]"		       0 # surprising but correct!
+de_DE.ISO-8859-1 "Ü"			"[a-z]"		       0 # surprising but correct!
 de_DE.ISO-8859-1 "a"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "z"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "ä"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "ö"			"[A-Z]"		       NOMATCH
-de_DE.ISO-8859-1 "ü"			"[A-Z]"		       NOMATCH
+de_DE.ISO-8859-1 "z"			"[A-Z]"		       0 # surprising but correct!
+de_DE.ISO-8859-1 "ä"			"[A-Z]"		       0 # surprising but correct!
+de_DE.ISO-8859-1 "ö"			"[A-Z]"		       0 # surprising but correct!
+de_DE.ISO-8859-1 "ü"			"[A-Z]"		       0 # surprising but correct!
 de_DE.ISO-8859-1 "A"			"[A-Z]"		       0
 de_DE.ISO-8859-1 "Z"			"[A-Z]"		       0
 de_DE.ISO-8859-1 "Ä"			"[A-Z]"		       0
@@ -515,16 +541,16 @@  de_DE.UTF-8	 "z"			"[a-z]"		       0
 de_DE.UTF-8	 "ä"			"[a-z]"		       0
 de_DE.UTF-8	 "ö"			"[a-z]"		       0
 de_DE.UTF-8	 "ü"			"[a-z]"		       0
-de_DE.UTF-8	 "A"			"[a-z]"		       NOMATCH
+de_DE.UTF-8	 "A"			"[a-z]"		       0 # surprising but correct!
 de_DE.UTF-8	 "Z"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "Ä"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "Ö"			"[a-z]"		       NOMATCH
-de_DE.UTF-8	 "Ü"			"[a-z]"		       NOMATCH
+de_DE.UTF-8	 "Ä"			"[a-z]"	       0 # surprising but correct!
+de_DE.UTF-8	 "Ö"			"[a-z]"	       0 # surprising but correct!
+de_DE.UTF-8	 "Ü"			"[a-z]"	       0 # surprising but correct!
 de_DE.UTF-8	 "a"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "z"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "ä"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "ö"			"[A-Z]"		       NOMATCH
-de_DE.UTF-8	 "ü"			"[A-Z]"		       NOMATCH
+de_DE.UTF-8	 "z"			"[A-Z]"		       0 # surprising but correct!
+de_DE.UTF-8	 "ä"			"[A-Z]"	       0 # surprising but correct!
+de_DE.UTF-8	 "ö"			"[A-Z]"	       0 # surprising but correct!
+de_DE.UTF-8	 "ü"			"[A-Z]"	       0 # surprising but correct!
 de_DE.UTF-8	 "A"			"[A-Z]"		       0
 de_DE.UTF-8	 "Z"			"[A-Z]"		       0
 de_DE.UTF-8	 "Ä"			"[A-Z]"		       0
diff --git a/posix/tst-regexloc.c b/posix/tst-regexloc.c
index 60235b4d3b..7fbc496d0c 100644
--- a/posix/tst-regexloc.c
+++ b/posix/tst-regexloc.c
@@ -29,8 +29,8 @@  do_test (void)
 
   if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
     puts ("cannot set locale");
-  else if (regcomp (&re, "[a-f]*", 0) != REG_NOERROR)
-    puts ("cannot compile expression \"[a-f]*\"");
+  else if (regcomp (&re, "[abcdef]*", 0) != REG_NOERROR)
+    puts ("cannot compile expression \"[abcdef]*\"");
   else if (regexec (&re, "abcdefCDEF", 1, mat, 0) == REG_NOMATCH)
     puts ("no match");
   else
-- 
2.14.3