[Project_owners] Re: Validating javascript identifiers

Eric T. cougio at gmail.com
Fri Sep 9 13:41:27 EDT 2005


I've found a blog post
(http://www.paraesthesia.com/blog/comments.php?id=809_0_1_0_C) with a
.NET sniplet to generate a list of unicode char ranges. Just what I
wanted... so after fixing a bug in the script and fixing a bug in my
patch (eh), here's what I'm looking for, according to Microsoft.NET
System.Globalization:

\u0041-\u005a\u0061-\u007a\u00aa\u00b5\u00ba\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u01ba\u01bc-\u01bf\u01c4-\u021f\u0222-\u0233\u0250-\u02ad\u02b0-\u02b8\u02bb-\u02c1\u02d0-\u02d1\u02e0-\u02e4\u02ee\u037a\u0386\u0388-\u038a\u038c\u038e-\u03a1\u03a3-\u03ce\u03d0-\u03d7\u03da-\u03f3\u0400-\u0481\u048c-\u04c4\u04c7-\u04c8\u04cb-\u04cc\u04d0-\u04f5\u04f8-\u04f9\u0531-\u0556\u0559\u0561-\u0587\u0640\u06e5-\u06e6\u0e46\u0ec6\u10a0-\u10c5\u1843\u1e00-\u1e9b\u1ea0-\u1ef9\u1f00-\u1f15\u1f18-\u1f1d\u1f20-\u1f45\u1f48-\u1f4d\u1f50-\u1f57\u1f59\u1f5b\u1f5d\u1f5f-\u1f7d\u1f80-\u1fb4\u1fb6-\u1fbc\u1fbe\u1fc2-\u1fc4\u1fc6-\u1fcc\u1fd0-\u1fd3\u1fd6-\u1fdb\u1fe0-\u1fec\u1ff2-\u1ff4\u1ff6-\u1ffc\u207f\u2102\u2107\u210a-\u2113\u2115\u2119-\u211d\u2124\u2126\u2128\u212a-\u212d\u212f-\u2131\u2133-\u2134\u2139\u3005\u3031-\u3035\u309d-\u309e\u30fc-\u30fe\ufb00-\ufb06\ufb13-\ufb17\uff21-\uff3a\uff41-\uff5a\uff70

Used something like this:

const UNI_LETTER_BLOCKS =
"\\u0041-\\u005a\\u0061-\\u007a\u00aa\u00b5\u00ba\\u00c0-\\u00d6\\u00d8-\\u00f6\\u00f8-\\u01ba\\u01bc-\\u01bf\\u01c4-\\u021f\\u0222-\\u0233\\u0250-\\u02ad\\u02b0-\\u02b8\\u02bb-\\u02c1\\u02d0-\\u02d1\\u02e0-\\u02e4\u02ee\u037a\u0386\\u0388-\\u038a\u038c\\u038e-\\u03a1\\u03a3-\\u03ce\\u03d0-\\u03d7\\u03da-\\u03f3\\u0400-\\u0481\\u048c-\\u04c4\\u04c7-\\u04c8\\u04cb-\\u04cc\\u04d0-\\u04f5\\u04f8-\\u04f9\\u0531-\\u0556\u0559\\u0561-\\u0587\u0640\\u06e5-\\u06e6\u0e46\u0ec6\\u10a0-\\u10c5\u1843\\u1e00-\\u1e9b\\u1ea0-\\u1ef9\\u1f00-\\u1f15\\u1f18-\\u1f1d\\u1f20-\\u1f45\\u1f48-\\u1f4d\\u1f50-\\u1f57\u1f59\u1f5b\u1f5d\\u1f5f-\\u1f7d\\u1f80-\\u1fb4\\u1fb6-\\u1fbc\u1fbe\\u1fc2-\\u1fc4\\u1fc6-\\u1fcc\\u1fd0-\\u1fd3\\u1fd6-\\u1fdb\\u1fe0-\\u1fec\\u1ff2-\\u1ff4\\u1ff6-\\u1ffc\u207f\u2102\u2107\\u210a-\\u2113\u2115\\u2119-\\u211d\u2124\u2126\u2128\\u212a-\\u212d\\u212f-\\u2131\\u2133-\\u2134\u2139\u3005\\u3031-\\u3035\\u309d-\\u309e\\u30fc-\\u30fe\\ufb00-\\ufb06\\ufb13-\\ufb17\\uff21-\\uff3a\\uff41-\\uff5a\uff70";

var isWordReg = new RegExp("^[" + UNI_LETTER_BLOCKS + "]+$");

if (isWordReg.test("Éphémère")) alert('"Éphémère" is a word!');

Ignoring the fact that the name gives me shivers, can anyone think of
an easy way of making sure MS followed the standard this time (and
Mozilla too for that matter...) ? I can of course redo the loop
matching it against the generated regexp to make sure there's no bug
in my code, which I will do, but that doesn't say anything about every
character being in the right category... or even that the official
categories make sense in this context... the above should match
Uppercase letter, Lowercase letter, Titlecase letter and Modifier
letter. I left out Other letter as the blocks list is in itself twice
the size of the rest and matches stuff like a 2 with a dash in it I
don't think I want to match...

And that doesn't solve the problem of checking is it's usable as an
identifier (I could just try {} it and catch a syntax error telling me
it can't I guess...

But I really wish someone would address bug 258974 following the
official guidelines: http://www.unicode.org/reports/tr18/tr18-3.html

I wrote a little gui for the generator (in .net...) you can pick the
categories you want and it outputs the ranges you need to match for
testing, I'll upload it somewhere if someone is interested in taking a
look.

Boy oh boy the more I look the more I'm lost. Who ever said text was simple??


More information about the Project_owners mailing list