For a project I needed to parse characters from specific Unicode categories. while this page
Whilst it seems simple to create them, some work is involved. This post will tell you how you can do it quickly...
Firstly, we need a website displaying all the characters per category. Luckily fileformat.info
has very extensive lists concerning this need. However, we need that list as a regex. And no, it won't just do that for you.
So while you can start doing this yourself, I should perhaps warn you that there are a few thousand characters
in the Unicode standard. Shall we just use YQL
instead? Finally an interesting project to use it for :)
We use YQL to filter out the codes and get them back as a list. The query goes something like this:
and href like '/info/unicode/char/%'
and href like '%/index.htm'
This fetches the document, filters out all the <a> tags. It then checks whether the target url checks for a specific url, but YQL only supports LIKE with a leading or ending % sign. So we have to check it twice. Finally, the query only returns the text from that tag, because that's what we're interested in.
shows a list of unicode character escapes.
Now, to format this list into a regex, go to your favorite text editor, paste the list and replace all the ","U+
instances by |\u
With the example
you should end up with this snippet:
And that's the regex :)Note that there's now an updated version, see this post
You can find some examples here
(56k script..). That script contains the regex for unicode categories "Uppercase Letter" (Lu), "Lowercase Letter" (Ll), "Titlecase letter" (Lt), "Modifier letter" (Lm), "Other letter" (Lo), "Letter number" (Nl), "Non-spacing mark" (Mn), "Combining spacing mark (Mc), "Decimal number" (Nd) and "Connector punctuation" (Pc). Bonus question: Do you see what parser I'm writing? ;)
Now, these regexes could be improved by combining ranges together. That's not needed for my project so I didn't.
Hope it helps you!