文字处理工具 — 「Unicode相关的编码与解码」

在浏览一些贴子的时候,看到了需要将一些特殊字符串还原为正常文字的需求。大部分网友提到的文法是利用一款编辑器,用里面的查找替换功能,这个编辑器这边没法用,也不知道好不好用,这里提供一个快速解决方法,一行命令就能搞定。

这里用到的工具是 uni2ascii ,先用如下命令安装:

sudo apt install uni2ascii

安装好后,即可使用,如:

  • 中文转Unicode
echo -n  "你好" | uni2ascii -a C  -q

结果如下:
image

  • 在网页中,常会遇到 \u6df1\u5733的形式
echo '\u4F60\u597D' | ascii2uni -a U -q

结果如下
image

  • 在网页中,常会遇到你好 的形式
echo '你好' | ascii2uni -a H -q

结果如下
image

利用 ascii2uni -L 可查看支持的格式,如下:

raw hexadecimal numbers
        (00E9)  R
 standard form hexadecimal numbers
        (0x00E9)        X
 prefix v decimal (Perl format)
        (v233)  2
 prefix $ hexadecimal ($00E9)   3
 prefix 16# hexadecimal (16#00E9)       4
 prefix #x hexadecimal (Common Lisp format) (#x00E9)    1
 prefix #16r hexadecimal (#16r00E9)     5
 prefix \u decimal (\u0233)     V
 prefix \u hexadecimal (\u00E9) U
 prefix \U outside BMP, \u within, hexadecimal (U+0000-U+FFFF)  L
 prefix U hexadecimal (U00E9)   E
 prefix u hexadecimal (u00E9)   F
 prefix %u hexadecimal (%u00E9) 9
 prefix U+ hexadecimal (U+00E9) P
 prefix X with hexadecimal in single quotes (X'00E9')   G
 prefix 16# and suffix # hexadecimal (16#00E9#) 6
 prefix U in anglebrackets hexadecimal (<U00E9>)        A
 prefix backslash-x hexadecimal (\x00E9)        B
 prefix backslash-x hexadecimal in braces (\x{00E9})    C
 HTML numeric character references - decimal (&#0233;)  D
 HTML numeric character references - hexadecimal (&#x00E9;)     H
 SGML numeric character references -decimal (\#0233;)   N
 SGML numeric character references - hexadecimal (\#x00E9;)     M
 octal escapes for 3 low bytes in big-endian order (\000\000\351)       O
 hexadecimal escapes for 3 low bytes in big-endian order
        (\x00\x00\xE9)  S
 decimal escapes for 3 low bytes in big-endian order (\d000\d000\d233)  T
 hexadecimal UTF-8 with each byte's hex preceded by an =-sign (=C3=A9).
                RFC 2045 Quoted Printable.      I
 hexadecimal UTF-8 with each byte's hex preceded by a %-sign  (%C3%A9)
                RFC 2396  URI escape format.    J
 hexadecimal UTF-8 with each byte's hex preceded by a backslash-x  (\xC3\xA9)
                Apache log format.      7
 hexadecimal UTF-8 with each byte's hex surrounded by angle brackets  (<C3><A9>)
                        0
 octal UTF-8 with backslash escapes (é) K
 HTML character entities        Q
 all three types of HTML  escape:  hexadecimal  character references,
        decimal character references, and character entities    Y

不仅能从标准输入读取,也可直接从文件中读取,在Linux下,转换,就是如此简单,一行命令的事

5 个赞

羡慕, 李内思!