[ prog / sol / mona ]

prog


How can I run my own instance of this

263 2020-06-01 01:20

Now that tickets are accessible and I can actually read the issue, here's what happens. In https://bbs.jp.net/sexp/prog/39 the text of >>194 starts with "お疲れさん.", whatever that is, sent as the bytes:

0002ca50  6f 6e 74 65 6e 74 20 28  70 20 28 61 20 28 40 20  |ontent (p (a (@ |
0002ca60  28 68 72 65 66 20 22 2f  70 72 6f 67 2f 33 39 2f  |(href "/prog/39/|
0002ca70  31 39 32 22 29 29 20 22  3e 3e 31 39 32 22 29 20  |192")) ">>192") |
0002ca80  28 62 72 29 20 22 e3 5c  32 30 31 5c 32 31 32 e7  |(br) ".\201\212.|
0002ca90  5c 32 32 36 b2 e3 5c 32  30 32 5c 32 31 34 e3 5c  |\226..\202\214.\|
0002caa0  32 30 31 5c 32 32 35 e3  5c 32 30 32 5c 32 32 33  |201\225.\202\223|
0002cab0  2e 22 20 28 62 72 29 20  22 4e 6f 77 2c 20 49 27  |." (br) "Now, I'|
0002cac0  76 65 20 62 65 65 6e 20  6d 65 61 6e 69 6e 67 20  |ve been meaning |

The relevant bytes are:

>>> s = "e3 5c  32 30 31 5c 32 31 32 e7 5c 32 32 36 b2 e3 5c 32  30 32 5c 32 31 34 e3 5c 32 30 31 5c 32 32 35 e3  5c 32 30 32 5c 32 32 33 2e"
>>> b = bytes (int (t, base = 16) for t in s.split ())
>>> b
b'\xe3\\201\\212\xe7\\226\xb2\xe3\\202\\214\xe3\\201\\225\xe3\\202\\223.'

The original string in utf8 is:

>>> "お疲れさん.".encode ("utf8")
b'\xe3\x81\x8a\xe7\x96\xb2\xe3\x82\x8c\xe3\x81\x95\xe3\x82\x93.'

so it is obvious that we have high bytes followed by backslashed octal escapes. In the bytes of >>64 a textual backslash can be seen to be doubled.

0000deb0  6e 20 20 28 6c 65 74 2a  20 28 28 72 31 20 28 73  |n  (let* ((r1 (s|
0000dec0  74 72 69 6e 67 2d 73 70  6c 69 74 20 72 61 6e 67  |tring-split rang|
0000ded0  65 20 23 5c 5c 2c 29 29  5c 6e 20 20 20 20 20 20  |e #\\,))\n      |

So we just need to process the octals before the utf8 decoding:

>>> f = lambda b: bytes (int (b [4*k+1 : 4*k+4].decode ("ascii"), base=8) for k in range (len (b) // 4))
>>> g = lambda b: re.sub (rb"([\x80-\xff])((\\[0-7]{3})+)", lambda mo: mo.group (1) + f (mo.group (2)), b).decode ("utf-8")
>>> g (b)
'お疲れさん.'

Just do the equivalent of this in elisp and you can have your weeb characters. Someone might send this to the sbbs.el person.

267 2020-06-01 11:35

Thanks to whoever linked >>263 in the ticket.
https://fossil.textboard.org/sbbs/tktview?name=ee2e075a98

>>266

non-pythonistas

What is a pythonista?

your code is incomprehensible

Input: raw byte array
Output: unicode characters
1. ([\x80-\xff])((\\[0-7]{3})+)
Scan the input and identify locations where a byte over 0x80 is followed by one or more groups of "\DDD" where the Ds are octal digits.
2. Pass everything else through.
3. For each location, emit that first byte over 0x80, then loop over the "\DDD" groups.
4. For each group dump the backslash, take DDD to be an ascii string of three characters, parse that string as an integer in base 8, emit that integer as a byte.
5. After each location has been procesed decode the resulting byte array as utf-8.

301


VIP:

do not edit these