prog

frontpage - thread list - new thread - preferences - ?

Everything is Unicode, until the exploits started rolling in

1 2021-01-21 13:00

https://news.ycombinator.com/item?id=25857729
ASCII Chads win again.

2 2021-01-21 17:04 *

The new generation: Unicode=Text ASCII=Byte #foke emoji strike force#

3 2021-01-21 19:12

Itself the problem was raised because the implementation aumed utf-8 where it should've expected ascii, by the standard.
The problem is not aa unicode problem, but of implementation.

4 2021-01-21 21:22

Go is a bad programming language, designed by the same band of idiots responsible for forcing UTF-8 on the world, but ASCII is also a terrible encoding. I've written at length about this topic.

Now, when I look at the discussions here, I don't expect a link to Hacker News with a single bad sentence. This discussion should be deleted by the owner, because it's low quality.

5 2021-01-22 04:41

>>4
Why do you think UTF-8 is bad?
Among the encodings I've seen (not many, granted) it's so far the best...

6 2021-01-22 08:24

>>4
Anything BCD is terrible, compared to that ASCII is the greatest thing I've ever seen

7 2021-01-22 08:42 *

>>5
He's probably just a contrarian. Everyone says UTF-8 is good, so he's gotta say "well, actshuly ...".

8 2021-01-22 09:36 *

>>7
There's ways to make it better ~~no bom, dynamic space and bit size, something about "text width"~~ but the theme of this place is being contrarian against things like plan9. Read the unix haters handbook for reference.

9 2021-01-22 15:10

>>5,7
Using UTF-8 for everything is sort of like using lists for everything without building higher abstractions or considering alternative data-structures. If you're lucky it works but it makes mistakes far easier and is often inefficient (without a high-level architecture). The arguments made by Verisimilitudes and accepted by at least a few others (including myself) is that using a well designed binary encoding is more correct than using character streams with serialization, parsing, coercion, etc.

Concerning written language, at first glance characters would seem to be the most natural and correct representation, but if one is willing to look one will find that that there can be much more efficient encodings (depending to some extent on the morphology of the language). In English and other nearly isolating languages encoding text as words rather than characters for example would be far more efficient in the base case, while encoding as characters could be preserved for the exceptional situation of encoding language which can't be found in a dictionary, for example the colloquial speech often used in Mark Twain's dialogues.

>>6
BCD is an issue yes, as is the (insane) complexity of normalization, rendering, and properties but even beyond the implementation of Unicode there are considerable issues with the fundamental principle. Unifying the encoding system of many different languages horizontally results in some languages being considerably more efficient than others, even when using hacks to increase efficiency of the inefficient languages. This is best seen in CJK, where despite Han Unification, which fuses Chinese, Japanese, and Korean making it impossible to mix any combination of the three languages in a single file (unless you're writing Korean or Japanese without Han Characters) the encoding of these languages still takes up an extra byte (or two!) more than necessary per glyph, as can be seen by for example Shift-JIS.

10 2021-01-22 16:45

english could be encoded unambiguously as phoneme bundles sequences
https://www.dyslexia-reading-well.com/44-phonemes-in-english.html

11 2021-01-22 17:33 *

>>10
Or we could just collectively agree on using a sane language instead of this abomination. I vote for Esperanto.

12 2021-01-22 19:09

>>9
You're saying that, say, Big5 is more efficient than ascii?

13 2021-01-23 01:57 *

>>11
I vote protoarabic, the language of biology.

14 2021-01-23 02:36

>>10
You can only sort of do this; English doesn't have standardized pronounciation, and you still have to deal with ambiguouity in the language its self.

>>11
Half of the abomination is the lack of inflection. Why not Russian, Latin, or Ancient Greek? (and this is why we can't have a standard...)

>>12
ASCII and Big5 are both similar to Shift-JIS for the point I was trying to make with regards to CJK. If you were referring to encoding language as words in nearly isolating languages, interestingly Chinese characters aren't each words since compounds are so common most words are composed of a couple characters, in a 16-bit encoding this would make the efficiency with English in ASCII often fairly close. Verisimilitude in http://verisimilitudes.net/2018-06-06 proposes that English be encoded using a 24-bit encoding allowing for 16,777,216 words to be stored in the dictionary, and out-competing 8-bit ASCII for word lengths larger than two (this is not counting the advantage with regards to encoding word separators, and punctuation in the higher-level encoding).

15 2021-01-23 02:45 *

>>14
~~Sorry for the mistakes, my home has been filled with neurotoxic gas for a few days and I'm not quite operating completely correctly because of this.~~

16 2021-01-23 05:24

>>12
might as well use toki-pona, if we are to limit vocabulary to a fixed number of words.
Honestly, you do realize words come and go, right? People who complain that emojis are taking up precious space in the unicode standard ought to be aware that exactly this would happen with a dictionary-styled encoding.
Moreover, are we really that pressed in space (memory/storage) that we need to optimize language for it?
Encoding root morphemes might be a better idea. But then you realize that upon mixing they transform in so many different ways (a change of vowel here, dropping/adding a consonant, etc) that implementing that would be 1000x more cumbersome that the whole unicode standard.

Remember that oftentimes, the simpler idea is the more correct.

17 2021-01-23 05:36 *

>>14 >>http://verisimilitudes.net/2018-06-06
Wouldn't text rendering still be expensive?

18 2021-01-23 07:49

tfw you can't invent neologisms or new compound words due it not being in the 3byte word collection and have to revert back to alphabetic encodings(Not Safe for China)

19 2021-01-23 07:50

tfw Communist party reallocates politically incorrect words from their positions to /dev/null and you can't talk about them anymore since they don't exist.

20 2021-01-23 14:19

>>16

Honestly, you do realize words come and go, right? People who complain that emojis are taking up precious space in the unicode standard ought to be aware that exactly this would happen with a dictionary-styled encoding.

I don't have an issue with adding words to the dictionary, old words are still useful for when they show up in old texts, and new when they show up in new. Archaic spellings I would have an issue with, but spelling reforms are relatively rare, and can often be handled by a translation layer.

Moreover, are we really that pressed in space (memory/storage) that we need to optimize language for it?

The reason I had to buy 4gb of ram for my machine and upgrade in the first place was because how exceptionally inefficient MathJax/MathML are. My machine could play high definition video just fine but a hundred posts in a discussion forum using mathematical notation would force me into swap and take several minutes. My understanding is that Verisimilitudes also plans on extending the encoding to high-level typesetting information.

Encoding root morphemes might be a better idea. But then you realize that upon mixing they transform in so many different ways (a change of vowel here, dropping/adding a consonant, etc) that implementing that would be 1000x more cumbersome that the whole unicode standard.

I believe there are plans to add some degree of morpheme comprehension for non-isolating languages. Irregularities in natural languages are common, and I agree that this would be the most complex part. The difference between this and the complexity of the Unicode standard (in e.g. character properties) is that this information is to some extent necessary. So the information encoded here is often going to exist on your machine anyway for semantic analysis, and makes some sorts of analysis trivial, such as spell checking, and word counting. Additionally there are problems with Unicode that the user does not need to worry about in this system, namely normalization, the call site would likely still be simpler.

>>17

Wouldn't text rendering still be expensive?

This mostly just depends on implementations, in the most basic case I think you would drop support for exclusively vertical scripts, and have no combining characters (ideally the latter would not exist in the most complex case either) etc. You're then doing three things, you're being assigned an encoding for a textblock and based on that encoding either rendering from right to left or left to right, you're then looking up words in the dictionary, and lastly you're rendering the glyphs as a bitmap. I don't recall Verisimilitudes commenting on this so this is just my personal thoughts, and in general the things I say on this topic are just my interpretation of his ideas.

>>18,19

tfw you can't invent neologisms or new compound words due it not being in the 3byte word collection and have to revert back to alphabetic encodings(Not Safe for China)
tfw Communist party reallocates politically incorrect words from their positions to /dev/ null and you can't talk about them anymore since they don't exist.

tfw US businesses box up and ship all of US industry to China, and then buy up all the news and media outlets along with most of the modes of communication and convince the US that it was China's fault that the US no longer has skilled labor, commodities are trash, poisonous, etc. They then manage to convince the US that any overt use of force by any non-US-proxy state is grounds for foreign intervention and loads of war-booty for the khan and his cronies directing it all.
Joking aside part of the beauty of this system is that it treats different subencodings differently so for Chinese characters the fall back encoding wouldn't be pinyin but simply a 16-bit encoding of Chinese characters. Avoiding “regulatory capture” if you like is not really the responsibility of the standard but of the standard body and while perhaps an interesting discussion, is not a discussion I have a particular opinion on. Pinyin would likely be an entirely different subencoding all together.

21 2021-01-23 14:39

Let's just make Classical Chinese the standard language for the internet and the world. Just look at the benefits:
Native support for the most patrician language https://wy-lang.org/
Embrace our new overlords.
Still fight against the totalitarian PRC (in spirit, at least) by using the classical characters in use at Taiwan and HK. Fight da powah.
Maximum economy of expression.
Be able to read Tang poetry.
Money.

22 2021-01-23 14:42

Oh and I forgot to mention: completely isolating language. No need for morphological analysis of any kind.
What are you waiting for? Reach out to your favorite standards organization now!

23 2021-01-23 16:17 *

>>21

Embrace our new overlords.

Still fight against the totalitarian PRC

eh?

24 2021-01-23 17:45

>>23
You can have your cake and eat it, too!

25 2021-01-23 19:17 *

>>21
Digitalization is killing traditional languages. r/neography and r/conlang welcome a new ideogram for another fantasy

26 2021-01-23 21:22

Remove charsets altogether! Let users write vector graphics directly. Freedom of the pen!

27 2021-01-23 21:43 *

>>20
~~I've spent far too much time writing for unknown reasons. I'm going to disconnect for a while.~~

28 2021-01-23 23:31

>>5
>>7
My thoughts are summarized here: http://verisimilitudes.net/2019-11-22

It's amusing to be called a contrarian, since there seems to be little resistance against going with the flow in these discussions. The people who think UTF-8 is fine probably also think UNIX is good.

>>9
>>14
>>15
It's pleasant to see my work mentioned by another. Be well, considering that gas issue.

>>16

Honestly, you do realize words come and go, right?

Words can be considered eternal.

People who complain that emojis are taking up precious space in the unicode standard ought to be aware that exactly this would happen with a dictionary-styled encoding.

There's the threat of a dictionary containing degenerations, such as ebonics, but that's a wider social issue.

Moreover, are we really that pressed in space (memory/storage) that we need to optimize language for it?

This is omnipresent in these discussions. Anything new need prove it's not inefficient, but the status quo receives no such skepticism, even when obviously grossly inefficient. Computers are far more capable, and sending individual characters as if they were teletypewriters is just as obscene as using an operating system half a century old which emulates them.

Remember that oftentimes, the simpler idea is the more correct.

What's beautiful and correct about Unicode? What's beautiful and correct about ASCII, a quarter of which is control characters which must be treated specially? There's no beauty and correctness, instead mere familiarity.

>>17
No. It could be better, because special cases would be eliminated. Consider the text rendering flaw of iOS which crashed the machines. It's obscene it was even conceivable.

>>18
>>19
I see these people following in the thread starter's footsteps. Don't be mental midgets. I need to, and will, rewrite that article, but it includes the following sentence:

In any case, this is one important reason why most any implementation of this system should permit circumventing the dictionary, with yet another reason being related to freedom of expression.

I've given a great deal of thought to this, and the best solution is having an alternative dictionary where necessary; I'd considered inlining such words, but this was a terrible solution. A single bit would determine whether a code uses the standard dictionary, or the alternation, and this avoids unnecessarily disadvantaging the use of words not in the former, also avoiding breaking the nicer qualities of the representation.

Part of the folly of unicode is the homoglyph. Text shouldn't be shown without its language being apparent, which defeats this issue.

>>25
Part of the reason for this is because humanity still thinks the idiots programming the computers should have more say than the entirety of the humans which preceded them. Programmers simply aren't accustomed to writing programs which actually work, or which bend over backwards for the environment they serve.

>>26
I don't suggest so, but this ties into my thoughts on character sets. Obviously, character sets are generally better than representing text as a mass of glyphs; the idea I propose merely takes it further, rather than calling character sets the final form for forever.

29 2021-01-24 00:05 *

>>28
Verisimilitude, remember when characters were hard coded in the video card and before the teletype client? Wonder if any video card manufacturer was inane enough to make enough space for utf-8.

30 2021-01-24 00:06 *

>>27
off topic anyway... We are jamming too many things in the encoding system. For archiving there could be a redesign of PDF for better compression ratio.
>>26
Join the emoji strike force you retarded, they are already using svg font glyphs. I heard that SVG is turing complete.

31 2021-01-24 01:27 *

>>28
About editing and rendering, do you need a database about basic chars and then another database about words, roughly an extension of Unicode database?

32 2021-01-24 05:04

>>28

Words can be considered eternal.
There's the threat of a dictionary containing degenerations, such as ebonics, but that's a wider social issue.

Are you trolling? I would be hard pressed to think of something as elusive and transient as natural language. Words are but symbols which describe the human experience in fuzzy ways that are hard to precise. Synonyms in a language have slight differences in meaning, having somewhat fuzzy boundaries and idiosyncratic use in a specific community, as opposed to another. Natural language is continuously changing, and words are constantly moving in a cloud of meaning, with complex interaction with similar words, commonly paired terms, cultural impact... Words are coined every year, some of them survive, some don't, many words fall out of use more and more, some of them become obsolete with the advent of new technologies, some have radical shifts in meaning within a short timespan...There's absolutely nothing eternal about a word. Not even concepts are eternal, they are loose attempts at signifying a set of ideas in our experience, and more often than not, people just can't agree with their precise definitions.
Then there is the concept of a lexicon: Mathematicians use a set of words, programmers, biologists, chemists, physicists, musicians, linguists. All of them have words that are shared across disciplines (and indeed influence each other) while other words have totally different meanings across different disciplines.
What I'm trying to get at is, under what criteria would you delimit a word in an encoding? Clearly not by meaning. Morphologically? Words are also change in their spelling, would you use the brittish or the american spelling of a word? Even if you argue there are minimal differences there, what would it be in 100 or 300 or 500 years when different dialects of english start becoming each their own language? I may be looking too far into the future, but you're the one who said "words are eternal."
Which takes me to the other weird claim that language change among the blacks seems to be an exception rather than the rule, and separating a social issue from language, which is itself the main vehicle and reflection of any social issue whatsoever. Not only blacks or some specific community evolves language, every single one does, and just as disciplines develop their own lingo, every community that forms for any reason does.
Furthermore, how would you catalogue the words? How would you make space for the new words that are coined each year, or which are inevitably overlooked by the initial cataloging scheme, without disorderly pushing them at the end of the list? How could you devise sane boundaries and placing of the different words, especially considering homophones and words which can serve as different syntactic elements (both a noun and a verb, or an adjective and an adverb)? I can think of nothing short of a full philogenetic tree. But what about words for which there is no consensus on their origin? What if new information reveals a word has been misplaced in the tree?
I can see a few (contrived) ways that the approach could arguably work in the real world: 1. The dictionary is essentially the same as the Oxford dictionary of the english language, being updated periodically (say each year or 5 years at most) to reflect current usage of the language, 2. The dictionary describes an official version of the english language out of which any entity would make it's own extentions to allocate idiosyncracies of their own use of the language (thereby inhibiting exchange by groups which are set apart by geography or even domain of discourse), 3. The dictionary is but an extension to a character encoding scheme, a sort of library of words which can be inserted in a character stream wherever a word is thus available. 4. The dictionary is made for a language such as lojban where most of these issues are simply non-existant, and which has well-defined rules for making new words and the meaning attached to them.
All things considered, making such a scheme beyond a trivial collection of high-frequency words seems way more complex than every encoding standard put together.
Sorry for the rant, I just couldn't not take the bait.

33 2021-01-24 07:46

>>29
I wouldn't know.

>>31
A database of basic chars would contain just a few entries. The dictionary of words isn't comparable to the Unicode database, I don't believe.

>>32

Are you trolling?

No. My thoughts are summarized in-part here: http://verisimilitudes.net/2020-11-11

Natural language is continuously changing, and words are constantly moving in a cloud of meaning, with complex interaction with similar words, commonly paired terms, cultural impact...

Sure, but I've found the same peoople who defend bastardizing language also tend to be the ones telling others how to use it when it suits them. Observe the cretins who use they singularly, or try to push latinx on Spanish, also tend to claim language evolves to justify these changes, while advocating them without room for disagreement.

There's absolutely nothing eternal about a word.

I disagree. I believe most people around me use English incorrectly. If everyone were to use English incorrectly, that doesn't make it right.

Not even concepts are eternal

Is that concept eternal?

What I'm trying to get at is, under what criteria would you delimit a word in an encoding?

They would be organized by form only. While my system could help to correct errors, that's not the primary purpose.

Words are also change in their spelling, would you use the brittish or the american spelling of a word?

I'd include both, and an auxiliary table of variant words could make a connection, were such valuable.

Not only blacks or some specific community evolves language

My point is ebonics isn't an evolution. It's a sickening degeneration, and I don't respect it. I'd prefer it didn't exist.

How would you make space for the new words that are coined each year, or which are inevitably overlooked by the initial cataloging scheme, without disorderly pushing them at the end of the list?

A new dictionary would be published, and rules for translation could be automatically derived.

1. The dictionary is essentially the same as the Oxford dictionary of the english language

The end goal would be a comprehensive document, yes.

2. The dictionary describes an official version of the english language

There's no issue with groups having specialized dictionaries, and this could even be valuable. I respect groups which have formed their own words, such as hackers, and a programming forum could include those words.

3. The dictionary is but an extension to a character encoding scheme

No; the goal is superseding such.

4. The dictionary is made for a language such as lojban where most of these issues are simply non-existant, and which has well-defined rules for making new words and the meaning attached to them.

I'm still playing with toki pona as a test bed: http://verisimilitudes.net/2020-06-18

That also needs updating, but the tiny language will work for a demonstration.

All things considered, making such a scheme beyond a trivial collection of high-frequency words seems way more complex than every encoding standard put together.

It's a shame it seems that way, since that's incorrect.

Sorry for the rant, I just couldn't not take the bait.

None of this is bait.

34 2021-01-24 08:57 *

>>33

I wouldn't know.

Was a joke about bad design using a rhetorical question. I do know modern teletype clients that store unifonts, an even worse design with the teletype client being a complete generic machine.

35 2021-01-24 10:28

Unicode is bloated
lets invent our own unicode, but this time cram entire world's dictionaries for extra encoding efficiency

36 2021-01-24 11:01

I'm still playing with toki pona as a test bed

1984 Newspeak
Animal Farm retardation of syntax
Brain damage from xenoestrogens
Semantic poverty and
Emoji-tier linguistics
by your power combined, Toki Pona.

37 2021-01-24 11:27

Imagine a future of Eloi NPCs speaking Toki Pona and Ithkuil-speaking technological Morlok elite.

38 2021-01-24 19:00 *

>>35
As I see it is keeping a list of code points (like Unicode but simplified) and then using dictionaries of words for compression of semi-rich-text documents. The bloat may not come from the system built-in dictionaries since they can be compressed and structured in a back&forward-compatible way. The additional dictionaries bundled with each document may add up to a great amount but the total storage efficiency is still to be estimate.

39 2021-01-26 03:06 *

>>28

http://verisimilitudes.net/2019-11-22
It's pleasant to see my work mentioned by another. Be well, considering that gas issue.

I had forgotten about the switch to a solely dictionary based system, and I haven't fully decided on the relative merit of variable width encodings yet. The gas is simply paint fumes and other artifacts of modern constructiond. I tend to react poorly to VOCs etc. and generally tend to avoid them but I'll be fine of course.

40 2021-01-26 04:52

>>36
Damn bro, far out. That's so deep. Have you read the books Animal Crossing and 1964 by this guy Joe Orville? You probably haven't heard of these hidden gems. They're totally far out, man. Accurately describes the society we live in. Far out.

41 2021-01-26 06:42

>>40

Brain damage from xenoestrogens

42 2021-01-26 06:53

Toki Pona creator >>41
https://en.wikipedia.org/w/index.php?title=Toki_Pona&oldid=48603
https://dic.academic.ru/dic.nsf/ruwiki/99927

43 2021-01-26 06:55

How do you translate "You will eat the bugs, you will live in a pod and you will be happy"
into Toki Pona?

44 2021-01-26 07:25

>>43
sina moku e pipi li lon tomo lili li pona

45 2021-01-26 07:30

>>43
Actually, here's a better translation:
sina moku e pipi la sina lon tomo lili la sina pona

46 2021-01-26 08:20

back-translated it means : "you your meal eat such bug it's said you your be in/at/on be there indoor
constructed space little it's said you your good"

indoor constructed space small

sounds like autistic pygmy describing alien sights to his tribe

47 2021-01-26 08:26 *

test

48 2021-01-29 05:23

>>33
a dictionary based encoding is definitively a mistake, even if a language were to be static and strictly defined you would not solve any problem by storing words instead of characters because speech is not composed of words, it is composed of verbs, nouns, adjectives, subjects, articles, etc. Their arrangement follows a set of rules and it is possible to make an encoding about them. in fact, everything i have said already exists, it use however is about data mining and surveillance, the use of auto-correctors has been forced into the populace to ease said purposes, grammar nazis are useful idiots. that is why i would ever opose an encoding that isn't dumb, it doesn't make communication better for anyone but makes it easier for computers to recognize, if you don't understand what is being said then ask the other to explain, there is no way around it.
a good idea would be however to not let computers handle written language, let IPA be our only encoding or write everything irl and OCRize it when needed.

49 2021-01-29 08:12

>>48

speech is not composed of words
it is composed of verbs, nouns, adjectives, subjects, articles, etc.

We agree that speech is composed of words, then.

While I can agree that having such a system for a new writer would handicap him, it would only benefit experienced writers such as myself. As for machine manipulation, that's part of the point. A character stream is difficult to manipulate in comparison; this idea could form the basis of a very complex and efficient text manipulation system. I don't like 4chan's /r9k/, but imagine an /r9k/ that can't be trivially bypassed by creating invalid strings of text. The very purpose is to add structure and enable greater things.

50 2021-01-29 15:33

I don't know if an encoding would be a good way to go about this. It feels like confusing the map with the territory.
A formal or semi-formal grammar enforced on top of a sane character encoding could do that job just as well, by parsing (and maybe tokenizing!) the text.
We have always separated jobs through layers of abstraction, in this case written text is at the lowest level, and encoding single characters is the simplest solution, even if better devices could be developed. I say characters are the simplest solution (and less prone to error) because of morphological and grammatical rules regarding case, conjugation, word composition, etc.
What we really transmit through text is ideas, the content of our text is semantic. But semantics are elusive and always changing. Some people don't seem to understand that a word's meaning is not entirely the same across time. At any rate, the lexical shape of text is of minor importance. Enforcing a spelling or limiting the play on words such as puns, intentional misspellings, agglutination, morphological shifts, coinage of words such as 'smog' (smoke-fog), etc, would choke natural language. It would work marvels for formal writing, such as documents expeded by an authority, legal contracts and whatnot. But for most writing, especially informal writing on the internet, it would be too constraining.
On the other hand, I would really love to see a sort of database of words implemented as a network of links both etymological and semantic. Thinking of it, the entries would be abstract concepts and words would be the output, the specific spelling would be the end-product of a combination of all semantic components and grammatical/morphological rules, where available (a character set being but a set of symbols to perform an output, kind of what pixels would be to a mathematical function(plot)). For instance, you could have "walk, walked" as a set of outputs (an infinitive and a perfective), and "tread, treaded" as another with slightly different semamtics.
Another thought just occurred to me. In order to more accurately reflect how language actually works, the database wouldn't be in the hands of an authority, but it would be distributed much in the spirit that has spawned descentralized systems such as ipfs. Nodes would be able to coin new words, expressions, gramma rules (like gender-neutral pronouns), and they would be adopted on demand by those who agree and not imposed by an authority.
Consider gender neutral pronouns. If the authority enforces them, half the populace will not be happy, if the authority decides not to, the other half will be unhappy (granted, perhaps 60% of the population don't actually care, so it's really 20%/20%).
Eventually languages will diverge as they have since the dawn of time into dialects and subfamilies, with the "encoding" reflecting not only that, but also their evolution and thus proving resilient to time.
It is better to adopt a solution that has a better chance to persist over the long term than one that is bound to a specific decade or century.

51