Niyati Bafna
Niyati Bafna English-CS grad from Ashoka, now specialising in linguistics/CL

The L in NLP: Why NLP Scientists Need to Linguistics Up

The L in NLP: Why NLP Scientists Need to Linguistics Up

The NLP Revolution

Linguistics, or rather, modern linguistics, dates back about 250 years. It was sparked by the discovery of regular sound change in previously thought unrelated languages, and it developed as a field to study the other structural systems – phonetic, phonological, morphological, syntactic, semantic – that languages exhibit. For example, morphology studies the processes of word formation of languages: some languages agglutinate morphemes, or bits of meaning, to form words, like Turkish:

Anlamıyorum : I don’t understand

and others don’t, or do different things with their words.

Syntax studies things like word order, and agreement of the different words in a sentence, so that a sentence is well-formed and able to express an idea coherently:

This article should getting to the points.

In short, different arenas of classical linguistics attempt to, among other things, ‘comprehend’, explain, and sometimes generate language with all of its facets, usually with handwritten rules enumerated with great attention to the structure of specific languages or language groups.

Around 30 or so years ago, with the availability of more compute power, we shifted our m.o. to statistics – we used machine learning algorithms to process large amounts of data and detect statistically prominent patterns. These kinds of machines could, then, generate language using probabilistic approach based on what it had ‘seen’. For example, a Bayesian n-gram model (with n = 3), calculates the frequency of all the triplets of consecutive words. When given a two-word ‘prompt’, it looks for the word most likely to complete the triplet given the prompt with some Bayesian magic. Now it uses the most recent two words to incrementally predict the next word, and so on. Some statistical models, like decision trees, were in principle similar to hand-crafted rules, although they were automatically ‘learned’.

Even more recently, we found Deep Learning (DL), or neural nets, and were instantly smitten. What’s not to love about a machine that requires no domain expertise, no intelligent execution and is so opaque that no-one can find faults in it? We’ll get to that in a minute. With DL came more and more mind-blowing and data-intensive models of learning language, each with some professed idea of its specialized function: we had CNNs, RNNs, which tried to remember sequential information, LSTMs, which addressed the vanishing gradient problem, CRFs, BiLSTMs, GRUs, GPTs

And finally, about half a decade ago, there was an even bigger revolution: that of BERT, or Bidirectional Encoder Representation from Transformers. We also invented word2Vec, with the staggering, and beautiful, discovery that we could represent words in large dimensional vectors and retain some of their semantic characteristics. We invented attention, transformers, multilingual BERT, character encodings, GloVe, CoVe, ELMo and so forth. The face of NLP drastically evolved as principles-first methods turned into more and more elegant and complicated neural architectures, that often used equally complicated tools, like word embeddings. Since it was impossible to actually follow the numbers in those matrices as they were processed millions of times, all eyes turned to the F1 score. Or BLEU, if you were doing translation, or GLUE, or whatever. No +3 F1, no go.

BLEU skies? 1

This is where the third year Computer Science student comes in. She takes Machine Learning. She tries out NLP. She reads some research papers in NLP that get 98.8% F1 (wow). She joins some NLP circles that are all about using neural nets like three-year-olds use Lego blocks, so that they can grab that +2 F1 and sneak in a paper. She learns the anthem of the field: you can be an NLP scientist and build machines for languages you don’t know anything about (wow)! All you need to grab a corpus in some language (let it be large) and start pumping those numbers. How tedious it would be, right, if you actually had to take a meaningful, scientific look at the data you’re working with? The data which, actually, is one of the world’s 6500+ languages, with a probably unique set of linguistic characteristics? But that, she comes to believe, is unnecessary, because these large, powerful neural architectures, are, in fact, language-agnostic, right – in the sense that capture and work equally well on different kinds of linguistic features? They just deal in numbers, right? They don’t have any biases, unlike those complex rules linguists once constructed (and still do, by the way).

All of the above ideas, of course, are mistaken – the rather heavy-handed italicization helps to highlight the sarcasm, for those who missed it. At some point, our CS major will have a rude shock i.e. when she realizes just how bad the performance gap of these technique on English versus on other languages can be. Contemporary wisdom will tell her “It’s just the data issue”: however, contemporary wisdom is not telling her the whole story. To be honest, contemporary wisdom simply hasn’t invested in finding other scientific causes for the fact that the same machines perform differently on different languages. It is simply too easy to point to data disparity, bypass that conversation about the typological differences between languages (icky!), and move on to trying other standard blindly implementable techniques, like transfer learning, using multilingual embeddings, etc. Frankly, no-one knows how to characterize or interpret multilingual embeddings (or word embeddings in general). Here and here are two papers that conduct research on multilingual embeddings themselves, and there are a bunch more. Have we spent any time on trying to understand why or on what sort of linguistic data transfer learning works best? Because it certainly doesn’t do equally well on all languages…and still, I don’t see any papers on an attempt to scientifically understand this, other than throwaway comments such as ‘works best on “similar” languages’. Ask an NLP scientist what ‘similar’ means for languages and watch her flounder.

Note that I am talking about two different problems here (and I’ll soon talk about a couple more). Firstly, a utilitarian issue: there is enough evidence to show that NLP tasks and models do discriminate by language, specifically, by typology (in this paper, for example). By ignoring the value that linguistic information gives us, NLP scientists are losing out on potentially valuable insights that may improve their models. Secondly, the scientific issue: whether or not such information or insights will help in ‘practical’ terms, without investigating them, we are simply performing bad science. We should not need the promise of higher BLEU scores to make a scientific enquiry into the impact that our models have. We need to be informed about how languages work simply for the explainability of our model architectures, for scientific insights on languages and Maths. Remember when we performed research for the sake of understanding and not demonstrating? It was a lot more fun. And a lot more responsible.

The Data Issue

I probably don’t need to spend time on why blind science is at best bad and at worst dangerous – the field of ethical AI is a good place to look for those debates. I will spend a little time on the data-intensiveness of some of these NLP tools, however. NLP has definitely played its part in convincing anyone who ventured to think that we live in a postcolonial world otherwise. How many languages of the world are amenable to mainstream NLP techniques? Conservatively, much less than 1%. Do we think this 1% is distributed randomly across languages and not concentrated among languages with historical colonial power, hard and soft? Hm. This makes for, obviously, a much more rigorous discussion, but in my opinion, NLP should certainly introspect on whether it is structurally propagating linguistic hegemony with the tools it chooses to invest in, along with simply reflecting it. Yes, demand and supply, blah blah, but we do postcolonial literature, you know. And postcolonial art. A lot of research historically does have roots in colonialism (anthropology, sociology, literature, translation studies), but their postcolonial revolutions occurred in parallel with those of the world, and their academicians are concerned with the questions of how to dismantle academia’s systemic propagation of power structures. Why are NLP and Computer Science so many steps behind? Because machine learning scientists are too cool to require a social consciousness or too Mathy to read, please? Bad science, bad politics, what’s not to like. Which is all to point out that ‘low-resource’ NLP, which is, of course, should be the norm as opposed to freakishly-high-resource NLP for mainstream English, Spanish, etc. variants, does rely to a greater extent on understanding the linguistic structures of languages…NLP needs to ask itself why its funds and bestseller techniques are geared towards a very specific one-hundredth of the world’s languages. And mainstream NLP scientists should invest in linguistic knowledge as well, or any other more equitable epistemology than data, in fact, so as to contribute to the balancing of the scales – and, of course, do good NLP.

I don’t buy-is

Lastly, the idea that NLP machines have no biases because they are not human, operate solely with numbers (pure, soulful numbers), etc. This paper points out something both important and obvious: ignorance of bias is not absence of bias (the paper also says a lot of very interesting things about free lunch). Firstly, neural architectures that process words are not free of bias by virtue of being non-human numerically based entities. A small example: one can intuitively guess that an n-gram expects, or will perform better on, languages with rigid word orders (like English!)2. I don’t know whether this is true, but I do find it strange that there aren’t any papers about it (that I can find – if you do know of some, get in touch ☺). Neither do I find papers about how to adapt word embeddings to morphologically rich languages (MRLs) which might have a more dynamic or productive vocabulary as compared to isolating languages (like English!). Secondly, data. Given their complete dependence on data, neural nets are highly susceptible to what you feed them, which means that the NLP scientist, or the feeder, should be aware of the characteristics of the data, and should be capable of reading the results in context. Again, this is not possible if the scientist cannot dependency parse “I need to read Fromkin”. Which is to say that whether or not the current mainstream technologies are biased, NLP scientists with no linguistics background are in no position to make that claim. Note that there are some multilingual metrics that, by virtue of evaluating models on diversified languages and tasks, should end up giving you an accurate idea of its biases or lack thereof. This doesn’t mean, however, that NLP scientists invest in understanding this diversity in any meaningful way at the ground level, and inform their work accordingly. There is a rather burgeoning linguistics-is-unnecessary view in the NLP community that, much like the medieval Church, is dismissing a field of information that it has no clue about. At worst, linguistics will only leave NLP more informed. At best, it will help us build better science – and that sounds like a hell of a good deal to me.

Niyati Bafna

niyatibafna13@gmail.com

(PS: Thanks @Tanmai Khanna and @Hercules Singh Munda for your contribution and suggestions!)

Related articles!

  1. https://www.linguisticsociety.org/resource/history-modern-linguistics
  2. https://www.dataversity.net/a-brief-history-of-natural-language-processing-nlp/
  3. https://analyticsindiamag.com/why-are-some-languages-harder-to-learn-for-ai/
  4. https://venturebeat.com/2020/08/04 the-problem-of-underrepresented-languages-snowballs-from-data-sets-to-nlp-models/

1Sometimes, people argue in favour of BLEU because it’s accessible and easy to use. In response, other people narrate the following story: “There’s a person who realizes at their doorstep on a dark night that they’ve dropped their keys. They start searching for them on the ground. Someone else comes up and asks what’s up. Person #1 says ‘I’ve dropped my keys’. Person #2 says ‘Okay, I can help you. Did you drop them here, under this streetlight?’ Person #1 responds ‘No, I dropped them over there in the dark, but it’s easier to look here’.”

2We do have, by the way, Byte Pair Encoding or char grams, useful in processing MRLs. However, BPE also sometimes becomes the end of the morphology discussion rather than the beginning, especially when NLP scientists do not know how to theoretically juxtapose char grams with the concept of morphology. This paper suggests that we do need stronger tools than BPE for MRLs.