There are vastly more collocations than idioms. (The largest explanatory collocation dictionaryin existence covers only 1% of all possible vocabulary.)
Most importantly, collocations don’t follow clear rules. There is no apparent reason why we should say a burning thirst and not a blazing thirst, except that most people say the former and not the latter. In a way, these whimsical word patterns are like an unexplored realm at the edges of grammar — a lush rainforest with all sorts of curious and apparently random species. At the edge of the forest, the human language learner and the MT system both face the same problem — how to chart it?
Playing Linnaeus to the human language “biosphere” is no trivial task, but fortunately there is help — massive computational power applied to vast sets of texts (linguistic corpora) is producing resources for us all:Machine Translation Systems: Statistical Language Models, that is, tables assigning a likelihood score to sets of words based on their frequency. MT Systems usually produce several “candidate” translations for a given source. Each of them is checked against the Language Model table, and the one with the highest score (the most frequent expression in the language) is used.
The work with collocations is far from over. For MT, the challenge is finding enough corpora. (Except for a few — such as English, French, and Spanish — most languages don’t have enough online texts to create accurate models.) For human learners, there is the additional problem of analyzing and describing the vast amount of data in terms useful to the language student.
The good news is that here, as in other areas, human linguists and MT systems can leverage each other’s efforts. Every new language model provides helpful data that can be used by the next generation of dictionaries, and every dictionary throws new light on the relationship patterns between words that MT can incorporate.