What Are Trojan Source Attacks and How to Deal with Them

Most software and application developers are already too familiar with the Trojan Horse malware. What most people aren’t familiar with, however, is the most recent development of this malware – the Trojan Source attack.

This clever attack method was discovered and announced by researchers from the University of Cambridge, UK. They have since cautioned developers in businesses and government organizations against it due to its devastating potential to infect any piece of code on the planet, regardless of the language.

So what exactly are Trojan Source attacks? How do they work, and how can developers detect and mitigate them?

Trojan Source Attacks Defined

A Trojan source attack is a theoretical open-source software supply chain vulnerability that involves adding malicious but semantically permissible modifications inside legitimate source code comments.

In this attack, malicious actors manipulate the source code to see what human reviewers see differently from what the compilers interpret. Call it a case of hiding a vulnerability in plain sight.

Types of Trojan Source Attacks

Bi-Di Character Control Attacks

In the first type of Trojan source attack, academics from the Cambridge research team note that Trojan source attacks rely on the usage of Unicode’s bi-directional (or bi-di) characters. In software development, Unicode is a standard character encoding system that attaches a universally-accepted code to every character, digit, symbol and emoji in every language.

This makes Unicode the only encoding standard that allows developers to use scripts from different languages on a single report. Unicode itself relies on Bi-di characters. These are a set of formatting symbols used in a block of text written in different languages, for instance, Arabic and English. They signal the shift between right-to-left (RTL) and left-to-right (LTR) reading directions making the text easier for human consumption.

The thing with these characters is that they are only visible to software applications meaning that human reviewers can’t discern them. Clever and malicious actors may take advantage of this to create codes that appear harmless to the human eye but are interpreted differently by compilers.

Although compilers don’t allow these controls in the source code, they do so in the comments when documenting the code and string literals. The researchers noted that attackers could hide bi-di character controls in either of these places to modify the source code.

Homoglyph Attacks

The second type of Trojan source attack involves the use of homoglyphs. Homoglyphs refer to characters with identical or very similar shapes- hence the name.

These attacks are pretty much similar to homoglyph phishing, which involves creating phony domains using characters that appear identical. For instance, an inattentive user may not note the difference between learners.com and Iearners.com (spoiler: the I in the second domain is actually “i” in capital form).

Similarly, the 15-page Trojan Source Attacks report states that threat actors may use the same technique to create deceptive codes using visually identical characters.

Trojan Source Attacks- Which Way Out?

Ross Anderson, a co-author of Trojan Source: Invisible Vulnerabilities, notes that this is one of the biggest threats facing developers today, considering its ability to affect any programming language- C, C#, C++, JavaScript, Java, Rust, Python and Go. And while there’s currently no exploitation against this risk, chances are malicious actors will start weaponizing it sooner or later.

This reinforces the need for engineers and DevSecOps teams to take their cyber hygiene even more seriously. This means looking beyond the usual vulnerabilities to review the source code for backdoors and exploitable flaws. Let’s discuss several possible ways of dealing with this problem.

Using ESLint to Prevent Trojan Source Attacks in JavaScript

DevSecOps researchers at Snyk recommend using ESLint as a possible way of mitigating Trojan Source attacks in JavaScript. ESLint is an open-source static code analysis tool that sniffs codes for errors and flaws and ensures that the codes adhere to the coding standard.

Essentially, ESLint presents a reliable technique for detecting existing Trojan Source vulnerabilities and stopping them from making it into the source code in the first place.

Consider Rewriting Your Source Code

As it stands, Trojan source code compiler vulnerability is tomorrow’s cyber threat for everyone. However, it certainly presents a severe problem for code development teams that are in the habit of copy-pasting codes from open source projects.

On the surface, copy-pasting pieces of code may make it easy for developers to meet tight project deadlines. But this has never been okay. The dangers of copy-pasting code just got worse with the discovery of Trojan Source Codes attacks.

In addition to licensing issues, the other reason why copy-pasting code is discouraged is the existence of invisible and injectable characters that dictate how the code behaves.

The moral lesson from the Trojan Source Attacks report is not to be too comfortable copying a code when you don’t know how it works. It’s a good best practice to retype the code yourself to avoid the risk of copying vulnerabilities. As you do so, remember to turn on IDE for a comprehensive view of Unicode control characters.

If you have to copy-paste a code, do yourself a favor and use a binary file editor (a.k.a byte or hex editor) to analyze the characters that constitute the file.

Stop Using Text Directionality Control Characters

On the issue of how to lower the risk of Trojan Source Code attack, the researchers say that “the simplest defense is to ban the use of text directionality control characters both in language specifications and in compilers implementing these languages.”

They continue to add that in a scenario where the application needs to write a text that needs bidirectional overrides, “developers can generate those characters using escape sequences rather than embedding potentially dangerous characters into source code.”