👋 Hi, I’m Melissa and welcome to my biweekly Note, Language Processor. Every other Tuesday, I dig deep into language & behavior, the limits of technologies, and the connection between what people say and do. Once a month, Signal Lab takes over and covers these topics for startups and early-stage investors.
By 1995, the Unabomber had been on a 17 year bombing rampage. (“Who’s the Unabomber?”) No one had any clue as to his identity.
That September, he sent a manifesto to The New York Times and The Washington Post, threatening to plant another bomb if it was not published. The manifesto, "Industrial Society and Its Future," was published and subsequently read by a David Kaczynski, who recognized similarity in the style of the writing to his brother, Ted’s.
David cooperated with the FBI, providing samples of his brother’s writing for comparison by profiler James Fitzgerald and other linguistic experts. A search warrant was issued on the basis of that work, and Ted Kaczynski was later arrested at his remote Montana cabin (where extensive bomb-making equipment was also found).
Idiolect
The Unabomber was caught because his writing style was recognized, something linguists call idiolect, i.e., the speech and language habits peculiar to one particular person.
Idiolect can be key evidence in forensics. There have been many instances of murderers texting from a victim’s phone to make it seem as if the victim is still alive, but because the idiolects don’t match, the recipient is alerted — something is wrong. Idiolect serves as linguistic fingerprint and is more difficult to imitate than it might appear: punctuation, grammatical structures, lexical choice, shorthands, even emojis.
Even when the suspect is unknown, and just as forensic artists develop composite drawings to aid in the search, linguistic profilers can pinpoint idiosyncratic psychological, dialectical, regional, socio-economic clues to the identity of a writer/speaker, which together comprise idiolect. Age and gender can be easily retrieved, as well many surprising personal details (like the famous “devil strip” tip off). All that information can be used to identify a suspect, bring them in, or convict them.
The Linguistic Work of Threat Analysis
Text analysis is central to the work of forensics and specifically threat analysis. There are many kinds of threat — cyber threats, terroristic threat, insider threat. But what about threats of physical violence? Verbal or written threats of harm against an individual person? Whenever these kinds of threats are received, a recipient must first assess intent.
A threat, if reported to law enforcement, undergoes a process of threat assessment whereby investigators evaluate various context-dependent social, psychological, and linguistic factors to assess whether the threat is real and also whether it is likely to be carried out.
Threats escalated to the investigators at the FBI’s National Center for the Analysis of Violent Crime are designated as high, moderate, or low according to seven equally weighted factors: degree of anger expressed, level of personalization, level of specificity, evidence of technical knowledge, evidence of commitment, existence of ancillary incidents, and level of escalation if multiple texts or events exist. Low level threats may contain lexically mitigated or conditional language, a lack of detail, or unrealistic plans. High-level threats are credible and contain facts that can be readily verified or detailed descriptions of how the threat will be carried out.
The Encyclopedia of Applied Linguistics lays out the process that initiates from here, which relies on idiolect analysis:
Threat assessors code a text for markers of potential authorship. For example, they may highlight orthographical or grammatical information that is commonly misused but is used correctly according to a standard variety by the author (e.g., they’re/their/there), regional spelling variations (e.g., analyze vs. analyse, theater vs. theatre), and other idiosyncratic words or phrases that may reveal information about the writer’s age, gender, education level, native language, or profession (Fitzgerald, 2005).
…
After coding, the identified language patterns are used to support assessments of intent or viability by revealing information about an author’s apparent level of knowledge or intelligence (e.g., are words used specific enough to the field of chemistry to demonstrate expertise to make a bomb?), personal—and potentially dangerous—affiliations (e.g., is language used that is affiliated with known terrorist organizations?), or emotional stance (e.g., is anger linguistically exhibited that provides motivation for fulfilling the threatened act?) (Rugala & Fitzgerald, 2003).
Evaluating the intent and plausibility of the threat is the primary concern.1 Will this threat (of the thousands that occur every day) be realized? To aid in this effort, only fifteen years ago did threat analysis turn to linguistics to establish empirical methods for assessments of threat intent and risk.
Previously, threat risk was calculated using circumstantial factors such as threatener’s history of substance abuse, prior criminal record, gender, education level, age, mental state, or personality disorder. But these behavioral, demographic factors do not usually provide an accurate measure of the level of intent or danger in a threat.
And sometimes the linguistic content of the threat is all there is to analyze (i.e., the threatener is anonymous), nullifying the possible contributions of the threatener’s behavioral or demographic information to an evaluation of the level of risk.
Linguistic Forms of Threat
While social, demographic, and psychological characteristics are not reliably predictive of threat realization, linguistic patterns in threat are.
There are a few strong “tells” when threatening communication may be realized. The FBI has spent years developing these linguistic risk-enhancing factors, some of which are listed below:
TOPICAL/DISCOURSE-LEVEL
Threats which repeatedly mention love, marriage, or romance
Threats to stalk or reveal detrimental information (whether true or false)
The use of persuasion (analyzed rhetorically) in the threatening communication
More linguistic forms of politeness, perhaps surprisingly
More detail included in threat articulation
LEXICAL-LEVEL
Inclusion of pejorative or offensive language
“If/then” statements or suggestions of ways the threatener could or will achieve their aims
Prediction modals (e.g., “could”)
Verb-controlled that complement clauses (e.g., “you knew that I…”)
Certainty adverbials (e.g., “definitely”, “probably”) — occur among realized threats at four times the rate of non-realized threats
And if you see any of these words in a threat…. would, will, never, and try to, it is high alert. Amazingly, just these actual words (in the right way, as modals) can be some of the biggest red flags for threat (85% of these threats go on to be attempted/realized)
Idiolect and Threat Realization
A cheat sheet of lexical clues is handy, but research has demonstrated that the most reliable indications of intent and danger from threat come from deeper analysis, beyond lexical and topic-level forms into structural and attitudinal forms.
For instance, the little-known linguistic concepts of stance (Biber et al., 1999) and appraisal (Martin & White, 2005) have demonstrated some of the strongest connections to threat realization. These representations of attitudinal meaning and social positioning are the basis for a predictive algorithm in use today with over 70% accuracy. Stance and appraisal analyses group lexical items not superficially but rather as diagnostics for more dynamic forms of meaning, allowing for contextualized insights with heightened predictive value.
Stance and appraisal are two examples of the deeper linguistic metrics required for a full analysis of idiolect. These can work in collaboration with the lexical clues above, where they have also been known to catch false negatives left by simpler, more straightforward lexical methods.
Idiolect analysis is multivalent, impossible to fully automate due to the format of the linguistic inputs. Accordingly, high-risk personal threat always involves a human in the loop.
See you in two weeks.
1 Much data on the linguistic qualities of threat was gleaned from research using the FBI’s Communicated Threat Assessment Database (CTAD) which contained approximately 4000 threatening communications. In 2012, CTAD was reintroduced as the Threatening Communication Database (TCD). This new database expanded beyond criminally oriented communications sent to the FBI to include letters and envelopes reported to the FBI sent through USPS. Full texts are no longer included in the database, only searchable fragments like phrases, misspellings, and odd word choices to enhance investigations across cases. These changes have hugely limited the utility of the resource.
