Meet the people warning the world about new covid variants


When WHO declared a pandemic in March 2020, the public sequencing database GISAID held 524 covid sequences. Over the next month, scientists uploaded 6,000 more. By the end of May, the total was over 35,000. (By contrast, global scientists added 40,000 flu sequences to GISAID in all of 2019.)

“Without a name, forget it—we can’t understand what other people are saying,” says Anderson Brito, a postdoctoral researcher in genomic epidemiology at the Yale School of Public Health who contributed to the Pango effort.

As the number of Covid sequences increased, researchers trying to study them had to instantly create entirely new infrastructure and standards. A universal naming system has been a key element of this effort: without it, scientists would struggle to talk to each other about how the virus’s descendants traveled and changed—either flagging a question or, more critically, sounding the alarm.

Where did pango come from

In April 2020, a handful of leading virologists in the UK and Australia suggested a system of letters and numbers to name strains or new branches of the covid family. Although the names he created (like B.1.1.7) were a bit of a mouthful, he had a logic and a hierarchy.

One of the authors of the paper was Áine O’Toole, a PhD candidate at Edinburgh University. He would soon become the primary person to do this sorting and classification, eventually manually scanning hundreds of thousands of sequences.

He says: “Very early, he was the person who was available to curate the sequences. This has been my job for a while. I guess I never quite understood the scale we were going to reach.”

He quickly began creating software to assign new genomes to the right lineages. Shortly after that, another researcher, postdoctoral fellow Emily Scher, developed a machine learning algorithm to speed things up even more.

“Without a name, forget it – we can’t understand what other people are saying.”

Anderson Brito, Yale School of Public Health

They named the software Pangolin, referring to a discussion about the animal origin of covid. (The entire system is now simply known as Pango.)

The naming system quickly became a global requirement, with the software to implement it. These nicknames are for the public and the media, although WHO has recently started using Greek letters for variants that seem particularly relevant, such as delta. Delta actually refers to a growing family of variants that scientists call the more precise Pango names: B.1.617.2, AY.1, AY.2, and AY.3.

“When the alpha appeared in the UK, Pango made it very easy for us to search for these mutations in our genomes to see if we had this lineage in our country,” says Jolly. “Since then, Pango has been used as the basis for reporting and surveillance of variants in India.”

Because Pango offers a rational and orderly approach to what would otherwise be chaos, it could forever change the way scientists name viral strains and allow experts from all over the world to work together with a shared vocabulary. “Most likely, this will be the format we’ll use to monitor any other new virus,” Brito says.

Many of the key tools for tracking Covid genomes have been developed and maintained over the past year and a half by early career scientists like O’Toole and Scher. When the need for worldwide covid collaboration exploded, scientists rushed to support it with ad-hoc infrastructure like Pango. Most of these studies fell on tech-savvy young researchers in their 20s and 30s. They used unofficial networks and tools that were open source – meaning they were free to use and anyone could volunteer to add tweaks and improvements.

“People at the cutting edge of new technologies tend to be graduate students and postdoctoral students,” says bioinformatics expert Angie Hinrichs, who participated in the Pangolin project at UC Santa Cruz earlier this year. For example, O’Toole and Scher work in the lab of Andrew Rambaut, a genomic epidemiologist who published the first publicly available covid sequences online after receiving them from Chinese scientists. “They are perfectly placed to provide these tools that have become absolutely critical,” says Hinrichs.

fast building

It hasn’t been easy. For most of 2020, O’Toole took on the bulk of the responsibility of identifying and naming new strains. The university was closed, but he and Verity Hill, another doctoral student of Rambaut’s, obtained permission to enter the office. The 40-minute walk to school from his lonely apartment gave him a sense of normalcy.

Every few weeks, O’Toole would download the entire covid repository from the GISAID database, which was growing exponentially each time. It would then look for groups of genomes with mutations that looked similar, or things that looked weird and might have been mislabeled.

When he got particularly stuck, Hill, Rambaut, and other members of the lab would step in to discuss assignments. But the job of grumbling fell to him.

“Imagine watching 20,000 sequences from 100 different parts of the world. I’ve seen sequences from places I’ve never heard of.”

Áine O’Toole, University of Edinburgh

Deciding when the virus’s descendants deserve a new family name can be as much art as science. It was a painstaking process to sift through an unheard of number of genomes and repeatedly ask: Is this a new strain of covid or not?

“It was pretty boring,” he says. “But he was always really humble. Imagine going through 20,000 TV shows from 100 different parts of the world. I saw scenes from places I had never heard of.”

As time went on, O’Toole struggled to keep up with the volume of new genomes to sequence and name.

In June 2020, there were more than 57,000 sequences stored in the GISAID database, and O’Toole had broken them down into 39 variables. O’Toole conducted his final solo work on the data by November 2020, a month after he was due to submit his thesis. It took him 10 days to review all the series, which by then was 200,000. (He is putting a chapter on Pango in his dissertation, though Covid has eclipsed his research on other viruses.)

Fortunately, Pango software is designed to be collaborative, and others have stepped up. An online community that Jolly turned to when he noticed the variant spreading across India sprouted and grew. This year, O’Toole’s work has been much more practical. Now, new lineages are often identified when epidemiologists around the world communicate with O’Toole and the rest of the team via Twitter, email, or his preferred method, GitHub.

“It’s more reactionary now,” O’Toole says. “If a group of researchers anywhere in the world is working on some data and they believe they have identified a new lineage, they can make a request.”

The data flood continued. Last spring, the team held a “pangothon,” a type of hackathon where they sequenced 800,000 sequences into about 1,200 lineages.

“We gave ourselves three solid days,” O’Toole says. “It took two weeks.”

Since then, the Pango team has recruited a few more volunteers, such as UCSC researcher Hindriks and Yale researcher Brito, who both joined by initially adding their two cents to their Twitter and GitHub page. Chris Ruis, a postdoctoral fellow at the University of Cambridge, turned his attention to helping O’Toole clean up a backlog of GitHub requests.

O’Toole recently asked them to formally join the organization as part of the newly created Pango Network. Pedigree Determination Committeediscusses and makes decisions about variable names. Another committee, including lab leader Rambaut, takes higher-level decisions.

“We have a website and an email that isn’t just my email,” O’Toole says. “It has become much more formal, and I think that will really help it scale.”


As the data grew, a few cracks began to appear around the edges. GISAID, which the Pango team has divided into 1,300 branches, has approximately 2.5 million covid series as of today. Each branch corresponds to a variant. Eight of these are must-watch, according to the WHO.

There is so much to process that the software is starting to crash. Jobs are mislabeled. Many strains are similar because the virus repeatedly evolves the most advantageous mutations.

As a workaround, the team developed new software that uses a different sorting method and can catch things Pango might miss.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *