import React from "react";
import TrainingPipeline from "../assets/img/researchPage/training_pipeline.jpeg";
import PredictionPipeline from "../assets/img/researchPage/prediction_pipeline.jpeg";
import ModelWithoutAge from "../assets/img/researchPage/model_without_age.jpeg";
import ModelWithAge from "../assets/img/researchPage/model_with_age.jpeg";
import DeepModelWithoutAge from "../assets/img/researchPage/deep_without_age.jpeg";
import DeepModelWithAge from "../assets/img/researchPage/deep_with_age.jpeg";
import FeaturesWithoutAge from "../assets/img/researchPage/features_without_age.png";
import FeaturesWithAge from "../assets/img/researchPage/features_with_age.png";
import ReactTooltip from "react-tooltip";
import Sidebar from "./Sidebar";
import ContentBlock from "./ContentBlock";

const Content = () => {
  return (
    <>
      <div className="research__body">
        <div className="research__body-content section">
          <Sidebar />
          <div className="research__page-content">
            <ContentBlock header="Overview">
             <p>
              CovidOutcome2 is a SARS-CoV-2 mutation identification pipeline
              and a corresponding machine learning tool that is capable of
              predicting the outcome of the disease using the genome
              information and the age of a given patient.
             </p>
             <p>
              Our research goal was to apply state-of-the-art machine learning
              techniques to reveal and predict such possible links between
              the mutation status and the outcome.
             </p>
             <p>
              The model is based on 67708 SARS-CoV-2 genomes and corresponding
              patient data from the GISAID database. After rigorous data
              cleaning and preprocessing the machine learning models were
              trained with not only the single nucleotide substitutions,
              but mutations affecting UTR regions as well. The training
              set was further stratified to time-periods and age groups.
             </p>
             <p>
              It also provides a prediction pipeline, where one can predict
              the outcome of the disease from a genome sample. The uploaded
              genome is analyzed and a prediction is made by one of the
              suitable models based on the user’s choice. Next to the
              prediction, we also output the found annotated mutations
              for the sample.
             </p>
            </ContentBlock>

            <ContentBlock header="Training pipeline">
              <img src={TrainingPipeline} alt="Training pipeline" />
            </ContentBlock>

            <ContentBlock header="Data">
              <p>
                The SARS-CoV-2 samples are downloaded from the GISAID database.
                The samples that have patient status information and uploaded
                between 01/01/2020 and 12/31/2021 to the database are used in
                the train set generation. Patient status are mapped to severe
                and mild values, severe and deceased patient statuses are mapped
                to severe, non-hospitalized and asymptomous samples are mapped
                to mild. Samples with hospitalized or other patient statuses are
                filtered out, as well as samples with invalid patient age or
                non-human host. The NCBI SARS-CoV-2 (hCoV-19/Wuhan/WIV04/2019)
                genome was used as a reference genome in the quality control and
                the mutation detection part of the pipeline. The annotations
                were found by SNPEff with reference id NC_045512.2.
              </p>
            </ContentBlock>

            <ContentBlock header="Quality check">
              <p>
                The sequence IDs are replaced with unique IDs. Sequences, that
                do not match the following prerequisites were filtered out:
              </p>
              <ul>
                <li>The length of the sequences: 5-35000 bases.</li>
                <li>The ACGT ratio of the sequences is higher than 0.95.</li>
                <li>
                  The congruence score of the alignment to the reference genome
                  is higher than 0.75.
                </li>
              </ul>
              <p>
                The congruence score is calculated based on a Mummer alignment
                of the sequences to the reference genome. The congruence score
                is calculated as 1 - (D + F) / l, where D is the number of
                different nucleotides from the reference, F is how many more
                bases the sequence has than the reference and l is the length of
                the reference.
              </p>
            </ContentBlock>

            <ContentBlock header="Mutation detection">
              <p>
                The mutation calling starts with running MAFFT to align the
                sequences in the fasta files with the reference sequence. Based
                on the alignment, SNV-s and indels are called for each sample,
                then subsequent mutations are grouped together. Large insertions
                and deletions at beginning and at the end of the genome are
                filtered out, and an aggregated mutation table is generated.
              </p>
              <p>
                The aggregated mutation table is converted to VCF format, and it
                is annotated with SNPEff based on the reference genome. The
                annotations are converted to standard mutation form:
              </p>
              <ul>
                <li>
                  in case of coding regions: &#123;protein
                  name&#125;_&#123;mutation in HGVS format&#125;
                </li>
                <li>
                  in case of UTR regions: &#123;protein name
                  before&#125;-&#123;protein name after&#125;_n.&#123;mutation
                  in HGVS format&#125;
                </li>
              </ul>
              <p>
                The synonym and rare mutations - that appeared less than 10% of
                the severe samples and less than 10% of the mild samples - are
                filtered out. 251 mutations were kept including insertions,
                deletions and substractions in the coding and non-coding
                regions. The neighboring SNVs are grouped together as one
                mutation.
              </p>
            </ContentBlock>

            <ContentBlock header="Model: CovidOutcome with Jadbio including age">
              <img
                src={ModelWithAge}
                alt="CovidOutcome with Jadbio including age"
              />
              <p>
                The training set includes 5892 severe and 5892 mild samples,
                with 251 mutations and the age of the patient as the features.
              </p>
              <p>
                The training data is stratified on the collection date of the
                sample, meaning for each severe sample a mild sample is chosen
                to the training set that is collected in the same quarter as the
                severe sample.
              </p>
              <p>
                Statistically Equivalent Signature (SES) algorithm (with
                hyper-parameters: maxK = 2, alpha = 0.05 and budget = 3 * nvars)
                is used for feature selection. The number of selected features
                is 25.
              </p>
              <p>
                Ridge Logistic Regression (with penalty hyper-parameter lambda =
                1.0) is used for prediction. The target of the prediction is the
                cohort (mild or severe) of the sample, 1 meaning severe 0
                meaning mild. The relative strength of the predictors based on
                the logistic model is shown in the following figure.
              </p>
              <img
                src={FeaturesWithAge}
                alt="CovidOutcome with Jadbio including age"
              />

            </ContentBlock>

            <ContentBlock header="Model: CovidOutcome with Jadbio excluding age">
              <img
                src={ModelWithoutAge}
                alt="CovidOutcome with Jadbio excluding age"
              />
              <p>
                The training set includes 5271 severe and 5271 mild samples,
                with 251 mutations as the features.
              </p>
              <p>
                The training data is stratified on the collection date of the
                sample and the age category of the patient (0-39, 40-49, 50-59,
                60-69, 70-79, 80-89, 90-), meaning for each severe sample a mild
                sample is chosen to the training set that is collected in the
                same quarter and from the same age category as the severe
                sample.
              </p>
              <p>
                Statistically Equivalent Signature (SES) algorithm (with
                hyper-parameters: maxK = 2, alpha = 0.05 and budget = 3 * nvars)
                is used for feature selection. The number of selected features
                is 25.
              </p>
              <p>
                Ridge Logistic Regression (with penalty hyper-parameter lambda =
                1.0) is used for prediction. The target of the prediction is the
                cohort (mild or severe) of the sample, 1 meaning severe 0
                meaning mild. The relative strength of the predictors based on
                the logistic model is shown in the following figure.
              </p>
              <img
                src={FeaturesWithoutAge}
                alt="CovidOutcome with Jadbio including age"
              />

            </ContentBlock>

            <ContentBlock header="Model: CovidOutcome with deep learning including age">
              <img
                src={DeepModelWithAge}
                alt="CovidOutcome with deep learning including age"
              />
              <p>
               The training set includes 5271 severe and 5271 mild samples, with 251
                mutations as the features.
              </p>
              <p>
               The training data is stratified on the collection date of the sample
               and the age category of the patient (0-39, 40-49, 50-59, 60-69, 70-79,
               80-89, 90-), meaning for each severe sample a mild sample is chosen to
               the training set that is collected in the same quarter and from the
               same age category as the severe sample.
              </p>
              <p>
               We applied a fully connected, multilayer neural network consisting of
               7 layers of linear layers with ReLU activation functions. For each
               sample <i> i</i> the output of the <i> j+1</i> layer
               <i> h<sub>i</sub><sup>(j+1)</sup></i> is defined as a nonlinear function
                of the output of the <i>j</i>-th layer as
               <i> h<sub>i</sub><sup>(j+1)</sup> = Relu(Linear(h<sub>i</sub><sup>(j)</sup>)</i>,
               where <i> Linear(h<sub>i</sub><sup>(j)</sup>)</i> is the linear function defined
               as: <i> W<sup>(j)</sup> x h<sub>i</sub><sup>(j)</sup> + b<sub>i</sub><sup>(j)</sup></i>,
               and Relu is the rectified linear activation function, <i>W<sup>(j)</sup></i> is the input
               weights of the neuron, and b<sup>(j)</sup> is the bias vector. The number of neurons
               in each layer was &#123;251, 256, 256, 256, 128, 64, 32&#125;. The models were
               trained with batch size 128, via 500 epoch, binary cross entropy with
               logit loss, were optimized with Adam optimizer and were implemented
               using PyTorch library.
              </p>

            </ContentBlock>

            <ContentBlock header="Model: CovidOutcome with deep learning excluding age">
              <img
                src={DeepModelWithoutAge}
                alt="CovidOutcome with deep learning excluding age"
              />
              <p>
                The training set includes 5271 severe and 5271 mild samples,
                with 251 mutations as the features.
              </p>
              <p>
                The training data is stratified on the collection date of the
                sample and the age category of the patient (0-39, 40-49, 50-59,
                60-69, 70-79, 80-89, 90-), meaning for each severe sample a mild
                sample is chosen to the training set that is collected in the
                same quarter and from the same age category as the severe
                sample.
              </p>
              <p>
               We applied a fully connected, multilayer neural network consisting of
               7 layers of linear layers with ReLU activation functions. For each
               sample <i> i</i> the output of the <i> j+1</i> layer
               <i> h<sub>i</sub><sup>(j+1)</sup></i> is defined as a nonlinear function
                of the output of the <i>j</i>-th layer as
               <i> h<sub>i</sub><sup>(j+1)</sup> = Relu(Linear(h<sub>i</sub><sup>(j)</sup>)</i>,
               where <i> Linear(h<sub>i</sub><sup>(j)</sup>)</i> is the linear function defined
               as: <i> W<sup>(j)</sup> x h<sub>i</sub><sup>(j)</sup> + b<sub>i</sub><sup>(j)</sup></i>,
               and Relu is the rectified linear activation function, <i>W<sup>(j)</sup></i> is the input
               weights of the neuron, and b<sup>(j)</sup> is the bias vector. The number of neurons
               in each layer was &#123;251, 256, 256, 256, 128, 64, 32&#125;. The models were
               trained with batch size 128, via 500 epoch, binary cross entropy with
               logit loss, were optimized with Adam optimizer and were implemented
               using PyTorch library.
              </p>

            </ContentBlock>

            <ContentBlock header="Comparison of the models">
              <p>
                The estimated model performances (Area under ROC and prediction accuracy)
                using 10-repeated 10 fold cross validation are shown in the
                following table.
              </p>
              <table>
                <tr>
                  <td>Model</td>
                  <td>Algorithms</td>
                  <td>AUC</td>
                  <td>Accuracy</td>
                </tr>
                <tr>
                  <td>CovidOutcome with Jadbio including age</td>
                  <td>Statistically Equivalent Signature / Ridge Logistic Regression</td>
                  <td>0.88</td>
                  <td>0.81</td>
                </tr>
                <tr>
                  <td>CovidOutcome with Jadbio excluding age</td>
                  <td>Statistically Equivalent Signature / Ridge Logistic Regression</td>
                  <td>0.75</td>
                  <td>0.67</td>
                </tr>
                <tr>
                  <td>CovidOutcome with deepl learning including age</td>
                  <td>Fully connected, multilayer neural network</td>
                  <td>0.89</td>
                  <td>0.82</td>
                </tr>
                <tr>
                  <td>CovidOutcome with deepl learning excluding age</td>
                  <td>Fully connected, multilayer neural network</td>
                  <td>0.83</td>
                  <td>0.75</td>
                </tr>
              </table>
            </ContentBlock>

            <ContentBlock header="Prediction pipeline">
              <img src={PredictionPipeline} alt="Prediction pipeline" />
              <p>
                The prediction pipeline uses the same quality check and mutation
                detection steps as the training pipeline on the input sequences.
              </p>
              <p>
                If age data is added,{" "}
                <i>CovidOutcome with Jadbio including age</i> model, if not{" "}
                <i>CovidOutcome with Jadbio excluding age</i> model is used on
                the detected mutations resulting a prediction value for each
                sample.
              </p>
              <p>The prediction values are classified as the following:</p>
              <ul>
                <li>0-0.2 high confidence mild</li>
                <li>0.2-0.4 low confidence mild</li>
                <li>0.4-0.6 undefined</li>
                <li>0.6-0.8 low confidence severe</li>
                <li>0.8-1 high confidence severe</li>
              </ul>
            </ContentBlock>

            <ContentBlock header="Limitations">
              <p>
               As many studies showed from the start of the covid-19
               pandemic, the main factor of the outcome of the disease
               is the age of the patient. We trained on data that
               included the patient’s age as well as on data that was
               stratified to age groups. We found that excluding age
               from the data, the models reached significantly lower
               performance (accuracy of 0.826 versus 0.714). This shows
               that the mutations present in the virus affects much less
               the outcome than the patient's age.
              </p>
              <p>
               Analyzing the metadata of the samples, we found that the
               ratio of severe samples of all samples are highly variable
               over the countries. For example, the ratio of severe samples
               is much lower in France (0.004) than in Italy (0.1112) in
               the studied samples. The models have found some mutations
               that are mostly prevalent in countries that’s severe ratio
               is significantly different than the average (0.0895),
               for example, NSP2 K81N or NS3a P104dup. We can not conclude
               whether these mutations play a role in the disease
               outcome or not.
              </p>
              <p>
               As any machine learning technique, our models are also
               limited to the data they were trained on. They can
               predict on newer data only with limitations. They can
               not foresee new mutations and variants. This is a great
               drawback considering the reported mutability of SARS-Cov-2.
               Frequently updating the models with the inclusion of new
               samples could resolve this problem at least partly.
              </p>
            </ContentBlock>

            <ContentBlock header="Citations">
              <p>
                CovidOutcome2: a tool for SARS-CoV2 mutation identification
                 and for disease severity prediction. Regina Kalcsevszki,
                 András Horváth, Balázs Győrffy, Sándor Pongor, Balázs Ligeti,
                 bioRxiv 2022.07.01.496571; doi: https://doi.org/10.1101/2022.07.01.496571
              </p>
              <p>
                COVIDOUTCOME—estimating COVID severity based on mutation
                signatures in the SARS-CoV-2 genome Ádám Nagy, Balázs Ligeti,
                János Szebeni, Sándor Pongor and Balázs Győrffy, 2021 Database,
                Volume 2021, 2021, baab020, DOI:10.1093/database/baab020
                https://academic.oup.com/database/article/doi/10.1093/database/baab020/6272506
              </p>
              <p>
                Different mutations in SARS-CoV-2 associate with severe and mild
                outcome. Nagy, Ádám, Sándor Pongor, and Balázs Győrffy, 2021
                International journal of antimicrobial agents 57.2 (2021):
                106272. DOI:10.1016/j.ijantimicag.2020.106272
                https://www.sciencedirect.com/science/article/pii/S0924857920305008
              </p>
            </ContentBlock>
          </div>
        </div>
      </div>
      <ReactTooltip place="right" type="dark" effect="solid" />
    </>
  );
};

export default Content;
