[ prog / sol / mona ]

prog


Longest common substring and subsequence of HIV and COVID

1 2020-04-27 04:50

Pr. Luc Montagnier, Nobel Laureate for his discovery of the HIV virus, claims that COVID-19 was created in a lab and contains sequences of HIV. Let's check that.

COVID sequence: https://www.ncbi.nlm.nih.gov/nuccore/MN908947
HIV sequence: https://www.ncbi.nlm.nih.gov/nuccore/9629357

CL-USER> (length hiv)
9181
CL-USER> (length covid)
29903
CL-USER> (defun longest-common-substring (a b)
  (let ((L (make-array (list (length a) (length b)) :initial-element 0))
        (z 0)
        (result '()))
    (dotimes (i (length a))
      (dotimes (j (length b))
        (when (char= (char a i) (char b j))
          (setf (aref L i j)
                (if (or (zerop i) (zerop j))
                    1
                    (1+ (aref L (1- i) (1- j)))))
          (when (> (aref L i j) z)
            (setf z (aref L i j)
                  result '()))
          (when (= (aref L i j) z)
            (pushnew (subseq a (1+ (- i z)) (1+ i))
                     result :test #'equal)))))
    result))
CL-USER> (defun longest-common-subsequence (array1 array2)
  (let* ((l1 (length array1))
         (l2 (length array2))
         (results (make-array (list l1 l2) :initial-element nil)))
    (declare (dynamic-extent results))
    (labels ((lcs (start1 start2)
               ;; if either sequence is empty, return (() 0)
               (if (or (eql start1 l1) (eql start2 l2)) (list '() 0)
                 ;; otherwise, return any memoized value
                 (let ((result (aref results start1 start2)))
                   (if (not (null result)) result
                     ;; otherwise, compute and store a value
                     (setf (aref results start1 start2)
                           (if (eql (aref array1 start1) (aref array2 start2))
                             ;; if they start with the same element,
                             ;; move forward in both sequences
                             (destructuring-bind (seq len)
                                 (lcs (1+ start1) (1+ start2))
                               (list (cons (aref array1 start1) seq) (1+ len)))
                             ;; otherwise, move ahead in each separately,
                             ;; and return the better result.
                             (let ((a (lcs (1+ start1) start2))
                                   (b (lcs start1 (1+ start2))))
                               (if (> (second a) (second b))
                                 a
                                 b)))))))))
      (destructuring-bind (seq len) (lcs 0 0)
        (values (coerce seq 'string) len)))))
CL-USER> (time (longest-common-substring hiv covid))
Evaluation took:
  6.390 seconds of real time
  6.385874 seconds of total run time (5.463049 user, 0.922825 system)
  [ Run times consist of 1.310 seconds GC time, and 5.076 seconds non-GC time. ]
  99.94% CPU
  12,115,641,079 processor cycles
  2,196,550,272 bytes consed

("atgtgtaaaattaa")
CL-USER> (time (longest-common-subsequence hiv covid))
Evaluation took:
  16.705 seconds of real time
  16.687692 seconds of total run time (15.005942 user, 1.681750 system)
  [ Run times consist of 10.308 seconds GC time, and 6.380 seconds non-GC time. ]
  99.90% CPU
  31,673,258,588 processor cycles
  4,153,634,880 bytes consed

"ggtctctctggttagaccagatctgagcctgggagctctctggctactagggaacc...gatcctgcatataagcagtgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggagctctctggcaactagggaacccactgttaagcccaataaagcttcttgagtgc"
9013

The length of the longest common substring is 14. Nothing impressive, I'd probably get a similar result with two random sequences of same size and an alphabet of 4. The length of the longest common subsequence is 9013 (length of HIV is 9181). What does it mean? Probably nothing and I don't know shit about biology. Algorithms are from Rosetta code. Pump up the heap and the control stack size for the longest common subsequence

2 2020-04-27 06:15

>>1
Try at least to find all the common substrings and use BLAST :)

https://www.researchgate.net/post/Third_Sequence_COVID-19_AATGGTACTAAGAGG_HIV-1_isolate_1966324H9_from_Netherlands_envelope_glycoprotein_env_gene_sequence_ID_GU4555031

3


VIP:

do not edit these