Sulston score: Difference between revisions

Jump to navigation Jump to search
No edit summary
Line 37: Line 37:
==Mathematical Refinement==
==Mathematical Refinement==


In a 2005 paper<ref name=wendl2005>{{cite journal |author=Wendl MC |title=Probabilistic assessment of clone overlaps in DNA fingerprint mapping via a priori models |journal=J Comput Biol. |volume=12 |issue=3 |pages=283–97 |year=2005 |month=Apr |pmid=15857243 |doi=10.1089/cmb.2005.12.283 }}</ref>, [[Michael Wendl]] gave an example showing that the assumption of independent trials is not valid. So, although the traditional Sulston score does indeed represent a [[Probability distribution]], it is not the distribution characteristic of the fingerprint problem. Wendl went on to give the general solution for this problem in terms of the [[Bell polynomials]], showing the traditional score overpredicts P-values by orders of magnitude. (P-values are very small in this problem, so we are talking, for example, about probabilities on the order of 10e-14 versus 10e-12, the latter Sulston value being 2 orders of magnitude too high.) This solution provides a basis for determining when a problem has sufficient information content to be treated by the probabilistic approach and is also a general solution to the [[Birthday_paradox#Generalization_to_multiple_types|birthday problem of 2 types]].
In a 2005 paper<ref name=wendl2005>{{cite journal |author=Wendl MC |title=Probabilistic assessment of clone overlaps in DNA fingerprint mapping via a priori models |journal=J Comput Biol. |volume=12 |issue=3 |pages=283–97 |year=2005 |month=Apr |pmid=15857243 |doi=10.1089/cmb.2005.12.283 }}</ref>, Michael Wendl gave an example showing that the assumption of independent trials is not valid. So, although the traditional Sulston score does indeed represent a [[Probability distribution]], it is not the distribution characteristic of the fingerprint problem. Wendl went on to give the general solution for this problem in terms of the Bell polynomials, showing the traditional score overpredicts P-values by orders of magnitude. (P-values are very small in this problem, so we are talking, for example, about probabilities on the order of 10e-14 versus 10e-12, the latter Sulston value being 2 orders of magnitude too high.) This solution provides a basis for determining when a problem has sufficient information content to be treated by the probabilistic approach and is also a general solution to the birthday problem of 2 types.


A disadvantage of the exact solution is that its evaluation is computationally intensive and, in fact, is not feasible for comparing large clones<ref name=wendl2005/>. Some fast approximations for this problem have been proposed<ref name="wendl-2007">{{cite journal |author=Wendl MC |title=Algebraic correction methods for computational assessment of clone overlaps in DNA fingerprint mapping |journal=BMC Bioinformatics |volume=8 |pages=127 |year=2007 |pmid=17442113 |pmc=1868038 |doi=10.1186/1471-2105-8-127 }}</ref>.
A disadvantage of the exact solution is that its evaluation is computationally intensive and, in fact, is not feasible for comparing large clones<ref name=wendl2005/>. Some fast approximations for this problem have been proposed<ref name="wendl-2007">{{cite journal |author=Wendl MC |title=Algebraic correction methods for computational assessment of clone overlaps in DNA fingerprint mapping |journal=BMC Bioinformatics |volume=8 |pages=127 |year=2007 |pmid=17442113 |pmc=1868038 |doi=10.1186/1471-2105-8-127 }}</ref>.

Revision as of 14:37, 9 January 2009

WikiDoc Resources for Sulston score

Articles

Most recent articles on Sulston score

Most cited articles on Sulston score

Review articles on Sulston score

Articles on Sulston score in N Eng J Med, Lancet, BMJ

Media

Powerpoint slides on Sulston score

Images of Sulston score

Photos of Sulston score

Podcasts & MP3s on Sulston score

Videos on Sulston score

Evidence Based Medicine

Cochrane Collaboration on Sulston score

Bandolier on Sulston score

TRIP on Sulston score

Clinical Trials

Ongoing Trials on Sulston score at Clinical Trials.gov

Trial results on Sulston score

Clinical Trials on Sulston score at Google

Guidelines / Policies / Govt

US National Guidelines Clearinghouse on Sulston score

NICE Guidance on Sulston score

NHS PRODIGY Guidance

FDA on Sulston score

CDC on Sulston score

Books

Books on Sulston score

News

Sulston score in the news

Be alerted to news on Sulston score

News trends on Sulston score

Commentary

Blogs on Sulston score

Definitions

Definitions of Sulston score

Patient Resources / Community

Patient resources on Sulston score

Discussion groups on Sulston score

Patient Handouts on Sulston score

Directions to Hospitals Treating Sulston score

Risk calculators and risk factors for Sulston score

Healthcare Provider Resources

Symptoms of Sulston score

Causes & Risk Factors for Sulston score

Diagnostic studies for Sulston score

Treatment of Sulston score

Continuing Medical Education (CME)

CME Programs on Sulston score

International

Sulston score en Espanol

Sulston score en Francais

Business

Sulston score in the Marketplace

Patents on Sulston score

Experimental / Informatics

List of terms related to Sulston score

Please Take Over This Page and Apply to be Editor-In-Chief for this topic: There can be one or more than one Editor-In-Chief. You may also apply to be an Associate Editor-In-Chief of one of the subtopics below. Please mail us [1] to indicate your interest in serving either as an Editor-In-Chief of the entire topic or as an Associate Editor-In-Chief for a subtopic. Please be sure to attach your CV and or biographical sketch.

The Sulston Score is an equation used in DNA mapping to numerically assess the likelihood that a given "fingerprint" similarity between two DNA clones is merely a result of chance. Used as such, it is a test of statistical significance. That is, low values imply that similarity is significant, suggesting that two DNA clones overlap one another and that the given similarity is not just a chance event. The name is an eponym that refers to John Sulston by virtue of his being the lead author of the paper that first proposed the equation's use[1].

The Overlap Problem in Mapping

Each clone in a DNA mapping project has a "fingerprint", i.e. a set of DNA fragment lengths inferred from (1) enzymatically digesting the clone, (2) separating these fragments on a gel, and (3) estimating their lengths based on gel location. For each pairwise clone comparison, one can establish how many lengths from each set match-up. Cases having at least 1 match indicate that the clones might overlap because matches may represent the same DNA. However, the underlying sequences for each match are not known. Consequently, two fragments whose lengths match may still represent different sequences. In other words, matches do not conclusively indicate overlaps. The problem is instead one of using matches to probabilistically classify overlap status.

Mathematical Scores in Overlap Assessment

Biologists have used a variety of means (often in combination) to discern clone overlaps in DNA mapping projects. While many are biological, i.e. looking for shared markers, others are basically mathematical, usually adopting probabilistic and/or statistical approaches.

Sulston Score Exposition

The Sulston Score is rooted in the concepts of Bernoulli and Binomial processes, as follows. Consider two clones, <math>\alpha</math> and <math>\beta</math>, having <math>m</math> and <math>n</math> measured fragment lengths, respectively, where <math>m \ge n</math>. That is, clone <math>\alpha</math> has at least as many fragments as clone <math>\beta</math>, but usually more. The Sulston score is the probability that at least <math>h</math> fragment lengths on clone <math>\beta</math> will be matched by any combination of lengths on <math>\alpha</math>. Intuitively, we see that, at most, there can be <math>n</math> matches. Thus, for a given comparison between two clones, one can measure the statistical significance of a match of <math>h</math> fragments, i.e. how likely it is that this match occurred simply as a result of random chance. Very low values would indicate a significant match that is highly unlikely to have arisen by pure chance, while higher values would suggest that the given match could be just a coincidence.

One of the basic assumptions is that fragments are uniformly distributed on a gel, i.e. a fragment has an equal likelihood of appearing anywhere on the gel. Since gel position is an indicator of fragment length, this assumption is equivalent to presuming that the fragment lengths are uniformly distributed. The measured location of any fragment <math>x</math>, has an associated error tolerance of <math>\pm t</math>, so that its true location is only known to lie within the segment <math>x \pm t</math>.

Derivation

In what follows, let us refer to individual fragment lengths simply as lengths. Consider a specific length <math>j</math> on clone <math>\beta</math> and a specific length <math>i</math> on clone <math>\alpha</math>. These two lengths are arbitrarily selected from their respective sets <math>i \in \{1, 2, \dots, m\}</math> and <math>j \in \{1, 2, \dots, n\}</math>. We assume that the gel location of fragment <math>j</math> has been determined and we want the probability of the event <math>E_{ij}</math> that the location of fragment <math>i</math> will match that of <math>j</math>. Geometrically, <math>i</math> will be declared to match <math>j</math> if it falls inside the window of size <math>2 t</math> around <math>j</math>. Since fragment <math>i</math> could occur anywhere in the gel of length <math>G</math>, we have <math>P \langle E_{ij} \rangle = 2 t / G</math>. The probability that <math>i</math> does not match <math>j</math> is simply the complement, i.e. <math>P \langle E_{i,j}^C \rangle = 1 - 2 t / G</math>, since it must either match or not match.

Now, let us expand this to compute the probability that no length on clone <math>\alpha</math> matches the single particular length <math>j</math> on clone <math>\beta</math>. This is simply the intersection of all individual trials <math>i \in \{1, 2, \dots, m\}</math> where the event <math>E_{i,j}^C</math> occurs, i.e. <math>P \langle E_{1,j}^C \cap E_{2,j}^C \cap \cdots \cap E_{m,j}^C \rangle</math>. This can be restated verbally as: length 1 on clone <math>\alpha</math> does not match length <math>j</math> on clone <math>\beta</math> and length 2 does not match length <math>j</math> and length 3 does not match, etc. Since each of these trials is assumed to be independent, the probability is simply

<math>P \langle E_{1,j}^C \rangle \times P \langle E_{2,j}^C \rangle \times \cdots \times P \langle E_{m,j}^C \rangle = \left(1 - 2 t / G\right)^m</math>.

Of course, the actual event of interest is the complement: i.e. there is not "no matches". In other words, the probability of one or more matches is <math>p = 1 - \left(1 - 2 t / G\right)^m</math>. Formally, <math>p</math> is the probability that at least one band on clone <math>\alpha</math> matches band <math>j</math> on clone <math>\beta</math>.

This event is taken as a Bernoulli trial having a "success" (matching) probability of <math>p</math> for band <math>j</math>. However, we want to describe the process over all the bands on clone <math>\beta</math>. Since <math>p</math> is constant, the number of matches is distributed binomially. Given <math>h</math> observed matches, the Sulston score <math>S</math>is simply the probability of obtaining at least <math>h</math> matches by chance according to

<math>S = \sum_{j=h}^n C_{n,j} p^j (1-p)^{n-j},</math>

where <math>C_{n,j}</math> are binomial coefficients.

Mathematical Refinement

In a 2005 paper[2], Michael Wendl gave an example showing that the assumption of independent trials is not valid. So, although the traditional Sulston score does indeed represent a Probability distribution, it is not the distribution characteristic of the fingerprint problem. Wendl went on to give the general solution for this problem in terms of the Bell polynomials, showing the traditional score overpredicts P-values by orders of magnitude. (P-values are very small in this problem, so we are talking, for example, about probabilities on the order of 10e-14 versus 10e-12, the latter Sulston value being 2 orders of magnitude too high.) This solution provides a basis for determining when a problem has sufficient information content to be treated by the probabilistic approach and is also a general solution to the birthday problem of 2 types.

A disadvantage of the exact solution is that its evaluation is computationally intensive and, in fact, is not feasible for comparing large clones[2]. Some fast approximations for this problem have been proposed[3].

References

  1. Sulston J, Mallett F, Staden R, Durbin R, Horsnell T, Coulson A (1988). "Software for genome mapping by fingerprinting techniques". Comput Appl Biosci. 4 (1): 125–32. PMID 2838135. Unknown parameter |month= ignored (help)
  2. 2.0 2.1 Wendl MC (2005). "Probabilistic assessment of clone overlaps in DNA fingerprint mapping via a priori models". J Comput Biol. 12 (3): 283–97. doi:10.1089/cmb.2005.12.283. PMID 15857243. Unknown parameter |month= ignored (help)
  3. Wendl MC (2007). "Algebraic correction methods for computational assessment of clone overlaps in DNA fingerprint mapping". BMC Bioinformatics. 8: 127. doi:10.1186/1471-2105-8-127. PMC 1868038. PMID 17442113.

See also

  • FPC: a widely-used fingerprint mapping program that utilizes the Sulston Score

Template:SIB

Template:WH Template:WS