Nicole G. answered 19d
R Studio & Data Analysis | Biochemistry, A&P, Biology | Exams/Projects
Threshold - decided which sequences get into the model
PSI-BLAST threshold values (E values)
E value measures how likely a sequence is to occur by chance.
smaller number = strict, only a few reliable hits (0.0001) - could miss real homologs
default = 0.005
larger number = looser, many hits (0.1 etc.)
PSI-BLAST first runs like a normal BLAST search.
- It finds matching sequences (hits).
- Trusted hits (based on the threshold) are aligned into a multiple sequence alignment (MSA).
- The MSA is used to build a PSSM (position-specific scoring matrix).
- PSI-BLAST searches the database again using the PSSM instead of a general scoring matrix.
- This makes the search more sensitive.
- New trusted hits may be added.
- The PSSM is updated.
- Another iteration begins.
Pseudocount - prevents PSI-BLAST from assuming something is impossible just because it was not observed.
Pseudocount - is a small extra value added so PSI-BLAST doesn’t assume something is impossible just because it wasn’t observed.
Low pseudocount = very specific, sensitive to noise, position 10 = A for all 5 sequences
High pseudocount = smoother, more conservative profile, position 10 = A for all 2 sequences
This isn't a fixed number. It is calculated based on:
(1) number of aligned sequences at each position
(2) how conserved or variable that position is
(3) background amino-acid frequencies derived from known databases.
Example:
Suppose PSI-BLAST aligns 5 sequences at one position: position 10 = A for all 5 sequences
Observed counts (no pseudocount):
A = 5
All other amino acids = 0
Probabilities without pseudocount:
P(A) = 1.0
P(any other amino acid) = 0
This means PSI-BLAST would treat all other amino acids as impossible at this position.
Same example with a pseudocount applied:
Adjusted counts (conceptual):
A = 5 + small extra
Other amino acids = 0 + very small values
Example probabilities with pseudocount:
P(A) ≈ 0.90
P(G) ≈ 0.005
P(S) ≈ 0.005
P(other amino acids) ≈ small but nonzero
This means A is strongly preferred, but other amino acids are still possible.