Conserved Immunoglobulin Domain Similarities of Higher Plant Proteins

The traces of immunoglobulin domain similarities were searched in sequences of higher plants using bioinformatic tools to look for possible early phylogenic structural relationships. 280 thousand sequence IDs, obtained by sixteen types of primary BLAST searches, were differently processed by seventeen selection procedures and an anti-redundant sequence-related approach using JavaScript, PHP, Windows programs and conserved domain searches by means CDD. The resulting seventeen sets of records describing conserved domain similarities of 1323 different sequence IDs yielded a set of next generation (final set) comprising forty-nine records containing superior (“non-refutable”) conserved immunoglobulin domain similarities. The selected sets and their subsets were mapped and subsequently statistically compared with respect to im-munoglobulin-related as well as other reciprocal domain linkages. The list of frequently occurring conserved domain similarities concerned first of all domains important for plant and metazoan immunity, e.g. tyrosine kinases accompanying variable immunoglobulin domains in early Metazoa, toll-like receptors, lectin and leucine-rich repeat domains. Detailed description of immunoglobulin domain similarities occurring in the final set was completed by fold analysis of the restricted segments. The data were then discussed with respect to i) immunoglobulin fold evolution, ii) possible structural importance of domains cd14066 (IRAK) and PLN00113 (LRR-associated kinase) for deep evolution of catalytic serine/threonine/tyrosine kinase domains, iii) interatomic, structural and specificity standpoints and iv) traces of antibody-like phosphorylation sites described in our previous paper.


Introduction
In accordance with recent opinions, we distinguish two main layers of plant immunity, i.e. pathogenic pattern-triggered immunity (PTI) and effector-triggered immunity (ETI). These layers are frequently accompanied by activation of kinase cascades, Ca 2+ influxes, generation of reactive oxygen species, transcriptional reprogramming, phytohormone signalling, proteasome degradation pathways and more specifically (i.e. in special cases of virus infections) also RNA silencing machinery [1] [2]. PTI is initiated by interactions of pathogenic motifs (patterns) with cell surface signaling molecules exposed to the cell wall mostly belonging to the superfamily of receptor like kinases (RLK; [3] [4] [5] [6]). Among familiar groups of RLK, serine/threonine kinases receptor-like kinases (STRK) are frequent. In addition to their specific catalytic domains, these STRK can contain binding sites formed by e.g. lectin or leucine-rich repeat (LRR) domains [7] [8] [9] [10]. ETI is mostly mediated by intracellular receptors called nucleotide binding sites/LRR proteins (NLR). Molecules of NLR are composed of i) nucleotide-binding LRR domain and ii) alternatively of either coiled-coil (CC) or Toll-like/interleukin 1 receptor (TIR) domain [11] [12] [13].
The domains mentioned above, i.e. catalytic domains or STRK and binding TIR or LRR domains, also take part in mechanisms of metazoan immunity [14] [15]. In accordance with this parallel occurrence, we can pose a question of whether at least some ancestor-like traces of typical animal superfamilies can be found in higher plants (Embryophyta), as consequences of horizontal transfer or co-evolution of superfamiliar ancestors. This concerns first of all traces of 800 million-years-old immunoglobulin (Ig) superfamily representing a very important group of immune proteins in most of metazoans including spongi and vertebrates [16] [17] [18] [19].
In our four preceding papers [20] [21] [22] [23], we hypothesized step-by-step a possible role of antibody-like phosphorylation sites (ALPS) or the corresponding nucleotide repeats in the evolution of antigen receptors and their ancestors. Consequently, ALPS sequences were used together with specifically restricted segments of conserved Ig domain sequences (Ig-cd) to compose four multiple protein sequence queries (MPSQ) inputted in our starting procedural step including BLAST searches. More precisely, the corresponding five-step selection procedures and the following data analysis were performed as described in Figure 1. Concerning superior ("non-refutable") conserved domain similarities (cds) of Ig-cd (i.e. Ig-cds) present in the set NRI2 (cf. Figure 1), we found cds-recording files (cds-files) representing protein sequences attaining i) significant Ig-cds (p < 0.01), ii) quasi-significant Ig-cds (0.01 ≤ p < 0.1) and iii) less significant Ig-cds supported by FFAS-fold searches or literature contexts. Two types of Ig-cds were distinguished in NRI set. Dominant Ig-cds attained superior regional evaluation of cds. Recessive Ig-cds co-located with cds of non-Ig domains achieving higher bit score but sometimes an interesting context. Two significant dominant Ig-cds included bacterial Ig domains. In addition to Ig-cds evaluation, we indicated multiple associations of robust cds of catalytic-kinase-, lectin-, LRR-and TIRrelated-domains, broadly occurring in the selected cds-files. Figure 1. Five-step selection and the following analysis of Ig-similar and Ig-domain-related sequences. A concise methodological overview related to the content of our paper can be seen here. E-Expects, S-sample sizes.

Overall Description of Softwares and Procedures
For an overall scheme of procedures and approaches used in this paper see Figure 1. Selection of sequence IDs and cds-files was performed with the assistance of online accessible programs BLASTP, PHI-BLAST [24] [25] and conserved domain searches of BLAST (CDD searches [26] [27] [28] [29]) based on domain structures selected with the contribution of three-dimensional analysis [30]. Further analysis of the data required on-line bioinformatic programs such as FFAS03 and HMMER [31] [32] [33] and a downloadable active web page with Fisher's exact test for 2 × 2 tables [34]. Our JavaScript and PHP codes of active web pages were written by freely downloaded PSPad editor. Easy PHP 12.1 then enabled runs of PHP codes. For some purposes, Word, Window Explorer or conditioned formatting and histograms both generated by Excel were necessary. For details see Sections of Chapter WP1.

Multiple Protein Sequence Queries (MPSQ)
Conserved variable Ig domains (IgV-cd) were used to form two MPSQ (MQI), i.e. MQI1 and MQI2. Segments forming our MQI composed the sequence block presented in Figure 2 of our previous paper [23]. Besides local high sequence similarities in the selected sequence sub-blocks, additional phylogenic parameters DOI: 10.4236/cmb.2020.101002 J. Kubrycht, K. Sigler decided about the final restriction of MQI segments. MQI1 included segments of sub-block (pm3) restricted based on fold relationships (block positions 15 -42), the corresponding consensus segment and accompanying pattern WXXQXP.
MQI2 comprised pattern DX(3)YXC and the segments containing common amino acids L,D,Y,C (block positions 81 -97) restricting also the corresponding conserved sub-block of IgV-cd-related invertebrate sequences [21]. Consensus was not used to form MQI2 due to its identity with the participating segment of cd00099.
Two ALPS-related MPSQ (MQP), i.e. MQP1 and MQP2, were composed of the two different groups of ALPS achieving superior evaluation in database confirmation or prediction of ALPS. For the ALPS groups and protocol describing MPSQ formation from segment sequences and spacers see [23]. Figure 2. Quasi-ternary approaches (Q3A). In accordance with Methods, Q3A added a weak condition (quasi-third dimension) to the selection of common sequence IDs coming from certain pairs of specific BLAST mega-records obtained always with two different MPSQ. This weak condition was realized via four alternative ways, when requiring: i) additional participation of PHI-BLAST-associated sequence pattern in the selection process (this condition restricts the collection of F-sets), ii) occurrence of selected sequence ID in two mega-records of searches differing only in adjusted matrix (double matrix sets, i.e. DM-sets), iii) the presence of one or two words or abbreviations in molecular titles associated with the pre-selected sequence IDs (TI-sets) and iv) additional combinations of set fusion and conjugation in cases considered as spreading or analytical for our data processing (sets CUFC, NPoP and ter+). Ø-BLAST mega-records are formally assumed but in fact do not exist (this status followed from the absence of sequence patterns associated with MQP1 or MQP2 and impossibility to perform PHI-BLAST with the matrix PAM250); panI1, panI2-sets prepared by conjugation of five sets (cf. Section 2.6); q3dim-condition representing "quasi-third" dimension; oP-sets obtained with ALPS-related MPSQ (i.e. MQP1 or MQP2); -P, -NP, -NPoP-disjunctive subsets obtained only with, only without or simultaneously with and without ALPS-related MPSQ, respectively (similarly to the unique subset-NPoP, the set NPoP exists); r_partial-only incomplete list of r-values was used due to incompatibilities described in explanation of abbreviation Ø.

Two Different Restrictions of Sequence IDs Coming from Top Sequence Items of BLAST Records
Five parallel "sample records" differing in limiting Expect values or the download date but belonging to the same type of BLAST records were obtained in each of sixteen possible cases (cf. Figure 2). "Sample records" differed in the numbers of extracted items when forming top10 and t100 samples composed of upper 10 or 100 sequence items, respectively. These records were specifically This means that non-redundant fusion of ALPS-unrelated (k = 1 -10) and ALPS-related (k = 11 -16) C[k] formed immediately disjunctive subsets t100dm-NP and t100dm-P, respectively.

BLAST-Derived Mega-Records Necessary for Quasi-Ternary and Multi-Cumulative Approaches
The aim of the searches was to obtain as many as possible long BLAST records of sequence items. Consequently, the productive Expect limits achieved the value range 2 × 10 5 -2 × 10 7 . This determined the 20000 sequence IDs for each BLASTP record, whereas PHI-BLAST searches yielded lower maximum numbers of sequence IDs, i.e. 6899-15.300. Each of the sixteen types of mega-records (cf. Figure 2) was downloaded in at least three versions differing in the date of download or limiting Expects.

Selection of Sequence IDs Based on Multi-Cumulative Approach
We selected sequence IDs simultaneously found by the same MPSQ in five required BLAST mega-records (cf. Figure 2)

Regularity of cds-Related Subsets
To avoid a false selection of cds-files containing only domain references but not searched cds, three keyword candidates (i.e. Pssm-ID, accession and name) were tested for numbers of selected cds-files. At least two of these numbers had to achieve minimum values to confirm regularity of the keywords associated with the minima (cf. WP2.1).

Two Set Families Were Derived When Removing Sequence Redundancy of cds-Files
Current conserved domain searches (see above) determining cds restricted by p < 0.01 and sample size 500 were performed with all items pre-selected by the preceding procedures ( Figure 2). This resulted in a class of sets with starting sets of cds-files.

Three-Step Selection of Model Tyrosine Kinase Domains Forming Robust cds with Sequences of Plant Proteins and the Proteins of Early Metazoans Containing Variable Ig Domains in Addition
In the first step, we searched for NS1F-related files including receptor terms: "Ig) domain", "Ig domain", "Ig)-like", "Ig-like", "Immunoglobulin", "B-cell receptor" and "T-cell receptor". In the second step, the records of conserved domains extracted from the files were reduced to those including required cell type or cell-type-associated process, i.e. looking for the terms: "B-cell", "T-cell", "lymphocyte", "lymphoma", "amutoimmune", "leukemia", "macrophage", "phagocyte", "immune system", "immunity", "hematopoietic". In the third step, we kept only such regular records of conserved domains whose names compose cds-files of three well defined signaling molecules containing both IgV-cd and tyrosine-kinase activity (RTK, SRTK, GCTK2) endogenously expressed in the immunologically important model living fossil Geodia cydonium [ we substituted the preceding selection procedures with the knowledge about the linkage of this domain to specific immunity and fossil IgV-cd mentioned above and complemented then the former set of nine selected cdigvtk-related accessions with cd05034 (for the list of ten selected cdigvtk see Results). Four types of strategies comprising searches with ten selected regular domain keywords (see above) of cdigvtk (dci) were employed, i.e. i) enumeration of the most frequently selected subsets of the same cds-files (mfs) when using dci, ii) identification of item number determined by any of dci (cdigvtk(max)) and iii) enumeration of cds-files containing all dci, i.e. total coincidence of dci (cdigvtk(tc)).

Statistical Evaluations
For multiple notes to statistical processing including enumeration of odds ratio values (OR and OR* for 2 × 2 tables including zero values) and t-test see our preceding paper [23], its important associated sources [ E[max] enabled us to assemble of Ig-cds-related set selected from all sets of NSI2F. This set was called NRI2 set (cf. Figure 1). For evaluation of Expect-related specificity and overall hierarchy of levels classifying cds in our paper see WP2.5 and WP2.6, respectively.

Terms, Acronyms, Abbreviations, Texts Denoted by WP-Associated References and the Reasons for Color Grading in Map-Like Pictures or Absence of Phylogenic Trees
The corresponding information composes our web page supplement, i.e. a pdf-file accessible in the corresponding section of our web page http://www.papersatellitesjk.com or via e-mail correspondence.

Cds-Files, Whose Descriptions of Individual Cds Contain the Term Immunoglobulin
The robust cds with catalytic tyrosine kinases domains frequently included term "immunoglobulin" in their descriptions. This concerned also most of descriptions of cds with evolutionary important model IgV-domain-associated catalytic tyrosine kinases domains from Geodia cydonium (cdigvtk; preselected here according to Methods). The corresponding coincidence with the term immunoglobulin was demonstrated using cds of domain cd05034 (representing Src-kinase-like family) achieving the highest score among cdigvtk-derived cds ( Figure 3 and Figure   4). In spite of these favorable results and the robust score values evaluating cds with cdigvtk (55 -140 bits), the other kinases (serine/threonine kinases without immunoglobulin-related contexts) achieved considerably higher scores in their cds with the same sequence regions ( Figure 4). In addition, the hierarchy of cds demonstrated in Figure 4 appeared to be consistent with the hierarchy of the cds-filerelated subsets composing the receptor-kinase-domain-rich set RoK representing the set attaining the most statistically deviated occurrences of cdigvtk (cf. where square brackets include numbers of cds-files (NCF) specifically double-DOI: 10.4236/cmb.2020.101002  2; cf. also [35]). This enabled us to made two color maps of statistical relationships. As follows from these maps and cds-records, PLN00113 achieved extreme variability of bit score values in its cds (maximum SD and Rel-SD) accompanied by considerable differences in cds lengths (data not shown) and maximum bit score values in both mentioned subsets (cf. black elements). In contrast, minimum Rel-SD was found in rows of cd14066 mostly superior in individual cases of evaluated sequence IDs. Regular records Ras (cd00882), IRAK3 (pseudogene), Lyk3 and Sek (neither found among CDD sequence IDs) were not found in our cds-files. Abbreviations: b, u-bottom and upper bit score values within cdigvtk set in evaluated cds-file; BS-bit score; cdigvtk/out-domain cds excluded from the compared model set due to repeated Tie specificity; gray background-slight differences in mean-score-derived order were observed in RoK-rc and NRI-mc subsets; in interval of mean(MBS(i))-the evaluated value occurs within minimum interval containing all mean values of MBS(i), i.e. Mean BS values occurring in cdigvtk-related bottom parts of both maps; Max(max), Min(max)-global picture-related maximum or minimum of column-related maxima, respectively; Mean-mean score value in color representation; NRI-mc-multi-connective files (cf. the Section 3.3) representing independently also all cds-files containing the observed kinase domains present in NRI1; Rel-SD-relative standard deviation (Rel-SD = SD/Mean); RoK-rc-cds-files randomly selected from RoK subset of cds-files including the term immunoglobulin but not any of the five files composing NRI-mc subset (number of selected sequences corresponded to the formula n = ceil(N/10), where N = 121 denoted number of cds-files in RoK set); STK/STKc-catalytic domains specifically phosphorylating mainly serines or threonines (cf. Section 4.3); SD-standard deviation of selective mean value; slightly > max(mean(MBS(i))), slightly < min(mean(MBS(i)))-non-significantly different values slightly higher or lower than maximum or minimum mean(MBS(i)) values (see above), respectively; ↓, ↑-bottom and upper number limits following from the adequate two-tailed t-test statistics were used, respectively. For additional abbreviations see WP4 or WP5.

Monitoring of Conserved Domain Similarities and Terms in NS1F-Related Sets of cds-Files
In accordance with the results in Table 1 Figure 2) represented here a model reference entity due to lucid and simple definition of its non-tendentiously selected sets. Abbreviations: all_num-number of cds-files (NCF) in an evaluated set; cdigvtk-tyrosine kinases associated with IgV domains broadly occurring in Metazoa (cf. Section 2.9); cdigvtk(max), cdigvtk(tc)-NCF containing at least one or all cdigvtk-derived cds, respectively (cf. WP2.1); mc-number of most frequently selected subsets (mfs) with the same file content, when using ten different cdigvtk (the expression 2x denotes the existence of two different maximum subsets); mc-, mc+-frequency of subsets which can be exclusively derived when either diminishing or extending the mfs, respectively; mc+/--frequency of subsets which can be derived only when simultaneously diminishing and extending the mfs; bold in the column all_num-sufficient sample sizes. Colors in backround: gray-compared F-sets; black-significantly different values. For abbreviations see also   . Monitoring of cds-and term-related subsets of cds-files in four selected NS1F-related sets and NRI1. Three lengthwise sections are separated by rows with "**" in each of five set-related pictures forming this figure. The first sections describe the occurrence (number) of Ig-related terms (terms and cds-associated domain identifiers) in cds-files of the evaluated sets. The second sections concern immunologically important terms, whereas the third sections comprises interesting or somewhere frequent terms. For our solution of problems with dissociation of conserved domain keywords/identifiers from cds records see Section 2.7 and WP2.1. Icons defining the extent of conjugations between subset pairs defined by rows and columns: ■-subset identities; #-at least 80% of cds-files are present in both subsets; Θ-one subset is fully included in the counterpart subset (cf. also all_num values). Abbreviations: all, al_num-color and numbers related to all items in the corresponding row or column, respectively; antifungal2, antibacterial2, antimicrobial2, antiviral2-all terms ahead of number 2 were scanned two times (with or without dash after term anti) to get a fusion subset; anti-all4-all four preceding sets were fused to a unique set; bac_ig-bacterial Ig-like domains whose accessions begin with big_ or BID_; cdigvtk(max), cdigvtk(tc)-the presence of any or all cds recording cdigvtk domains was required, respectively (cf. This subset comprised a unique cds-file independently containing all cdigvtk-related accessions and terms of S2E antiviral2 but not the terms of S2E antimicrobial2. The cds-file of the same content like this file composed also the RoK set. The terms of S2E antimicrobial2 were not observed in RoK, but it still exhibited significantly increased occurrence in the set t100dm set (NCF = 18; p < 0.01 in t-test).
The corresponding subset of exclusive term occurrence, i.e. t100dm-NP, did not include cds-files with cdigvtk-related accessions and the other compared terms of S2E antiviral2 or antifungal2. However, only six of the selected cds-files realized in fact the aim of this S2E indicating robust cds with ABC-type antimicrobial peptide transport system.
Twenty cds-files with TIR domains (item tir dom in Figure 5) were found in the resis set composed of cds-records generated by sequence IDs which contained the term resistance in the molecular title (cf. Figure 2). Other sets significantly differed from such extreme TIR domain occurrence in the number and density values (at least p < 0.001 in t-test; cf. Figure  Most of cds-files containing the term "-lectin" comprised the domain with accession pfam01453 (i.e. mannose specific lectin). Term lrr* (leucine-rich repeats) revealed mostly kinase domains associated with LRR but much less frequently LRR domains (for details concerning distribution of LRR domains see WP3.3 and Figure 5).
Seven multi-connective cds-files (five of RoK origin) were independently found in NRI1-set when: i) individually searching for the terms antiviral, killer, lymphocyte, macrophage, or the term programmed cell death; ii) combining the term resistance and strategy immun3 (cf. I3R mentioned above) or using conjugation strategy death4; and iii) looking for the occurrence of most frequent cdigvtk cd05034 or co-locating cds with cd14066 ( Figure 5). The cds-records with TIR domains and aaa+ (superfamily of ATP-ases) were found neither in NRI1 nor in NRI2 set ( Figure 5 and WP2.1). Apparent contradictions between NRI-related data present in Figure 5 and last table in fact followed only from the difference in sample sizes restricting content of cds-files composing NRI1 and NRI2 (see  Specifically shortened versions of abbreviations otherwise used in this paper: antifu2-antifungal/anti-fungal; antivir2-antiviral/anti-viral; C(m), C(t)-cdigvtk(max), cdigvtk(tc) described in Table 1, respectively; C*-case, if C(m) = C(t). For additional abbreviations see Figure 2 or Figure 5 and WP4 or WP5. b Odds ratio values were enumerated conventionally or as zero-associated Bayesian OR* (cf. Section 2.10). Statistical evaluation of the corresponding 2 × 2 tables was performed by Fisher's exact test. Two tailed exact Fisher's test can be always assumed as valid, whereas one-tailed test consists in the usage of additional orienting conditions (sometimes depends on interpretation or context). #for the details see the Section 3.6; aE-n = a × 10 −n ; One-t, Two-t-values following from one-tailed and two-tailed Fisher's test, respectively. c Semi-quantitative odds-ratio-related evaluation represents lucid insight into the presented data. Different comments to one and two tailed test are separated by "/", if such difference exists. Classification of significance levels: poor-insufficiently low or none validity (p ≥ 0.1); q-quasi-significant (in this case 0.05 ≤ p < 0.1; cf. WP2.6); si-minimally significant (0.005 ≤ p < 0.05); si2-of improved significance (10 −6 ≤ p < 0.005; cf. PSI BLAST limit); ssl-of superior significance (p < 10 −6 ). Classification of linkages according to OR values: C-causal linkage (OR ≥ 10); inv-inverse linkage (OR < 1/1.

Statistical Analysis of Linkages between Sets and Subsets of NS1F Based on Evaluation of 2 × 2 Tables
Three types of strong significant statistical associations following from robust cds can be seen in the different segments of Table 2 co-occurrences in individual cds-files of resis set (segment v) and Figure 5).
Though the linkages between cdigvtk and ALPS-related segments were mostly strong, they achieved only boundary line of significance (segment ii)).

Statistics of ALPS Occurrence
The occurrence of ALPS-selected cds-files (ASC; ASC represents molecules or sequence IDs selected with the assistance of sequences of ALPS forming two sequence queries MQP; cf. Section 2.2) was evaluated with respect to the number of selected cds-files (four sets with oP in the name and six subsets with -P in the name) and their relative occurrence (six pairs of subsets). Two values were significantly higher in t-test than the values in the compared collection of the residual sets (cf. WP2.2), i.e. i) maximum ratio (13) between ALPS-positive and such negative samples in CUFC set (p < 0.01) and ii) the number (104) of ASC in RoK set (p < 0.05). Significantly deviated maximum odds ratio (OR) and the best significance level of OR evaluation, selected CUFC and RoK sets as significantly ALPS-associated sets, respectively. For the data and methodology concerning model odds ratios see the sixth section of Table 2 and WP2.4, respectively. Figure 6 represented not only a map of similarities in broad range of Expects but also starting point for screening of the corresponding Ig-cds (see also WP3.5).

Conserved Ig Domain Similarities in the Maximum Expect Extension of NSF2 and NRI2 Sets
Among others the domain cd04982 (IgV_TCR_gamma) achieved significantly increased NCF (p < 0.05) in the t100dm-NP subset of NS2F (NCF = 18 holds also for the set t100dm). Twelve of the corresponding eighteen cd04982-related sequence IDs restricted cds-files specifically including cds with cd03053 (glutathione S-transferase GST_N_Phi frequent in Arabidopsis and Oryza genomes) in NS1F-related t100dm-NP set. This indicated a significant molecular association of short TCR-gamma-related peptides and cds with cd03053 (Table 2). These immunologically interesting peptides were found among others in certain species of vegetable and tobacco (for further comments see WP3.6).

More Detailed Description of the Selected "Non-Refutable" Ig-cds Forming NRI2 Set
Only seven sequences included "non-refutable" Ig-cds, which were called here as dominant Ig-cds, i.e. as Ig-cds that achieved the highest score and minimum Expect among cds co-locating with the same segments of individually evaluated sequences (cf. Table 3). Three and four of the selected sequences contained bacterial and metazoan dominant Ig-cds, respectively. Two significant and one quasi-significant dominant Ig-cds including bacterial Ig domains occurred in different sequence regions of nuclear pore complex protein GP21 (sequence ID XP_010248630.1; Table 3). All these three segments restricted by Ig-cds were approved by our Ig-fold evaluation. Another protein LOC105056499 (XP_010937019.2) was unique one selected by three different procedures. Its Ig-cds was composed Figure 6. Occurrence of Ig-cds, Ig-cds-overlapping cds and Ig-domain-related terms in cds-files forming "expanded" NS2F and NRI2 sets. The cds limited by Expect values E ≤ 100, i.e. cds at least comparable with dense similarities between segments achieving lengths of secondary structures (cf. WP2.6), are described here. For distribution of re-selected "non-refutable" Ig-cds see Table 3. Abbreviations: big_/BID_-terms denoting major part of bacterial domains; cl26464-atrophin-1 family; igv_-cluster of IgV-related terms; NS2F-Ig, NRI2-Ig-content of Ig-cds and related terms or term clusters within non-redundant set NS2F-Ig (comprising 223 cds-files extracted from all sets of NS2F as subset of cds-files including Ig-cds records) and NRI2 described in Section 3.1, respectively; P1453-pfam01453 (domains of mannose specific lectins); P5938-pfam05938, i.e. domain of plant self-incompatibility protein S1; sma406-smart00406, i.e. IgV domain frequent in non-chordate Metazoa. For additional information and abbreviations see Figure 5 or Sections 2.11, WP1.5, and WP4.
of typical metazoan and dominant Ig domain Ig-cd Ig1_ IL1R_like. The segment restricted by Ig-cds achieved maximum fold score 6.81 in case of Ig light chain, but not prevailing number of required Ig folds (rule Q1 in Table 3). Nevertheless, the Ig-cds-related specificity of 96.3% was enumerated based on Expect values of all cds records overlapping the sequence segment in the cds-file of NRI2 origin (limited by E ≤ 100; for details see WP2.5). Other three dominant typical metazoan Ig-cds with sequences XP_021903093.1, XP_ 017226294.1, XP_022544643.1 achieved higher Expects than XP_010937019.2, but were fully approved by our folding analysis with FFAS03 (Table 3). Lower fraction (23.8%) of FFAS03-approved sequences was selected among forty-two "non-refutable" Ig-cds called here as recessive Ig-cds and representing Ig-cds co-locating with more valid non-Ig cds. Three types of interesting phenomena were observed in the case of recessive "non-refutable" Ig-cds: i) FFAS-and CDD-confirmed chimera of filamin domains and Ig-cds (see Table 3 and Section 4.1), ii) the existence of a plant self-incompatibility protein looking like a functional and structural analogue of plant T-cell receptor (E = 5.0 for co-locating weak cds with IgV-TCR_gamma) and iii) extremely frequent co-locations of weak Ig-cds (4.605 ≤ E ≤ 100) with "non-refutable" Ig-cds. The last phenomenon concerned mainly thirteen and fourteen weak Ig-cds co-locating with five and two also co-locating "non-refutable" Ig-cds found in cds-records of sequences XP_021662681.1 (ALE2 STRK) and XP_009381707.1 (potassium transporter) by CDD searches, respectively (cf. Table 3).  In addition to FFAS and CDD studies, the searches for maximum sequence similarities of non-plant sequences to the Ig-cds-derived plant segments dealt with here were performed using HMMER. Six non-plant sequences, including the two ones of fungal origin, were classified here as hot candidates for recent horizontal transfer exchanges (Table 3).

Selected Ig-Domain-Related Sequence Segments
In fact, at least three main groups of Ig-cds can be considered with respect to out data as ensues from the three following paragraphs.
The presence of two dominant cds recording significant bacterial Ig domains was indicated in Results and Table 3 in case of protein GP210 (XP_010248630.1; [38] [39]). Bacterial Ig domains perhaps evolved from eukaryotic Ig domains [40].
In plants, they even contribute to evolution of cell surface proteins in lineages from a common ancestor to glaucophytes, rhodophytes and viridiplants [50].
Since mannose-specific lectin domains prevail among the selected lectin domains, the possible structural relationship between Ig-like beta-sandwich fold and beta-Prism I fold found in plant mannose-and galactose-specific lectins [51] could explain the occurrence of Ig-like structures within lectin domains. The consistent presence of Ig domains in channels (cf. Table 3) has as yet been described only in mammals [52] [53]. Common molecular origin of Ig superfamily and self-incompatibility protein (SIP) recorded in Table 3 represents an important and interesting clue for authors working with these proteins because structural models were developed [54] [55]. Similarly to compared TCR, SIP is involved in cell death events [56]. The inclusions of Ig-cds within several catalytic or transporter domains described also in Table 3  and Ig-cds located in its different subsegments (Table 3). For comments to important significant Ig-cds of ALE2 see next section.
In contrast to high occurrence of cds-files recording cds with cdigvtk and STRK-related catalytic domains in RoK set, only low file fraction of such cds forms the NRI sets including top Ig-cds (Table 3). Nevertheless relatively increased density but low number of cds with cdigvtk was observed in cds-files of ALPS-related fraction of NRI (Table 1; cf. also Sections 4.2 and 5). This indicates considerable independence between occurrences of top Ig-cds and cds of the discussed kinase domains in sequences of higher plant protein origin. For comments to incomplete Ig-like domain folds composing also "non-refutable" Ig-cds-derived segments desribed here see the paper of Berisio [58] (cf. also WP3.7).

Sequence Similarities of Antibody-Like Phosphorylation Sites
As is well known, the major part of phosphorylation sites with specificity cor-  [23].
Similarly members of Atrophin-1 superfamily (domain co-located here) participate in a progressive neurodegenerative disorder in vertebrates [67]. This possible agreement raises the question whether the described coincidence of the three types of sequence similarities mentioned here can be interesting for the phylogeny of ageing.

RoK Set as a Set of Phylogenic Interest
The ALPS-associated RoK set is composed of molecules selected by combined procedures and reselected according to the content of the terms receptor and kinase in molecular title (cf. Figure 2). In accordance with this selection strategy and Results, this set contained mainly serine/threonine receptor/receptor-like kinases (STRK). Many plant STRK exhibit additional tyrosine specificity forming a group of STY (i.e. Ser/Thr/Tyr) kinases and also the corresponding sequence chimerism [68]. We can even think about deep evolution of STY kinases leading to the last universal common ancestor (LUCA) of Archaea, Bacteria and Eukarya [69] [70]. In accordance with this opinion, STY kinases achieve structural relationship to animal non-receptor tyrosine kinases, Src, Abl, Lyn, Fes, Sek, Kin and Ras as well as receptor-like kinase Lyk3 suggesting thus common superfamily origin of the compared kinases [71] [72]. In comparison with these data, our subsets of cds-files specifically compared in Figure 4 did not contain the catalytic domains of Ras and Lyk3 kinases.
Occurrence of global score maxima of PLN00113 (Figure 4), superiority of its subset in RoK set (Formula (4)), its lower cds densities and lower frequencies of individual superior cds than in case of cd14066 ( Figure 4) appear to be in agreement with a broader and more distant, i.e. ancient ancestor-like structural con- achieves the densest cds and most frequently superior kinase-related scores in individual cds-files. These scores represent sometimes even score maxima among all cds in evaluated cds-file and determine extremely low relative standard deviations ( Figure 4). This extreme data can indicate closer but later structural relationship of cd14066 to ancestor structure than in case of domain PLN00113. The described cds-derived domain chimerism appears to be interesting in two aspects, i.e. i) possible structural divergence between protozoan or early metazoan RTK (including also IgV-associated cdigvtk evaluated here) and plant STY kinases

Structures in the Light of Statistics
Removal of sequence redundancy diminished the number of sequence IDs from 2217 to 1323. In fact, this processing reflected the presence of sequences encoding duplicated isoforms and alternative splicing products. In agreement with the monitored associations between variously selected sets and subsets, many significant domain linkages and disjunctions also exist (cf. Table 2 and the corresponding section of Results) as possible consequences of diversified selective processes of domain shuffling (cf. [15]). In accordance with statistics in Table 2 and the relationships described in Results, we can among others pose question, whether the third superfamily of LRR domains represented by cl27891 is specifically involved in recognition by ETI-related NLR composed also of TIR domains (cf. [12]).
Several significantly increased occurrences of terms in the sets were partially explained here. This comprised the occurrences of: i) term immunoglobulin, ii) terms linked in strategy I3R (related to plant immunity) in RoK set and iii) the terms antifungal2 in the set F2oP1 (see Section 3.2). Statistically predicted hotcandidate partners for recent horizontal transfer among genes of fungal origin displayed in Table 3 represent meanwhile a question for further analysis and critical comments exceeding extent of this paper.

Conclusions
The described procedures selected many non-Ig molecules important for plant immunity. To explain this relationship for cases of Ig-similar proteins and proteins with weak Ig-cds, we considered mainly similarities following from motif-motif, motif-secondary structure or ligand-secondary structure interactions often generated via convergent mutation changes, repeat effects and recombination changes on DNA level [ [79]. Due to phylogenic distance of higher plants from animals including most of Ig-domains, we enlarged the limiting Expect values to those approximating i) significant refusing of domain similarity (Section 2.10) and among others also ii) dense similarities of short fold lengths (cf. WP2.6). Since uncertainty of such weakened restriction increased, complementary evaluation of cds specificity (uniquely proved here in case of quasi-significant Ig-cds of XP_010937019.2; cf. Sections 3.7 and WP2.5) or usage of alternative online accessible methods (cf. FFASfold-derived analysis in Table 3) became to be important. The function of the selected Ig-cds-related segments displayed in Table 3 remains unclear. We can only remind the discussed possible relationships to interleukin-1-receptor-related structures and structures involved in ageing or self-recognition interesting for further structural investigation. Several extremely statistically strong and significant domain linkages or disjunctions in multi-domain molecules follow from our data. Nevertheless, it remains to determine whether all these linkages are independent of our Ig-related selection machinery. This could be decided in future revision of the described associations on larger sets keeping only domain relationships. Similarly, more detailed phylogenic evaluation of relationships between cds including cd14066, PLN00113 and cdigvtk segments could be important for the studies concerning deep evolution of SRTK, RTK and perhaps also Ig-domains (see Figure 4 and below).
In accordance with data presented in Table 2 and Table 3, an alternative role of ALPS-related segments in recognition was considered in Section 4.2. This raises the question, whether phylogenic (inter-genic or intragenic) interactions between at least certain genes encoding i) Ig domain ancestors or Ig domains and ii) STRK could include transfer of ALPS-related structures (or the corresponding repeats; cf. Section 4.2) from STRK improving sometimes recognition mediated pre-Ig or Ig domains.