A Personalized Digital Code from Unique Genome Fingerprinting Pattern for Use in Identification and Application on Blockchain ()
1. Introduction
There are approximately 8 billion people in the world and a large majority own computers, smartphones, and other internet-connected devices [1] [2] . According to a US survey in 2020, 87 percent of individuals have access to a computer in their households, and about 5 billion active internet users in the world today [3] . This number is expected to grow exponentially with the recent revolution of IoT, Web3.0 and Metaverse applications [4] . Traditional forms of identity in use today have limited security and are often fragmented and inconvenient. A traditional username and password are commonly used to identify individuals, but password theft and phishing attacks raise serious concerns for many, and forgetting one’s login information can be a significant inconvenience [5] [6] . Popular biometric identification methods such as facial recognition are increasingly common and at times can be more convenient than using a traditional username, but proper functionality involves ideal conditions, and using facial data for security purpose may raise concerns about a government or corporate overreach [7] . Blockchain technology-enabled identification methods could potentially allow for more secure management and storage of digital identities by providing unified, interoperable, and tamper-proof infrastructure with key benefits to enterprises, users, and IoT management systems [8] [9] . Having a unique, personalized identification code allows the individual to safely identify themselves without having to reveal other personal information to other parties with the added benefit of the code being truly unique to that individual.
Individual identity and personalization are integral to a functioning society and economy [10] . Having a safe, secure, and robust way to identify and personalize ourselves and our possessions is becoming increasingly crucial in today’s digital society and global markets [11] [12] . At its most basic level, identity is a collection of claims about a person, place, or thing. For people, this usually consists of first and last name, date of birth, nationality, and some form of national identifier such as one’s passport number, social security number (SSN), driving license, etc. [13] [14] . These data points are issued by centralized entities (governments) and are stored in centralized databases (central government servers) [15] . A digital identity arises organically from the use of personal information on the web and from shadow data created by the individual’s actions online [16] . More robust identity and personalization management systems could be used to eradicate current identity issues such as inaccessibility, data insecurity, and fraudulent identities [8] [17] [18] . Security and identity are complex and ever-evolving issues for enterprise and government systems alike. Blockchain-based solutions could provide exceptional utility in solving issues common in current identity and digital systems [19] [20] . Blockchain technology allows users to create and manage digital identities through the combination of decentralized identifiers, identity management and embedded encryption. However, current blockchain-based identification methods rely on randomly generated digital code that is not unique to the individual [21] [22] . The genome fingerprint-based personalized digital code is unique to each individual and could potentially be used to verify ownership of identity even in the case when a private key or passcode is not available for the associated data.
The human genome holds roughly 3 billion base pairs [23] [24] . Each person’s genome sequence holds a unique combination formulated from random recombination from the parent’s genome during fertilization [25] [26] . Half of the individual genome is inherited from the father and the other half from the mother. The genetic makeup is composed of four nucleotides (ACGT) and shares 99% similarities among all people [27] . Most genetic variations are in the form of single nucleotide polymorphisms (SNPs), which have been extensively studied by many acclaimed scientists after the completion of the Human Genome Project (HGP) in 2003 [28] [29] . SNPs are commonly obtained through several methods such as genome sequencing [30] , genotyping arrays [31] , or polymerase chain reaction (PCR) methods [2] . When using such methods, it is crucial to verify and validate SNP sites because validation success can vary based on SNPs chosen and different ethnic populations [32] . Most of these specific variations are rather rare and show an individual and population-specific occurrence, but some fraction of these variations arise very frequently in every population studied [33] [34] . These high allele frequencies (close to 50%) pan-ethnic common SNPs are considered to be natural results from evolution history and are not associated with any known phenotype or disease [35] . Most of these variations are bi-allelic, which means only two choices of nucleotides are observed at a given specific site [36] [37] . The patterns of these high allele frequencies bi-allelic SNP variations can be used to generate a personalized genetic code that can potentially identify and track individuals for use in forensics, such as the CODES method which uses fragment length polymorphisms [38] . SNPs have also been used to track animals such as U.S. beef cattle when suppliers require cattle identification for paternity analysis and breeding information [39] . We employ these bi-allelic, high frequency, pan-ethnic common SNP variations in the genome to prepare and create a unique genomic pattern akin to fingerprint authentication (Genome Fingerprinting). This information can provide accurate and powerful means to identify and distinguish individual people in a digital format by simply decoding the digital code without revealing other identifiable personal information. We employ a panel of selective, universal, high allele frequency bi-allele human SNP markers to efficiently generate a unique combination code that can be used in digital identification and personalization.
2. Methods
The proposed digital identification method comprises of steps that involve selecting genetic markers for identification, assembling information for the unique genetic identification, assigning digital codes for each of the features, creating the digital identification codes, decoding the digital identification code, registering the codes for blockchain, and implementing codes for various applications.
2.1. Digital Identification Format
In order to formulate a digital identification which holds all the necessary information from a wide selection of species of animals and plants with the possibility of a rare event (identical twins or cloning), we propose the following digital identification format:
Species Code (2 - 4 digits), Gender Code (1 digit with Natural or Virtual birth separation) (1), Identification Codes (24 digits for numerical, and 12 digits for alphabetical), Auxiliary Code (1 or 2 digits).
2.2. Species Codes
Every animal, plant, bacteria, and virus on earth has genetic makeup (DNA and RNA) vital for growth, reproduction, and multiplication. Although the current study is primarily focused on using genetic code in identifying human individuals, the same concept of genetic based blockchain identification can be used in other species for identification, parental linkage, tracking, or management analysis in agricultural businesses. Therefore, species identification code should be included in the digital format. Below is an example of the proposed codes for some of the well-known species of interest (Table 1). The two-digit combination of the alphabetical code can cover up to 576 (24 × 24) different species in the code combination. Three (13,824 species) or four (331,776 species) digit species identification code combination may be implemented in future expansion.
2.3. Gender Codes
With respect to biological sex, humans are born as either male or female. However, gender classifications are simply not binary so other types (gender neutral, transgender, genetic abnormalities) should be considered in the gender code assignment to cover a wide spectrum of people. Furthermore, we are also exploring the possibility of a non-natural birth (Artificial or Virtual) individual such as cloning or digital creation in some applications, that could be differentiated from real (Natural) individuals to that of artificial or virtual individuals (Table 2).
Table 2. Gender code with possible outcomes for both natural and virtual birth.
2.4. 24-Digit Numerical Codes
Because of SNP’s mostly bi-allelic nature, each position of the markers has three different possible combinations as summarized in the table below. We can assign a code “1” for the known reference allele, “2” for the alternative allele, and “3” for the mixture of two alleles called heterozygotes. However, there is a possibility of an incomplete or failed sequencing result at a particular site that can stem from either low confidence call or missing call for a certain SNP [40] [41] . The missing calling makes it difficult to assign a piece of correct allele information on the given sites. Therefore, the missing genotypes are labeled “NA” in the sequencing output file and assigned a “0” in the code output. Typically, the standard whole genome sequencing coves roughly 95% of the entire genome region when it is sequenced in 30X read depth (standard Whole Genome Sequencing QC criteria), meaning that there is always a chance that one or more SNPs may not produce a satisfactory result to assign an appropriate code [42] [43] . In order to establish a criteria cutline and a measure of quality control, the minimum data point needed to assign a reliable identity code will be designated to 21 data points. Based on the 24 SNPs needed for human identification code generation used, the encoding system can accommodate and produce a valid result even when up to three SNP data points are missing (Table 3).
2.5. 12-Digit Alphabetical Codes
In order to accommodate the limitation of the numerical code while shortening it simultaneously, we converted the 24-digit numerical code into a 12-digit alphabetical code by combining two numerical codes into one alphabetical code (Table 4).
The numerical code has only four (0 - 3) options in each digit sites, but alphabetical codes have 16 different options in each digit sites and gives more options for downstream applications.
The 12-digit alphabetical codes can be easily converted back to a 24-digit numeric codes and be able to decode original SNP sequence of each individual people.
2.6. Auxiliary Codes
One of the drawbacks for using genetic-based identification codes is its inability to separate and distinguish individuals with identical genetic makeup. Such criteria are found in identical twins, triplets, quadruplets, and human clones [44] . Therefore, an auxiliary identification code is needed when two individuals with the same genetic makeup are to be identified. The auxiliary code can also be used
Table 3. Possible outcomes of each genetic markers’ digital code including three possible combinations plus no result calls.
Table 4. Conversion tables for two numeric codes to one alphabetical code. The alphabet order is based on occurrence rate of each combination with A being the most common and P being the least common in the general population.
to separate real (natural birth) individuals from artificial (virtual birth) individuals that are created in virtual environments through games or other computer algorithms. The auxiliary code can be numeric or alphabetic, both of which can be either single digit or double-digit, depending on the potential number of genetically identical individuals created in a natural (identical twins) or artificial (cloning) environment. For example, we are assigning the number “1” for single born real individuals and “2” for twins with identical genetic makeup (Table 5).
Table 5. Auxiliary code for genetically identical individuals such as twins and clones.
Note that non-identical twins have different genetic makeup and therefore will produce different codes. The auxiliary code is also designed to take into account the very rare event (one out of 280 billion cases when 24 SNPs are being used in the panel) when two unrelated individuals have the same sequencing code by accident.
2.7. SNP Identity Marker Selection
It is well established that about 0.1% of the genome accounts for all the genetic variations that are found from person to person [45] [46] . This means that there are potentially around three million base pair differences in a particular genome between two non-related individuals [47] . All of these variations in the genome can be used as a fingerprint for patterns when differentiating from other individuals. Therefore, each individual, except identical twins, has a unique genomic fingerprint that solely belongs to that person [48] . Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation, and they are easy to detect with most genotyping and sequencing technology platforms [49] [50] . Extensive research in the human genome project has led to more than 20 million variations being studied in different ethnicity and nationality backgrounds [51] [52] . Most of these variations are quite rare and show individual and population-specific distribution patterns, but some of these fractions arise very frequently in every population studied [53] [54] . The majority of these variations are bi-allelic, which means only two choices of nucleotides are observed within a given specific SNP site. The patterns and fingerprints of these high-frequency bi-allelic SNPs have been used in other applications, most notably to identify individuals for forensic purposes [55] [56] .
In order to select the most informative markers for the identification and personalization purposes, SNPs that show a high minor allele frequency (>0.45) from all population studies were selected from Hap Map and 1000 genome sequencing database [57] [58] . For example, there are over 4 million SNPs that have been genotyped in the CEU cohort within HapMap, and 218,000 SNPs showed minor allelic frequencies of greater than 45% [59] . The higher the minor allelic frequencies are, the more informative the result will be based on marker combinations present in the genomes [60] . Therefore, fewer markers are required when higher frequency SNP markers are used for testing, analysis, and interpretation for the identification and personalization purpose [61] .
2.8. Marker Selection Workflow
Markers are selected from published Hap Map and 1000 Genome Sequencing database using the criteria below:
- Identify marker sets with high minor allele frequency (e.g., >45%) within the study population
- Select only bi-allelic SNP markers with good Hardy-Weinberg distribution
- Avoid markers with adjacent known polymorphisms to minimize potential allele drop in sequencing
- Find markers that are well separated from each other
- Avoid markers in a region containing duplicated sequence motifs
Number of SNP markers for digital identification code generation
Because each bi-allele SNPs has three different allele combinations (Reference, Alternative, Heterozygote), each SNP analysis could generate three distinct identification codes [62] . The possible combination of two unrelated SNP markers could potentially generate 9 (3 × 3) different codes, and three distinct SNP markers would result in 27 (3 × 3 × 3) different combination codes. When considering a world population of 8.1 billion people in recent world population survey, a 21 SNPs composition (Over 10 billion possible combination codes) could potentially differentiate all the people in the world (Table 6). Therefore, we propose to use 24 SNP combinations for human identification to create sufficient buffer and accommodate for future population growth, as well as virtual creation of artificial individuals.
2.9. Candidate SNP Marker Selection for Digital Identification
The table below showcases the proposed 100 markers to be used, all of which have an average 50% minor allele frequency ranging from 0.498 to 0.508 in all populations tested by the 1000 genome sequencing project (Table 7). Therefore, each of these SNPs is expected to be highly present in all populations. The results from any subset of these markers provide enough statistical power to reliably identify and separate every individual from the most genome data available.
2.10. Digital Identification Panel Design
We further selected 24 SNPs from the 100 candidates SNPs for our digital identity panel that showed the least ethnicity differences in all populations tested
Table 6. Theoretical combinations codes from the given SNP marker analysis based on three different allele combinations (bi-allele SNP analysis).
Table 7. An example of a 100 candidate SNP marker selection from a public HapMap and 1000 Genome Sequencing database for digital identification and personalization purpose. RSID: Reference SNP Cluster ID, #CHR: Chromosome Number, position: SNP position in the chromosome.
(Table 8). The chosen SNPs show minor allele frequency between 0.43 - 0.58 in every population tested, allowing for a wide variety of combinations for every individual across all populations. The fixed universal marker set can be used in all individuals in the same species for the digital identification panel.
2.11. Personal Digital Identity Code Assignment from 24 SNPs Panel Collection
Table 9 below is used to generate a personalized digital code for two real individuals (Person 1 male, and Person 2 female) using selected 24 SNP markers in the panel.
Table 8. 24SNP marker selection for the development of a digital identification panel. RSID: Reference SNP Cluster ID, #CHR: Chromosome Number, position; SNP position in the chromosome, REF: Reference Allele, ALT: Alternative Allele, EAS: East Asian, SAS: South.
3. Results
- Individual 1:
Numerical Codes: 233331123133313113232333
Alphabetical Codes: DACHCACCEDDA
- Individual 2:
Numerical Codes 133223333321112302211231
Alphabetical Codes: EBDAAGIDNGHC
Table 9. Digital identification code assignment from the SNP panel data.
Generation of the Personalized Digital ID Code
Format: Species Code (2), Gender Code (1), Identification Codes (24), Auxiliary Code (1). Below are the two people used to generate the digital identification code using the format and composition of this paper. Individual 1 is a male human and single born. Individual 2 is a female human and single born. A 2D barcode can be created from the following code and can be linked to an HTML file or web page to link with additional information (Figure 1 and Figure 2).
- Individual 1: Human, Male, Single born
HS,1,233331123133313113232333,1
HS,1,DACHCACCEDDA1
- Individual 2: Human, Female, Single born
HS,2,133223333321112302211231,1
HS,2,EBDAAGIDNGHC,1
Figure 1. Generated 2D QR Code from the numerical genetic identity code from each individual using the UTF-8 encoding.
Figure 2. Generated 2D QR Code from the alphabetical genetic identity code from each individual using the UTF-8 encoding.
4. Discussion
The era of the Internet and the continual rise of social media platforms has shown the true value of secure identification and personalization. A large majority of consumers now expect companies to deliver personalized products and services, while at the same time, ensuring maximum privacy and control over their data [63] .
In this paper, a detailed process of generating a personalized digital code from an individual’s genome data was proposed. Through a process called genome fingerprinting, this code can be used as part of a decentralized identification (DID) system [64] . The genetic information obtained from genome fingerprinting can then be securely stored on blockchain, creating a genome based decentralized identifier (gDID) [65] [66] . This allows for secure authentication and verification of identity based on the individual's genomic information.
A genome-based code can be useful in a variety of ways in current Internet infrastructure as well as a decentralized one [67] . For example, genome data can be used to personalize virtual experiences by having custom avatars and characters. A real-world example that similarly captures this concept is CryptoKitties released in 2017 [68] [69] . CrytoKitties is a decentralized platform and digital collectible game built on the Ethereum blockchain. The game has a community of players that breed and raise virtual cats. Although the platform did not use actual genome data, each CryptoKitty, to a certain extent, had its own “virtual” genome, making each virtual cat unique and having its own set of traits and attributes that determine its appearance and abilities. CryptoKitties retained its popularity through 2019 where the platform accumulated over 1.5 million users and over $40 million worth of transactions. The game’s popularity has shown that personalized virtual experiences using blockchain technology has immense potential and using real genome data to augment such platforms can bring new, unreached levels of identity and personalization [69] . But perhaps a more universal use-case of utilizing a genome-based code is enhanced authentication and identity on Blockchain [70] . Genetic data is unique to an individual and cannot be changed or altered, inherently making it a highly secure form of identification [71] . Genome data can be used for biometric authentication in addition to finger or iris scanning [72] . A study has shown that a large majority of internet consumers frequently report password problems and frustrations, supporting the idea that novel forms of biometric authentication provide a more secure alternative to traditional passwords and usernames [73] [74] . Many users are also hesitant of putting their genome data out into the Internet so a decentralized storage solution like Blockchain can prevent data breaches that can compromise traditional centralized databases [75] .
However, it is important to note that while genome fingerprinting provides a unique identifier, there are privacy and ethical concerns associated with the storage and use of genetic information, particularly in the context of decentralized identification [65] [76] . Careful consideration must be given to the security, privacy, and ethical implications of using genomic information in this manner, and appropriate safeguards must be put in place to ensure that the information is protected and used responsibly. In addition, the lack of clear regulations of DIDs at the moment can create uncertainty for businesses and individuals who want to use the technology [77] . This can cause further implications with scalability, as the average user might struggle with widespread adoption especially when the technology cannot keep up with the demand.
But with almost every mass technological adoption, society tends to appreciate technological revolutions only in hindsight. DIDs carry with it numerous advantages in comparison to existing forms of identification. First, DIDs enable individuals to have a greater degree of control over their personal data due to their reduced reliance on centralized intermediaries [78] [79] . Second, DIDs are based on open standards, allowing them to be used in a wide range of platforms and streamlining password authentication processes [80] . Such benefits are paramount in modern society because it improves the overall security and privacy of digital interactions, a key concern for many new technology innovations.
5. Conclusion
With the internet and social networking becoming an integral part of people’s daily lives, secure identification and personalization are of utmost importance to many consumers and businesses. In this paper, we presented a novel way of securely identifying an individual online by using universal, high allele frequency bi-allele human SNP markers to efficiently generate a unique combination code. We also presented real-world use cases for applying a genome-based DID in the industry as well as implications and technical challenges. When generating the code, it was important to select genetic markers that were common in all populations, so that any individual, no matter their background, is able to generate an identifier unique only to that person. Novel forms of identification and personalization based on genome data can open up new opportunities for personalization and offer better security and a better user experience for Internet users in the present as well as those in the decentralized future.