Blockchain based general data protection regulation compliant data breach detection system

View article
PeerJ Computer Science

Introduction

Data breaches are a serious concern for every organization. A data breach is defined as any event in which data are viewed, erased, modified, or transferred by an unauthorized person or authorized person mistakenly or maliciously (Information Commissioners Office, 2022). In the context of the General Data Protection Regulation (GDPR), an incident involving security that compromises the integrity or confidentiality of any covered personal data that you are in charge of protecting constitutes a GDPR data breach (Mehta, 2022). There are many factors that could lead to a data breach, such as hardware issues, software crashes, phishing, malware, ransomware, distributed denial-of-service, human error, misplaced or lost data storage devices (such as universal serial bus (USB) drives, laptops, portable drives, etc.), malicious insider and external problems such as power outages (Mehta, 2022). However, our focus in this research work is on malicious insider threats. Insider data breaches are becoming more frequent and have a greater financial impact on organizations (Chavali, 2022). According to a recent report, 60% of data breaches are supposedly caused primarily by insider threats (Storchak, 2022). Insider security events have increased by 47% since 2018, and insider threat costs have increased by 31% during the same period. Currently, an insider threat costs $11.5 million on average per year (Ponemon Institute, 2022).

An insider is typically someone who either intentionally or unintentionally damages the organization while having access to its resources. Current or former workers, contractors, partners, or employees who have access to an organization’s systems or data may cause damage (Ponemon Institute, 2022). Because insiders need a higher level of access and trust to complete their tasks, insider threats are particularly challenging to defend against (Smith, 2022). System administrators and other information technology (IT) experts, for instance, may legitimately need access to sensitive systems and data. They are intimately familiar with an organization’s infrastructure and cybersecurity technologies. Thus, an attacker with administrator privileges can also alter logs and login records to erase the proof of the attack. Because of this, insider attacks are difficult to detect, and we see a large number of malicious and unintentional insider attacks each month that cause data breaches. Such attacks frequently lead to losses in money and reputation and may even cause organizations to collapse.

These days, it has become essential for hospitals to gather and process patient data. Almost all hospital departments deal with patients’ personally identifiable information and electronic health records. When a patient’s private information is compromised due to an insider attack, it is impossible to recover privacy or undo psychosocial harm. In addition, these attacks not only put the patient’s identity at risk, but tampered data can also slow down hospital operations and be harmful to the patient’s health and well-being. Due to these operational delays, if immediate medical attention is not provided, this may cause death or permanent disability.

A paradigm shift in data privacy has occurred since the GDPR came into effect in May 2018 (European Parliament, 2022). In order to reduce the challenges of insider threat and data protection, the GDPR has specified the processes and regulations. As a result, while the GDPR is in effect, the data controller is required to notify the data protection authorities and the affected data owners when a data breach occurs. If he fails to notify the breaches within a certain time, he will face heavy fines. According to GDPR, organizations that suffer a data breach might have to pay up to four percent of turnover, or 20 million, whichever is greater (PrivazyPlan, 2022). To avoid heavy penalties, a system that can effectively detect data breaches is needed.

On the other hand, blockchain is a cutting-edge technology that, in contrast to traditional internet technology, which only provides a “network of information”, delivers a “network of value” (Farhan, 2020). An Ethereum blockchain uses specific languages such as Solidity (Chris, 2019) to become fully programmable, enabling modern decentralized applications’ development. These decentralized applications use smart contracts. Smart contracts are coding scripts that allow users to conduct transactions without any risk of fraud and third-party interference (Jeza, 2021). Finck (2018) explored how the European Union’s EU General Data Protection Regulation can be applied to blockchain technologies. The authors also stated that blockchains can provide benefits in terms of data security. It is important to note that this is not always the case. Rather, the blockchain system must be purposely designed in order for this to occur. To overcome the data breach issue, this article proposed a GDPR-compliant detection system by leveraging the benefits of smart contracts and blockchain technologies. The steps of the proposed system methodology and functionality of smart contracts are elaborated on in the subsequent sections.

Contributions

The major contributions of our work are as follows:

  1. A blockchain based data breach detection system is developed. Every row is turned into an evidence hash and stored with the unique id of the off-chain data inside the smart storage contract. This allows the application to detect any alteration in the database record and verify the authenticity of any record 24/7 without depending on the internal database in a centralized organization.

  2. An access control mechanism is developed, where only authorized entities can disseminate sensitive functions like adding proofs. These functions are highly restricted and can only be called by the authorized person (Data Controller in our case) using Data Controller’s wallet key (“Public and Private”) cryptography.

  3. To verify the authenticity of the data owner, a digital consent model is developed, where the Data Owner digitally signs his consent using a secret key using PKI cryptography.

  4. Since smart contracts are immutable and cannot be changed, proxy patterns with separate dynamic data storage are developed to support contract versioning and future features update.

  5. Before the Data Controller stores Data Owner’s data off-chain, it generates irreversible evidence with a timestamp and stores it on the blockchain network. This system keeps every record’s proof on the network, which is used to verify the authenticity of an individual row of the data. It also helps the Data Controller in the process of determining the necessary mitigation measures for data breach events

  6. The system presented in this work also complies with the GDPR articles 6, 7, 8, 12, 14, 15, 16, 17, 18, 20 and 33.

Theoretical and practical significance

Our proposed system provides data controllers with an automated tool to detect data breaches instantly. That will help them identify malicious activities and mitigate them before escalating further. Before the Data Controller stores Data Owner’s data off-chain, it generates irreversible evidence with a timestamp and stores it on the blockchain network. This system keeps every record’s proof on the network, which is used to verify the authenticity of an individual row of data. It also helps the Data Controller in the process of determining the necessary mitigation measures for data breach events.

The rest of the article is organized as follows: ‘Literature Review’ presents a detailed review of the closely related research works. In  ‘Case Study’, a case study is presented to illustrate the approach and methodology. A comprehensive discussion of the proposed system is provided in ‘System Design and Proposed Framework’. ‘Experimental Results and Discussion’ presents detailed information about the simulation setup and simulation results. In ‘Conclusions and Future Work’, the conclusion and future work are discussed.

Literature Review

Recently, malicious insider threats represent one of the most damaging types of breach attacks happening in organizations. Data breaches refer to security incidents when an attacker infiltrates an organization’s network, application, or database and performs malicious activities. While organizations (who experience a data breach) like to believe they are secure, the truth is that insider threats are already happening. Organizations would not learn about the breach for months or even years. Increasing breach events have attracted more attention toward data breach detection and prevention. In order to address this issue, numerous research works have been done. In this article, we have categorized related research works in blockchain- and non-blockchain-based techniques. Table 1 presents the summary of the non-blockchain based techniques. While the summary of blockchain-based techniques is presented in Table 2.

Table 1:
Summary of non-blockchain-based techniques.
Ref Technique Evaluation measures Dataset used Language/ Tool used Achievements Limitations
Hanan, Traore & Woungang (2021) Document semantic signature Detection rate, false positive Enron email dataset Protege Very strong in detecting modified and rephrased data. Encrypted data cannot be detected.
Daren, Bertino & Sallam (2020) Hidden Markov model Accuracy, precision, recall Client applications dataset Java, Dyninst and Jahmm libraries Very low false positive rates. It might produce false positive alerts if the training dataset is not sufficient.
Cesar, Santos & Lopez (2017) Task sequences and probabilities algorithm False positive, false negative, true positive, true negative Real data provided by a governmental institution of Ecuador Not reported Based on the identification of users’ unusual behavior, the article provides an algorithm for data leakage detection. Because it uses historical data and a supervised algorithm, we need information from the consumers to create their behavioral pattern.
Desai & Gaikwad (2016) Signature matching Detection rate Not reported Not reported Implemented two algorithms to detect attack detection. Modified data cannot be detected.
Lin, Yang & Lin (2020) Machine learning Accuracy Randomly generated MATLAB Machine learning based approach for data breach protection. System was tested on randomly generated data.
Moghaddam & Zincir-Heywood (2020) Artificial neural network Accuracy, precision, recall, F-score Digital image documents provided by insurance company R, Python, R-Studio, Spyder IDE Artificial neural network-based technique is proposed to protect the customers sensitive information. System relies on training dataset.
Krishnaveni et al. (2020) Support vector machine Accuracy, response time NSL-KDD dataset Python Anomaly detection system is presented for cloud computing. Training can be more time consuming for SVM.
Le & Zincir-Heywood (2021a) Machine learning FP, FN, TP, TN CERT dataset Python 3.7 Machine learning based threat detection model. Lacking semantic understanding, system relies on training dataset.
Squicciarini, Sundareswaran & Lin (2020) Portable data binding technique Total time taken to encrypt and de-crypt files None Not reported A three-layer data protection framework. Requires the categorization of the data beforehand. Data that has been incorrectly classified can leak.
Gomez-Hidalgo et al. (2010) Named entity recognition False positive, false negative, accuracy Twitter comments dataset Freeling Named entity recognition-based data leak prevention scheme. Did not use semantic technologies to give meaning to entities. It might be impacted by misspelled words.
DOI: 10.7717/peerjcs.1882/table-1
Table 2:
Summary of blockchain-based techniques.
Ref Technique Evaluation measures Dataset used Language/ tool used Achievements Limitations
Hu et al. (2020) Consensus algorithm Response time Synthetic data Not mentioned Traceability system for insider threat detection. No practical solution is identified, lack of proper framework.
Srivastava et al. (2021) RSA algorithm Transaction added to database Not mentioned HTML, CSS Event-driven data alteration detection system. Partially implemented.
Srivastava et al. (2019) Fingerprint Time taken to run the queries Synthetic data was generated. REST API, Python version 3.5.2 Verity framework to detect data tempering in database. Performance optimization aspects for increasing the system throughput was neglected.
DOI: 10.7717/peerjcs.1882/table-2

Non-blockchain-based techniques

Numerous research works have been carried out using non-blockchain-based solutions to address the data breach detection challenges. The authors proposed a data leakage prevention system in Hanan, Traore & Woungang (2021). The authors have used semantic signatures of documents to detect leakages. The system detects data leaks when the outgoing document’s semantic signatures are matched with the original document’s semantic signature. However, a sensitive file can circumvent the detection system if an adversary encrypts it and sends it via email. This detection system cannot view the encrypted data as a sensitive file in this scenario. Hence, sensitive data can be exposed. Additionally, this system is domain-specific, specifically designed for the financial industry business domain. We cannot deploy in any other organization/domain (e.g., hospital, bank, etc.). In contrast to this limitation, our system does not have this constraint because it depends on smart contracts, which use a public blockchain to store the proof, which cannot be tempered or controlled by any entity.

For the protection of the database, an anomaly detection model is presented in Daren, Bertino & Sallam (2020). The authors used the hidden Markov model (HMM) for prediction and achieved low false-positive rates. However, the HMM-based system relies on the training dataset. The system might produce false-positive alerts if the training dataset is insufficient. In contrast to this limitation, our system does not have this constraint because it depends on smart contracts, regardless of the data and future changes.

In Ponemon Institute (2022), the authors proposed a data leakage detection algorithm to detect anomalous user behavior in computer sessions. The proposed algorithm works based on the probability of sequences of the tasks. By simulations, they show that the proposed algorithm outperforms other techniques in terms of accuracy with low false-positive rates. However, the limitation of this work is that it utilizes historical data and necessitates the requirement for user data to develop behavioral patterns. If a user completes too few tasks in a session, the system will not be able to tell if that session is normal or not.

The authors presented the data breach challenges in Barona & Anita (2017). Data breaches or leaks can hamper organizations’ reputation and credibility if the breach compromises sensitive information. A semantic rule-based fraud detection system is proposed in Ahmed et al. (2021). A method for sensitive data breach detection and prevention is presented in Gaikwad, Chougule & Charhate (2016). The authors used Shingling and Rabin filters for data leakage detection tasks. The approaches based on Rabin algorithms show some benefits over traditional approaches. However, they have some constraints, such as coverage and unavoidable false-positive rates.

A signatures and pattern based data leakage detection system is proposed by the authors in Desai & Gaikwad (2016). The system detects data breaches when the outgoing data’s pattern is matched with the original data’s signatures. It also raises the alarm when a high similarity is found. However, the adversary can modify sensitive files and data by substituting, adding, and subtracting words before sending data. Furthermore, the semantics of sensitive files can be rewritten as a summary. These modifications can change the identity of original sensitive data. Hence, sensitive data can be exposed.

In recent years, several insider threat mitigation techniques and data breach detection schemes have been proposed that are based on machine learning techniques (Lin, Yang & Lin, 2020; Moghaddam & Zincir-Heywood, 2020; Ferreira, Le & Zincir-Heywood, 2019; Meng et al., 2018; Le, Zincir-Heywood & Heywood, 2020; Sun et al., 2021; Al-Shehari & Alsowail, 2021; Le & Zincir-Heywood, 2021a; Le & Zincir-Heywood, 2019; Hu et al., 2019). Yan et al. (2022) proposed a field theory-based risk calculation approach for further development of three-dimensional risk assessment. In Krishnaveni et al. (2020), the authors proposed an effective anomaly detection system with the help of support vector machine algorithms. However, the main limitation of this work is that the classifier may need to be retrained each time there is a sizable change in the monitored data for an anomaly detection system based on machine learning, and there must be sufficiently representative samples of the monitored data for training.

In Le & Zincir-Heywood (2021b), the authors proposed a machine learning-based user-centered system for insider threat detection, but there is no learning-based approach to achieve early detection. The proposed system was tested on the CERT (Software Engineering Institute, 2021) dataset. However, the limitation of this work is that once the system is trained, it is unable to detect other new types of attacks. A hybrid intelligent system for insider threat detection is proposed in Ren & Wang (2020). The hybrid system comprises pre-processing, rule matching, entity portrait, and iterative attention unit. The proposed system was evaluated on the CERT dataset.

Considering the information leakage problem caused by indexing in the cloud, the authors presented a three layers data protection framework in Squicciarini, Sundareswaran & Lin (2020). However, the methodology requires a pre-defined data classification. Misclassified data can be leaked. In Gomez-Hidalgo et al. (2010), authors have proposed named entity recognition(NER) based data leak prevention approach. However, the presented approach did not use semantic technologies to give meaning to entities. Consequently, NER could be affected by spelling mistakes and connected words.

Blockchain-based techniques

Presently, blockchain technology has been applied in various fields (Ali et al., 2020; Jamil et al., 2021; Iqbal et al., 2021; Jamil et al., 2020). To solve the data breach detection issues, numerous research works have been conducted using blockchain technologies. However, there are some issues that are not addressed by existing schemes. As such, a blockchain-based traceability system for insider threat detection is designed in Hu et al. (2020). However, the model discussed in the article does not explain how the framework will technically work. Additionally, no practical application example or solution is identified in the article. It is important that storing any data in some sort of structure requires a smart contract; however, this model does not share any knowledge of the smart contract at all. Additionally, long-lasting storage requires a smart contract on the blockchain network, which the authors do not mention in this study. Furthermore, this article did not organize how to store data evidence or fingerprints on the blockchain network. The lack of a proper framework will make this solution vague and impractical.

A blockchain-based event-driven data alteration detection system is proposed in Srivastava et al. (2021). However, the solution provided in the article is not using smart contracts on any public blockchain network. Additionally, the solution does not comprehensively comply with GDPR requirements such as “Forgetting Data”. The centralized application creates the wallet key generation, which can be exposed to cyber-attacks from the server side. Furthermore, the current system does not explain how technically the entire solution is built and what network or technologies are used to develop this whole system. Additionally, the data storage is inaccurately explained with regards to storing data in blocks.

The authors proposed a blockchain-based framework in Srivastava et al. (2019) to detect insider attacks in relational database systems. However, the solution provided by the authors addresses only the private data and centralized control system where a private blockchain network is created under a privately controlled environment without democratic participants. Additionally, attackers can temper any data or network within an organization regardless of whether the network is built on blockchain technology. Storing all proof within the same centralized controls or system can create an insider risk. Authors have used a fully private blockchain network, which means technically and logically, anyone inside that organization, if he has access, can also manipulate the private blockchain network. Even in this case, the whole organization becomes compromised. That is why using a public blockchain network is safe and secure.

Furthermore, the existing research work lacks GDPR complaint-based real solutions for data breach detection that Data Controllers and data protection authorities could use to determine whether there is a need to notify the affected Data Owners.

Summarizing the above discussion, we conclude that existing blockchain-based data breach detection systems suffer from several issues, such as non-democratic control in the private blockchain networks, questionable data trust under private organizations, lack of smart contracts, lack of authenticity of the owner and security issues in private blockchain networks, while existing non-blockchain-based systems suffer from problems such as non-transparency, non-immutability, training overhead, unable to detect encrypted data, domain dependent, dataset dependent, non-practical usage, vague framework, and GDPR non-compliant systems. Hence, designing a system capable of handling the abovementioned issues is challenging. Considering the limitations of the research works mentioned above, we have proposed a novel data breach detection system in this work. In our proposed model, we have comprehensively developed a GDPR compliant system with properly identified technologies using the Ethereum blockchain network.

Case Study

In this section, a case study is presented to illustrate the working of proposed system.

General overview

To show how the system works, we discuss the use case scenario of the health sector. In hospitals, the collection and processing of personal data from patients have become a necessity today. Almost every department in a hospital deals with personally identifiable information and protected health information of patients. As a result of an insider attack, when a patient’s sensitive data are breached, it is not possible to restore privacy or reverse psychosocial harm. Moreover, these types of attacks not only pose a risk to the patient’s identity, but tempered data can also delay hospital operations and harm the patient’s health and well-being. Due to these operating delays, if emergency treatment is not received, this disease might result in death or lifelong disability.

Use case scenario

The objective of this section is to discuss the use case scenario of our proposed system. We assume that the data owner, Alice, is the patient in this case scenario, and the data processor Bob is a surgeon that often requests patients’ medical records for operation purposes. Figure 1 provides a high-level overview of the hospital use case, where Alice’s medical data are stored on the blockchain after receiving consent from Alice. Bob can obtain their desired patient data from the patient database by sending a request to the data controller Mike. Mike uses our proposed system to perform necessary tasks such as consent validation and data verification. Our proposed system allows Mike to detect any alteration in the database record and verify the authenticity of any record 24/7 before sharing data with Bob. The step-by-step method of the proposed system is discussed in the subsequent section.

Overview of the hospital use case scenario.

Figure 1: Overview of the hospital use case scenario.

System Design and Proposed Framework

A comprehensive discussion of the proposed system is presented in this section, along with important assumptions. Figure 2 illustrates the proposed data breach detection system architecture and its components.

Proposed system architecture.

Figure 2: Proposed system architecture.

Assumptions

The following assumptions are considered in this article while designing the proposed system.

  1. The Data Controller and data protection authority are fully honest entities and do not have malicious intentions.

  2. The database has one or more Data Controllers, and they have full access to the database. Thus, the Data Controllers can also modify the records of database tables and logs.

  3. The attacker phishes the Data Controller and uses stolen credentials to access the Data Controller’s portal.

  4. The database is clear from tampering or existing attacks at the very beginning.

Workflow of proposed system

In the context of the GDPR, there are four main entities, namely Data Controller, Data Owner, Data Protection Authorities, and Data Processor. The Data Controller defines the purpose of collecting and processing the data. The Data Controller decides why and how data will be processed. Generally, an entity that collects the data are known as a Data Controller. A Data Owner is a person whose data are processed. The data protection authority is an entity that guides the Data Controller in their duties. An entity that processes data on behalf of the Data Controller is known as a data processor or data consumer. However, the primary objective of our work is to provide Data Controllers with an automated tool to detect data breaches instantly. That will help them identify malicious activities and mitigate them before escalating further. This will also support the Data Controller in the process of performing detailed analyses and preparing breach reports concerning the reported personal data breaches. The proposed system’s step-by-step execution is demonstrated below:

  1. Data protection authorities assign a Data Controller to do data entries inside their user data repository. The Data Controller accepts the Data Owner’s personally identifiable information (PII) such as first name, last name, email, phone number, age, address, and a dedicated Ethereum Request for Comments 20 (ERC-20) compliant wallet address; all are mandatory. The submitter, Data Owner, gives this information verbally by visiting the Data Controller’s office to be registered.

  2. Once the Data Controller receives PII data, they enter these details into a dedicated application built with Nodejs/MYSQL centrally controlled by the Authority/Data Controller called “App Data Entry”.

  3. This submission creates a “consent dynamic link” and sends it to the Data Owner’s email address or on their mobile. This link provides a consent panel called “Dapp 2 Data Consent”.

  4. The Data Owner receives this link, which allows them to see their details on the page. To verify these details, they would be asked to use their Ethereum account, which they submitted at the registration time. The Data Owner uses their MetaMask Ethereum-compliant wallet to sign these details. The text is signed using the private key of the Data Owner, and they are only allowed to sign if they select the same address on their MetaMask, which was given to the Data Controller.

  5. The “DAPP2 Data Consent” application sends this data signature hash and all relevant PII data to store inside the MYSQL database (off-chain). The Authority’s Data Controller verifies all data that have been given consent for the blockchain network.

  6. It provides a unique identification primary (incremented) key against that new record and a timestamp.

  7. The application creates a new transaction packet to send to the Ethereum blockchain network. This transaction packet is a Secure Hash Algorithm 256-bit (SHA256) hash (proof of record) created using an embedded string of the entire new record’s detail such as first name, last name, email, phone number, age, address, and a dedicated ERC-20 compliant wallet address. The SALT for this encryption is the user’s wallet public address.

  8. To submit this new transaction on the blockchain network, the application “App1 Data Entry” retrieves the address of the latest version of the smart contract. Technically, to obtain the latest smart contract address, the “Main Data Relay Contract” must be called, which acts as a registry of all old or recent versions of the “Main Data Contract”. Since smart contracts are immutable, we cannot change any existing smart contract, but to add a new feature, we can create a new one. That is why relay or registry patterns can help organize new contracts without worrying about their address. Algorithm 3 presents how the main data relay activity takes place.

  9. Once the application receives the latest correct “Main Data Contract” address, it sends the transaction packet of user identification number (UID) (“Record’s Primary key”), Timestamp, and SHA256 hash (proof of record) to the smart contract function called “AddRecordProof”. This function is highly restricted and can only be called by the Authorized person (Data Controller in our case) using the Data Controller’s Wallet Key (“Public and Private”) Cryptography. Functional access controls are defined in the “Security Layer Contract”, which has different security functions such as ownable and breaker. The concrete steps of security layer activity are shown in Algorithm 4 .

  10. The “Main Data Contract” uses its storage library contract called “Data Storage Layer Contract” to store these data inside the smart contract. “Data Storage Layer Contract” is an independent contract compared to its caller. Separating the data storage layer gives the power to add or change any variable or structure inside the contract. It is impossible to change any variable after the deployment; therefore, creating a storage library using a dynamic array helps control the situation. On a successful transaction, a new hash is generated by the Ethereum, which is called a “Transaction hash”. Authority must pay the gas fees in order to record these data. Algorithm 1 presents how the main data storage activity takes place. While main data contract activity is presented in Algorithm 2 .

  11. The third app, called “App3 Data Breach Viewer”, is strictly developed to view the record’s authenticity. It shows all the user records in a grid view with the “Verify” button. Clicking this button will send the request to the “Main Data Contract” function called “getRecordProof” using a newly generated SHA256 hash as an off-chain record identifier. This function returns the “Proof of Record” SHA256 hash to the application; this record was registered previously by the data collector.

  12. The application backend retrieves the off-chained data record from the MYSQL Database and generates a new Proof of Record using SHA256. The proof of record from the smart contract and off-chained must be equal to be considered valid. Otherwise, a record is considered invalid or tampered with.

  13. The compromised records are sent as a notification to the Data Controller. The compromised database can be replicated or updated to its original position using backup databases in the data center.

Experimental Results and Discussion

The evaluation of our proposed system is presented in this section. We have calculated the execution time of the system’s key functionalities (data submission, consent acquisition, blockchain data migration, and data verification). To verify the functionality of the proposed system, we created seven sets of tests, as discussed in the subsequent section. Additionally, we have compared our system against GDPR data protection principles. The simulation setup and simulation results of the testing system are described in ‘Simulation Setup’ and ‘Simulation Results’, respectively. While the GDPR-based evaluation is presented in ‘GDPR Compliance’.

    
 
 
Algorithm 2 Main data contract 
    Input:  Data storage layer contract_ Cdsl 
  2:  procedure MAIN DATA_CONTRACT(Cmd.c) 
          Data_Storage_LIBaddress ← address. Data_Storage_contract() 
 
 4:       constructor (address storage_Data) 
         Data_Storage_contract ← storage_Data() 
 
 6:       function add_Record_Proof (proofmemory, proof_Id_Hashmemory) 
               return Data_Storage_contract.setProof (proof, proof_Id_Hash) 
 8:       end function 
 
        function get_Record_Proof ( proof_Id_Hashmemory) 
10:           return Data_Storage_contract.getProof (proof_Id_Hash) 
         end function 
 
12:       function get_Storage_Address 
             return address(Data_Storage_contract) 
14:       end function 
    end procedure 
 
    __________________________________________________________________________________________________ 
 
_________________________________________________________________________________________________________________________ 
 
Algorithm 3 Main data relay contract 
    Input:  Security layer contract_Csl 
  2:  procedure MAIN DATA RELAY_CONTRACT(Cmdr.c) 
          latest_version uint ← null() 
 4:       mappinguint=¿address ← version_history() 
 
         function Update_Version( Contract_Address) 
 6:           version_history (latest_version) ← Contract_Address() 
               latest_version ← latest_version + + 
 8:       end function 
 
        function get_Contract_By_Version_Number( uint version) 
10:           if version¿=latest_version then  
                return version_history (latest_version-1) 
12:           else 
                return version_history (version) 
14:           end if 
        end function 
 
16:       function get_Version_Number( ) 
               return latest_version 
18:       end function 
    end procedure 
    _________________________________________________________________________________________________ 
 
_________________________________________________________________________________________________________________________ 
Algorithm 4 Security layer contract 
    Input:  Data owner wallet address_DOdo.address, Data controller wallet address_DCdc.address 
  2:  procedure OWNER_CONTRACT(Co.c) 
          event Owner_Set (oldOwnerindexed, newOwnerindexed) 
 
 4:       modifier isOwner() 
         require(msg.sender == owner, ”CallerIsNotOwner”) 
 
 6:       constructor() 
         owner ← msg.sender 
 8:       emit OwnerSet(address(0),owner) 
 
          function change_Owner( new_Owner) 
10:           emit OwnerSet(owner,newOwner) 
               owner ← new_Owner 
12:       end function 
 
        function get_Owner( ) 
14:           return owner 
         end function 
16:  end procedure 
_________________________________________________________________________________________________ 

Simulation setup

The experiments were conducted on a Hewlett-Packard Laptop with 1.50 GHz Intel Core i7 and 8 Gb RAM running Windows 11. Table 3 presents the tools and technologies that were used to develop this system.

Table 3:
Tools and technologies utilized.
Tools Description
Solidity language It is a programming language to write smart contracts that will help to store proof and evidence of the record inside the blockchain network.
NodeJS and truffle NodeJS support truffle scripts and frameworks for building Ethereum-based smart contracts.
NodeJS express, MYSQL MYSQL is for database storage and express is used for the web application server framework.
MetaMask wallet and HDWallet Two different browser-based wallets are used for interacting with the smart contract.
SHA256 libraries Encryption libraries are used to generate the SHA256 (Handschuh, 2005) hashes inside the NODEJS server-side backend.
Ganache The Ganache simulator is used for local testing on the computer before going with testnet.
Ethereum testnet The Ropsten/Goerli test network is used to deploy and simulate the smart contract.
MOCHA and CHAI The unit testing framework in JS MOCHA and CHAI is used to test the smart contract’s input and output.
Truffle Console This console is used for immediate testing and deployment in the testing and non-testing environment.
SMTP Mailing Mailing notification is used for sending notification alerts in the system.
HTML5 and JavaScript Apps (UI/UX) Three Applications are developed a) data breach viewer, b) data entry panel c) data consent panels. Using HTML5 and JavaScript.
DOI: 10.7717/peerjcs.1882/table-3

Simulation results

We built a blockchain network to set up the environment using the truffle console simulator. Ten accounts are created by the Truffle blockchain console node and then assigned to the Data Controller and Data Owners. The accounts created on the Ethereum network are displayed in Fig. 3. The network requires that each participant create an account with a private key and account ID. The smart contract is deployed once the network is operational, as seen in Fig. 4. A smart contract can be accessible by all network members once it has been distributed on the network by a single instance and has been added to the ledger. As a result, the smart contract will be updated and available to all participants. Since the smart contract cannot be changed, it validates the network security. The transaction hash, block number, and contract address are crucial elements that should be considered.

Accounts created by truffle console simulator.

Figure 3: Accounts created by truffle console simulator.

The result of deploying smart contracts.

Figure 4: The result of deploying smart contracts.

We can test the contract’s methods after deployment using the truffle test functionality. The Ethereum framework for developing, deploying, and compiling solidity contracts is called truffle. Solidity contracts are tested by truffle automation tests using the MOCHA framework. Following the deployment of the solidity contract, Fig. 5 depicts the contract’s execution as it is tested on our network. We can see from the test execution how long (in milliseconds) it takes to perform each contract method.

Network and contract testing.

Figure 5: Network and contract testing.

To verify the functionality of the proposed system, we created the following six sets of tests. System performance in terms of execution time for each test (gas consumption, breach detection/data integrity, consistency, access control/consent authenticity, immutability and time consumption) is shown in Figs. 6 to 11.

Execution time for test 1.

Figure 6: Execution time for test 1.

Execution time for test 2.

Figure 7: Execution time for test 2.

Execution time for test 3.

Figure 8: Execution time for test 3.

Execution time for test 4.

Figure 9: Execution time for test 4.

Execution time for test 5.

Figure 10: Execution time for test 5.

Execution time for test 6.

Figure 11: Execution time for test 6.

Gas consumption (test 1): To test Ethereum gas consumption, we have used a unit test framework written in JavaScript on each contract call. Using specific scenarios and parameters, we can see each contract’s time consumption, gas consumption, and speed.

Breach detection/data integrity (test 2): To verify the data integrity, if the database record is not tempered, we have retrieved the off-chained user record (R1 = SHA256 of first name, last name, phone, etc.), and compare it with the evidence hash H1 from the smart contract. If the comparison between R1 and H1 is equal, then it would be considered a true record.

Consistency (test 3): To examine the consistency of the system, we have checked whether or not the registry contract can retrieve the latest “Main Data Contract” address.

The Truffle Automated JavaScript testing framework is used in the scenario where one contract is deployed and inserted inside the registry contract. After calling the “GetLatestAddress” of the registry contract, it must return the newly deployed address of the “Main Data Contract”.

Access control and consent authenticity (test 4): To ensure security, every method in the “Main Data Contract” is restricted by Authority’s wallet address. The transaction is rejected if anyone calls the “AddDataByAuthority” without an authorized address. Furthermore, to verify the authenticity of the data owner and consent, we have checked whether the user has provided consent or not, as the requested data are signed by the data owner’s private key. Then it is checked with the data owner’s public address to see the correct signature validity.

Immutability (test 5): To check validity, if the main data contract has the valid address of the storage contract. Since all data that are being stored are supposed to be stored independently in a separate contract instead of the main contract, a comparison between the storage contract address and the contract address inside the main contract needs to be compared.

Time consumption (test 6): To calculate the execution time for writing data to the blockchain, we ran the addproof function for multiple data to calculate the time consumption of the contract function call.

GDPR compliance

To ensure compliance and determine whether the system is GDPR compliant.

The system is categorized into the following GDPR clauses. In Table 4, we have mapped our system features with each clause to ensure that the system was compliant. The proposed system offers capabilities to comply with the GDPR from an application standpoint. This is because of the following factors:

Art. 6 “Lawfulness of processing”: According to the criteria of the GDPR, the proposed system provides a setting to aid Data Owners in maintaining control over their personal data. This framework aims to meet the primary GDPR requirements, according to which Data Controllers will be able to request consent from Data Owners and inform them clearly and transparently about the data to be managed, as well as the purpose of processing.

Art. 7, 8 “Conditions for consent” & “Conditions applicable to child’s consent in relation to information society services”: Our proposed system can manage the data consent that allows Data Owners to easily control consent regarding access to their personal data and exercise their rights defined by GDPR.

Art. 12–14 “Transparent information, communication, and modalities for the exercise of the rights of the data subject” & “Information to be provided where personal data are collected from the data subject” & “Information to be provided where personal data have not been obtained from the data subject”: The proposed system constantly needs Data Owners to sign off on data migration on blockchain and consent requests. On each blockchain data migration event, data are signed by the Data Owner’s private key. Then, it is checked with the Data Owner’s public address to see the correct signature validity. The data are signed using the private key of the Data Owner, and they are only allowed to sign if he selects the same address on his Metamask, which was given to the Data Controller.

Table 4:
Comparison of our proposed system with reviewed state-of-the-art.
Related works GDPR principles Hu et al. (2020) Srivastava et al. (2021) Srivastava et al. (2019) Our work In our proposed system
Lawfulness of processing No No No Yes processing of personal data is lawful and transparent because the Data Owner has given her/his consent for processing the data and Data Owners have full permission to grant or revoke consent at any time.
Consent conditions No No No Yes
Transparent information and communication Yes No No Yes
Right of access No Yes Yes Yes no one can alter the Data Controller’s privileges to do all CRUD activities on her personal data as stated in the default policy.
Right to rectification No No No Yes
Right to be informed No No No Yes the platform always needs the data owner’s signature before collecting data or obtaining consent.
Right to erasure No No No Yes as personal data are stored off chain, Data Controller can erase the data as requested by Data Owner.
Right to restriction of processing No No No Yes Data Owners have full permission to grant or revoke consent anytime.
Right to data portability No No No Yes
Notification of breaches Yes No Yes Yes This system keeps every record’s proof on the network. This allows the system to detect any alteration in the database record.
Purpose and storage limitation No No No Yes personal information is only gathered for clear, legitimate, and stated purposes and is never used in a way that is inconsistent with those purposes. Additionally, the data are only maintained for as long as it is absolutely essential for the goals being pursued.
Data minimization No No No Yes only data that is relevant, appropriate, and limited to what is needed for the desired outcome is processed.
DOI: 10.7717/peerjcs.1882/table-4

Art. 15, 16 “Right of access by the data subject” & “Right to rectification”: In our proposed system, no one can alter the data without the Data Owner’s permission. It allows Data Owners to request a copy of their data. It assists them in understanding how their data are being processed. If they find their data inaccurate or incomplete, they have the right to update and rectify the data.

Art. 17 “Right to erasure (right to be forgotten)”: When using blockchain for data storage, the question of whether the platform is GDPR-compliant arises since distributed ledgers are immutable, which means they can never be deleted. Our system fulfills this principle since data are kept off-chain and the Data Controller can delete them at the Data Owner’s request.

Art. 18 “Right to restriction of processing”: In our system, Data Owners have complete permissions to manage consent, including the ability to grant or revoke consent at any time or location by using the Consent Grant, withdraw and reject functions. On the consent withdrawal event, no one can perform the activities (Create, Read, Update, Delete) anymore on their personal data, which are set in the default policy when consent is granted.

Art. 20 “Right to data portability” The proposed system provides Data Owners the ability to access and download their personal information from the database. Data Owners can also request that the Data Controller send them a copy of their stored personal information.

Art. 33 “Notification of a personal data breach to the supervisory authority”: Our system has the ability to store Data Owners’ data off-chain. It generates irreversible evidence with a timestamp and stores it on the blockchain network. Every datum is turned into an evidence hash and stored with the unique ID of the off-chain data inside the smart storage contract. This system keeps every record’s proof on the network. This allows the system to detect any alteration in the database record. This will support the Data Controller in detecting data breaches and notifying the data protection authorities accordingly.

Conclusions and Future Work

In this article, a GDPR-compliant data breach detection system is proposed by leveraging the benefits of blockchain technology. Smart contracts are created for specific objectives. Gas usage and cost analysis of these smart contracts are calculated. Depending on the logic and complexity of the processes carried out in each function, it uses a variable amount of gas. Moreover, the proposed system accomplishes data integrity, security, access control, transparency, and authenticity of the data owner and consent. In simulation results, the execution time for the different tests is plotted to represent the performance of the proposed system. The result shows that the proposed system is capable of achieving the insider threat mitigation goal. Future work will involve implementing the data breach severity assessment mechanism using semantic web technologies (ontologies). These severity assessment levels (high, medium, low) will help the Data Controller and Data Protection Authorities in notifying the affected (Data Owners). Furthermore, considering the old age factor, some biometric traits will be included in some parts of our proposed system. It will also open a window for some researchers who work in security and biometrics.

Supplemental Information

Smart contract code for data storage layer

DOI: 10.7717/peerj-cs.1882/supp-1

Smart contract code for security layer

DOI: 10.7717/peerj-cs.1882/supp-2

Smart contract code for main data contract

DOI: 10.7717/peerj-cs.1882/supp-3

Smart contract code for data relay

DOI: 10.7717/peerj-cs.1882/supp-4
  Visitors   Views   Downloads