sebzz

Build your own key value store

Sebin — Sun, 06 Apr 2025 18:30:52 GMT

A key value store is a kind of NoSQL database that stores data in key value format. The key is a unique identifier. This type of data model is commonly used for caching hence making Redis a good example. Other examples includes Rocks DB, Level DB. This is just an outline of my thinking process while building one. I got this idea from the book called Designing Data Intensive Applications.

Building a real key-value store is a hard problem, but like any complex system, it can be broken down into simpler parts. I’ve been fascinated by database systems for a while and always wanted to build one. The idea felt out of reach, though, until I decided to start with the most basic version possible and build up from there. This post walks through how to build that minimal starting point.

I want my key-value store to have the following features:

The data should be available across restarts which means that we will be writing the contents to a file.
Since we will be writing to a file, we need to decide on a format that allows both the reader and writer to understand a common language.
The key value store should support basic indexes.

An append-only file (AOF) is a type of file where data is only added to the end. Once data is written it cannot be deleted or edited. This makes it simple as we don’t need to worry about restructuring the file while performing editing. Incase the system crashes while writing we can ignore the corrupt data. The storage solutions like HDDs or SDD are often optimised for sequential reads and writes.

As for the format is concerned let us choose a simple format for a key value pair — A key value pair where : is the delimiter and \\n is used to denote end of value.

<\\n>

Even though the proposed solution is simple, it is not without its own problems.

What if someone wants to save : or \\n in the key or value. Now the client needs to handle the escaping logic.
And what if someone wants to delete? We cannot delete a record as this is a append only file.

Given that engineering is all about tradeoffs I am well aware that the client can add custom logic and get over these problems easily. Well, here we would be trading off a bit of usability for simplicity, which might not be a good idea in the real world but works for us.

Now we have simplified the problem from writing a key value to store to writing two functions that will read a file that is in specific format and return the value and another function that will write to a file in a certain value.

To get a value for a key, we just need to write a function that reads from a file. Sounds simple, right? It is! There are a few things that might be worth remembering. As we are writing to an append-only file, there is no way for the program to know if there exist duplicate keys—and if there are duplicate keys, what should we return? Thinking back, an append-only file might not have been a really good idea, right? If there are multiple values for the same key, it just means the user tried to edit the key. Oh wait, if the user tried to edit the value, we don’t really care about the previous value. We can just show the user the latest value. Now, as we are using an append-only file, it makes it easier for us. Read the file from the end, and when the program encounters the key, return the value. Append-only file to the rescue.

Writing is even easier just encode the key and value in the above format and write it to the end of the file.

Your basic key value store is ready. You started using it. You now have millions of key values. You notice that some key lookups are slow while others are very fast. This is not a good experience for the user and we are asked to debug the issue. Computers are not magic there would be some logic behind any issues you are facing. Now we realise that the newer keys are faster to read and the older keys are slower to read. Why you might ask. This lies in how we read the value. If you remember we read the file from the end, and values in the last are newer values. Now what?

We add index. Just like any book has index we are going to build a index for our key value store. Now we need to choose the right data structure for building the index. Which data structure is most similarity to a key value store? A hashmap. This data structure provides O(1) lookup time. For people who don’t understand what O(1) means, it just means it's fast and the speed doesn’t grow with the size of the data. Making it a ideal candidate for this scenario.

Now we need to modify the get and set functions. While setting a value we also need to update the index with the key and a value as an offset. An offset just tells you where should you start looking for the value rather than looking through the entire file. Now to get a value check the index and jump to the position that the index recommends.

Can you see the problem with the solution? The hashmap will be empty if the key-value store restarts. Thats easy just rebuild the index from scratch every time the system restarts and we are good to go. But there is a small problem now: the start time of the database will be slower. That is again a tradeoff we are willing to take for now.

But hey… we’re not done yet. This was the simplest possible key-value store — and we made it work! But real-world databases aren’t this chill. So let’s end this blog with a few juicy questions to leave you thinking:

What is a database that doesn’t even offer full CRUD functionality out of the box? And yet, people still use it?
What if the user wants to store the delimiter in the key or value itself? Uh oh...
What happens when the index becomes so large that it no longer fits in memory? Can we still do O(1) lookups?
What if the file becomes so massive that it doesn’t fit on a single disk? What then? Sharding? Splitting? Crying?

Ok bie.

How Databases Handle Multiple Transactions

Sebin — Thu, 06 Feb 2025 14:30:07 GMT

Concurrency control is a set of technique by which a database system handles concurrently executing transactions. It is one of the key component in the transaction manager of a database system. It is tasked with ensuring that the concurrent transactions follow the ACID principles that a relational database system guarantees.

Why do we need Concurrency Control?

To understand the need for concurrency control, we first need to define what a schedule and a serial schedule are. A schedule is a list of operations required to execute a set of transactions from a database perspective. When transactions are executed one after another without interleaving, such schedules are called serial schedules. Ideally, all schedules should be serial because this ensures strict adherence to the ACID (Atomicity, Consistency, Isolation, Durability) principles. However, modern computers have multiple cores, and databases are often distributed across multiple servers. This means that database systems need to handle multiple transactions concurrently to improve performance and resource utilization. A straightforward way to enforce a serial schedule is to wait for each transaction to commit before starting the next one. However, this approach has a major drawback—it leads to underutilized resources. In high-traffic scenarios, it would significantly increase transaction processing time, making the system inefficient.

Key Conflict Scenarios

A transaction can have multiple read and write operations. For simplicity let us assume that each transaction either contain a read or a write operations. Now when two transactions run concurrently four possible scenarios can occur:

Read - Read (RR)
Read - Write (RW)
Write - Read (WR)
Write - Write (WW) We don’t need to worry about Read-Read (RR) transactions since they do not cause any conflicts. Reading the same data multiple times does not modify it, so it has no impact on consistency. However, the other three cases can lead to concurrency issues, so let’s explore them with examples.

Read - Write

Occurs when Transaction T1 reads a value, and Transaction T2 writes to it before T1 finishes. Example:

T1: Reads balance = 1000 from an account.
T2: Updates balance = 1200.
T1: Still believes balance = 1000, leading to outdated or inconsistent data.

Write - Read

Occurs when Transaction T1 writes a value, and Transaction T2 reads it before T1 commits. Example:

T1: Writes balance = 1200 but hasn’t committed yet.
T2: Reads balance = 1200, assuming it’s final.
T1: Later rolls back, meaning T2 read an uncommitted value, leading to the dirty read problem.

Write - Write

Occurs when both transactions write to the same value, potentially leading to lost updates. Example:

T1: Updates balance = 1100.
T2: Updates balance = 900, unaware of T1’s update.
The final value could be either 1100 or 900, depending on which write happens last, potentially leading to lost data.

Understanding Concurrency Control in Database Systems

One of the key deciding factors of a database system or database engine is how it manages concurrency control. Efficient concurrency control ensures data consistency and integrity while allowing multiple transactions to execute simultaneously. There are three widely used concurrency control mechanisms: Optimistic Concurrency Control (OCC), Multi-Version Concurrency Control (MVCC), and Pessimistic Concurrency Control (PCC).

Optimistic Concurrency Control (OCC)
Optimistic Concurrency Control operates under the assumption that transaction conflicts are rare. Instead of blocking execution, OCC allows transactions to execute concurrently and validates their serializability before committing the results. This approach is widely used in modern database engines, such as WiredTiger, which employs OCC for document-level concurrency.
OCC generally follows three phases:

Read Phase: The transaction collects dependencies (read sets) and potential side effects (write sets) while reading data.
Validation Phase: Before committing, the system checks if any concurrent transactions violate serializability. If conflicts are detected, the transaction is aborted.
Write Phase: If no conflicts are found, the transaction is committed, and changes are applied to the database state.
This approach is particularly effective for workloads with infrequent conflicts, as it reduces the overhead of locking mechanisms.

Multi-Version Concurrency Control (MVCC)
Multi-Version Concurrency Control (MVCC) allows multiple versions of the same data to exist simultaneously, ensuring a consistent view of the database at a specific point in time. MVCC is commonly used in databases like PostgreSQL and MySQL (InnoDB) to enhance performance and reduce contention.
MVCC can be implemented in different ways, including:

Validation techniques, where only one of the conflicting transactions is allowed to commit.
Lockless techniques, such as timestamp ordering, which ensures transactions execute in a predefined sequence.
Lock-based approaches, such as two-phase locking (2PL), where locks are acquired and released in phases to maintain consistency.
MVCC provides a non-blocking read mechanism, making it an excellent choice for high-concurrency environments where multiple transactions need to read data without waiting for write locks to be released.

Pessimistic Concurrency Control (PCC)
Pessimistic Concurrency Control, also known as Conservative Concurrency Control, operates by blocking or aborting transactions as soon as a potential conflict is detected. Unlike OCC, which assumes conflicts are rare, PCC proactively prevents them by restricting access to shared resources.
PCC can be implemented using two main approaches:

Lock-Based Approach: Transactions acquire locks on database records, preventing other transactions from modifying locked records until the lock is released.
Non-Lock-Based Approach: Instead of using explicit locks, the system maintains a list of read and write operations and restricts execution based on these dependencies.
While PCC ensures strict consistency, it can lead to performance bottlenecks due to deadlocks, where two or more transactions wait indefinitely for each other to release locks. Proper deadlock detection and resolution mechanisms are essential to mitigate this issue.

Conclusion

Concurrency control is a critical aspect of database systems, influencing their efficiency, scalability, and consistency. Optimistic Concurrency Control (OCC) is best suited for workloads with minimal conflicts, Multi-Version Concurrency Control (MVCC) balances performance and consistency with non-blocking reads, and Pessimistic Concurrency Control (PCC) is ideal for highly transactional environments where preventing conflicts is a priority.

Understanding the Need for Virtual DOM

Sebin — Sat, 31 Aug 2024 13:11:32 GMT

Introduction

When I started to get into frontend development I just wanted to learn how to learn react, I jumped straight in coding it out not knowing why react exists or what problem it solves. As a result I failed to appreciate the library. Now that I have worked on it for few years now, I am just trying to answer these questions for myself and documenting it for the future. For anyone trying to understand the need for React, I will try to explain my understanding about the library.

What is a DOM

DOM stands for Document Object Model is a browsers representation of your HTML code as a tree structure. Each element in the HTML document is represented as a node in this tree. DOM also allows your JavaScript to interact with your HTML document, letting you to manipulate content.

Steps involved in Paint of layout

After data is fetched in chunks of 8kb a content tree is created by parsing the HTML document converting it into DOM nodes. Then a render tree is constructed by parsing the css and the content tree. After the render tree is constructed each node is given exact coordinates where it should appear on the screen. Then the back-end UI layer will be traversed and painted. For every small change the browser uses a dirty bit system. A DOM node that changes marks itself dirty and a incremental layout change is triggered. This incremental change is done by repaint and reflow.

Repaint

The Repaint occurs when changes are made to the appearance of the elements that change the visibility, but doesn't affect the layout.

Reflow

Reflow means re-calculating the positions and geometries of elements in the document. The Reflow happens when changes are made to the elements, that affect the layout of the partial or whole page. The Reflow of the element will cause the subsequent reflow of all the child and ancestor elements in the DOM.

Both Reflow and Repaints are an expensive operation.

Virtual DOM (VDOM)

Virtual DOM is a lightweight in memory copy of the real DOM. It is a clever hack that some libraries use to make UI updates more efficient.

How is virtual DOM represented

React's virtual DOM implies a "virtual" representation (as a tree, as each element is a node that holds an object ) of a user interface, which is preserved in memory and synchronised with the browser's DOM via React's ReactDOM library.

Below is a code snippet for a basic React component. The component increments the value of the value of variable count by one.

function App() {
 const [count, setCount] = useState(0);

 return (
   <div>
     <h1>Counter: {count}h1>
     <button onClick={() => setCount(count + 1)}>Incrementbutton>
   div>
 );
}

Below we can see how React converts to to a native JavaScript object.

{
 "type": "div",
 "props": {},
 "children": [
   {
     "type": "h1",
     "props": {},
     "children": [
       {
         "type": "TEXT_ELEMENT",
         "props": {
           "nodeValue": "Counter: 0"
         }
       }
     ]
   },
   {
     "type": "button",
     "props": {
       "onClick": "setCount(count + 1)"
     },
     "children": [
       {
         "type": "TEXT_ELEMENT",
         "props": {
           "nodeValue": "Increment"
         }
       }
     ]
   }
 ]
}

How virtual DOM is faster

As we have seen in the above sections that manipulating the DOM is a expensive operation when you are making a bunch of changes. This is because manipulating native JavaScript objects are more faster than manipulating the DOM. When a change is requested by the UI, it is first updated on the virtual DOM. Multiple changes are batched together and the smallest number of updates required to to synchronise the virtual DOM with the real DOM and applied. This is done by finding the difference difference in DOM and virtual DOM. This reduction in the number of operations is what makes React faster.

Reconciliation using diffing of virtual DOM

When there are changes the component tree to reconcile nested components. It compares the component types, props, and keys to determine whether a component needs to be updated, added, or removed and creates another version of the virtual DOM. When there are two versions on Virtual DOM, it uses a diffing algorithm to identify the difference between them, trying to minimize the number of changes needed. The algorithm assumes elements of different types will result in different trees and elements that don't need to be checked can be set as static. If the root element type changes, then the old tree is discarded and a new one is built, effectively performing a full rebuild of the tree and the element type remains the same, React compares the attributes of both versions and updates only the nodes with changes, without altering the tree structure. The component will be updated in the next lifecycle call.

Conclusion

React’s Virtual DOM provides a more efficient way to handle UI updates by minimizing the costly operations associated with manipulating the real DOM. By working with a lightweight, in-memory representation of the DOM, React can batch updates, find the minimal set of changes required, and apply them in an optimized manner. This approach not only improves performance but also simplifies the development process by abstracting the complexities of direct DOM manipulation. For developers, understanding this process can deepen their appreciation of React’s capabilities and why it’s a popular choice for building modern, dynamic web applications.

Links

https://web.dev/articles/howbrowserswork#the_main_flow https://www.geeksforgeeks.org/what-is-diffing-algorithm/ https://developer.mozilla.org/en-US/docs/Web/Performance/How_browsers_work https://dev.to/gopal1996/understanding-reflow-and-repaint-in-the-browser-1jbg https://www.freecodecamp.org/news/what-is-the-virtual-dom-in-react/ https://refine.dev/blog/react-virtual-dom/#drawbacks-in-updating-the-dom https://dev.to/geraldhamiltonwicks/understanding-diffing-algorithm-in-react-5581

Connect with me

Email: sebinsebzz2002@gmail.com
GitHub: github.com/sebzz2k2

ACID properties of a relational database

Sebin — Sat, 20 Jul 2024 16:05:28 GMT

Any operation that can possibly access or modify the contents of a database is called transactions. Database follows ACID properties in order to maintain the consistency of data before and after a given transaction. ACID is an acronym for Atomicity, Consistency, Isolation, and Durability. These principles are essential for ensuring reliable and secure database transactions.

A - atomicity

Atomicity is a property of database by which a transaction is ensured to completed or rolled back to its initial state.

Features of Atomicity

Serializability ensures that a series of operations requested by a single user appear as a single operation to an outside observer, such as another process or query.
Recoverability guarantees that a database engine will not produce partial results; a transaction will either complete fully or fail entirely.

How is Atomicity achieved

One of the most easiest way to achieve atomicity is to use undo logs and redo logs. An undo log is a collection of undo records associated with a single read-write transaction. It contains information about how to undo the latest change by transaction to a primary key index. If another transaction requires to see the original state, it is retrieved from the undo log. Another way is by using a redo-log. Redo log is a disk-based data structure which is used during crash recovery to correct data written by incomplete transactions. Two phase commit is a common way to achieve atomicity in distributed database systems. Using this a commit only occurs only when all involved system says yes.

C - consistency

A database system must remain consistent before and after transaction. Keeping data consistent means that any change of data in a single table must be reflected across all linked tables and entities as well. Only valid data will be written to database. Data that breaks the rules of consistency that transaction will be rolled back. A consistent read means that a snapshot of a database at a point in time will be presented before the read transaction started. All writes that occurred after the read transaction will not be presented. As it is difficult to ensure business logic consistency one must enforce consistency by serialisable transactions or by using explicit blocking locks.

I - Isolation

Isolation is the state of separation. A good database must allow multiple transactions to be executed simultaneously and no data must have impact on one another.

Levels of isolation

The Uncommitted read isolation can see data in a transaction.
The Read committed isolation can only see the data committed before the transaction before the query began. It never sees either uncommitted data changes committed by concurrent transactions during the query's execution.
The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed by concurrent transactions during the transaction's execution. However, each query does see the effects of previous updates executed within its own transaction, even though they are not yet committed.
The Serializable isolation level provides the strictest transaction isolation. This level emulates serial transaction execution for all committed transactions; as if transactions had been executed one after another, serially, rather than concurrently.

The phenomena which are prohibited at various level of isolation are

A dirty read is a read when the database reads uncommitted data
Multiple values being retrieved from the same row in a database during the same transaction is known as non repeatable read.
Different collection of rows being returned for the same query in the same transaction is known as phantom reads.
The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time is called serialization anomaly
Read Uncommitted
- Dirty Read: Yes
- Non-repeatable Read: Yes
- Phantom Read: Yes
- Serialization Anomaly: Yes
Read Committed
- Dirty Read: No
- Non-repeatable Read: Yes
- Phantom Read: Yes
- Serialization Anomaly: Yes
Repeatable Read
- Dirty Read: No
- Non-repeatable Read: No
- Phantom Read: Yes
- Serialization Anomaly: Yes
Serializable
- Dirty Read: No
- Non-repeatable Read: No
- Phantom Read: No
- Serialization Anomaly: No When simultaneous or processes or users tries to manipulate data and the data does not contain any inconsistency, this is due to a feature called concurrency control. It is one of the way that a database guarantees isolation.

Pessimistic concurrency control is when database assumes something that ought to go wrong will go wrong. This approach of database prevents blocking before it even occurs. Database uses read and write locks to avoid this station
A read lock is when reads of the same item by multiple transactions are allowed but not a write, whereas a write lock is an exclusive lock that allows only one transaction hold it. This thereby blocks other transaction from updating the same database item.
By assuming something can always go wrong the database does not allow a transaction to read/write on uncommitted database items
Optimist control is when a transaction does not obtain locks on data they read or write. Unlike the pessimist approach the database checks for conflicts at the end of transaction
The isolation level repeatable read can be achieved by optimistic approach.

D - durability

Durability is property of database that guarantees that once committed data is not lost and is permanently available on disk.

How does database achieve durability

Durability is property of database that guarantees that once committed data is not lost and is permanently available on disk.
Write ahead logging is when data is written to redo log before a transaction. Making the changes permanent. In case of failure the database system replays to redo log.
We can also periodically write the state of database into a disk called checkpoints. A less effort is required for data recovery.
RAID (Redundant Array of Inexpensive Disks) is used to integrate several drives into a single logical unit. RAID can be used to implement redundancy and ensure that data is durable even in the event of a disk failure.
Storing multiple copies of same data redundantly can also help in case of system failures.

Gitguard: where sloppy commits meet their match

Sebin — Fri, 26 Jan 2024 14:39:31 GMT

As software development teams grow larger and more diverse, maintaining a clean and organized codebase becomes increasingly important. One aspect of this organization involves creating meaningful and easy-to-understand commit messages. Enter Conventional Commits and GitGuard, two tools that aim to simplify the process of managing commit messages while promoting best practices within your codebase.

What are Conventional Commits?

Conventional Commits is a specification that outlines guidelines for writing commit messages. These guidelines help create a common standard for developers to understand the changes introduced in a codebase. A typical Conventional Commit message consists of three parts:

<type>():

Type: Categorizes the commit into different types, indicating the nature of the change (e.g., feat, fix, chore, etc.).
Scope: Specifies the module, component, or area of the project affected by the commit (optional).
Message: Provides a concise and clear description of the changes made.

By using Conventional Commits, you can benefit from:

Automated semantic versioning: Tools can analyze commit messages to determine version increments based on the types of changes introduced.
Meaningful release notes: Consistent commit messages make it easier to generate relevant information about new features, bug fixes, and other changes during release preparation.
Improved project understanding: Following a convention like Conventional Commits enhances the overall comprehension of the project history, facilitating better collaboration among developers, contributors, and maintainers.

However, implementing Conventional Commits manually can be time-consuming and prone to errors. That's where GitGuard comes in.

Introducing GitGuard

GitGuard is an open-source tool that enforces the Conventional Commits specification, ensuring that every commit message is clear, concise, and follows the established standards. Developed by Segin GH, GitGuard is language-agnostic, compatible with Git, and highly customizable, making it an excellent choice for any project.Some key benefits of using GitGuard include:

Language agnosticism: GitGuard works seamlessly with various programming languages, including Python, Rust, and JavaScript, making it an ideal tool for multi-language projects.
Git integration: GitGuard integrates with Git's commit-msg hook, acting as a sidekick for your commits, ensuring they're always up to par.
Customizable rules: Although not yet available, GitGuard will soon allow you to set your own commit commandments, tailoring it to your project's needs.
Cross-platform compatibility: Tested on Linux, GitGuard is working towards compatibility with Windows and macOS, ensuring it's accessible to all.
Feedback friendly: GitGuard offers gentle nudges instead of hard knocks, helping you correct your commit messages without causing frustration.
Easy setup: With a simple installation process, you can quickly integrate GitGuard into your workflow and start enjoying better commit messages.

How GitGuard Works

Whenever a developer makes changes, commits those changes using Git, and the pre-commit hook is triggered, GitGuard performs commit linting for the commit messages. If the checks pass, the commit proceeds; otherwise, an error message appears, and the commit is halted.

How to Install GitGuard

Download GitGuard in the root of your repository:

wget https://github.com/segin-GH/gitGuard/raw/main/dist/gitguard.zip

Unzip the downloaded file:

bash unzip gitguard.zip

Remove the zip file:

bash rm gitguard.zip

Change to the .gitguard directory:

cd .gitguard

Install GitGuard:

./gitguard.py

With GitGuard installed, you can now enjoy the benefits of having clear and consistent commit messages throughout your codebase. As a bonus, you'll also learn more about the importance of commit messages and how to structure them effectively.

Gitguard Repo: https://github.com/segin-GH/gitGuard

Webiste URL : https://gitguard.segin.in

How to manage dependencies in a monorepo

Sebin — Thu, 11 Jan 2024 15:15:42 GMT

Embarking on a personal project can be a rollercoaster ride of excitement and challenges, often punctuated by moments of both confusion and enlightenment. In my latest endeavor, I delved into the intricate world of project structures, navigating from the tangled simplicity of a monolithic single-folder approach to the organized efficiency of workspaces using Turborepo. This journey, filled with its trials and triumphs, has been a profound learning experience, reshaping my understanding of effective project management.

Through this article, I aim to share my transformative journey, highlighting the key decisions, obstacles, and revelations that marked my path from a chaotic single-folder system to the streamlined clarity of Turborepo workspaces. It's a tale of growth, resilience, and the relentless pursuit of efficiency in the ever-evolving landscape of software development.

In the early stages of my project, I decided to add everything into a single folder, a move reminiscent of a monolithic structure. However, this decision came with its own set of challenges. I encountered shared dependencies, such as a logger intended for both the front end and back end. Determined to address this, I created a 'libs' folder, envisioning it as a centralized hub for common elements. It was supposed to work but did not. Why you might ask. The libs folder had its own node_modules folder and both frontend and backend had another set of node_modules. So if I had to use something like winston a popular logging module in javascript. I had to install winston in places i wanted to use. Even though it was supposed to be dependency in libs folder.

Recognizing the need for a more nuanced approach, I opted to compartmentalize my project. This led to the creation of separate folders for each service, hoping to bring order to the chaos. Unfortunately, this only escalated the complexity of my project, prompting me to explore the use of Git submodules. In hindsight, the decision to create four distinct folders for four services, tethered by submodules, was not the panacea I had hoped for. As i had to update the submodules manually or make a ci/cd to update each submodule.

The realization struck when I acknowledged that even with Git submodules, I would still face the arduous task of rewriting code. Frustration mounting, I sought a more efficient alternative, stumbling upon node workspaces.

Workspaces emerged as the beacon of simplicity in managing dependencies. With its introduction, I could seamlessly organize my project, creating a dedicated folder for my Object-Relational Mapping (ORM). This proved to be a game-changer, as the ORM became a shared resource, effortlessly integrated into multiple subprojects. The ability to employ the ORM across various subprojects eliminated the cumbersome process of rewriting code for each service.

What are workspaces?

Workspaces, allows developers to manage multiple packages within a single repository. Workspaces enable developers to organize their codebase into separate packages, each with its own package.json file, while still maintaining a centralized control over shared dependencies. In a workspace setup, the root directory of the project contains a package.json file, which lists all the workspaces or packages in the project. Each workspace is a separate folder within the root directory, containing its own package.json file that specifies its dependencies and scripts.

How do workspaces simplify dependency management?

Shared dependencies: Workspaces allow developers to define shared dependencies in the root package.json file. These dependencies are automatically linked to each workspace, ensuring that all packages use the same version of the dependency. This eliminates the need for manual dependency management and reduces the risk of version conflicts.
Local package installation: When installing a package using npm or yarn, workspaces automatically install the package in the appropriate workspace folder, ensuring that the package is available locally for the project. This reduces the overall disk space usage and improves the performance of the build process.
Automatic linking: Workspaces automatically link packages within the project, ensuring that each package can access other packages without the need for manual installation or symlinking. This simplifies the development process and reduces the risk of errors due to misconfigured symlinks.

Tools for managing workspaces

While workspaces provide a solid foundation for managing dependencies and organizing codebases, there are several tools available that can further simplify the development process:

Lerna: Lerna is a popular tool for managing large-scale monorepo projects. It provides features such as automatic package versioning, shared dependencies, and script execution across multiple packages.
Yarn Workspaces: Yarn Workspaces is an extension of the Yarn package manager that provides additional features for managing workspaces, such as automatic linking and local package installation.
Turborepo: Turborepo is a high-performance monorepo tool that provides features such as caching, parallel execution, and remote caching. It is designed to handle large-scale monorepo projects with thousands of packages and dependencies.

I used turborepo for its as it was fast and easy to set up. The official documentation of turborepo was fast easy to understand and setting up a turborepo was as simple as running a command npx create-turbo@latest on the cli. It has many flavors of javascript/typescript projects such Next Js and Remix or with React Native and so on.

In hindsight, my initial confusion between monorepo and monolithic architecture served as a catalyst for exploring more effective project structures. The journey from a monolithic single folder to a workspace-driven project was marked by trial and error, but the lessons learned were invaluable.

Workspaces not only addressed my immediate concerns but also paved the way for a more scalable and maintainable project structure. The ability to manage dependencies seamlessly not only simplified the development process but also enhanced the overall efficiency of my project.

As developers, embracing new tools and methodologies is crucial for staying ahead in the ever-evolving landscape of software development. My experience with workspaces serves as a testament to the importance of adaptability and the constant pursuit of more efficient solutions.

In retrospect, my journey from the monolithic confusion of a single folder to the organized world of workspaces has been an enlightening one. It taught me the importance of choosing the right tools and structures for efficient project management. Workspaces, particularly those managed with tools like Turborepo, not only resolved my immediate challenges but also set a foundation for a scalable and maintainable approach to handling complex projects. This experience has underscored the value of adaptability and the continuous search for better solutions in the dynamic field of software development. As I share this journey, I hope it serves as an inspiration for others to explore, learn, and adapt, finding their path to streamlined project management and enhanced productivity.

Exploring Git's Magic: How Merkle Trees Power Version

Sebin — Tue, 26 Dec 2023 12:23:27 GMT

If you've ever scratched your head wondering how Git, the go-to tool for version control, manages to keep each commit as unique as a snowflake, you're in for a treat. Today, we're pulling back the curtain to reveal the wizardry behind Git - and it's all about Merkle Trees and the nifty process of hashing.

Hashing

First off, what exactly is a hash? Imagine taking a bunch of data and squishing it into a fixed-length string of characters - that's hashing in a nutshell. It's like a culinary recipe that always gives you the same cake, no matter how many times you bake it, as long as you follow the recipe to the end. And the beauty of it? It's a one-way street – you can't un-bake the cake to get the original ingredients. The popular hashing algorithms include MD5, SHA, RIPEMD-160 as so on.

Merkle Tree

Merkle tree or hash tree was first conceptualized by Ralph Merkle an American computer scientist and mathematician in 1979. A merkle tree is a tree of hashes which is built from bottom up where the leaf node (a child node with no further children) are represented as a hash of data blocks. The parent of the leaf nodes are represented as a hash of concatenation (joining a string from end-to-end) of the hash of its child nodes. Now this step is continued till we get a single node or a root node.

In the first illustration, the root node represents the initial state of the folder, with the hash of each file in the folder used to build the tree. In the second illustration, we can see that the root node has changed, indicating that at least one file in the folder has been modified. This change in the root node causes the entire tree to change, as the parent nodes must also be updated to reflect the new hashes of their child nodes.

Git uses Merkle trees to efficiently track changes to files in a repository. Each file in Git is represented as a leaf node in a Merkle tree, with the SHA-1 hash of the file's contents used as the node's value. The hash is computed based on the file's contents, as well as the file's name and any relevant metadata, such as the file permissions.

When a file is modified, even slightly, the SHA-1 hash of the file will change, which in turn will cause the hashes of the parent nodes in the Merkle tree to change. This allows Git to quickly and efficiently identify which files have been modified, added, or deleted in a repository.

In addition to tracking changes to individual files, Git also uses Merkle trees to track changes to entire directories and even the entire repository. This is done by creating a separate Merkle tree for each directory and subdirectory in the repository, with the root node of each tree representing the contents of that directory. The root nodes of these trees are then combined into a higher-level Merkle tree, which represents the entire repository.

By using Merkle trees in this way, Git is able to efficiently track changes to large repositories with thousands or even millions of files. When a developer makes a change to a file, Git can quickly and efficiently identify the changes and update the Merkle tree accordingly.

For extended reading:

Connect with Me

Email: sebinsebzz2002@gmail.com
GitHub: github.com/sebzz2k2

Https is not secure enough ?

Sebin — Sat, 09 Dec 2023 09:32:32 GMT

In a recent intriguing case in Kerala, India, a leading news channel reported a unique instance where law enforcement successfully identified a criminal through IP address tracking. The accused, involved in a child abduction case, had shown a Tom and Jerry cartoon to the kidnapped child. Upon the child's rescue, authorities used the child's recollection of the cartoon to trace the specific video link. Although Google initially declined to cooperate, the cyber police were able to obtain the necessary details from the Internet Service Provider (ISP). This case highlights the critical role of digital footprints and the importance of understanding internet protocols like HTTPS in ensuring online security and aiding law enforcement.

The Significance of HTTPS

This situation raises important questions about privacy and data security, especially in the context of HTTPS (Hypertext Transfer Protocol Secure). HTTPS is the secure version of HTTP, which is the primary protocol used to send data between a web browser and a website. HTTPS is encrypted to increase the security of data transfer. This encryption makes it more challenging for unauthorized parties to intercept any data being transferred, including search terms and other sensitive information.

When a user searches for something on YouTube or any other website using HTTPS, the specific search terms they use are encrypted. This means that while any intermediary – such as an Internet Service Provider (ISP) or a potential attacker – can see that a connection is being made to YouTube, they cannot see the specific content of the search. They can only see the domain name (like youtube.com) and the fact that a connection is being made, not the specific path or query.

The Technical Challenge

In light of this, the claim that the police were able to obtain the exact search terms from the ISP seems technically questionable. Major companies like Google, which owns YouTube, have robust encryption protocols to protect user data, including search queries. Therefore, without direct access to YouTube's server logs or without cooperation from the company itself, it would be extremely difficult, if not impossible, for an external party to determine the exact search terms used by an individual.

Privacy, Data Security, and Law Enforcement

This case, therefore, brings to the forefront the ongoing debate around digital privacy, data security, and the capabilities of law enforcement in the digital age. It underscores the need for a clear understanding of how data encryption works and the legal frameworks governing access to digital information. While law enforcement agencies need certain capabilities to pursue criminal investigations, there is also a paramount need to safeguard individual privacy rights and data security in an increasingly digital world.

The Strengths of HTTPS

It's crucial to understand that ISPs, in a typical scenario, cannot view the contents of HTTPS-encrypted requests, such as the body of a request which might contain sensitive information like credit card details. This is due to several key features of HTTPS:

Encryption for Confidentiality: HTTPS encrypts data during transmission, converting readable data into an undecipherable format using cryptographic keys. This process ensures that even if data is intercepted while it travels across the internet, it remains unreadable to anyone who does not have the corresponding decryption key. This encryption is what keeps sensitive information like credit card numbers safe when you shop online.
Data Integrity Protection: Another critical aspect of HTTPS is its ability to protect the integrity of data. This means that the data sent or received is not altered, deleted, or tampered with during transmission. Data integrity checks are vital because they ensure that the information you send and receive is exactly as intended, without any unauthorized modifications.
Authentication of Communication Parties: HTTPS also plays a crucial role in authenticating the legitimacy of websites. For instance, when you visit a website like Amazon, HTTPS helps verify that you are indeed on the correct website and not a fraudulent one designed to look similar. This is done through SSL/TLS certificates issued by trusted certificate authorities. These certificates serve as a stamp of approval, confirming that the website is legitimate and that your communication with it is secure.

Accessing Encrypted Data

The scenarios in which a government entity or an ISP can access the contents of an individual's HTTPS-encrypted requests are indeed limited. They typically involve either obtaining server logs directly from the service provider or having access to the encryption keys. In the context of the case in Kerala, where police claimed to have identified a suspect based on their YouTube search queries, the methods for obtaining such specific data are constrained by these technical boundaries.

Given the robust encryption provided by HTTPS, the most plausible ways for law enforcement to access an individual's search history would be through direct access to the user's device, such as examining the YouTube history on the accused's phone. Another method, albeit less likely given the article's denial of such cooperation, would be obtaining server logs from the service provider. However, this would typically require legal procedures and the willingness of the provider to comply.

Balancing Privacy and Law Enforcement

In conclusion, this case underscores the complexities and challenges that arise in the intersection of technology, privacy, and law enforcement. While the capabilities of HTTPS in protecting user data are formidable, they also present hurdles for legal investigations. This situation highlights the ongoing need for a balanced approach that respects individual privacy and data security while providing law enforcement with the tools necessary for effective and lawful investigations. As technology continues to evolve, so too must our understanding and frameworks for managing these critical and often competing priorities.

Below are the links:

Connect with Me

Email: sebinsebzz2002@gmail.com
GitHub: github.com/sebzz2k2
Feel free to comment down

How to add multiple ssh keys for a single github account

Sebin — Sat, 02 Dec 2023 17:59:06 GMT

Hey there, fellow code wranglers! Today, I'm going to spill the beans on a little trick that saved my bacon during my early developer internships. Picture this: you're happily using your personal GitHub account with SSH, and boom! Suddenly, you need to juggle multiple SSH keys. Sound familiar?

Step-by-Step Guide to Adding Your First SSH Key (Yes, it's a breeze!)

Fire up that Terminal:
Generate your SSH key: Replace with your GitHub email, with a name you want for your RSA file and run this nifty little command :
```
 ssh-keygen -t rsa -b 4096 -C “” -f ~/.ssh/id_rsa_name
```
Start your ssh-agent: This might vary based on your setup, but generally, you can kick it off with:
```
 eval "$(ssh-agent -s)"
```
(Pro tip: You need to do this only once.)
Add that SSH key to your ssh-agent: Replace id_rsa_name if yours is different. Here's the command:
```
 ssh-add ~/.ssh/id_rsa_name
```

Config time: Edit your SSH config file like so: Replace name in Host and IdentityFile parameters accordingly

 cat >> ~/.ssh/config << SSHConfigEoF  
 Host name.github.com  
 HostName github.com
 PreferredAuthentications publickey  
 IdentityFile ~/.ssh/id_rsa_name 
 SSHConfigEoF`

Hook it up with GitHub: Add your SSH key to your GitHub account. Check out their guide if you need a hand. adding ssh to Github
Rinse and Repeat for More Keys: When cloning new repos, use a slight twist in the command: Assuming your repo link is git clone git@github.com:/ replace with git clone git@name.github.com:/ . name is the Host you gave in step 5

Quick Security Check: Permissions Matter!

Config file (rw-r--r-): You can read and write; others can just peek.
Private key (id_rsa, rw-------): Eyes only for you – full access, but keep it under wraps!
Public key (id_rsa.pub, rw-r--r--): You can read and write; others can just peek.

There you have it, folks – a quick trip through the world of multiple SSH keys. Whether you're a seasoned dev or just dipping your toes in the tech pool, remember: there's always a way around those pesky tech hurdles.

generate ssh token

Connect with Me

Email: sebinsebzz2002@gmail.com
GitHub: github.com/sebzz2k2

API optimization using SQL Pagination

Sebin — Sat, 18 Nov 2023 11:30:45 GMT

Pagination, the process of dividing a large set of query results into manageable chunks or pages, is a crucial technique in web development and database management. Particularly useful when dealing with large datasets, pagination ensures that loading the entire data at once, which is often impractical, is avoided.

Uses of Pagination

1. Web and Mobile Applications

Web and mobile applications frequently utilize pagination to display content in an organized manner. This includes search results, product listings on e-commerce sites, social media feeds, and blog posts.

2. Content Management Systems (CMS)

CMS platforms leverage pagination to manage articles, posts, and media effectively. It aids in structuring content, thereby enhancing user accessibility and navigation.

3. Forums and Comment Sections

In forums and comment sections, pagination plays a pivotal role by displaying the most relevant comments first, facilitating efficient loading and improved user experience.

4. E-mail Clients and Messaging Applications

E-mail clients use pagination to prioritize recent emails, thereby improving user engagement and experience.

5. APIs and Data Feeds

Many web APIs implement pagination in their responses to optimize server load and enhance data retrieval performance.

Pagination Methods

1. Limit and Offset Method

This method involves SQL queries that use LIMIT and OFFSET clauses to control the number of rows returned.

Example Query:

SELECT * FROM employees ORDER BY employee_id LIMIT 10 OFFSET 0;

This query fetches the first 10 records from the employees table. For the next 10 records, change the OFFSET to 10.

Advantages

- Simple to implement and understand.
- Allows direct page navigation.
- Predictable and consistent pagination behaviour.

Disadvantages

- Performance issues with large datasets.
- Inconsistencies with data changes.
- Limited suitability for datasets with frequently changing order.

2. Keyset Pointer (Cursor-Based Pagination)

This method uses a unique key as a pointer to navigate through datasets, ideal for large datasets and real-time data.

Example Query:

SELECT * FROM posts WHERE created_at > '2023-01-01T00:00:00' ORDER BY created_at ASC LIMIT 10;

Fetches the first 10 posts created after January 1, 2023. The created_at value of the last post serves as the new pointer for subsequent queries.

Advantages

- More efficient for large datasets.
- Resilient to data modifications.
- Consistent performance with real-time data.

Disadvantages

- More complex navigation.
- Relies on a unique, sequential column.
- Higher initial learning curve.

Connect with Me

Email: sebinsebzz2002@gmail.com
GitHub: github.com/sebzz2k2

N+1 Query Minimization

Sebin — Tue, 31 Oct 2023 19:59:59 GMT

In the ever-evolving world of technology, speed and efficiency are key pillars in providing a seamless user experience. When it comes to processing time-series data through an Express endpoint, the initial setup can be the difference between lightning-fast results and frustrating delays. In this blog post, we embark on a journey to uncover the challenges faced by an Express endpoint tasked with handling vast amounts of time-series data, and how we transformed it into a high-performance system.

The Original Approach:

Our story begins with an Express endpoint tasked with querying extensive time-series data spanning over seven days. The endpoint needed to serve multiple users simultaneously, but there was a problem – the processing time was far from optimal, often taking several seconds to complete. To understand the root of the issue, let's delve into the original approach.

Step 1: Querying Tenant-Specific Customer Data

The first step involved querying the database to obtain all customers associated with a particular tenant. The query looked like this:

SELECT * FROM customer WHERE tenant_id='some-uuid'

The response provided an array of customer names, such as ["Cust1", "Cust2", ..., "Cust n"].

Step 2: Iterating Through Customers

Next, a for loop was used to iterate through each customer in the array. For each customer, another query was executed to retrieve devices related to that customer:

SELECT * FROM device WHERE customer='custId'`

The response yielded an array of device names, such as ["device1", "device2", ..., "device n"].

Step 3: Iterating Through Devices

Continuing down the rabbit hole, another for loop was employed to iterate through each device in the array. For each device, yet another query was executed to retrieve data for that device:

SELECT * FROM device_attribute WHERE deviceId='deviceId'

Finally, the code would look something like this:

const customerQuery = `SELECT * FROM customer WHERE tenant_id='${tenantId}'`;
const customers = await database.query(customerQuery);
for (const customer of customers) {
  const customerId = customer.id;
  const deviceQuery = `SELECT * FROM device WHERE customer='${customerId}'`;
  const devices = await database.query(deviceQuery);
  for (const device of devices) {
    const deviceId = device.id;
    const dataQuery = `SELECT * FROM device_attribute WHERE deviceId='${deviceId}'`;
    const deviceAttributes = await database.query(dataQuery);
  }
}

Challenges:

This approach presented several challenges that hindered performance optimization:

Multiple Database Queries: The approach required multiple database queries, causing a significant overhead in terms of database interactions.
Nested Loops: The nested loop structure compounded the problem, as it added complexity and slowed down the processing.

Optimizing the Endpoint:

To enhance the Express endpoint's performance, a more efficient strategy was employed, which involved the following key changes.

Change 1: Use the IN Clause

In the optimized code snippet, we first retrieve the customer IDs and use the IN clause to query devices for those customers. This approach minimizes the number of queries executed, improving efficiency and potentially reducing database load. Here's how it looks:

const customerQuery = `SELECT * FROM customer WHERE tenant_id='${tenantId}'`;
const customers = await database.query(customerQuery);
const customerIds = customers.map(customer => customer.id);
const deviceQuery = `SELECT * FROM device WHERE customer IN (${customerIds.join(', ')})`;
const devices = await database.query(deviceQuery);
for (const device of devices) {
  const deviceId = device.id;
  const dataQuery = `SELECT * FROM device_attribute WHERE deviceId='${deviceId}'`;
  const deviceAttributes = await database.query(dataQuery);
}

Using the IN clause can be faster because:

It reduces the number of round-trips between your application and the database, which can be a significant performance improvement, especially when dealing with a large number of IDs.
Databases are optimized to efficiently process IN clauses, which allows them to perform the operation in a more optimized manner.
It simplifies your code by allowing you to consolidate the queries into one, making it easier to manage and read.

Change 2: Using JOINs

By using SQL JOINs, you can retrieve all the necessary data in a single query, which can significantly reduce the number of database interactions and potentially improve query performance. This approach is generally more efficient and optimized for database operations. Here's an example:

const query = `
  SELECT c.*, d.*, da.*
  FROM customer AS c
  JOIN device AS d ON c.id = d.customer
  JOIN device_attribute AS da ON d.id = da.deviceId
  WHERE c.tenant_id = '${tenantId}'
`;
const result = await database.query(query);``

Conclusion:

By streamlining the data retrieval process, reducing database queries, and optimizing the overall workflow, the Express endpoint was transformed into a much faster and more responsive system. This optimization not only improved the user experience but also reduced the strain on the database, leading to better overall system performance.