Skip to main content

CD-Stream:CDC Replicator Tool & Cons on ETL pipelines


Just another day at the work place;

5 minutes post the boot:

You hear everyone complain that the production database is slow. You quickly start to investigate; exploring all possible outcomes on the dashboards.. 

Could it have been the long-running slow query which you had raised a ticket for the production support to fix?.. Or Is it one of the queries run based on an un-indexed column?



6th Minute and 15 minutes down the lane:

Next you hear the fellow data-analysts lament over their failed reports. 

You now realize that your CPU had taken a humongous amount of query load and you understand that your relational database system has gone for a toss into an eternal slumber.

And all of this due to a slow running query of your ETL pipeline..!! Ding. Ding.. Ding...!! We have a winner!!!

Alright, let's phrase it this way. 

Probably you did/used one of the following:

- SELECT * from production_database.table where updated_at between x and y;
- Airflow pipelines
- Bulk exports and Dumps once in every few minutes
- Long running and forgotten Zombie Crontabs


 Let's put it this way.. DB Size <500 GB; it's OK to do selects but if it exceeds >500GB, unless you have cuts in the budget, do not ever do a bulk select and transfer it over the wire to the destination database in the form of a pipeline.

Enter CDC:

CDC A.K.A Change Data Capture is there to assist this data wrangling exercise and if your data is rapidly growing and if your BI/BA need an access to that sweet-sweet DWH (data warehouse); THIS right here, is the way to go.

You meant the replication? - I don't wanna get stuck with my OLTP DB Engine:

Well, you are not alone. For many reasons, data-wranglers generally don't prefer a similar database engine as that of the OLTP. The main reasons include query performance, need for triggers and ease of re-running transformation jobs. But, doing so, requires setting up of airflow clusters and setting up connectors to transform and load.


Wouldn't it be great if you could do all this without doing a bulk select from the production database? Of course, you can..

CD-Stream is a cross-database CDC driven replicator tool that currently supports replication between MySQL and Postgres. 

The tool runs queues to process the information occurring in the binary logs of the source database and replicates it across to a destination database of entirely different engine.

https://github.com/datawrangl3r/cd-stream 
Post the setup, as given in the project page: CD-Stream; there's a directory called 'sample' in the project which contains some of the intensive DDL and Data Insertion scripts, for you to evaluate and exercise.

And that, People; is how you wrangle production data!!!

Comments

Popular posts from this blog

ES Index - S3 Snapshot & Restoration:

The question is.. What brings you here? Fed up with all the searches on how to back-up and restore specific indices? 

Fear not, for your search quest ends here.!

After going through a dozens of tiny gists and manual pages, here it is.. We've done all the heavy-lifting for you.



The following tutorial was tested on elasticsearch V5.4.0

And before we proceed, remember:

Do's:

Make sure that the elasticsearch version of the backed-up cluster/node <= Restoring Cluster's version.

Dont's:

Unless it's highly necessary;

curl -XDELETE 'http://localhost:9200/nameOfTheIndex

      - deletes a specific index

Especially not, when you are drunk!:

curl -XDELETE 'http://localhost:9200/_all

      - deletes all indexes (This is where the drunk part comes in..!!)



Step1:Install S3 plugin Support:        sudo bin/elasticsearch-plugin install repository-s3
                                  (or)
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install repository-s3

Depends on w…

Flyway - Database Migrations made easy & How not to accidentally Roleback all of your migrations

Flyway - by boxfuse: Is a schema migration tool and it acts more of like a version control for your relational databases.

If you are manually executing your sql scripts or if your administrator is manually executing the sql scripts, on your production or UAT environment, you definitely need this tool to be setup in all of your environments.

Before we proceed:

Statutory Warning: 

Never ever execute the following command, be it your production or UAT environment:

$ flyway clean   # Do not execute this, ever!!!!

Wondering what it does? It roles back whatever table migrations/changes you have done through flyway, along with their data. 

In short, Don't ever execute this command.

Now that we are done with all the warnings:


Installation:It is fairly straight forward:
Run the above command in a shell prompt.
Running the above creates a directory called as flyway-x.x.x/
Inside this directory are many other directories of which, the two most import directories are:
 conf/ - Configuration for eac…

ELK Stack... Not!!! FEK, it is.!!! Fluentd, Elasticsearch & Kibana

If you are here, you probably know what elasticsearch is and at some point, trying to get into the mix. You were searching for the keywords "logging and elasticsearch" or perhaps, "ELK"; and probably ended up here. Well, you might have to take the following section with a pinch of salt, especially the "ELK Stack"  fam.
At least from my experience, working for start-ups teaches oneself, a lot of lessons and one of the vast challenges include minimizing the resource utilization bottlenecks. On one hand, the logging and real-time application tracking is mandatory; while on the the other hand, there's a bottle neck in the allocated system resource, which is probably an amazon EC2 instance with 4Gigs of RAM.
ELK Stack 101: Diving in, ELK => Elasticsearch, Logstash and Kibana. Hmm, That doesn't add up; don't you think? Elasticsearch stores the reformed log inputs, Logstash chops up the textual logs and transforms them to facilitate query, deriva…