Position : Data Engineer
Location : Dallas,TX
Duration : 6+ months


Experience level needed is 6-7 years
Roles and responsibilities shall include, but not limited to:
Profiling of data from various sources using Shell scripts, Python, PowerBI, PostgreSQL, Oracle, SQL Server on AWS environment - 20%.
Involved in profiling and understanding the data set received from various sources using variety of tools available like PowerBI, Shell and Python scripts.
Create python scripts to analyze the entire data and extract required information such as total number of attributes, mean, median, NULL percentage.
We also build logic to extract information like percentage of missing values, list of data types available, cardinality, distinct count per attribute, unique percentages, percentages of top ten values per attribute.
Create scripts for profiling attributes with numerical data types to extract information like Standard deviation, mean, Q1, Q3, 5th percentile, 95th percentile, Range, Variation etc.
Create Visualization analysis like pie charts and bar graphs on the source data using PowerBI which can help making critical decisions and prepare strategy to ingest the data.
Using PGSQL to write procedures for getting a high level profiling information for large datasets.

Prepare high-level ETL (Extract, Transform and Load) mapping specifications of source to target mapping - 15%.
As we get large amount of data from various sources, they may have diverse information in different representations. Therefore mapping is done from the source to target attributes in order to create centralized data.
Using levenshtein distance, we calculate the distance between the attributes of source and target and perform mapping between them.
Store the mapping in a database table using Postgres procedures and automate the process for every new source.

Involve in writing complex code data scripts (Primarily SQL) for ETL and Data Warehouse - 20%.
Create complex SQL scripts to fetch the source data and ingest into the data warehouse.
Create Python scripts to preprocess the data before ingesting into the data warehouse. For example, Standardizing the data format, standardizing the dates and trimming the words etc.
Create JSON handlers to pick up SQLs and execute them based on the custom settings in the ETL.
Create scheduled jobs in Jenkins to execute the JSON and SQL and deploy them in Amazon AWS.
Create branches and push the whole code to a centralized code repository using GIT thereby maintain version controlling across the project.

Involve in data quality tasks - Test plan preparation, test case creation, execution and test report creation - 10%.
Create Unit test cases and execute them for every scenario while development of SQL and JSON
After the data ingestion create and execute scripts using Postgres Procedures and inbuilt functions to compare the total counts, number of attributes and various other parameters to ensure the quality of the ingested data.
Create automation test scripts to analyse and compare the whole data between various sources and target database tables.
Generate statistics represented by graphs, bar graphs and powerpoint presentations which is used by the technical leaders to present it to customers.

Involve in troubleshoot & determine best resolution for data issues and anomalies - 10%.
Scan through the whole data warehouse and pick out the discrepancies in the data through reverse engineering.
execute the SQLs and JSON files that have already been used for ingestion and reverse engineer them to verify the source data from the target data.
Involved in group discussions and brainstorm sessions to validate and fix the data anomalies across the whole data warehouse
Involved in creating a framework which will run continuous jobs and monitor the data ingestion by storing metadata for every transaction that happened to data warehouse.

Providing timely reports, visualizations and presentations to executives for critical decisions - 15%.
For every source data, involved in generation deeper insights and useful information related to real estate requirements.
Based on the information usefulness important decisions would be taken in regards to the source.
Involved in creating intuitive graphs and analysis reports which will portray all the information in the data.

Design and support for environments in DEV, QA and Production for various ongoing projects - 10%.
Involved in the deployment of the data in multiple environments sequentially to fully prepare for the production.
Involved in integrating the data into the DEV environment in order to build the required logic and make changes to existing framework, making sure all the data is being ingested.
Involved in populating the data into the QA environment in order to perform a deep Quality Assurance tasks from source to target.
Part of team that is responsible for productionizing the whole data warehouse where various consumers can access the data through various platforms like APIs or KAFKA platform.



Post a Comment

Previous Post Next Post