We are seeking a highly skilled Senior Data Engineer with expertise in SQL, Python and PySpark, with extensive knowledge and practical experience in utilizing AWS services. The ideal candidate will have a strong background in data engineering, with a focus on building scalable and efficient data pipelines. The person in this role will work with a wide array of healthcare data for ingestion, processing and consumption including but not limited to eligibility, claims, payments and risk adjustment.
KEY DUTIES AND RESPONSIBILITIES:
* Design, develop, and maintain robust data pipelines using Python and PySpark to process large volumes of healthcare data efficiently in a multitenant analytics platform.
* Collaborate with cross-functional teams to understand data requirements, implement data models, and ensure data integrity throughout the pipeline.
* Optimize data workflows for performance and scalability, considering factors such as data volume, velocity, and variety.
* Implement best practices for data ingestion, transformation, and storage in AWS services such as S3, Glue, EMR, Athena, and Redshift.
* Model data in relational databases (e.g., PostgreSQL, MySQL) and file-based databases to support data processing requirements.
* Design and implement ETL processes using Python and PySpark to extract, transform, and load data from various sources into target databases.
* Troubleshoot and enhance existing ETLs and processing scripts to improve efficiency and reliability of data pipelines.
* Develop monitoring and alerting mechanisms to proactively identify and address data quality issues and performance bottlenecks.
EDUCATION AND EXPERIENCE:
* Minimum of 5 years of experience in data engineering, with a focus on building and optimizing data pipelines.
* Expertise in Python programming and hands-on experience with SQL and PySpark for data processing and analysis.
* Proficiency in Python frameworks and libraries for scientific computing (e.g. Numpy, Pandas, SciPy, Pytorch, Pyarrow).
* Strong understanding of AWS services and experience in deploying data solutions on cloud platforms.
* Experience working with healthcare data, including but not limited to eligibility, claims, payments, and risk adjustment datasets.
* Expertise in modeling data in relational databases (e.g., PostgreSQL, MySQL) and file-based databases, ETL processes and data warehousing concepts.
* Proven track record of designing, implementing, and troubleshooting ETL processes and processing scripts using Python and PySpark.
* Excellent problem-solving skills and the ability to work independently as well as part of a team.
* Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
* Relevant certifications in AWS or data engineering would be a plus.
PostgreSQL PySpark amazon-s3 pandas data-warehouse Python Amazon Web Services (AWS) Amazon Redshift Data Engineering Amazon Athena Amazon EMR NumPy SQL ETL aws-glue MySQL pyarrow SciPy PyTorch