Li Qingwang - Software Development Engineer, Cisco
introduction
Hello everyone, I am Li Qingwang, a software development engineer from Cisco. Our team has been using Apache DolphinScheduler to build our own big data scheduling platform for nearly three years. From the initial version 2.0.3 to now, we have grown together with the community. The technical ideas I share with you today are secondary development based on version 3.1.1, adding some new features that are not included in the community version.
Today, I will share how we used Apache DolphinScheduler to build a big data platform, submit our tasks and deploy them on AWS, and some of the challenges we encountered and our solutions.
Architecture design and adjustment
Initially, all our services were deployed on Kubernetes (K8s), including API, Alert, and Zookeeper (ZK), Master, and Worker components.
Big data processing tasks
We have carried out secondary development on tasks such as Spark, ETL, and Flink:
ETL tasks: Our team has developed a simple drag-and-drop tool that allows users to quickly generate ETL tasks.
Spark Support: The early version only supported Spark running on Yarn. We made it support running on K8s through secondary development. The latest version of the community now supports Spark on K8s.
*Flink secondary development: Similarly, we have added Flink On K8s stream tasks, as well as support for SQL tasks and Python tasks On K8s.
Support Job on AWS
As our business expands and data policies change, we face the challenge of having to run data tasks in different regions. This requires us to build an architecture that can support multiple clusters. The following is a detailed description of our solution and implementation process.
Our current architecture consists of a centralized control endpoint, a single Apache DolphinScheduler service, which manages multiple clusters distributed across different geographic locations, such as the EU and the US, to comply with local data policies and isolation requirements.
Architecture Adjustment
To meet this need, we made the following adjustments:
Keeping Apache DolphinScheduler services centrally managed: Our DolphinScheduler service is still deployed in the self-built Cisco Webex DC, maintaining centralized and consistent management.
Support for AWS EKS clusters: At the same time, we have expanded the capabilities of the architecture to support multiple AWS EKS clusters. This way, we can meet new business requirements for running tasks on EKS clusters without affecting the operation and data isolation of other Webex DC clusters.
Through this design, we can flexibly respond to different business needs and technical challenges while ensuring data isolation and policy compliance.
Next, we will introduce how to handle the technical implementation and resource dependencies of Apache DolphinScheduler when running tasks in Cisco Webex DC.
Resource dependencies and storage
Since all our tasks run on Kubernetes (K8s), the following points are crucial for us:
Docker images
storage location: Previously, all our Docker images were stored in a Docker repository at Cisco.
Image Management:These images provide the necessary operating environment and dependencies for the various services and tasks we run.
Resource files and dependencies
Jar packages and configuration files, etc.:We use Amazon S3 Bucket as the resource storage center to store users' Jar packages and possible dependent configuration files.
Security Resource Management: Including database passwords, Kafka encryption information, and user-dependent keys, all of this sensitive information is stored in Cisco's Vault service.
Secure access and rights management
To access the S3 Bucket, we need to configure and manage AWS credentials:
IAM account configuration
Credentials Management:We use IAM accounts to manage access permissions to AWS resources, including access keys and secret keys.
K8s Integration:These credential information are stored in the Kubernetes Secret and referenced by the API-Service to securely access the S3 Bucket.
Permission control and resource isolation: Through IAM accounts, we can achieve fine-grained permission control to ensure data security and business compliance.
IAM account access key expiration issue and solutions
While using IAM accounts to manage AWS resources, we faced the issue of access key expiration. Here is a detailed description of how we dealt with this challenge.
Access key expiration issue
Key period: The AWS key of the IAM account is usually set to automatically expire every 90 days to enhance the security of the system.
Mission Impact: Once the keys expire, all tasks that rely on these keys to access AWS resources will not be able to be executed, which requires us to update the keys in a timely manner to maintain business continuity.
To address this situation, we set up a periodic restart for the task and set up corresponding monitoring. If there is a problem with the AWS account before the expiration date, we need to notify our corresponding developers to do something about it.
Support for AWS EKS
As our business expanded to AWS EKS, we needed to make a series of adjustments to our existing architecture and security measures.
For example, as mentioned earlier, we previously put the Docker image in Cisco's own Docker repo, but now we need to put the Docker image on ECR.
Support for multiple S3 Buckets
Due to the decentralization of AWS clusters and the data isolation requirements of different businesses, we need to support multiple S3 Buckets to meet the data storage requirements of different clusters:
Correspondence between cluster and bucket: Each cluster will access its corresponding S3 Bucket to ensure data locality and compliance.
Modify the policy:We need to adjust our storage access strategy to support reading and writing data from multiple S3 Buckets. Different business parties need to access their own corresponding S3 buckets.
Changes to password management tools
To improve security, we migrated from Cisco's self-built Vault service to AWS's Secrets Manager (ASM):
Use of ASM: ASM provides a more integrated solution for managing passwords and keys for AWS resources.
We use IAM Role and Service Account to enhance the security of Pod:
Create an IAM Role and Policy: First, create an IAM Role and bind the necessary Policy to it to ensure that only the necessary permissions are granted.
Bind Kubernetes Service Account: Then create a Kubernetes Service Account and associate it with the IAM Role.
Pod permission integration: When running a Pod, by associating it with a Service Account, the Pod can directly obtain the required AWS credentials through the IAM Role to access the necessary AWS resources.
These adjustments not only improved the scalability and flexibility of our system, but also strengthened the overall security architecture, ensuring that operations in the AWS environment are both efficient and secure. At the same time, they also avoided the problem of automatic key expiration requiring restart.
Optimize resource management and storage processes
To simplify the deployment process, we plan to push the Docker image directly to ECR instead of going through a secondary transfer:
Direct push: Modify the current packaging process so that the Docker image is pushed directly to ECR after being built, reducing time delays and potential error points.
Change Implementation
Code-level adjustments: We modified the DolphinScheduler code to enable it to support multiple S3 Clients and added cache management for multiple S3 Clients.
Resource Management UI Adjustments: Allows users to select different AWS Bucket names for operation through the interface.
Resource Access: The modified Apache DolphinScheduler service can now access multiple S3 Buckets, allowing flexible management of data between different AWS clusters.
AWS resource management and permission isolation
Integrate with AWS Secrets Manager (ASM)
We have extended Apache DolphinScheduler to support AWS Secrets Manager, allowing users to select secrets in different cluster types:
ASM Function Integration
User Interface Improvements: In the DolphinScheduler user interface, the display and selection functions of different secret types have been added.
Automatic key management: At runtime, the file path that stores the secret selected by the user is mapped to the actual Pod environment variable, ensuring the safe use of the key.
Dynamic resource configuration and initialization service (Init Container)
To manage and initialize AWS resources more flexibly, we implemented a service called Init Container:
Resource Pull: Before the Pod is executed, the Init Container will automatically pull the S3 resources configured by the user and place them in the specified directory.
Key and configuration management: According to the configuration, the Init Container will check and pull the password information in the ASM, then store it in a file and map it through environment variables for use by the Pod.
Application of Terraform in resource creation and management
We use Terraform to automate the configuration and management of AWS resources, simplifying resource allocation and permission settings:
Automatic resource configuration: Use Terraform to create the required AWS resources such as S3 Bucket and ECR Repo.
IAM policy and role management: Automatically create IAM policies and roles to ensure that each business unit can access the resources it needs on demand.
Privilege isolation and security
We use sophisticated permission isolation strategies to ensure that different business units operate in independent namespaces, avoiding resource access conflicts and security risks:
Implementation details
Service Account Creation and Binding: Create an independent Service Account for each business unit and bind it to the IAM role.
Namespace Isolation: Each Service Account operation accesses its corresponding AWS resources through an IAM role within a specified namespace.
Improvements in cluster support and permission control
Cluster type expansion
We added a new field cluster type, to support different types of K8S clusters, which include not only standard Webex DC clusters and AWS EKS clusters, but also specific clusters with higher security requirements:
Cluster type management
Cluster Type Field:by Introducedcluster typefields, we can easily manage and extend support for different K8S clusters.
Code-level customization: To address the unique needs of specific clusters, we can make code-level modifications to ensure that their security and configuration requirements are met when running jobs on these clusters.
Enhanced permission control system (Auth system)
We have developed an Auth system specifically for fine-grained permission control, including permission management between projects, resources, and namespaces:
Permission management function
Project and resource permissions:Users can control permissions through project dimensions. Once they have project permissions, they have access to all resources under the project.
Namespace permission control: Ensure that a specific team can only run the jobs of its project in the specified namespace, thereby ensuring the isolation of running resources.
For example, team A can only run certain project jobs on its namespace A. So, for example, user B cannot run jobs on user A's namespaces.
AWS resource management and permission application
We manage permissions and access control for AWS resources through the Auth system and other tools, making resource allocation more flexible and secure:
Multiple AWS Account Support: In the Auth system, you can manage multiple AWS accounts and bind different AWS resources such as S3 Bucket, ECR, and ASM.
Resource mapping and permission application: Users can map existing AWS resources and apply for permissions in the system, so that they can easily select the resources they need to access when running a job.
Service Account Management and Permission Binding
To better manage service accounts and their permissions, we implemented the following features:
Service Account Binding and Management
Service Account unique distinction: Bind the Service Account to a specific cluster, namespace, and project name to ensure its uniqueness.
Permission binding interface: Users can bind the Service Account to specific AWS resources, such as S3, ASM, or ECR, on the interface to achieve precise control of permissions.
Simplify operations and resource synchronization
I have said a lot just now, but in fact, the operation is relatively simple for users. The entire application process is actually a one-time job. In order to further improve the user experience of Apache DolphinScheduler in the AWS environment, we have taken a series of measures to simplify the operation process and enhance the resource synchronization function.
Let me summarize for you:
Simplified user interface
In DolphinScheduler, users can easily configure the specific cluster and namespace where their jobs run:
Cluster and namespace selection
Cluster Selection: When users submit a job, they can select the cluster where they want the job to run.
Namespace configuration: Depending on the selected cluster, the user also needs to specify the namespace where the job will run.
Service Account and Resource Selection
Service Account Display: The page will automatically display the corresponding Service Account based on the selected project, cluster, and namespace.
Resource access configuration: Users can select the S3 Bucket, ECR address, and ASM key associated with the service account through the drop-down list.
Future Outlook
There are still some areas that can be optimized and improved for the current design to improve user submission and facilitate operation and maintenance:
Image push optimizationConsider skipping Cisco’s transit packaging process and pushing packages directly to ECR, especially for EKS-specific image modifications.
One-click synchronization function: We plan to develop a one-click synchronization function, allowing users to automatically synchronize a resource package uploaded to an S3 Bucket to other S3 Buckets, reducing the work of repeated uploading.
Automatically mapped to the Auth system: After AWS resources are created through Terraform, the system will automatically map these resources to the permission management system to avoid users from manually entering resources.
Optimized permission control: Through automated resource and permission management, user operations become simpler and the complexity of setup and management is reduced.
With these improvements, we expect to help users use Apache DolphinScheduler to more effectively deploy and manage their jobs, both on Webex DC and EKS, while improving the efficiency and security of resource management.