Staff Software Engineer - AI Infrastructure
Published: 2025-08-14XPENG is a leading smart technology company at the forefront of innovation, integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric vertical take-off and landing (eVTOL) aircraft, and robotics. With a strong focus on intelligent mobility, XPENG is dedicated to reshaping the future of transportation through cutting-edge R&D in AI, machine learning, and smart ...
Job details
XPENG is a leading smart technology company at the forefront of innovation, integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric vertical take-off and landing (eVTOL) aircraft, and robotics. With a strong focus on intelligent mobility, XPENG is dedicated to reshaping the future of transportation through cutting-edge R&D in AI, machine learning, and smart connectivity.
About the Role We are looking for a versatile Machine Learning Infrastructure Engineer to join XPeng’s Fuyao AI Platform team — a core AI infrastructure powering autonomous driving, robotics, and intelligent cockpit applications. You will build and optimize next-generation AI infrastructure, spanning dataloader, dataset and data production systems, large-scale inference, and distributed compute platforms — with a strong focus on efficiency, scalability, and reliability. Job Responsibilities-
Contribute to one or more of the following areas:
- Design and optimize large-scale data processing, production and loading pipelines, supporting heterogeneous data types (images, videos, point clouds, sensor streams, etc.).
- Build and maintain high-performance dataset management and loading frameworks, ensuring low-latency, high-throughput pipelines for training and inference.
- Develop and optimize distributed compute and inference systems, including scheduling, resource utilization, and performance tuning.
- Collaborate with cross-functional teams (e.g. Algorithms, Data Lakehouse) to translate requirements into production-ready infrastructure solutions.
- Continuously monitor, profile, and eliminate bottlenecks across AI data, inference and compute stack.
Basic Qualifications
- Master’s degree in Computer Science, Software Engineering, or equivalent experience.
- 5+ years of experience in large-scale data processing or ML infrastructure.
- Proficient in Python with solid software engineering fundamentals, clean coding practices, and strong debugging skills.
- Hands-on experience with relational databases and NoSQL systems, including metadata and cache management; prior experience with large-scale VectorDB is highly desirable.
- Familiarity with Linux file systems and network I/O optimization for distributed or object storage.
- Strong communication skills and ability to work cross-functionally in fast-paced environments.
- Strong ability to learn quickly, adapt to new challenges, and proactively explore and adopt new technologies.
- Familiarity with the autonomous driving industry and enthusiasm for its challenges.
- Experience with distributed computing frameworks such as Ray, Flink or Spark.
- Experience in building and scaling ML infrastructure in cloud-native environments.
-
Experience in any of the following areas:
- Large-scale deep learning training or inference optimization focused on scalability and model acceleration.
- Columnar storage formats (Parquet/ORC) and related ecosystems, including partitioning, compression, and vectorized I/O optimization.
- Large-scale data loading frameworks (PyTorch Dataloader, Hugging Face Datasets).