ChemAudit: Chemical Structure Validation & Quality Assessment Platform

- Introduction
- ChemAudit is a comprehensive web platform designed to streamline cheminformatics workflows, drug discovery, machine learning dataset curation, and generative chemistry evaluation.
- The project presents a cohesive ecosystem that covers structure validation, standardization, profiling, curation, and detailed analytics, all wrapped in an accessible user interface and robust API.
- Built with modern technologies to support scalable research and production environments, ChemAudit integrates a suite of tools and libraries commonly used in the field, including a powerful RDKit foundation for chemical informatics, a FastAPI backend, and a React-based frontend.
- The platform emphasizes modularity and extensibility, enabling teams to tailor validation profiles, scoring schemes, and standardization rules to their specific research pipelines.
- Visual emphasis is on clarity and traceability: every action in the pipeline is designed to produce auditable results with explicit provenance, and export options ensure downstream interoperability with other systems and data stores.
- Key branding elements and status indicators accompany the project, including licensing, releases, and documentation status, signaling an open and actively developed project ecosystem.
- Visual introduction and key imagery
- The project storefront opens with a banner image that captures the essence of a chemical-audit driven platform, conveying structure validation, data curation, and analytical workflows.
- Dashboard visuals provide a quick sense of how information is organized, including single-molecule validation views and batch-processing dashboards.
- The imagery underscores an integrated experience where chemistry, data science, and software engineering converge in a single environment.
- Core features at a glance
- The platform centers on a lifecycle approach: Validate • Standardize • Score • Profile • Curate • Analyze.
- This lifecycle supports end-to-end cheminformatics tasks, from raw molecular representations to ML-ready datasets and high-quality, auditable reports.
3.1 Structure validation
- ChemAudit performs comprehensive chemical structure analysis with an emphasis on accuracy and reliability.
- Validation checks cover valence, connectivity and aromaticity, stereochemistry, ring system assignments, and atom/bond type verification.
- Users can apply configurable validation profiles that include multiple presets to suit various research needs and quality gates.
- The result is a robust, per-molecule quality assessment that helps identify structural anomalies and data quality issues early in the workflow.
3.2 Structural alerts
- The Structural Alerts screen cross-checks compounds against established catalogs of problematic substructures.
- Alerts include PAINS (Pan-Assay Interference compounds), BRENK filters, Kazius/NIBR-style surveillance, and other curated catalogs, enabling rapid triage of potentially problematic chemotypes.
- The system supports deduplication of alerts across catalogs to avoid redundant flagging and to highlight the most impactful structural concerns.
3.3 ML-readiness scoring
- A dedicated scoring module assesses compounds for machine learning readiness across multiple dimensions.
- The scoring covers descriptor calculability, fingerprint validation, and overall molecular intricacy, helping to prioritize molecules for model training and evaluation.
- Profile-based scoring enables the use of custom presets to align with particular ML tasks or datasets, fostering consistent data quality across projects.
3.4 Standardization pipeline
- The standardization component aligns molecules to a common representation compatible with major data sources and models.
- Features include salt stripping, neutralization, tautomer canonicalization, and stereochemistry normalization.
- Cross-pipeline comparison (RDKit vs ChEMBL) ensures consistency and helps resolve discrepancies between data sources.
- The pipeline is designed to be ChemBL-compatible, facilitating downstream data sharing and interoperability.
3.5 Data preparation suite
- QSAR-ready pipeline: a 10-step curation process that produces machine-learning-ready structures.
- Steps include desalting, neutralisation, tautomer canonicalisation, and InChIKey deduplication with change tracking at each stage.
- The pipeline supports multiple presets (QSAR-2D, QSAR-3D, Minimal) and batch processing via a Celery-based architecture with WebSocket progress updates.
- Structure filter validation funnels are provided for generative model outputs (e.g., REINVENT), enabling multi-stage filtering and scoring to refine generated molecules.
- The system offers an optional novelty check via ChEMBL Tanimoto similarity, supporting REINVENT-compatible scoring endpoints and interactive funnels that visualize drop-off points.
3.6 Dataset auditing and diagnostics
- Dataset audit functionality uploads datasets and returns comprehensive health scores, highlighting structural issues, standardization inconsistencies, and distributional properties.
- Contradictory labels (the same InChIKey associated with opposite activity) are detected, enabling robust quality control for curated datasets.
- Full curation reports and dataset diffs help track changes across curation cycles, supporting reproducibility and auditability.
3.7 Compound profiler
- The profiler aggregates a suite of ligand and compound-level metrics to guide selection and optimization.
- Property Forecast Index (PFI), ligand efficiency variants (LE, LLE, LELP), and other efficiency metrics inform decision-making.
- Desirability frameworks and three independent SA (synthetic accessibility) comparisons feed into a unified assessment.
- The 3D shape analysis module uses PMI (principal moment of inertia) plots derived from ETKDGv3/MMFF94 conformers to evaluate spatial compatibility and scaffold diversity.
3.8 Safety assessment
- The platform integrates multiple safety-oriented analyses to identify liabilities early in development.
- CYP soft spots and other SMARTS-based metabolic liability patterns provide atom-level highlighting to guide metabolic risk mitigation.
- hERG liability scoring uses a multi-factor amphiphile model to assess cardiac safety concerns.
- The Beyond Rule of 5 (bRo5) framework expands evaluation into extended chemical space for larger or more flexible molecules.
- REOS filtering helps rapidly eliminate swill, streamlining the screening process.
- A deduplication mechanism ensures that the same safety concerns aren’t redundantly reported across multiple catalogs.
3.9 Database integrations and identifier resolution
- The system supports universal identifier resolution for SMILES, InChI, InChIKey, CAS, common name, ChEMBL ID, PubChem CID, and more.
- Cross-reference capabilities span PubChem, ChEMBL, COCONUT, Wikidata, ChEBI, UniChem, and more, enabling cross-database lookups and data enrichment.
- Side-by-side stereochemistry-aware diffing and SureChEMBL patent lookups via UniChem are provided to support patent clearance and literature-aware workflows.
3.10 Batch analytics
- Batch analytics include clustering (Butina) with configurable distance thresholds, interactive scaffold and taxa analyses, and a SMARTS-based chemical taxonomy with extensive classification rules.
- Registration hash collision detection (RDKit RegistrationHash v2 with tautomer support) helps ensure data integrity across datasets.
- MCS (maximum common substructure) comparisons enable side-by-side molecule analysis with Tanimoto similarity, common substructures, and property deltas.
- Click-to-filter drill-down features enable interactive exploration of analytics charts, and shareable permalink reports with auto-snapshot persistence support collaboration.
3.11 Export system
- An auditable export system provides a consolidated 78-column audit trail across six sections: Validation, Deep Validation, Scoring, Safety, Compound Profile, and Standardization.
- A declarative registry drives all exporters, ensuring consistency across formats.
- Supported formats include CSV (all 78 columns), Excel (single or multi-sheet layouts, optional 2D structure depictions, and conditional formatting), JSON (nested structure), SDF (optional full audit data toggle), and PDF (batch reports with optional full audit data toggle).
- Quick Start and deployment options
- Quick Start emphasizes Docker-based deployment as the recommended route for developers and teams seeking a turnkey setup.
- The Quick Start workflow includes cloning the repository, starting services with docker-compose, and viewing logs for troubleshooting.
- Access points include a Web UI at http://localhost:3002, API documentation at http://localhost:8001/docs, the MCP server at http://localhost:8001/mcp, and a monitoring endpoint at http://localhost:9090.
- For production deployment, an interactive deploy script guides users through profile selection to balance capacity, memory, and compute resources. Available profiles cover small, medium, large, xl, and coconut configurations, with explicit constraints on maximum molecules, file sizes, and worker counts.
- The deployment guide references a detailed documentation site that covers environment setup, scaling, and security considerations.
- Command line interface and batch processing
- ChemAudit ships with a command-line interface (CLI) that provides four subcommands: validate, score, standardize, and profile.
- Examples include validating a molecule from SMILES, scoring a structure, standardizing a SMILES string, and profiling a molecule.
- The CLI supports offline mode (--local) and remote API usage (--server), with output formats selectable as json or table.
- Batch processing capabilities enable file-based workflows (CSV or SDF) with progress streaming via WebSockets and various export options for results.
5.1 Batch processing and progress
- Large-scale processing leverages asynchronous task queues for scalability.
- Real-time progress updates keep users informed of batch status and intermediate results.
- The batch workflow is designed to handle practical file sizes up to a gigabyte, with molecule counts reaching into the hundreds of thousands per batch depending on profile constraints.
- Screenshots and visual walkthroughs
- The interface includes a range of visual views to communicate results clearly:
- Single Molecule Validation: displays validation outcomes and remedial suggestions for a specific molecule.
- Batch Processing: shows the workflow for processing large datasets with progress indicators.
- Scoring Dashboard: visualizes ML-readiness and related scores in an interpretable dashboard.
- Database Lookup: demonstrates cross-database enrichment and integration views for identifiers and metadata.
- Supporting images illustrate the dashboard and key screens, helping users navigate the toolset and interpret results rapidly.
- Tech stack and architectural overview
- Frontend
- Built with React, TypeScript, Vite, and Tailwind to deliver a responsive, modern UI.
- Backend
- Powered by Python and FastAPI to provide a fast, scalable API surface.
- RDKit serves as the cheminformatics engine underpinning structure handling, validation, and standardization.
- Celery handles background processing tasks for long-running workflows, enabling reliable batch processing.
- Database and storage
- PostgreSQL provides robust relational data storage with transactional integrity and ID management.
- Redis supports fast caching and message brokering for batch jobs and session state.
- Infrastructure
- Docker provides containerized deployment for reproducibility and portability.
- Nginx serves as a reverse proxy and load balancer to distribute traffic and improve security.
- Monitoring and observability
- Prometheus is used for metrics collection, and Grafana provides visualization dashboards for system health and performance.
- AI integration
- An MCP (Model Context Protocol) server enables AI assistants to call ChemAudit tools directly, enabling AI-assisted workflows without bespoke tool integrations.
- This stack is complemented by an ecosystem of open-source tools and databases that ChemAudit connects to or mirrors for validation, standardization, and data enrichment.
- MCP Server and AI-assisted workflows
- The MCP Server exposes a Model Context Protocol endpoint enabling AI assistants to access ChemAudit capabilities directly.
- The MCP integration automates tool discovery, generating approximately 68 tools from the existing API surface, eliminating the need to run a separate server for AI interactions.
- Quick setup instructions show how to configure AI clients to connect to the MCP server, including sample configurations for Claude, Cursor, Windsurf, and similar assistants.
- Tool categories available through MCP include Validation, Scoring, Standardization, Alerts & Safety, Compound Profiler, Identifier Resolution, Database Integrations, Diagnostics, QSAR-Ready, Structure Filter, Dataset Intelligence, Batch & Export, and Scoring Profiles.
- Security considerations ensure that admin endpoints remain inaccessible via MCP, with runtime checks to prevent leakage of admin tags into the MCP allowlist.
- Example AI interactions illustrate practical use cases, such as validating a molecule and checking PAINS alerts, resolving identifiers across multiple databases, and running the QSAR-ready pipeline on a given SMILES string.
- Project structure and repository organization
- The project is organized into a clear directory hierarchy to separate concerns and facilitate development:
- backend contains the core application, with modules for routes, configuration, data schemas, services, and templates.
- frontend houses the React client, including components, pages, API clients, hooks, and type definitions.
- client provides a Python client library to interact with the backend programmatically.
- docs-site and docs provide documentation resources and guides for setup, usage, deployment, troubleshooting, and API references.
- nginx and docker-compose configurations support orchestration and reverse proxying.
- security policies, CI/CD workflows, and test suites support ongoing quality assurance and governance.
- The backend module is further broken down into service areas such as validation, scoring, alerts, profiler, safety, diagnostics, qsarready, structurefilter, dataset_intelligence, analytics, integrations, and export.
- The frontend modules map to user-facing experiences, including 15 route pages, reusable components, and a robust API client to connect with the backend.
- Security and governance
- ChemAudit emphasizes defense-in-depth security:
- API key authentication guarded by Redis-backed key management.
- Rate limiting with per-IP and per-key tiering and progressive IP banning.
- Session isolation using HttpOnly cookies and PostgreSQL row-level security.
- WebSocket ownership checks to prevent cross-session data access.
- CSRF protection and Content Security Policy headers.
- Secret scanning via CI pipelines with Gitleaks.
- A Security policy document is maintained to guide vulnerability reporting and responsible disclosure.
- Regular security reviews and dependency checks help mitigate common web and data-security risks.
- Testing, quality assurance, and contribution workflow
- The project adopts a rigorous testing regime:
- Backend tests run with pytest to validate server logic, routes, and data handling.
- Frontend tests use npm test to ensure UI components and interactions function correctly.
- Coverage reports are generated to monitor test depth across the codebase.
- Contributions are encouraged:
- A straightforward workflow includes forking the repository, creating feature branches, submitting PRs, and aligning with project guidelines.
- Clear guidance exists for adding tests, maintaining code quality, and participating in the open-source development process.
- Documentation and learning resources
- The documentation suite provides multiple entry points:
- Getting Started: installation and initial setup guidance.
- User Guide: comprehensive usage instructions for day-to-day operations.
- API Reference: thorough REST API documentation for programmatic access.
- Deployment: production deployment guidance, scaling considerations, and environment-specific configurations.
- Troubleshooting: common issues and practical solutions.
- An interactive API documentation experience is available at the local deployment docs endpoint, enabling hands-on exploration of endpoints and payloads.
- Licensing, acknowledgments, and open-source ecosystem
- The project is distributed under the MIT License, reflecting an emphasis on openness and collaboration.
- Acknowledgments highlight the influence and integration with a range of community-driven resources:
- RDKit: a foundational cheminformatics toolkit.
- ChEMBL: a key bioactivity database used for validation and benchmarking.
- PubChem: a primary database for chemical information.
- COCONUT: a natural products database for cross-referencing.
- ChEBI: Chemical Entities of Biological Interest.
- UniChem: cross-reference mapping service.
- Wikidata: open knowledge base for enrichment.
- SureChEMBL: patent chemistry database for intellectual property awareness.
- The acknowledgment section emphasizes the broad open-source ecosystem that ChemAudit builds upon, illustrating a collaborative approach to scientific software.
- Licensing, support, and community engagement
- The project maintains release notes, test coverage, and documentation as part of ongoing maintenance.
- Community involvement is encouraged through contributing guidelines, issue reporting, and pull requests.
- The visual and branding elements convey a product that is actively maintained and aligned with modern software development practices.
- Visual assets and attribution notes
- The product page and documentation make use of a set of logos and images to illustrate capabilities and partnerships.
- Where applicable, imagery from the input is incorporated to ground the description in concrete visuals such as dashboards, single molecule views, batch processing scenes, and integration screens.
- Users should refer to the provided assets for context on how and where visuals appear within the user interface and reporting artifacts.
- Summary overview
- ChemAudit presents an end-to-end platform designed to validate, standardize, score, profile, curate, and analyze chemical data.
- Its architecture emphasizes modularity, scalability, and interoperability, enabling teams to build robust cheminformatics pipelines aligned with drug discovery and ML-driven research.
- The combination of powerful validation checks, structural alert catalogs, ML-readiness scoring, and comprehensive data preparation tools makes it suitable for research groups, pharmaceutical settings, and academic laboratories seeking rigorous data quality and reproducibility.
- The MCP integration extends the platform into AI-assisted workflows, enabling seamless collaboration with AI assistants and automated tool invocation.
- The project’s open-source ethos, extensive documentation, and active community support provide a foundation for ongoing innovation, contribution, and collaboration in the cheminformatics ecosystem.
- Quick reference: access points and essentials
- Web UI: http://localhost:3002
- API Documentation: http://localhost:8001/docs
- MCP Server: http://localhost:8001/mcp
- Metrics: http://localhost:9090
- Core licensing: MIT
- Primary tech stack: React, FastAPI, RDKit, PostgreSQL, Redis, Docker
- Notable capabilities: 15+ structure validation checks, comprehensive structural alerts, ML-readiness scoring, 10-step QSAR-ready data preparation, 6-stage generate-model funnels, safety liabilities detection, cross-database integration, batch analytics, and multi-format exports.
- Visual wrap-up
- The platform is designed to be both developer-friendly and scientist-friendly, balancing programmatic access with an intuitive user interface.
- The collaborative nature of the project is reinforced by its open documentation, transparent release practices, and a broad ecosystem of open data resources and chemical databases.
- By combining rigorous chemical validation with scalable data workflows and AI-ready tooling, ChemAudit aims to accelerate reliable cheminformatics work, improve model training data, and support reproducible science in drug discovery and beyond.
Images referenced in this narrative (from the input):
- Banner image:

- Dashboard and UI visuals:

- Single molecule validation:

- Batch processing:

- Scoring dashboard:

- Database integrations screenshot:

If you’d like, I can tailor this description to a specific audience (e.g., researchers, software engineers, or product stakeholders) or adjust the emphasis toward particular modules (e.g., MCP integration, batch processing, or safety assessment).
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/Kohulan/ChemAudit
GitHub - Kohulan/ChemAudit: ChemAudit: Chemical Structure Validation & Quality Assessment Platform
ChemAudit is an open-source AI‑enabled platform that streamlines cheminformatics workflows, including structure validation, standardization, scoring, profiling,...
github - kohulan/chemaudit