.. _multi-project: Multi-Project Setups ==================== Cosmos supports multi-project dbt architectures where multiple dbt projects reference each other's models. This is commonly achieved using `dbt-loom `__, a `python package `__ that enables cross-project references by injecting models from upstream projects into downstream projects. This allows you to: - Split large dbt projects into smaller, focused domain projects - Share common staging models across multiple downstream projects - Maintain clear boundaries between data domains while allowing references Cosmos works with dbt-loom out of the box, automatically handling external node references. How dbt-loom Works ------------------ dbt-loom enables cross-project references by: 1. Reading the ``manifest.json`` from upstream dbt projects 2. Injecting the upstream models' metadata into the downstream project's namespace 3. Allowing cross-project references using the dbt Mesh syntax: ``{{ ref('upstream_project', 'model_name') }}`` How Cosmos Handles dbt-loom --------------------------- When Cosmos parses a dbt project that uses dbt-loom, it encounters two types of nodes: 1. **Local nodes**: Models that exist as files in the current project 2. **External nodes**: Models injected by dbt-loom from upstream projects (no local file path) Cosmos automatically: - **Skips external nodes** during DAG generation (they don't have file paths) - **Creates Airflow tasks only for local nodes** in each project - **Maintains proper dependency tracking** within each project This means you don't need any special configuration - Cosmos works with dbt-loom projects automatically. Requirements ------------ For dbt-loom to work with Cosmos: 1. **For DAG parsing**: The upstream project's ``manifest.json`` must be accessible 2. **For task execution**: The downstream project must be able to query upstream tables 3. **dbt-loom installation**: `dbt-loom `__ must be installed in the same Python virtual environment as the dbt executable used by Cosmos. This applies whether you're using a system-wide dbt installation or a project-specific virtual environment via ``ExecutionConfig``. The upstream manifest can be generated by running any dbt command that parses the project: .. code-block:: bash cd dbt parse # Fastest - just generates manifest # or dbt compile # Also generates compiled SQL # or dbt ls # Lists resources and generates manifest Configuration Example --------------------- Project Structure ~~~~~~~~~~~~~~~~~ A typical dbt-loom setup has an upstream project and one or more downstream projects: .. code-block:: text dbt/ ├── upstream/ # Upstream project (staging, intermediate) │ ├── dbt_project.yml │ ├── profiles.yml │ ├── models/ │ │ ├── staging/ │ │ │ └── stg_customers.sql │ │ └── intermediate/ │ │ └── int_customer_orders.sql │ └── target/ │ └── manifest.json # Required by downstream projects │ └── downstream/ # Downstream project (marts, reports) ├── dbt_project.yml ├── profiles.yml ├── dbt_loom.config.yml # Points to upstream manifest ├── dependencies.yml # Includes dbt-loom package └── models/ └── fct_revenue.sql # References upstream models Upstream Project Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The upstream project exposes models as ``public`` for cross-project access: **dbt_project.yml**: .. code-block:: yaml name: 'upstream' version: '1.0.0' config-version: 2 profile: 'upstream' models: upstream: staging: +materialized: view +access: public # Required for dbt-loom intermediate: +materialized: view +access: public Downstream Project Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The downstream project configures dbt-loom to read from the upstream manifest. Note that dbt-loom is installed as a Python package (``pip install dbt-loom``), not as a dbt package. **dbt_loom.config.yml**: .. code-block:: yaml manifests: - name: upstream type: file config: # Use environment variable for flexibility path: '{{ env_var("UPSTREAM_MANIFEST_PATH", "../upstream/target/manifest.json") }}' enable_telemetry: false **Model using cross-project ref**: .. code-block:: sql -- fct_revenue.sql select c.customer_id, c.customer_name, sum(o.amount) as total_revenue from {{ ref('upstream', 'stg_customers') }} c left join {{ ref('upstream', 'int_customer_orders') }} o on c.customer_id = o.customer_id group by 1, 2 Cosmos DAG Configuration ~~~~~~~~~~~~~~~~~~~~~~~~ You can use either separate DAGs or a combined DAG with task groups. .. note:: Each project can use a **different profile configuration**, allowing you to: - Write to different schemas (e.g., ``platform`` vs ``finance``) - Use different databases or data warehouses entirely - Use different credentials or Airflow connections per project This flexibility is useful when different teams own different projects or when data needs to flow across database boundaries. **Important**: The downstream profile must have read access to the tables/views created by the upstream project. Ensure appropriate grants or cross-database access is configured. **Option 1: Combined DAG with Task Groups using dbt ls Load Mode (Recommended)** .. literalinclude:: ../../dev/dags/cross_project_dbt_ls_dag.py :language: python :start-after: [START cross_project_dbt_ls_dag] :end-before: [END cross_project_dbt_ls_dag] **Option 2: Combined DAG with Task Groups using Manifest Load Mode** This option uses pre-generated ``manifest.json`` files for faster DAG parsing (no ``dbt ls`` execution required). .. literalinclude:: ../../dev/dags/cross_project_manifest_dag.py :language: python :start-after: [START cross_project_manifest_dag] :end-before: [END cross_project_manifest_dag] .. note:: **Prerequisites for Manifest Load Mode**: - Generate ``manifest.json`` for both projects before deploying (``dbt compile`` or ``dbt parse``) - For remote manifests (S3/GCS/Azure), configure the appropriate Airflow connection and use ``manifest_conn_id`` **Option 3: Separate DAGs with Assets (Airflow 3) / Datasets (Airflow 2.4+)** Cosmos automatically emits assets from each task when ``emit_datasets=True`` (the default). You can use these assets to trigger downstream DAGs. .. figure:: /_static/cross_projects_assets_view.png :alt: Cross-project assets view in Airflow Assets emitted by dbt models showing the OpenLineage-based URIs. .. figure:: /_static/cross_project_asset_triggered_dag.png :alt: Asset-triggered DAG in Airflow A downstream DAG triggered by upstream model assets. .. note:: Airflow 3 renamed "Datasets" to "Assets". The functionality is the same, but the import changes from ``from airflow.datasets import Dataset`` to ``from airflow.sdk import Asset``. **Understanding Asset URIs in Cosmos** Cosmos uses OpenLineage to extract lineage information and generates Asset URIs based on the actual database tables. The URI format follows the pattern: .. code-block:: text {db_type}://{host}:{port}/{database}/{schema}/{table_name} For example, a Postgres model ``stg_customers`` in schema ``platform`` generates: .. code-block:: text postgres://postgres:5432/postgres/platform/stg_customers **Example: Trigger Downstream DAG on Specific Upstream Models (Airflow 3)** .. code-block:: python from airflow.sdk import Asset from cosmos import DbtDag, ProfileConfig, ProjectConfig, RenderConfig # Define assets using the OpenLineage-based URIs # Format: {db_type}://{host}:{port}/{database}/{schema}/{table_name} UPSTREAM_CUSTOMERS = Asset("postgres://postgres:5432/postgres/platform/stg_customers") UPSTREAM_ORDERS = Asset( "postgres://postgres:5432/postgres/platform/int_customer_orders" ) # Upstream DAG - tasks automatically emit assets upstream_dag = DbtDag( dag_id="upstream_dag", project_config=ProjectConfig(dbt_project_path=UPSTREAM_PATH), profile_config=ProfileConfig(...), render_config=RenderConfig( emit_datasets=True, # Default - each task emits an asset ), schedule="@daily", ) # Downstream DAG triggers when specific upstream models complete downstream_dag = DbtDag( dag_id="downstream_dag", project_config=ProjectConfig(dbt_project_path=DOWNSTREAM_PATH), profile_config=ProfileConfig(...), schedule=[UPSTREAM_CUSTOMERS, UPSTREAM_ORDERS], # Triggers on upstream completion ) **Example: Using AssetAlias (Airflow 3) / DatasetAlias (Airflow 2.10+)** AssetAlias provides more flexible asset matching using URI patterns: .. code-block:: python from airflow.sdk import AssetAlias from cosmos import DbtDag, ProfileConfig, ProjectConfig # Downstream DAG triggers on any asset matching the alias pattern downstream_dag = DbtDag( dag_id="downstream_dag", project_config=ProjectConfig(dbt_project_path=DOWNSTREAM_PATH), profile_config=ProfileConfig(...), schedule=[ AssetAlias( name="postgres://postgres:5432/postgres/platform/int_customer_orders" ) ], ) **Example: Manual Asset for DAG-Level Dependency** If you want a single asset to represent the entire upstream DAG completion, add a final task that emits a custom asset: .. code-block:: python from airflow import DAG from airflow.sdk import Asset from airflow.operators.empty import EmptyOperator from cosmos import DbtTaskGroup, ProfileConfig, ProjectConfig UPSTREAM_COMPLETE = Asset("upstream_platform_complete") with DAG( dag_id="upstream_dag", schedule="@daily", # ... ) as upstream_dag: upstream_tasks = DbtTaskGroup( group_id="upstream_platform", project_config=ProjectConfig(dbt_project_path=UPSTREAM_PATH), profile_config=ProfileConfig(...), ) # Final task that emits a single "completion" asset mark_complete = EmptyOperator( task_id="mark_complete", outlets=[UPSTREAM_COMPLETE], ) upstream_tasks >> mark_complete # Downstream DAG triggers on the completion asset downstream_dag = DbtDag( dag_id="downstream_dag", project_config=ProjectConfig(dbt_project_path=DOWNSTREAM_PATH), profile_config=ProfileConfig(...), schedule=[UPSTREAM_COMPLETE], ) **Disabling Asset Emission** To disable automatic asset emission: .. code-block:: python from cosmos import DbtDag, RenderConfig dag = DbtDag( dag_id="my_dag", render_config=RenderConfig(emit_datasets=False), # ... ) Cross-Project Sources --------------------- dbt-loom handles **model references** but does not directly support cross-project source references (``{{ source('upstream_project', 'table') }}``). Here are the recommended patterns: **Pattern 1: Wrap Sources in Staging Models (Recommended)** Define sources in the upstream project and expose them via staging models: .. code-block:: text upstream_platform/ └── models/ └── staging/ ├── sources.yml # Source definition └── stg_raw_orders.sql # Staging model wrapping the source **sources.yml** (upstream): .. code-block:: yaml version: 2 sources: - name: raw_data schema: raw tables: - name: orders **stg_raw_orders.sql** (upstream): .. code-block:: sql {{ config(materialized='view', access='public') }} select * from {{ source('raw_data', 'orders') }} Now the downstream project references the staging model instead of the source: .. code-block:: sql -- downstream model select * from {{ ref('upstream_platform', 'stg_raw_orders') }} **Pattern 2: Duplicate Source Definitions** If you must reference the same raw table in multiple projects, define the source in each project: .. code-block:: yaml # In both upstream and downstream projects version: 2 sources: - name: shared_raw_data database: "{{ env_var('RAW_DATABASE') }}" schema: raw tables: - name: orders This approach requires keeping source definitions in sync across projects. Cross-Project Macros -------------------- dbt-loom does **not** handle macro sharing. Macros are resolved at compile time within each project. Here are the recommended patterns for sharing macros: **Pattern 1: Create a Shared dbt Package (Recommended)** Create a separate dbt package containing shared macros and install it in all projects: .. code-block:: text shared_macros/ # Shared package (separate repo) ├── dbt_project.yml └── macros/ ├── generate_schema_name.sql ├── cents_to_dollars.sql └── hash_columns.sql **dbt_project.yml** (shared package): .. code-block:: yaml name: 'company_shared_macros' version: '1.0.0' config-version: 2 Install in each project via **packages.yml** or **dependencies.yml**: .. code-block:: yaml packages: # From git repository - git: "https://github.com/your-org/company-shared-macros.git" revision: v1.0.0 # Or from local path (for development) - local: ../shared_macros Use the macro with the package prefix: .. code-block:: sql select {{ company_shared_macros.cents_to_dollars('amount_cents') }} as amount_dollars from {{ ref('orders') }} **Pattern 2: Copy Macros to Each Project** For simpler setups, copy commonly used macros to each project. This is easier to maintain for a small number of macros but doesn't scale well. **Pattern 3: Override dbt Built-in Macros Consistently** If you override dbt built-in macros (like ``generate_schema_name``), ensure the override is consistent across all projects: .. code-block:: sql -- macros/generate_schema_name.sql (same in all projects) {% macro generate_schema_name(custom_schema_name, node) %} {% if custom_schema_name %} {{ custom_schema_name }} {% else %} {{ target.schema }} {% endif %} {% endmacro %} **Macro Sharing Summary** .. list-table:: :widths: 30 35 35 :header-rows: 1 * - Approach - Pros - Cons * - Shared dbt Package - Single source of truth, versioned - Requires package management setup * - Copy Macros - Simple, no dependencies - Hard to keep in sync * - Consistent Overrides - Works for built-in macros - Limited to override scenarios Troubleshooting --------------- **Error: "The path does not exist" for manifest.json** This occurs when dbt-loom can't find the upstream manifest. Solutions: 1. Use an absolute path in ``dbt_loom.config.yml`` 2. Set the ``UPSTREAM_MANIFEST_PATH`` environment variable 3. Ensure the upstream project has been parsed (run ``dbt parse``) **Error: "unsupported operand type(s) for /: 'PosixPath' and 'NoneType'"** This occurred in older Cosmos versions when external nodes (from dbt-loom) didn't have file paths. This is now fixed - Cosmos 1.13.0+ automatically skips nodes without file paths. **Error: "Table does not exist" during execution** The upstream tables must exist in the database before running downstream models: 1. Ensure the upstream project has been executed (not just parsed) 2. Verify both projects can access the same database/schemas 3. Check that cross-database access is configured if using different databases Best Practices -------------- 1. **Use environment variables** for manifest paths to support different environments 2. **Chain task groups** (same DAG) or **use assets** (separate DAGs) to ensure proper execution order 3. **Mark upstream models as public** using ``+access: public`` 4. **Generate manifests in CI** to ensure they're always available 5. **Use persistent storage** (not in-memory databases) for cross-project data sharing 6. **For asset-based scheduling**, use a completion marker task or depend on specific model assets 7. **Consider AssetAlias** (Airflow 3) / **DatasetAlias** (Airflow 2.10+) for more flexible asset matching Limitations ----------- - dbt-loom external nodes are skipped during Cosmos DAG generation (by design) - Cross-project lineage is not yet visualized in Airflow's lineage view - DAGs cannot have ``outlets`` directly; use a completion marker task or rely on task-level assets - Asset URIs are auto-generated based on OpenLineage and may change if database connection details change