Collect & Create Stage

Collect & Create Stage

Collect & Create Stage

Collect & Create Stage — Detailed Summary Table

Category

Details & Actionable Guidance (UCPH · reNEW · EU funders)

Stage Purpose

Actively generate, gather, or receive data and documents while ensuring outputs are organized, secure, documented, and reusable from day one.

Triggers & Timing

Project start; first data acquisition; first instrument run; receipt of third‑party/clinical data; new collaborator/site added; immediately if a project is already underway but lacks structure.

Scope (What’s in)

Raw data, processed/derived datasets, lab protocols, code/pipelines, instrument logs, readme/data dictionary, consent forms/DUAs (if applicable), and any externally supplied datasets.

Directory Structure (Foundational)

Establish a standard hierarchy at the outset.Example (project root):/project_name/ 01_raw/ 02_intermediate/ 03_processed/ 04_analysis/ 05_docs/ 06_metadata/ 07_code/ 08_reports/ 09_sharing/Use leading numbers to enforce order and one structure across all sites in reNEW collaborations.

File Naming Conventions

Use descriptive, deterministic names: <YYYYMMDD>_<project/assay>_<sampleID>_<step>_<v01>.<ext>.Do: 20250731_wgs_S123_align_v01.bamDo: 20250731_mic_ROI45_raw_v01.tifAvoid: spaces/special chars; use _ or -; include version and date; keep ≤ 32–50 chars when possible.

Metadata & Documentation (Minimum Set)

Maintain a README per folder; a data dictionary per dataset (variables, units, ranges, missing codes); acquisition metadata (instrument, software, version, parameters); provenance (who/when/how). Store in 06_metadata/ and alongside data.

Templates & Starter Kits

Provide team‑wide templates: README_template.md, data_dictionary_template.csv, folder_structure.json, naming_convention.md, and a project setup checklist. Keep in a shared reNEW template registry.

Data Quality & Validation

Define QC gates at ingest: file integrity (checksums), schema/format validation (CSV delimiters, header checks), range checks, minimal completeness (required columns/metadata). Record QC outcomes in qc_log.csv.

Storage & Access (UCPH/reNEW)

Store working data on approved services: ERDA, high‑security storage for sensitive data, or managed SharePoint/Teams for collaborative docs. Configure automatic daily snapshots and off‑site backup (3‑2‑1 principle where feasible).

Security & GDPR

For personal/clinical/confidential data: apply least‑privilege access, encrypt at rest/in transit, keep consent/DP in 05_docs/ethics/, and maintain a data‑handling record. Pseudonymize at collection where possible; separate identifiers from research data.

Formats & Interoperability

Prefer open, analysis‑friendly formats (CSV/TSV, JSON, TIFF, HDF5). If proprietary formats are unavoidable at collection, document exact software/versions and plan conversion steps to open formats at the earliest safe point.

Versioning & Provenance

Distinguish raw vs processed (01_raw/ vs 03_processed/). Use semantic or incremental versions (v01, v02), or tag releases in Git for code/pipelines. Record transformations (script name, parameters, input→output mapping) in provenance_log.md.

Collaboration Practices

Centralize shared conventions; publish the structure & naming in the project README. Use controlled shared spaces (ERDA/SharePoint) with clear write/read roles. Schedule periodic structure audits (bi‑weekly in early project).

Automation & Tooling

Provide small utilities: create_project_skeleton.sh, new_experiment.py (auto‑stamps folders/files), validate_dataset.py (schema/QC checks), and checksum_all.sh. Run pre‑commit hooks to prevent non‑compliant names.

Compliance & Reporting (EU/Horizon Europe)

Maintain a living DMP (update when collection methods, collaborators, or data risks change). Track funder IDs/grants in metadata; ensure traceability from raw → published dataset for reproducibility and audits.

Roles & Responsibilities

PI: approves structure/naming, ensures resources/compliance. Data Steward/Manager: owns templates, QC, training, and audits. Researchers/Analysts: apply conventions, complete metadata/QC logs. IT: storage, access control, backups, monitoring.

Quick Wins (Week 1–2)

1) Adopt the standard folder schema. 2) Publish naming convention and README templates. 3) Turn on backups/snapshots. 4) Implement checksum/QC on ingest. 5) Hold a 30‑minute training for the team.

KPIs / Health Checks

% files compliant with naming; % datasets with README & data dictionary; time‑to‑locate file (target ≤ 1 min); % folders passing QC; # of restore tests per quarter; # of GDPR events (target 0).

Risks & Mitigations

Drift from conventions → monthly audits & automated validators. Loss/corruption → tested restores, 3‑2‑1 backup. GDPR breach → encryption, access logs, steward sign‑off. Tool lock‑in → early open‑format exports + format notes.

Deliverables (Outputs)

Standardized folder tree, naming guide, README/data dictionary, QC logs, provenance records, and access controls documented—ensuring every file is clearly labeled, traceable, and ready for downstream analysis/sharing.

Last updated