You've just received an email that your data is ready to download. This page explains how to download CAT sequencing data, what files you will receive, how long data are retained, and what to do if you need help.
How Do I Download My Data?
When your data are ready, you will receive a “Data Ready” email with the exact data path and credentials. You can download using SFTP or rsync. Data are typically available for about 3 months on fast SSD storage, then moved to slower storage for roughly another 3 months. Please download and back up your data as soon as possible.
How Do I Use SFTP (GUI)?
We recommend Cyberduck (Windows/Mac) or FileZilla. Server: fastq.ucsf.edu, Port: 22. Username and password are included in the Data Ready email.
Short video walkthrough coming here.
How Do I Use SFTP (Command Line)?
sftp <username>@fastq.ucsf.edu # enter password
sftp usefull commands:
- ls # list remote directory
- cd DIR # change remote directory
- pwd # show remote working directory
- lls # list local directory
- lcd DIR # change local directory
- lpwd # show local working directory
- get filename # download one file
- get -r directory # download a whole directory
- exit # quit SFTP
How Do I Use rsync? (Recommended for Large Transfers)
rsync -a --append-verify --partial --info=progress2 --no-o --no-g --no-perms --compress-level=0 \
<username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
To list files without copying:
rsync -av --list-only <username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
Note: “Could not chdir to home directory …” can be ignored.
Data Integrity (Checksums)
Your data folder will include a checksum file (for example fastq_checksums.md5 or checksums.md5). After download, verify integrity with:
md5sum -c fastq_checksums.md5
If you are on macOS and don’t have md5sum, install coreutils or use an equivalent tool to compare hashes.
Data Retention and Regeneration
Data is initially available from fast storage on /volume1/SSD/illumina/run_id after approximately 3 weeks data is moved to slower storage /volume2/tortoise/run_id OR /volume3/slow/run_id and then deleted after approximately three months. Raw run data is retained for approximately 6 months. You can request regeneration through iLabs for a $100 fee (per project). We recommend backing up in at least two locations immediately after download.
We only guarantee data availability for approximately 6 months total.
Missing Samples (Barcodes)
If expected samples are missing from your results, the most common cause is an incorrect samplesheet (wrong barcodes or orientation), or low/absent representation in the library. Before contacting CAT, review Reports/Top_Unknown_Barcodes.csv to see barcode pairs present in significant quantity that may be missing from the results. Also review the SampleSheet.csv found in the bcl-convert outputs. The pipeline attempts to identify and add these automatically, but review is still recommended.
How Sample Names and Barcodes Are Edited
Sample IDs are sanitized for compatibility: Unicode is converted to ASCII, punctuation removed, whitespace converted to underscores, and only letters, numbers, underscores, and dashes are allowed. If two Sample IDs collide after sanitization, a short hash is appended (_hXXXXXX) to make them unique. Barcodes are normalized to uppercase A/C/G/T only, N characters are interpreted as UMI and are outputed, and invalid entries are rejected. Some workflows may omit i5 or enforce fixed index length.
What Is the CAT Project Report?
Each project output includes a PDF report (<PROJECT_ID>_project_report.pdf) that summarizes project-level sequencing quality and demultiplexing performance. The report is intended to help labs quickly identify project-specific issues and improve future library prep, indexing, and pooling.
Use the report in this order:
- Lane Summary
- Sample Summary
- Reads per Sample
- Top Unknown Barcodes
- Glossary
How Should I Interpret the Project Report?
In most cases, project-level issues are related to sample sheet metadata or library quality, not instrument failure.
Common patterns:
- Incorrect sample sheet barcodes are a common cause of poor demultiplexing.
- Increasing barcode mismatch to 2 is generally not recommended.
- Lower demultiplexing and lower quality metrics are often driven by library quantity, quality, and fragment size.
- Flowcell-level issues usually affect many/all lanes, not just one or a few lanes. CAT will proactively notify customers when a true flowcell-wide issue is observed.
Lane Summary: What to Look For
The Lane Summary reports lane-level reads, bases, %Q30, PhiX alignment, lane-level index hopping estimate, and % Demuxed.
The sample distribution plots (Reads (M), %Q30, Mean Q) show one point per sample.
Focus on spread:
- Tight spread usually indicates consistent sample prep/library quality/pooling.
- Wide spread or outliers usually indicate sample-specific issues.
Sample Summary and Undetermined Reads
Sample Summary includes per-sample reads, quality, and index match stats, plus Undetermined reads (reads not assigned to expected sample barcodes).
Undetermined reads can include PhiX unless PhiX is explicitly listed in the SampleSheet.
For NovaSeq X demultiplexing behavior and library/index effects, see:
Effects of library quality and indexes on demultiplexing NovaSeq X Series data
Top Unknown Barcodes: How to Use It
Top Unknown Barcodes helps troubleshoot missing/low samples by showing barcode pairs observed by the instrument that were not assigned as expected.
Use this section to:
- Compare observed index/index2 pairs against your submitted SampleSheet
- Check for orientation/index-entry mistakes
- Spot technical artifacts (for example Poly-G or adapter-like sequences)
Only top entries are shown in the PDF; use Reports/Top_Unknown_Barcodes.csv in your project folder for the full list.
Barcode Collisions and Automatic Corrections
Some barcodes are too similar (or contain Ns) and may be filtered to avoid misassignment. The CAT pipeline uses the run's top emited barcodes to infer dominant unexpected barcodes, attempts reverse‑complement corrections when needed, and may write a corrected samplesheet. If corrections are unsafe, it skips changes and reports a warning.
bcl2fastq vs. bcl-convert
CAT now uses Illumina’s bcl-convert instead of bcl2fastq. The output is still FASTQ, but report/log layout is updated (e.g., Reports/ and Logs/ folders), sample sheet handling is stricter, and some file naming details may differ. If your pipeline depends on the older layout, contact CAT for guidance.
What Is in My Project Folder?
Illumina projects typically include FASTQ files, Reports/, Logs/, and a SampleSheet.csv.
PacBio projects typically include BAM files (and indexes), run metadata, and summary/report files.
Common Errors and Quick Fixes
- Permission denied or login failures: confirm you are using the correct username/password from the Data Ready email.
- Slow or interrupted transfers: use
rsyncwith--partialand--append-verifyto resume. - Partial files: re‑run
rsyncor verify with checksums. - Missing expected samples: review
Top_Unknown_Barcodes.csvand your submitted barcodes.
Contact Us
If you need assistance from the CAT Core please email either:
- [email protected] if there is an issue with data generation. Please review the CAT Data FAQ before emailing.
- [email protected] for assistance with project and data analysis.
- [email protected] for assistance from the CAT web lab.
Glossary
- FASTQ: Text format containing sequencing reads and quality scores (usually compressed as
.fastq.gz). - BAM: Binary alignment format commonly used for PacBio data; includes read sequence and metadata.
- Demultiplexing: Assigning reads to samples using barcode/index sequences.
- Lane: Physical lane on a flow cell; a run can have multiple lanes.
- Index (Barcode): Short sequence used to identify samples pooled in a run.