You've just received an email that your data is ready to download. This page explains how to download CAT sequencing data, what files you will receive, how long data are retained, and what to do if you need help.
How Do I Download My Data?
When your data are ready, you will receive a “Data Ready” email with the exact data path and credentials. You can download using SFTP or rsync. Data are typically available for about 3 months on fast SSD storage, then moved to slower storage for roughly another 3 months. Please download and back up your data as soon as possible.
How Do I Use SFTP (GUI)?
We recommend Cyberduck (Windows/Mac) or FileZilla. Server: fastq.ucsf.edu, Port: 22. Username and password are included in the Data Ready email.
Short video walkthrough coming here.
How Do I Use SFTP (Command Line)?
sftp <username>@fastq.ucsf.edu # enter password
sftp usefull commands:
- ls # list remote directory
- cd DIR # change remote directory
- pwd # show remote working directory
- lls # list local directory
- lcd DIR # change local directory
- lpwd # show local working directory
- get filename # download one file
- get -r directory # download a whole directory
- exit # quit SFTP
How Do I Use rsync? (Recommended for Large Transfers)
rsync -a --append-verify --partial --info=progress2 --no-o --no-g --no-perms --compress-level=0 \
<username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
To list files without copying:
rsync -av --list-only <username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
Note: “Could not chdir to home directory …” can be ignored.
Data Integrity (Checksums)
Your data folder will include a checksum file (for example fastq_checksums.md5 or checksums.md5). After download, verify integrity with:
md5sum -c fastq_checksums.md5
If you are on macOS and don’t have md5sum, install coreutils or use an equivalent tool to compare hashes.
Data Retention and Regeneration
Data is initially available from fast storage on /volume1/SSD/illumina/run_id after approximately 3 weeks data is moved to slower storage /volume2/tortoise/run_id OR /volume3/slow/run_id and then deleted after approximately three months. Raw run data is retained for approximately 6 months. You can request regeneration through iLabs for a $100 fee (per project). We recommend backing up in at least two locations immediately after download.
We only guarantee data availability for approximately 6 months total.
Missing Samples (Barcodes)
If expected samples are missing from your results, the most common cause is an incorrect samplesheet (wrong barcodes or orientation), or low/absent representation in the library. Before contacting CAT, review Reports/Top_Unknown_Barcodes.csv to see barcode pairs present in significant quantity that may be missing from the results. Also review the SampleSheet.csv found in the bcl-convert outputs. The pipeline attempts to identify and add these automatically, but review is still recommended.
How Sample Names and Barcodes Are Edited
Sample IDs are sanitized for compatibility: Unicode is converted to ASCII, punctuation removed, whitespace converted to underscores, and only letters, numbers, underscores, and dashes are allowed. If two Sample IDs collide after sanitization, a short hash is appended (_hXXXXXX) to make them unique. Barcodes are normalized to uppercase A/C/G/T only, N characters are interpreted as UMI and are outputed, and invalid entries are rejected. Some workflows may omit i5 or enforce fixed index length.
Barcode Collisions and Automatic Corrections
Some barcodes are too similar (or contain Ns) and may be filtered to avoid misassignment. The CAT pipeline uses the run's top emited barcodes to infer dominant unexpected barcodes, attempts reverse‑complement corrections when needed, and may write a corrected samplesheet. If corrections are unsafe, it skips changes and reports a warning.
bcl2fastq vs. bcl-convert
CAT now uses Illumina’s bcl-convert instead of bcl2fastq. The output is still FASTQ, but report/log layout is updated (e.g., Reports/ and Logs/ folders), sample sheet handling is stricter, and some file naming details may differ. If your pipeline depends on the older layout, contact CAT for guidance.
What Is in My Project Folder?
Illumina projects typically include FASTQ files, Reports/, Logs/, and a SampleSheet.csv.
PacBio projects typically include BAM files (and indexes), run metadata, and summary/report files.
Common Errors and Quick Fixes
- Permission denied or login failures: confirm you are using the correct username/password from the Data Ready email.
- Slow or interrupted transfers: use
rsyncwith--partialand--append-verifyto resume. - Partial files: re‑run
rsyncor verify with checksums. - Missing expected samples: review
Top_Unknown_Barcodes.csvand your submitted barcodes.
Contact Us
If you need assistance from the CAT Core please email either:
- [email protected] if there is an issue with data generation. Please review the CAT Data FAQ before emailing.
- [email protected] for assistance with project and data analysis.
- [email protected] for assistance from the CAT web lab.
Glossary
- FASTQ: Text format containing sequencing reads and quality scores (usually compressed as
.fastq.gz). - BAM: Binary alignment format commonly used for PacBio data; includes read sequence and metadata.
- Demultiplexing: Assigning reads to samples using barcode/index sequences.
- Lane: Physical lane on a flow cell; a run can have multiple lanes.
- Index (Barcode): Short sequence used to identify samples pooled in a run.