CAT Data FAQ

You've just received an email that your data is ready to download. This page explains how to download CAT sequencing data, what files you will receive, how long data are retained, and what to do if you need help.

How Do I Download My Data?

When your data are ready, you will receive a “Data Ready” email with the exact data path and credentials. You can download using SFTP or rsync. Data are typically available for about 3 months on fast SSD storage, then moved to slower storage for roughly another 3 months. Please download and back up your data as soon as possible.

How Do I Use SFTP (GUI)?

We recommend Cyberduck (Windows/Mac) or FileZilla. Server: fastq.ucsf.edu, Port: 22. Username and password are included in the Data Ready email.

Short video walkthrough coming here.

How Do I Use SFTP (Command Line)?

sftp <username>@fastq.ucsf.edu ​ # enter password

sftp usefull commands:

  • ls # list remote directory
  • cd DIR # change remote directory
  • pwd # show remote working directory
  • lls # list local directory
  • lcd DIR # change local directory
  • lpwd # show local working directory
  • get filename # download one file
  • get -r directory # download a whole directory
  • exit # quit SFTP

How Do I Use rsync? (Recommended for Large Transfers)

rsync -a --append-verify --partial --info=progress2 --no-o --no-g --no-perms --compress-level=0 \
    <username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
  

To list files without copying:

rsync -av --list-only <username>@fastq.ucsf.edu:<PATH_TO_DATA> <LOCAL_DESTINATION>
  

Note: “Could not chdir to home directory …” can be ignored.

Data Integrity (Checksums)

Your data folder will include a checksum file (for example fastq_checksums.md5 or checksums.md5). After download, verify integrity with:

md5sum -c fastq_checksums.md5
  

If you are on macOS and don’t have md5sum, install coreutils or use an equivalent tool to compare hashes.

Data Retention and Regeneration

Data is initially available from fast storage on /volume1/SSD/illumina/run_id after approximately 3 weeks data is moved to slower storage /volume2/tortoise/run_id OR /volume3/slow/run_id and then deleted after approximately three months. Raw run data is retained for approximately 6 months. You can request regeneration through iLabs for a $100 fee (per project). We recommend backing up in at least two locations immediately after download.

We only guarantee data availability for approximately 6 months total.

Missing Samples (Barcodes)

If expected samples are missing from your results, the most common cause is an incorrect samplesheet (wrong barcodes or orientation), or low/absent representation in the library. Before contacting CAT, review Reports/Top_Unknown_Barcodes.csv to see barcode pairs present in significant quantity that may be missing from the results. Also review the SampleSheet.csv found in the bcl-convert outputs. The pipeline attempts to identify and add these automatically, but review is still recommended.

How Sample Names and Barcodes Are Edited

Sample IDs are sanitized for compatibility: Unicode is converted to ASCII, punctuation removed, whitespace converted to underscores, and only letters, numbers, underscores, and dashes are allowed. If two Sample IDs collide after sanitization, a short hash is appended (_hXXXXXX) to make them unique. Barcodes are normalized to uppercase A/C/G/T only, N characters are interpreted as UMI and are outputed, and invalid entries are rejected. Some workflows may omit i5 or enforce fixed index length.

What Is the CAT Project Report?

Each project output includes a PDF report (<PROJECT_ID>_project_report.pdf) that summarizes project-level sequencing quality and demultiplexing performance. The report is intended to help labs quickly identify project-specific issues and improve future library prep, indexing, and pooling.

Use the report in this order:

  • Lane Summary
  • Sample Summary
  • Reads per Sample
  • Top Unknown Barcodes
  • Glossary

How Should I Interpret the Project Report?

In most cases, project-level issues are related to sample sheet metadata or library quality, not instrument failure.

Common patterns:

  • Incorrect sample sheet barcodes are a common cause of poor demultiplexing.
  • Increasing barcode mismatch to 2 is generally not recommended.
  • Lower demultiplexing and lower quality metrics are often driven by library quantity, quality, and fragment size.
  • Flowcell-level issues usually affect many/all lanes, not just one or a few lanes. CAT will proactively notify customers when a true flowcell-wide issue is observed.

Lane Summary: What to Look For

The Lane Summary reports lane-level reads, bases, %Q30, PhiX alignment, lane-level index hopping estimate, and % Demuxed.

The sample distribution plots (Reads (M), %Q30, Mean Q) show one point per sample.
Focus on spread:

  • Tight spread usually indicates consistent sample prep/library quality/pooling.
  • Wide spread or outliers usually indicate sample-specific issues.

Sample Summary and Undetermined Reads

Sample Summary includes per-sample reads, quality, and index match stats, plus Undetermined reads (reads not assigned to expected sample barcodes).

Undetermined reads can include PhiX unless PhiX is explicitly listed in the SampleSheet.
For NovaSeq X demultiplexing behavior and library/index effects, see:
Effects of library quality and indexes on demultiplexing NovaSeq X Series data

Top Unknown Barcodes: How to Use It

Top Unknown Barcodes helps troubleshoot missing/low samples by showing barcode pairs observed by the instrument that were not assigned as expected.

Use this section to:

  • Compare observed index/index2 pairs against your submitted SampleSheet
  • Check for orientation/index-entry mistakes
  • Spot technical artifacts (for example Poly-G or adapter-like sequences)

Only top entries are shown in the PDF; use Reports/Top_Unknown_Barcodes.csv in your project folder for the full list.

Barcode Collisions and Automatic Corrections

Some barcodes are too similar (or contain Ns) and may be filtered to avoid misassignment. The CAT pipeline uses the run's top emited barcodes to infer dominant unexpected barcodes, attempts reverse‑complement corrections when needed, and may write a corrected samplesheet. If corrections are unsafe, it skips changes and reports a warning.

bcl2fastq vs. bcl-convert

CAT now uses Illumina’s bcl-convert instead of bcl2fastq. The output is still FASTQ, but report/log layout is updated (e.g., Reports/ and Logs/ folders), sample sheet handling is stricter, and some file naming details may differ. If your pipeline depends on the older layout, contact CAT for guidance.

What Is in My Project Folder?

Illumina projects typically include FASTQ files, Reports/, Logs/, and a SampleSheet.csv.

PacBio projects typically include BAM files (and indexes), run metadata, and summary/report files.

Common Errors and Quick Fixes

  • Permission denied or login failures: confirm you are using the correct username/password from the Data Ready email.
  • Slow or interrupted transfers: use rsync with --partial and --append-verify to resume.
  • Partial files: re‑run rsync or verify with checksums.
  • Missing expected samples: review Top_Unknown_Barcodes.csv and your submitted barcodes.

Contact Us

If you need assistance from the CAT Core please email either:

Glossary

  • FASTQ: Text format containing sequencing reads and quality scores (usually compressed as .fastq.gz).
  • BAM: Binary alignment format commonly used for PacBio data; includes read sequence and metadata.
  • Demultiplexing: Assigning reads to samples using barcode/index sequences.
  • Lane: Physical lane on a flow cell; a run can have multiple lanes.
  • Index (Barcode): Short sequence used to identify samples pooled in a run.

Referencing the CAT Core and Publications

 

When submitting manuscripts, please acknowledge the CAT by including the text: “Sequencing was performed at the UCSF CAT, supported by UCSF PBBR, RRP IMIA, and NIH 1S10OD028511-01 grants.” 

 

To aid our facility in future grant submissions, it helps to reference labs and publications that have benefited from our core please notify us of your achievement by emailing catcore [at] ucsf.edu the PMID number, lab, and the piece of equipment used in the paper.