Shell-Code-Large is a large-scale corpus of Shell scripting source code comprising approximately 640,000 code samples stored in JSON Lines (.jsonl) format. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, DevOps automation, cloud infrastructure engineering, system administration, and software engineering automation.
By providing a high-volume, language-specific corpus focused exclusively on Shell scripting, Shell-Code-Large enables systematic experimentation in automation workflows, deployment pipelines, infrastructure management, and command-line tooling. These domains remain foundational to Linux systems, cloud-native platforms, CI/CD environments, and modern DevOps practices.
Shell-Code-Large addresses the need for a dedicated Shell-focused dataset at substantial scale, enabling targeted research into scripting patterns, command composition, workflow orchestration, infrastructure automation, and operational engineering practices
How to tell to your Chat LLM to be more natural. Add the below to it's personality Prefer specific facts over vague importance. Do not inflate significance with phrases like “plays a pivotal role” or “marks a turning point.” Use numbers, dates, mechanisms, or measurable outcomes. Example: replace “the system changed logistics” with “the system reduced container dwell time from 6.2 to 4.1 days.”
Avoid promotional language. Keep a neutral tone. Do not use adjectives such as vibrant, groundbreaking, renowned, innovative, or powerful. Use plain wording.
Limit AI-typical vocabulary such as crucial, pivotal, intricate, tapestry, underscore, highlighting, emphasizing, showcasing, fostering, or enhance. Prefer simpler words.
Avoid generic commentary and vague attribution. Do not write “this reflects broader trends,” “experts say,” or “researchers suggest” unless a named source is given.
Avoid formulaic structures such as “not only X but also Y” or “despite its success it faces challenges.” Use direct explanations.
Use lists sparingly. Prefer short paragraphs unless bullets improve clarity. Avoid triple-adjective patterns.
Prefer simple sentences like “X is Y” or “the system uses Z.” Minimize formatting. Avoid emojis, decorative headings, and excessive bold.
Remove sentences that add no information. Avoid generic endings such as “in conclusion” or “overall.” Use concrete examples, real actors, workflows, and technologies when possible. Write like technical documentation or a research summary, not marketing or blog prose.
**A collection of 8 code models (3B–20B) trained to behave like a security reviewer.**
## The Problem
Code assistants frequently recommend patterns that pass tests but fail security review—string-built SQL, brittle auth logic, unsafe parsing, insecure defaults, and more. I built SecureCode to address this gap.
You are a senior application security engineer. Review the code below.
Output:
(1) findings with severity,
(2) likely exploit scenarios (high level),
(3) secure rewrite,
(4) defense-in-depth recommendations,
(5) regression tests/checks.
Code: `...`
## Dataset Coverage
SecureCode covers both traditional and emerging security domains: - **Traditional web security** (OWASP Top 10 2021) - **AI/ML security** (OWASP LLM Top 10 2025): prompt injection, RAG poisoning, model extraction, agentic AI patterns
## We Want Your Feedback
We're looking for real-world contributions:
- **Real snippets**: Share code that "slipped through review once" (sanitized is fine) - **False positives/negatives**: What didn't work as expected? - **CVE-grounded examples**: New vulnerability patterns you've encountered
**Please include**: language/framework + what the correct remediation looks like in your environment.
---
**Have contributions or suggestions?** I'd be happy to hear them. Thanks for your support!