Model Code Guideline
Below is a set of guidelines on how to structure the codebase so that we can consistently transform raw data into production-ready features and quickly deploy models to production environments. The primary goal is to ensure reusability, maintainability, and consistency of feature transformations.
1. Data Processing
-
Please Keep Data Transformations in Their Own Module.
- Why?
- Separating transformations from model training logic makes it easier to reuse these transformations for inference.
- This separation also keeps your code cleaner and more maintainable.
- How to do it?
- If you write python scripts, create a dedicated Python module (e.g.,
data_processing.py
orfeature.py
) containing functions or classes for each transformation. - If you use notebooks, create a separate section in the notebook that contains functions or classes for each transformation.
- Each data-processing function or class should have a clearly defined input (e.g., a DataFrame or multiple DataFrames) and output (e.g., a transformed DataFrame).
- Example:
- If you write python scripts, create a dedicated Python module (e.g.,
# data_processing.py or data processing section def transform(df: pd.DataFrame) -> pd.DataFrame: """ Applies developer-specific transformations on the DataFrame and returns the result. """ # ... return transformed_df
- Why?
-
Please use the Same Code for Training and Inference
- Why?
- If you replicate transformations by rewriting them in separate scripts, you risk introducing mismatches or errors.
- A common approach is to define transformations in one place and call them from both your training code and your inference code.
- How to do it?
- During training, call the transformations from your data-processing module before training the model.
- During inference, import the same data-processing functions and apply them to incoming data.
# train.py or train section from data_processing import clean_text, encode_categorical def train_model(training_data): df = load_data(training_data) df['text_clean'] = clean_text(df['text']) df_encoded = encode_categorical(df['category']) # ... # Train your model using the processed features model.fit(df_encoded, df['target'])
# inference.py or inference section from data_processing import clean_text, encode_categorical def predict(model, new_data): df = load_data(new_data) df['text_clean'] = clean_text(df['text']) df_encoded = encode_categorical(df['category']) # ... return model.predict(df_encoded)
- Why?
2. Model Development
Please separate your training logic from your inference (i.e., prediction) logic by developing at least two distinct scripts/modules:
- Training Script / Section
- Purpose: Orchestrates data loading (training set), applies any data preprocessing or feature engineering, and trains the model.
- Outputs:
- A trained model artifact (e.g.,
.pth
,.pt
, or.h5
file depending on your framework). - Optional: logs or metrics for evaluation (e.g., validation accuracy, loss).
- A trained model artifact (e.g.,
- Inference (Prediction) Script / Section
- Purpose: Loads the trained model artifact, applies the same data preprocessing steps (in the same order and format) as in training, then runs forward passes to get predictions.
- Inputs: The trained model artifact plus any new (unlabeled) input data.
- Outputs: Predicted labels, numeric values, or other model-specific outputs.
3. What to upload
Archive ****your project folder into a .zip
containing:
- If you use notebooks, include the following sections in the notebook. If you write python scripts, include the following script files.
- Data processing module
- This ensures no training-serving skew in data transformations.
- Model training module
- Model inference/prediction module that handles model loading and inference logic (e.g.,
inference.py
orpredict.py
):- Loads the model artifacts.
- Applies the same data transformations used in training (at least, the ones necessary for inference).
- Exposes an entry point (e.g., a function) to receive input data and produce predictions.
- Data processing module
- Dependency Definitions
- List all Python libraries and versions in a
requirements.txt
file
- List all Python libraries and versions in a
- Model Artifacts (weights/files) - Optional, as long as your results are reproducible by running your code
By following these guidelines, you’ll end up with a more robust, maintainable, and production-friendly workflow—allowing us to integrate the models quickly while ensuring stable, consistent predictions.
Updated about 1 month ago