items_header

Open projects

Projects available to all portals

Locoshop
Montreal, Quebec, Canada
DF
Founder
Preferred learners
  • Anywhere
  • Academic experience
Categories
Computer science & it Data analysis Website development Software development Machine learning Artificial intelligence
Project scope
What is the main goal for this project?

Locoshop aggregates product data from millions of e-commerce websites globally, enabling consumers to efficiently find the products they seek.


However, we face a significant challenge: accurately matching identical products when the product information varies across different stores.


Currently, we utilize Elasticsearch with keyword fuzziness to match products. This method depends heavily on the similarity of product information entered by different stores, leading to limitations in accuracy.


To enhance our capability to match identical products with high certainty, we need to develop more sophisticated matching techniques. This is crucial, especially under the following conditions:


  • Product descriptions vary significantly from store to store.
  • Product images differ across stores.
  • UPC (or Manufacturer Product ID) codes are not consistently available.


We are exploring advanced solutions, including machine learning algorithms that can analyze images and text to recognize products regardless of description variability, and data enrichment techniques to compensate for missing information like UPC codes.


This will enable us to meet our desired use cases with the required accuracy.


What tasks will learners need to complete to achieve the project goal?

To successfully achieve the project goal of accurately matching identical products across different e-commerce platforms despite variations in descriptions, images, and sometimes missing UPC codes, learners will need to undertake a series of activities.


These tasks will encompass both technical and analytical skills, involving data collection, preprocessing, model development, and testing. Here’s a structured breakdown of the necessary activities:


Data Collection and Integration:

  • Gather product data from various e-commerce sources.
  • Ensure the collection of a variety of fields such as names, descriptions, images, prices, and available identification codes (like UPC).
  • Develop mechanisms to continuously ingest and update the data as product listings change over time.


Data Cleaning and Preprocessing:

  • Normalize text data (product descriptions and titles) to a consistent format.
  • Handle missing values, especially for critical fields like UPC codes.
  • Use image preprocessing techniques to standardize product images for analysis.


Feature Engineering:

  • Extract features from text data using NLP techniques such as TF-IDF, word embeddings, or BERT-like models.
  • Develop image feature extraction methods using CNNs or pre-trained models like ResNet.
  • Create composite features that might help in matching products, such as category-specific attributes.


Developing Matching Algorithms:

  • Implement and test various matching algorithms, from simple fuzzy matching to more complex machine learning models.
  • Explore and integrate advanced matching techniques such as Siamese networks for images and semantic similarity models for text.
  • Evaluate the use of ensemble methods to combine results from different models for improved accuracy.


Testing and Validation:

  • Split the data into training, validation, and test sets to evaluate the effectiveness of the matching models.
  • Use metrics such as precision, recall, and F1-score to assess performance.
  • Conduct manual reviews to validate matches in cases where automatic measures are inconclusive.


System Integration:

  • Develop a system architecture that integrates the matching models into the existing Elasticsearch setup.
  • Ensure that the system can handle large volumes of data and requests efficiently.
  • Implement robust error handling and logging mechanisms.


User Interface and Experience:

  • Design interfaces that allow users to review and correct matches if necessary.
  • Implement feedback mechanisms to continuously improve the matching algorithms based on user inputs.


Performance Tuning and Optimization:

  • Optimize the matching algorithms and system configuration for speed and accuracy.
  • Test scalability and make necessary adjustments to handle peak loads.
  • Deployment and Monitoring:
  • Deploy the system in a production environment.
  • Monitor the system’s performance and make iterative improvements based on real-world usage patterns and feedback.


Documentation and Training:

  • Document the system design, algorithms used, and operational procedures.
  • Train team members on system maintenance, troubleshooting, and future development.


These activities will provide a comprehensive approach to tackling the challenge of product matching across diverse e-commerce platforms, ultimately enhancing the accuracy and reliability of the search and match functionality.


How will you support learners in completing the project?

Our CTO will be available to answer any code or database related inquiries.

Learners will also have access to internal product databases upon request.


What skills or technologies will help learners to complete the project?


Natural Language Processing (NLP) techniques

  • Term Frequency-Inverse Document Frequency word embeddings
  • BERT-like models are used to process and analyze large amounts of natural language data.


Open-source image recognition libraries:

  • OpenCV: Optimized for real-time applications, supports extensive image processing operations.
  • TensorFlow: Comprehensive machine learning platform, ideal for building and training CNNs for image tasks.
  • Keras: High-level neural networks API, runs on top of TensorFlow, designed for fast experimentation.
  • PyTorch: Flexible and fast, favored in research for developing complex machine learning models.
  • Darknet: Underpins YOLO, known for real-time object detection with direct GPU computation.
  • SimpleCV: User-friendly framework for building computer vision applications, integrates well with OpenCV.
  • scikit-image:


UPC database APIs for data enriching


  • GS1 UPC API (https://www.gs1us.org/tools/gs1-us-data-hub/product/view-use)
  • Crowdsourced UPC databases (https://go-upc.com/)


Each of these techniques processes text differently and is chosen based on the specific needs of the task, such as the level of semantic understanding required, computational resources, and the nature of the language data.

Supported causes
Sustainable cities and communities
About the company

Locoshop is the world's first global search engine dedicated to helping shoppers locate the brand name products they want in the closest nearby stores, boutiques and shopping centers.