今日arXiv精选 | 23篇顶会论文:ICASSP / ICCV / CIKM / ICME / AAAI-编程知识网

 关于 #今日arXiv精选 

这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。

VerbCL: A Dataset of Verbatim Quotes for Highlight Extraction in Case Law

Comment: CIKM 2021, Resource Track

Link: http://arxiv.org/abs/2108.10120

Abstract

Citing legal opinions is a key part of legal argumentation, an expert taskthat requires retrieval, extraction and summarization of information from courtdecisions. The identification of legally salient parts in an opinion for thepurpose of citation may be seen as a domain-specific formulation of a highlightextraction or passage retrieval task. As similar tasks in other domains such asweb search show significant attention and improvement, progress in the legaldomain is hindered by the lack of resources for training and evaluation.  This paper presents a new dataset that consists of the citation graph ofcourt opinions, which cite previously published court opinions in support oftheir arguments. In particular, we focus on the verbatim quotes, i.e., wherethe text of the original opinion is directly reused.  With this approach, we explain the relative importance of different textspans of a court opinion by showcasing their usage in citations, and measuringtheir contribution to the relations between opinions in the citation graph.  We release VerbCL, a large-scale dataset derived from CourtListener andintroduce the task of highlight extraction as a single-document summarizationtask based on the citation graph establishing the first baseline results forthis task on the VerbCL dataset.

Exploring Simple 3D Multi-Object Tracking for Autonomous Driving

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.10312

Abstract

3D multi-object tracking in LiDAR point clouds is a key ingredient forself-driving vehicles. Existing methods are predominantly based on thetracking-by-detection pipeline and inevitably require a heuristic matching stepfor the detection association. In this paper, we present SimTrack to simplifythe hand-crafted tracking paradigm by proposing an end-to-end trainable modelfor joint detection and tracking from raw point clouds. Our key design is topredict the first-appear location of each object in a given snippet to get thetracking identity and then update the location based on motion estimation. Inthe inference, the heuristic matching step can be completely waived by a simpleread-off operation. SimTrack integrates the tracked object association, newbornobject detection, and dead track killing in a single unified model. We conductextensive evaluations on two large-scale datasets: nuScenes and Waymo OpenDataset. Experimental results reveal that our simple approach comparesfavorably with the state-of-the-art methods while ruling out the heuristicmatching rules.

Ranking Models in Unlabeled New Environments

Comment: 13 pages, 10 figures, ICCV2021

Link: http://arxiv.org/abs/2108.10310

Abstract

Consider a scenario where we are supplied with a number of ready-to-usemodels trained on a certain source domain and hope to directly apply the mostappropriate ones to different target domains based on the models' relativeperformance. Ideally we should annotate a validation set for model performanceassessment on each new target environment, but such annotations are often veryexpensive. Under this circumstance, we introduce the problem of ranking modelsin unlabeled new environments. For this problem, we propose to adopt a proxydataset that 1) is fully labeled and 2) well reflects the true model rankingsin a given target environment, and use the performance rankings on the proxysets as surrogates. We first select labeled datasets as the proxy.Specifically, datasets that are more similar to the unlabeled target domain arefound to better preserve the relative performance rankings. Motivated by this,we further propose to search the proxy set by sampling images from variousdatasets that have similar distributions as the target. We analyze the problemand its solutions on the person re-identification (re-ID) task, for whichsufficient datasets are publicly available, and show that a carefullyconstructed proxy set effectively captures relative performance ranking in newenvironments. Code is available at \url{https://github.com/sxzrt/Proxy-Set}.

Towards Balanced Learning for Instance Recognition

Comment: Accepted by IJCV. Journal extension of paper arXiv:1904.02701

Link: http://arxiv.org/abs/2108.10175

Abstract

Instance recognition is rapidly advanced along with the developments ofvarious deep convolutional neural networks. Compared to the architectures ofnetworks, the training process, which is also crucial to the success ofdetectors, has received relatively less attention. In this work, we carefullyrevisit the standard training practice of detectors, and find that thedetection performance is often limited by the imbalance during the trainingprocess, which generally consists in three levels – sample level, featurelevel, and objective level. To mitigate the adverse effects caused thereby, wepropose Libra R-CNN, a simple yet effective framework towards balanced learningfor instance recognition. It integrates IoU-balanced sampling, balanced featurepyramid, and objective re-weighting, respectively for reducing the imbalance atsample, feature, and objective level. Extensive experiments conducted on MSCOCO, LVIS and Pascal VOC datasets prove the effectiveness of the overallbalanced design.

ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Comment: Accepted in ICCV 2021 as oral

Link: http://arxiv.org/abs/2108.10165

Abstract

Localizing objects and estimating their extent in 3D is an important steptowards high-level 3D scene understanding, which has many applications inAugmented Reality and Robotics. We present ODAM, a system for 3D ObjectDetection, Association, and Mapping using posed RGB videos. The proposed systemrelies on a deep learning front-end to detect 3D objects from a given RGB frameand associate them to a global object-based map using a graph neural network(GNN). Based on these frame-to-model associations, our back-end optimizesobject bounding volumes, represented as super-quadrics, under multi-viewgeometry constraints and the object scale prior. We validate the proposedsystem on ScanNet where we show a significant improvement over existingRGB-only methods.

Deep Relational Metric Learning

Comment: Accepted to ICCV 2021. Source code available at  https://github.com/zbr17/DRML

Link: http://arxiv.org/abs/2108.10026

Abstract

This paper presents a deep relational metric learning (DRML) framework forimage clustering and retrieval. Most existing deep metric learning methodslearn an embedding space with a general objective of increasing interclassdistances and decreasing intraclass distances. However, the conventional lossesof metric learning usually suppress intraclass variations which might behelpful to identify samples of unseen classes. To address this problem, wepropose to adaptively learn an ensemble of features that characterizes an imagefrom different aspects to model both interclass and intraclass distributions.We further employ a relational module to capture the correlations among eachfeature in the ensemble and construct a graph to represent an image. We thenperform relational inference on the graph to integrate the ensemble and obtaina relation-aware embedding to measure the similarities. Extensive experimentson the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasetsdemonstrate that our framework improves existing deep metric learning methodsand achieves very competitive results.

BiaSwap: Removing dataset bias with bias-tailored swapping augmentation

Comment: Accepted to ICCV'21

Link: http://arxiv.org/abs/2108.10008

Abstract

Deep neural networks often make decisions based on the spurious correlationsinherent in the dataset, failing to generalize in an unbiased datadistribution. Although previous approaches pre-define the type of dataset biasto prevent the network from learning it, recognizing the bias type in the realdataset is often prohibitive. This paper proposes a novel bias-tailoredaugmentation-based approach, BiaSwap, for learning debiased representationwithout requiring supervision on the bias type. Assuming that the biascorresponds to the easy-to-learn attributes, we sort the training images basedon how much a biased classifier can exploits them as shortcut and divide theminto bias-guiding and bias-contrary samples in an unsupervised manner.Afterwards, we integrate the style-transferring module of the image translationmodel with the class activation maps of such biased classifier, which enablesto primarily transfer the bias attributes learned by the classifier. Therefore,given the pair of bias-guiding and bias-contrary, BiaSwap generates thebias-swapped image which contains the bias attributes from the bias-contraryimages, while preserving bias-irrelevant ones in the bias-guiding images. Givensuch augmented images, BiaSwap demonstrates the superiority in debiasingagainst the existing baselines over both synthetic and real-world datasets.Even without careful supervision on the bias, BiaSwap achieves a remarkableperformance on both unbiased and bias-guiding samples, implying the improvedgeneralization capability of the model.

Image coding for machines: an end-to-end learned approach

Comment: Added typo fixes since the version accepted in IEEE ICASSP2021

Link: http://arxiv.org/abs/2108.09993

Abstract

Over recent years, deep learning-based computer vision systems have beenapplied to images at an ever-increasing pace, oftentimes representing the onlytype of consumption for those images. Given the dramatic explosion in thenumber of images generated per day, a question arises: how much better would animage codec targeting machine-consumption perform against state-of-the-artcodecs targeting human-consumption? In this paper, we propose an image codecfor machines which is neural network (NN) based and end-to-end learned. Inparticular, we propose a set of training strategies that address the delicateproblem of balancing competing loss functions, such as computer vision tasklosses, image distortion losses, and rate loss. Our experimental results showthat our NN-based codec outperforms the state-of-the-art Versa-tile VideoCoding (VVC) standard on the object detection and instance segmentation tasks,achieving -37.87% and -32.90% of BD-rate gain, respectively, while being fastthanks to its compact size. To the best of our knowledge, this is the firstend-to-end learned machine-targeted image codec.

Learned Image Coding for Machines: A Content-Adaptive Approach

Comment: Added some typo fixes since the accepted version in ICME2021

Link: http://arxiv.org/abs/2108.09992

Abstract

Today, according to the Cisco Annual Internet Report (2018-2023), thefastest-growing category of Internet traffic is machine-to-machinecommunication. In particular, machine-to-machine communication of images andvideos represents a new challenge and opens up new perspectives in the contextof data compression. One possible solution approach consists of adaptingcurrent human-targeted image and video coding standards to the use case ofmachine consumption. Another approach consists of developing completely newcompression paradigms and architectures for machine-to-machine communications.In this paper, we focus on image compression and present an inference-timecontent-adaptive finetuning scheme that optimizes the latent representation ofan end-to-end learned image codec, aimed at improving the compressionefficiency for machine-consumption. The conducted experiments show that ouronline finetuning brings an average bitrate saving (BD-rate) of -3.66% withrespect to our pretrained image codec. In particular, at low bitrate points,our proposed method results in a significant bitrate saving of -9.85%. Overall,our pretrained-and-then-finetuned system achieves -30.54% BD-rate over thestate-of-the-art image/video codec Versatile Video Coding (VVC).

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.09980

Abstract

Contrastive learning has been widely used to train transformer-basedvision-language models for video-text alignment and multi-modal representationlearning. This paper presents a new algorithm called Token-Aware Cascadecontrastive learning (TACo) that improves contrastive learning using two noveltechniques. The first is the token-aware contrastive loss which is computed bytaking into account the syntactic classes of words. This is motivated by theobservation that for a video-text pair, the content words in the text, such asnouns and verbs, are more likely to be aligned with the visual contents in thevideo than the function words. Second, a cascade sampling method is applied togenerate a small set of hard negative examples for efficient loss estimationfor multi-modal fusion layers. To validate the effectiveness of TACo, in ourexperiments we finetune pretrained models for a set of downstream tasksincluding text-video retrieval (YouCook2, MSR-VTT and ActivityNet), videoaction step localization (CrossTask), video action segmentation (COIN). Theresults show that our models attain consistent improvements across differentexperimental settings over previous methods, setting new state-of-the-art onthree public text-video retrieval benchmarks of YouCook2, MSR-VTT andActivityNet.

Learning Signed Distance Field for Multi-view Surface Reconstruction

Comment: ICCV 2021 (Oral)

Link: http://arxiv.org/abs/2108.09964

Abstract

Recent works on implicit neural representations have shown promising resultsfor multi-view surface reconstruction. However, most approaches are limited torelatively simple geometries and usually require clean object masks forreconstructing complex and concave objects. In this work, we introduce a novelneural surface reconstruction framework that leverages the knowledge of stereomatching and feature consistency to optimize the implicit surfacerepresentation. More specifically, we apply a signed distance field (SDF) and asurface light field to represent the scene geometry and appearancerespectively. The SDF is directly supervised by geometry from stereo matching,and is refined by optimizing the multi-view feature consistency and thefidelity of rendered images. Our method is able to improve the robustness ofgeometry estimation and support reconstruction of complex scene topologies.Extensive experiments have been conducted on DTU, EPFL and Tanks and Templesdatasets. Compared to previous state-of-the-art methods, our method achievesbetter mesh reconstruction in wide open scenes without masks as input.

Voxel-based Network for Shape Completion by Leveraging Edge Generation

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.09936

Abstract

Deep learning technique has yielded significant improvements in point cloudcompletion with the aim of completing missing object shapes from partialinputs. However, most existing methods fail to recover realistic structures dueto over-smoothing of fine-grained details. In this paper, we develop avoxel-based network for point cloud completion by leveraging edge generation(VE-PCN). We first embed point clouds into regular voxel grids, and thengenerate complete objects with the help of the hallucinated shape edges. Thisdecoupled architecture together with a multi-scale grid feature learning isable to generate more realistic on-surface details. We evaluate our model onthe publicly available completion datasets and show that it outperformsexisting state-of-the-art approaches quantitatively and qualitatively. Oursource code is available at https://github.com/xiaogangw/VE-PCN.

SegMix: Co-occurrence Driven Mixup for Semantic Segmentation and Adversarial Robustness

Comment: Under submission at IJCV (BMVC 2020 Extension). arXiv admin note:  substantial text overlap with arXiv:2008.05667

Link: http://arxiv.org/abs/2108.09929

Abstract

In this paper, we present a strategy for training convolutional neuralnetworks to effectively resolve interference arising from competing hypothesesrelating to inter-categorical information throughout the network. The premiseis based on the notion of feature binding, which is defined as the process bywhich activations spread across space and layers in the network aresuccessfully integrated to arrive at a correct inference decision. In our work,this is accomplished for the task of dense image labelling by blending imagesbased on (i) categorical clustering or (ii) the co-occurrence likelihood ofcategories. We then train a feature binding network which simultaneouslysegments and separates the blended images. Subsequent feature denoising tosuppress noisy activations reveals additional desirable properties and highdegrees of successful predictions. Through this process, we reveal a generalmechanism, distinct from any prior methods, for boosting the performance of thebase segmentation and saliency network while simultaneously increasingrobustness to adversarial attacks.

A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.09897

Abstract

This paper addresses weakly supervised amodal instance segmentation, wherethe goal is to segment both visible and occluded (amodal) object parts, whiletraining provides only ground-truth visible (modal) segmentations. Followingprior work, we use data manipulation to generate occlusions in training imagesand thus train a segmenter to predict amodal segmentations of the manipulateddata. The resulting predictions on training images are taken as thepseudo-ground truth for the standard training of Mask-RCNN, which we use foramodal instance segmentation of test images. For generating the pseudo-groundtruth, we specify a new Amodal Segmenter based on Boundary Uncertaintyestimation (ASBU) and make two contributions. First, while prior work uses theoccluder's mask, our ASBU uses the occlusion boundary as input. Second, ASBUestimates an uncertainty map of the prediction. The estimated uncertaintyregularizes learning such that lower segmentation loss is incurred on regionswith high uncertainty. ASBU achieves significant performance improvementrelative to the state of the art on the COCOA and KINS datasets in three tasks:amodal instance segmentation, amodal completion, and ordering recovery.

CANet: A Context-Aware Network for Shadow Removal

Comment: This paper was accepted to the IEEE International Conference on  Computer Vision (ICCV), Montreal, Canada, Oct 11-17, 2021

Link: http://arxiv.org/abs/2108.09894

Abstract

In this paper, we propose a novel two-stage context-aware network named CANetfor shadow removal, in which the contextual information from non-shadow regionsis transferred to shadow regions at the embedded feature spaces. At Stage-I, wepropose a contextual patch matching (CPM) module to generate a set of potentialmatching pairs of shadow and non-shadow patches. Combined with the potentialcontextual relationships between shadow and non-shadow regions, ourwell-designed contextual feature transfer (CFT) mechanism can transfercontextual information from non-shadow to shadow regions at different scales.With the reconstructed feature maps, we remove shadows at L and A/B channelsseparately. At Stage-II, we use an encoder-decoder to refine current resultsand generate the final shadow removal results. We evaluate our proposed CANeton two benchmark datasets and some real-world shadow images with complexscenes. Extensive experimental results strongly demonstrate the efficacy of ourproposed CANet and exhibit superior performance to state-of-the-arts.

Multi-Expert Adversarial Attack Detection in Person Re-identification Using Context Inconsistency

Comment: Accepted at IEEE ICCV 2021

Link: http://arxiv.org/abs/2108.09891

Abstract

The success of deep neural networks (DNNs) haspromoted the widespreadapplications of person re-identification (ReID). However, ReID systems inheritthevulnerability of DNNs to malicious attacks of visually in-conspicuousadversarial perturbations. Detection of adver-sarial attacks is, therefore, afundamental requirement forrobust ReID systems. In this work, we propose aMulti-Expert Adversarial Attack Detection (MEAAD) approach toachieve this goalby checking context inconsistency, whichis suitable for any DNN-based ReIDsystems. Specifically,three kinds of context inconsistencies caused byadversar-ial attacks are employed to learn a detector for distinguish-ing theperturbed examples, i.e., a) the embedding distancesbetween a perturbed queryperson image and its top-K re-trievals are generally larger than those betweena benignquery image and its top-K retrievals, b) the embedding dis-tances amongthe top-K retrievals of a perturbed query im-age are larger than those of abenign query image, c) thetop-K retrievals of a benign query image obtainedwith mul-tiple expert ReID models tend to be consistent, which isnot preservedwhen attacks are present. Extensive exper-iments on the Market1501 andDukeMTMC-ReID datasetsshow that, as the first adversarial attack detectionapproachfor ReID,MEAADeffectively detects various adversarial at-tacks andachieves high ROC-AUC (over 97.5%).

Influence-guided Data Augmentation for Neural Tensor Completion

Comment: Accepted for publication at 30th ACM International Conference on  Information and Knowledge Management (ACM CIKM 2021). Code and data:  https://github.com/srijankr/DAIN

Link: http://arxiv.org/abs/2108.10248

Abstract

How can we predict missing values in multi-dimensional data (or tensors) moreaccurately? The task of tensor completion is crucial in many applications suchas personalized recommendation, image and video restoration, and linkprediction in social networks. Many tensor factorization and neuralnetwork-based tensor completion algorithms have been developed to predictmissing entries in partially observed tensors. However, they can produceinaccurate estimations as real-world tensors are very sparse, and these methodstend to overfit on the small amount of data. Here, we overcome theseshortcomings by presenting a data augmentation technique for tensors. In thispaper, we propose DAIN, a general data augmentation framework that enhances theprediction accuracy of neural tensor completion methods. Specifically, DAINfirst trains a neural model and finds tensor cell importances with influencefunctions. After that, DAIN aggregates the cell importance to calculate theimportance of each entity (i.e., an index of a dimension). Finally, DAINaugments the tensor by weighted sampling of entity importances and a valuepredictor. Extensive experimental results show that DAIN outperforms all dataaugmentation baselines in terms of enhancing imputation accuracy of neuraltensor completion on four diverse real-world tensors. Ablation studies of DAINsubstantiate the effectiveness of each component of DAIN. Furthermore, we showthat DAIN scales near linearly to large datasets.

Integrating Transductive And Inductive Embeddings Improves Link Prediction Accuracy

Comment: 5 Pages, Accepted by CIKM 2021

Link: http://arxiv.org/abs/2108.10108

Abstract

In recent years, inductive graph embedding models, \emph{viz.}, graph neuralnetworks (GNNs) have become increasingly accurate at link prediction (LP) inonline social networks. The performance of such networks depends strongly onthe input node features, which vary across networks and applications. Selectingappropriate node features remains application-dependent and generally an openquestion. Moreover, owing to privacy and ethical issues, use of personalizednode features is often restricted. In fact, many publicly available data fromonline social network do not contain any node features (e.g., demography). Inthis work, we provide a comprehensive experimental analysis which shows thatharnessing a transductive technique (e.g., Node2Vec) for obtaining initial noderepresentations, after which an inductive node embedding technique takes over,leads to substantial improvements in link prediction accuracy. We demonstratethat, for a wide variety of GNN variants, node representation vectors obtainedfrom Node2Vec serve as high quality input features to GNNs, thereby improvingLP performance.

On the Acceleration of Deep Neural Network Inference using Quantized Compressed Sensing

Comment: 3 pages, no figures, paper accepted at Black In AI at the 34th  Conference on Neural Information Processing Systems (NeurIPS 2020),  Vancouver, Canada

Link: http://arxiv.org/abs/2108.10101

Abstract

Accelerating deep neural network (DNN) inference on resource-limited devicesis one of the most important barriers to ensuring a wider and more inclusiveadoption. To alleviate this, DNN binary quantization for faster convolution andmemory savings is one of the most promising strategies despite its serious dropin accuracy. The present paper therefore proposes a novel binary quantizationfunction based on quantized compressed sensing (QCS). Theoretical argumentsconjecture that our proposal preserves the practical benefits of standardmethods, while reducing the quantization error and the resulting drop inaccuracy.

APObind: A Dataset of Ligand Unbound Protein Conformations for Machine Learning Applications in De Novo Drug Design

Comment: The 2021 ICML Workshop on Computational Biology

Link: http://arxiv.org/abs/2108.09926

Abstract

Protein-ligand complex structures have been utilised to design benchmarkmachine learning methods that perform important tasks related to drug designsuch as receptor binding site detection, small molecule docking and bindingaffinity prediction. However, these methods are usually trained on only ligandbound (or holo) conformations of the protein and therefore are not guaranteedto perform well when the protein structure is in its native unboundconformation (or apo), which is usually the conformation available for a newlyidentified receptor. A primary reason for this is that the local structure ofthe binding site usually changes upon ligand binding. To facilitate solutionsfor this problem, we propose a dataset called APObind that aims to provide apoconformations of proteins present in the PDBbind dataset, a popular datasetused in drug design. Furthermore, we explore the performance of methodsspecific to three use cases on this dataset, through which, the importance ofvalidating them on the APObind dataset is demonstrated.

Automatic Speech Recognition using limited vocabulary: A survey

Comment: 20 pages, 9 figures, 6 tables, submitted to IEEE ACCESS for possible  publication

Link: http://arxiv.org/abs/2108.10254

Abstract

Automatic Speech Recognition (ASR) is an active field of research due to itshuge number of applications and the proliferation of interfaces or computingdevices that can support speech processing. But the bulk of applications isbased on well-resourced languages that overshadow under-resourced ones. Yet ASRrepresents an undeniable mean to promote such languages, especially when designhuman-to-human or human-to-machine systems involving illiterate people. Anapproach to design an ASR system targeting under-resourced languages is tostart with a limited vocabulary. ASR using a limited vocabulary is a subset ofthe speech recognition problem that focuses on the recognition of a smallnumber of words or sentences. This paper aims to provide a comprehensive viewof mechanisms behind ASR systems as well as techniques, tools, projects, recentcontributions, and possibly future directions in ASR using a limitedvocabulary. This work consequently provides a way to go when designing ASRsystem using limited vocabulary. Although an emphasis is put on limitedvocabulary, most of the tools and techniques reported in this survey applied toASR systems in general.

Farsighted Probabilistic Sampling based Local Search for (Weighted) Partial MaxSAT

Comment: Submitted to AAAI 2022

Link: http://arxiv.org/abs/2108.09988

Abstract

Partial MaxSAT (PMS) and Weighted Partial MaxSAT (WPMS) are both practicalgeneralizations to the typical combinatorial problem of MaxSAT. In this work,we propose an effective farsighted probabilistic sampling based local searchalgorithm called FPS for solving these two problems, denoted as (W)PMS. The FPSalgorithm replaces the mechanism of flipping a single variable per iterationstep, that is widely used in existing (W)PMS local search algorithms, with theproposed farsighted local search strategy, and provides higher-quality localoptimal solutions. The farsighted strategy employs the probabilistic samplingtechnique that allows the algorithm to look-ahead widely and efficiently. Inthis way, FPS can provide more and better search directions and improve theperformance without reducing the efficiency. Extensive experiments on all thebenchmarks of (W)PMS problems from the incomplete track of recent four years ofMaxSAT Evaluations demonstrate that our method significantly outperformsSATLike3.0, the state-of-the-art local search algorithm, for solving both thePMS and WPMS problems. We furthermore do comparison with the extended solver ofSATLike, SATLike-c, which is the champion of three categories among the totalfour (PMS and WPMS categories, each associated with two time limits) of theincomplete track in the recent MaxSAT Evaluation (MSE2021). We replace thelocal search component in SATLike-c with the proposed farsighted sampling localsearch approach, and the resulting solver FPS-c also outperforms SATLike-c forsolving both the PMS and WPMS problems.

Detection of Illicit Drug Trafficking Events on Instagram: A Deep Multimodal Multilabel Learning Approach

Comment: Accepted by CIKM 2021

Link: http://arxiv.org/abs/2108.08920

Abstract

Social media such as Instagram and Twitter have become important platformsfor marketing and selling illicit drugs. Detection of online illicit drugtrafficking has become critical to combat the online trade of illicit drugs.However, the legal status often varies spatially and temporally; even for thesame drug, federal and state legislation can have different regulations aboutits legality. Meanwhile, more drug trafficking events are disguised as a novelform of advertising commenting leading to information heterogeneity.Accordingly, accurate detection of illicit drug trafficking events (IDTEs) fromsocial media has become even more challenging. In this work, we conduct thefirst systematic study on fine-grained detection of IDTEs on Instagram. Wepropose to take a deep multimodal multilabel learning (DMML) approach to detectIDTEs and demonstrate its effectiveness on a newly constructed dataset calledmultimodal IDTE(MM-IDTE). Specifically, our model takes text and image data asthe input and combines multimodal information to predict multiple labels ofillicit drugs. Inspired by the success of BERT, we have developed aself-supervised multimodal bidirectional transformer by jointly fine-tuningpretrained text and image encoders. We have constructed a large-scale datasetMM-IDTE with manually annotated multiple drug labels to support fine-graineddetection of illicit drugs. Extensive experimental results on the MM-IDTEdataset show that the proposed DMML methodology can accurately detect IDTEseven in the presence of special characters and style changes attempting toevade detection.

·